Roughly 2/3 of requests are coming back as 504 error

Posted over 7 years ago by stlproinc

Post a topic
Answered
s
stlproinc

Hello, when I test my scrapers without Crawlera I get a 200 response almost every time. But then when I make the same request(s) using Crawlera I'm seeing a large number of these 504 errors. My jobs weren't finishing before the next scheduled run so I lowered the DOWNLOAD_TIMEOUT in my settings.py from 600 to 30. Generally when accessing the target site the response is under 10 seconds though.


Any idea what could be causing this? Thanks in advance for the help.

0 Votes

nestor

nestor posted about 7 years ago Admin Best Answer

Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.

0 Votes


2 Comments

Sorted by
nestor

nestor posted about 7 years ago Admin Answer

Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.

0 Votes

s

stlproinc posted over 7 years ago

I've been digging into this issue since posting and found that the site I'm scraping is using a service called Iovation to block illegitimate use. I don't know the criteria that service uses to identify a valid user, but if you don't fit the criteria then it will generate a random failing response. I've seen 404 pages, the wrong page being served (funny in hindsight, pulled my hair out for a few hours with that), timeouts, etc.  This isn't a fault with Crawlera or my settings at all. The reason I was getting valid responses on my dev machine is that I was being seen as a valid user.


I was able to see this by using the `scrapy shell` command. As an example:

scrapy shell "https://www.google.com"

 And once that gave me a prompt I typed

view(response)

to show the response in the browser. I noticed odd behavior right away. The page was flashing between two different pages. When I searched through the source I found a Javascript file named `https://mpsnare.iesnare.com/snare.js' along with timer libraries that were rewriting the page based on results from `snare.js`.


tl;dr -> My target site doesn't want me scraping them and has tools in place to detect and hinder bots.

0 Votes

Login to post a comment