Start a new topic
Answered

Roughly 2/3 of requests are coming back as 504 error

Hello, when I test my scrapers without Crawlera I get a 200 response almost every time. But then when I make the same request(s) using Crawlera I'm seeing a large number of these 504 errors. My jobs weren't finishing before the next scheduled run so I lowered the DOWNLOAD_TIMEOUT in my settings.py from 600 to 30. Generally when accessing the target site the response is under 10 seconds though.


Any idea what could be causing this? Thanks in advance for the help.


Best Answer

Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.


I've been digging into this issue since posting and found that the site I'm scraping is using a service called Iovation to block illegitimate use. I don't know the criteria that service uses to identify a valid user, but if you don't fit the criteria then it will generate a random failing response. I've seen 404 pages, the wrong page being served (funny in hindsight, pulled my hair out for a few hours with that), timeouts, etc.  This isn't a fault with Crawlera or my settings at all. The reason I was getting valid responses on my dev machine is that I was being seen as a valid user.


I was able to see this by using the `scrapy shell` command. As an example:

scrapy shell "https://www.google.com"

 And once that gave me a prompt I typed

view(response)

to show the response in the browser. I noticed odd behavior right away. The page was flashing between two different pages. When I searched through the source I found a Javascript file named `https://mpsnare.iesnare.com/snare.js' along with timer libraries that were rewriting the page based on results from `snare.js`.


tl;dr -> My target site doesn't want me scraping them and has tools in place to detect and hinder bots.

Answer

Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.

Login to post a comment