Roughly 2/3 of requests are coming back as 504 error

s

stlproinc

started a topic over 6 years ago

Hello, when I test my scrapers without Crawlera I get a 200 response almost every time. But then when I make the same request(s) using Crawlera I'm seeing a large number of these 504 errors. My jobs weren't finishing before the next scheduled run so I lowered the DOWNLOAD_TIMEOUT in my settings.py from 600 to 30. Generally when accessing the target site the response is under 10 seconds though.

Any idea what could be causing this? Thanks in advance for the help.

Best Answer

n

nestor said over 6 years ago

Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.

s

stlproinc

said over 6 years ago

I've been digging into this issue since posting and found that the site I'm scraping is using a service called Iovation to block illegitimate use. I don't know the criteria that service uses to identify a valid user, but if you don't fit the criteria then it will generate a random failing response. I've seen 404 pages, the wrong page being served (funny in hindsight, pulled my hair out for a few hours with that), timeouts, etc. This isn't a fault with Crawlera or my settings at all. The reason I was getting valid responses on my dev machine is that I was being seen as a valid user.

I was able to see this by using the `scrapy shell` command. As an example:

scrapy shell "https://www.google.com"

And once that gave me a prompt I typed

view(response)

to show the response in the browser. I noticed odd behavior right away. The page was flashing between two different pages. When I searched through the source I found a Javascript file named `https://mpsnare.iesnare.com/snare.js' along with timer libraries that were rewriting the page based on results from `snare.js`.

tl;dr -> My target site doesn't want me scraping them and has tools in place to detect and hinder bots.

nestor

said over 6 years ago

Answer

Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.

Zyte Support Center

How can we help you today?