Hello, when I test my scrapers without Crawlera I get a 200 response almost every time. But then when I make the same request(s) using Crawlera I'm seeing a large number of these 504 errors. My jobs weren't finishing before the next scheduled run so I lowered the DOWNLOAD_TIMEOUT in my settings.py from 600 to 30. Generally when accessing the target site the response is under 10 seconds though.
Any idea what could be causing this? Thanks in advance for the help.
0 Votes
nestor posted
about 7 years ago
AdminBest Answer
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
0 Votes
2 Comments
Sorted by
nestorposted
about 7 years ago
AdminAnswer
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
0 Votes
s
stlproincposted
about 7 years ago
I've been digging into this issue since posting and found that the site I'm scraping is using a service called Iovation to block illegitimate use. I don't know the criteria that service uses to identify a valid user, but if you don't fit the criteria then it will generate a random failing response. I've seen 404 pages, the wrong page being served (funny in hindsight, pulled my hair out for a few hours with that), timeouts, etc. This isn't a fault with Crawlera or my settings at all. The reason I was getting valid responses on my dev machine is that I was being seen as a valid user.
I was able to see this by using the `scrapy shell` command. As an example:
scrapy shell "https://www.google.com"
And once that gave me a prompt I typed
view(response)
to show the response in the browser. I noticed odd behavior right away. The page was flashing between two different pages. When I searched through the source I found a Javascript file named `https://mpsnare.iesnare.com/snare.js' along with timer libraries that were rewriting the page based on results from `snare.js`.
tl;dr -> My target site doesn't want me scraping them and has tools in place to detect and hinder bots.
Hello, when I test my scrapers without Crawlera I get a 200 response almost every time. But then when I make the same request(s) using Crawlera I'm seeing a large number of these 504 errors. My jobs weren't finishing before the next scheduled run so I lowered the DOWNLOAD_TIMEOUT in my settings.py from 600 to 30. Generally when accessing the target site the response is under 10 seconds though.
Any idea what could be causing this? Thanks in advance for the help.
0 Votes
nestor posted about 7 years ago Admin Best Answer
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
0 Votes
2 Comments
nestor posted about 7 years ago Admin Answer
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
0 Votes
stlproinc posted about 7 years ago
I've been digging into this issue since posting and found that the site I'm scraping is using a service called Iovation to block illegitimate use. I don't know the criteria that service uses to identify a valid user, but if you don't fit the criteria then it will generate a random failing response. I've seen 404 pages, the wrong page being served (funny in hindsight, pulled my hair out for a few hours with that), timeouts, etc. This isn't a fault with Crawlera or my settings at all. The reason I was getting valid responses on my dev machine is that I was being seen as a valid user.
I was able to see this by using the `scrapy shell` command. As an example:
And once that gave me a prompt I typed
to show the response in the browser. I noticed odd behavior right away. The page was flashing between two different pages. When I searched through the source I found a Javascript file named `https://mpsnare.iesnare.com/snare.js' along with timer libraries that were rewriting the page based on results from `snare.js`.
tl;dr -> My target site doesn't want me scraping them and has tools in place to detect and hinder bots.
0 Votes
Login to post a comment