Roughly 2/3 of requests are coming back as 504 error
s
stlproinc
started a topic
almost 7 years ago
Hello, when I test my scrapers without Crawlera I get a 200 response almost every time. But then when I make the same request(s) using Crawlera I'm seeing a large number of these 504 errors. My jobs weren't finishing before the next scheduled run so I lowered the DOWNLOAD_TIMEOUT in my settings.py from 600 to 30. Generally when accessing the target site the response is under 10 seconds though.
Any idea what could be causing this? Thanks in advance for the help.
Best Answer
n
nestor
said
almost 7 years ago
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
I've been digging into this issue since posting and found that the site I'm scraping is using a service called Iovation to block illegitimate use. I don't know the criteria that service uses to identify a valid user, but if you don't fit the criteria then it will generate a random failing response. I've seen 404 pages, the wrong page being served (funny in hindsight, pulled my hair out for a few hours with that), timeouts, etc. This isn't a fault with Crawlera or my settings at all. The reason I was getting valid responses on my dev machine is that I was being seen as a valid user.
I was able to see this by using the `scrapy shell` command. As an example:
scrapy shell "https://www.google.com"
And once that gave me a prompt I typed
view(response)
to show the response in the browser. I noticed odd behavior right away. The page was flashing between two different pages. When I searched through the source I found a Javascript file named `https://mpsnare.iesnare.com/snare.js' along with timer libraries that were rewriting the page based on results from `snare.js`.
tl;dr -> My target site doesn't want me scraping them and has tools in place to detect and hinder bots.
nestor
said
almost 7 years ago
Answer
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
stlproinc
Hello, when I test my scrapers without Crawlera I get a 200 response almost every time. But then when I make the same request(s) using Crawlera I'm seeing a large number of these 504 errors. My jobs weren't finishing before the next scheduled run so I lowered the DOWNLOAD_TIMEOUT in my settings.py from 600 to 30. Generally when accessing the target site the response is under 10 seconds though.
Any idea what could be causing this? Thanks in advance for the help.
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
- Oldest First
- Popular
- Newest First
Sorted by Popularstlproinc
I've been digging into this issue since posting and found that the site I'm scraping is using a service called Iovation to block illegitimate use. I don't know the criteria that service uses to identify a valid user, but if you don't fit the criteria then it will generate a random failing response. I've seen 404 pages, the wrong page being served (funny in hindsight, pulled my hair out for a few hours with that), timeouts, etc. This isn't a fault with Crawlera or my settings at all. The reason I was getting valid responses on my dev machine is that I was being seen as a valid user.
I was able to see this by using the `scrapy shell` command. As an example:
And once that gave me a prompt I typed
to show the response in the browser. I noticed odd behavior right away. The page was flashing between two different pages. When I searched through the source I found a Javascript file named `https://mpsnare.iesnare.com/snare.js' along with timer libraries that were rewriting the page based on results from `snare.js`.
tl;dr -> My target site doesn't want me scraping them and has tools in place to detect and hinder bots.
nestor
Some websites check up on HTTP headers like Accept, Accept-Encoding, Accept-Language, etc. Try to check what common browsers send and add them manually as request headers to your spider.
-
Crawlera 503 Ban
-
Amazon scraping speed
-
Website redirects
-
Error Code 429 Too Many Requests
-
Bing
-
Subscribed to Crawlera but saying Not Subscribed
-
Selenium with c#
-
Using Crawlera with browsermob
-
CRAWLERA_PRESERVE_DELAY leads to error
-
How to connect Selenium PhantomJS to Crawlera?
See all 401 topics