Unable to get robots.txt from Scrapinghub

Posted almost 7 years ago by Nikhil

Post a topic

Answered

Nikhil

I have 6 spiders that I have set up locally and they all work perfectly. But when I upload them to to run on Scrapinghub, but only 4 work. It looks like all the requests return 403 responses. The first requests to robots.txt also yields a 403 error.

custom_settings = {    'ROBOTSTXT_OBEY': False,    'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

custom_settings = {
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
}
I've set up

0 Votes

nestor posted almost 7 years ago Admin Best Answer

Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.

0 Votes

3 Comments

nestor posted almost 7 years ago Admin Answer

Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.

0 Votes

Nikhil posted almost 7 years ago

Request 02018-12-12 17:57:19 UTC
Duration	82 ms
Fingerprint	109aa2f7fa3f609e74579c88b593f1fdbbbc7837
HTTP Method	GET
Response Size	2043 bytes
HTTP Status	403
Last Seen	2018-12-12 17:57:19 UTC
URL	https://www.@#$%^&*.com/robots.txt

Request 12018-12-12 17:57:28 UTC
Duration	68 ms
Fingerprint	49dddb7196bff0efdfc43250ce81610150696e3f
HTTP Method	GET
Response Size	1914 bytes
HTTP Status	403
Last Seen	2018-12-12 17:57:28 UTC
URL	https://www.@#$%^&(.com/#$%^&(&^%$#$%^&().html

0 Votes

Nikhil posted almost 7 years ago

Here's an log from ScrapingHub

Attachments (1)

robots403.png
79.5 KB

0 Votes