Start a new topic
Answered

Unable to get robots.txt from Scrapinghub

I have 6 spiders that I have set up locally and they all work perfectly. But when I upload them to to run on Scrapinghub, but only 4 work. It looks like all the requests return 403 responses. The first requests to robots.txt also yields a 403 error.

  

   

custom_settings = {    'ROBOTSTXT_OBEY': False,    'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

 

custom_settings = {
'ROBOTSTXT_OBEY': False,
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
} I've set up

Best Answer

Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.



Here's an log from ScrapingHub

robots403.png
(79.5 KB)

Request 02018-12-12 17:57:19 UTC
Duration
82 ms

Fingerprint

109aa2f7fa3f609e74579c88b593f1fdbbbc7837

HTTP Method

GET

Response Size

2043 bytes

HTTP Status

403

Last Seen

2018-12-12 17:57:19 UTC

URL

https://www.@#$%^&*.com/robots.txt

 

Request 12018-12-12 17:57:28 UTC
Duration
68 ms

Fingerprint

49dddb7196bff0efdfc43250ce81610150696e3f

HTTP Method

GET

Response Size

1914 bytes

HTTP Status

403

Last Seen

2018-12-12 17:57:28 UTC

URL

https://www.@#$%^&*(.com/#$%^&*(*&^%$#$%^&*().html

Answer

Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.


Login to post a comment