Unable to get robots.txt from Scrapinghub

Posted over 6 years ago by Nikhil

Post a topic
Answered
N
Nikhil

I have 6 spiders that I have set up locally and they all work perfectly. But when I upload them to to run on Scrapinghub, but only 4 work. It looks like all the requests return 403 responses. The first requests to robots.txt also yields a 403 error.

  

   

custom_settings = {    'ROBOTSTXT_OBEY': False,    'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

 

custom_settings = {
'ROBOTSTXT_OBEY': False,
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
} I've set up

0 Votes

nestor

nestor posted over 6 years ago Admin Best Answer

Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.


0 Votes


3 Comments

Sorted by
N

Nikhil posted over 6 years ago

Here's an log from ScrapingHub

Attachments (1)

0 Votes

N

Nikhil posted over 6 years ago

Request 02018-12-12 17:57:19 UTC
Duration
82 ms

Fingerprint

109aa2f7fa3f609e74579c88b593f1fdbbbc7837

HTTP Method

GET

Response Size

2043 bytes

HTTP Status

403

Last Seen

2018-12-12 17:57:19 UTC

URL

https://www.@#$%^&*.com/robots.txt

 

Request 12018-12-12 17:57:28 UTC
Duration
68 ms

Fingerprint

49dddb7196bff0efdfc43250ce81610150696e3f

HTTP Method

GET

Response Size

1914 bytes

HTTP Status

403

Last Seen

2018-12-12 17:57:28 UTC

URL

https://www.@#$%^&*(.com/#$%^&*(*&^%$#$%^&*().html

0 Votes

nestor

nestor posted over 6 years ago Admin Answer

Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.


0 Votes

Login to post a comment