I have 6 spiders that I have set up locally and they all work perfectly. But when I upload them to to run on Scrapinghub, but only 4 work. It looks like all the requests return 403 responses. The first requests to robots.txt also yields a 403 error.
custom_settings = { 'ROBOTSTXT_OBEY': False, 'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
custom_settings = { 'ROBOTSTXT_OBEY': False, 'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" }
I've set up
0 Votes
nestor posted
over 6 years ago
AdminBest Answer
Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.
Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.
I have 6 spiders that I have set up locally and they all work perfectly. But when I upload them to to run on Scrapinghub, but only 4 work. It looks like all the requests return 403 responses. The first requests to robots.txt also yields a 403 error.
0 Votes
nestor posted over 6 years ago Admin Best Answer
Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.
0 Votes
3 Comments
Nikhil posted over 6 years ago
Here's an log from ScrapingHub
Attachments (1)
robots403.png
79.5 KB
0 Votes
Nikhil posted over 6 years ago
109aa2f7fa3f609e74579c88b593f1fdbbbc7837
GET
2043 bytes
403
2018-12-12 17:57:19 UTC
https://www.@#$%^&*.com/robots.txt
49dddb7196bff0efdfc43250ce81610150696e3f
GET
1914 bytes
403
2018-12-12 17:57:28 UTC
https://www.@#$%^&*(.com/#$%^&*(*&^%$#$%^&*().html
0 Votes
nestor posted over 6 years ago Admin Answer
Check log line 10 of your job. ROBOTSTXT_OBEY is being overridden, and the 403 is most likely the website blocking request from the IP, you might need a proxy service like Crawlera.
0 Votes
Login to post a comment