different page served to scrapinghub ips

Posted almost 7 years ago by ma

Post a topic

Answered

I start to suspect some website shadowban scrapinghub.

An example is

If I try by hand with scrapy to get the content of json+ld I get a certain json. If I let scrapinghub read it, I will get something else.

The simplest

yield {

'url' : response.url,

'body': response.body

}

Shows that the body on scrapy shell contains a long json ld.

This will help find the issue easier.

results = response.css("script[type='application/ld+json']").extract()

What could I do? It's not a matter of user agent imho.

0 Votes

nestor posted almost 7 years ago Admin Best Answer

You probably need a proxy like Crawlera: https://scrapinghub.com/crawlera

0 Votes

1 Comments

nestor posted almost 7 years ago Admin Answer

You probably need a proxy like Crawlera: https://scrapinghub.com/crawlera

0 Votes