Start a new topic
Answered

different page served to scrapinghub ips

I start to suspect some website shadowban scrapinghub.

An example is 

https://www.myrecipes.com/recipe/chocolate-cream-martini


If I try by hand with scrapy to get the content of json+ld I get a certain json. If I let scrapinghub read it, I will get something else.


The simplest 

yield {

'url' : response.url,

'body': response.body

}


Shows that the body on scrapy shell contains a long json ld.

This will help find the issue easier.

results = response.css("script[type='application/ld+json']").extract()


What could I do? It's not a  matter of user agent imho.


Best Answer

You probably need a proxy like Crawlera: https://scrapinghub.com/crawlera

1 Comment

Answer

You probably need a proxy like Crawlera: https://scrapinghub.com/crawlera

Login to post a comment