I start to suspect some website shadowban scrapinghub.
An example is
https://www.myrecipes.com/recipe/chocolate-cream-martini
If I try by hand with scrapy to get the content of json+ld I get a certain json. If I let scrapinghub read it, I will get something else.
The simplest
yield {
'url' : response.url,
'body': response.body
}
Shows that the body on scrapy shell contains a long json ld.
This will help find the issue easier.
results = response.css("script[type='application/ld+json']").extract()
What could I do? It's not a matter of user agent imho.
You probably need a proxy like Crawlera: https://scrapinghub.com/crawlera
ma
I start to suspect some website shadowban scrapinghub.
An example is
https://www.myrecipes.com/recipe/chocolate-cream-martini
If I try by hand with scrapy to get the content of json+ld I get a certain json. If I let scrapinghub read it, I will get something else.
The simplest
yield {
'url' : response.url,
'body': response.body
}
Shows that the body on scrapy shell contains a long json ld.
This will help find the issue easier.
results = response.css("script[type='application/ld+json']").extract()
What could I do? It's not a matter of user agent imho.
You probably need a proxy like Crawlera: https://scrapinghub.com/crawlera
nestor
You probably need a proxy like Crawlera: https://scrapinghub.com/crawlera
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 452 topics