We must use Crawlera for a site because otherwise the request is blocked but it goes so slow. We have an average of 1 request per minute and 0.4 items per minute and this is too slow.
def parse(self, response):
for href in response.css('...::attr(href)'):
yield response.follow(self.base_url+href.extract(), self.parse_content)
for href in response.css(...::attr(href)'):
yield response.follow(href, self.parse)
The code is pretty straight forward, let's say it's a bunch of articles with pagination.
Also on every article we need to visit a redirection link to get the final url.
And the idea is to run this every day but right now the spinder has not finnished after 7 days. What can we do? And why is it so slow? Are we missing something?
All suggestions are appreciated.
Best regards Joacim
0 Votes
nestor posted
almost 7 years ago
AdminBest Answer
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
Hi, I'm having the same issue with crawlera, its only averaging around 4 items/min for a scrape of ~4000 items despite having the C100 plan with CONCURRENT_REQUESTS set to 50 and autothrottle turned off.
Could you let me know how this can be sped up?
0 Votes
nestorposted
almost 7 years ago
Admin
Try increasing CONCURRENT_REQUESTS (can be added on the UI settings too) to something within your plans limits, C50 plans can have up to 50. But if you put the max on your scrapy spdier then you won't be able to use it simultaneously on other applications as you will receive 429 error for exceeding the connections limit.
0 Votes
O
Ostappposted
almost 7 years ago
Thank you Nestor! It worked! However, I am still facing the same issue - Crawlera is very slow. I don't upload the spider to Scrapy Cloud. My spider employs Selenium + Polipo and its not so trivial to make a correct Docker image of it so that it could be uploaded to Scrapy cloud. So, I use it locally. I disabled Autothrottle in my project settings, however the I am still getting 10 items per minute on average when using Crawlera and around 150 items per minute without it. Can you please advise me if there is any way to speed Cralwera up? Maybe I should purchase some extra package? P.S. I am crawling this website - https://www.jameda.de/
Hi Nestor, how do I disable autothrottle in Scrapy cloud settings?
0 Votes
nestorposted
almost 7 years ago
Admin
You're welcome :)
0 Votes
j
joacimgunnarssonposted
almost 7 years ago
Thanks nestor!
Disabling Autothrottle seems to have speeded up things.
0 Votes
nestorposted
almost 7 years ago
AdminAnswer
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
Hey,
We must use Crawlera for a site because otherwise the request is blocked but it goes so slow. We have an average of 1 request per minute and 0.4 items per minute and this is too slow.
The code is pretty straight forward, let's say it's a bunch of articles with pagination.
Also on every article we need to visit a redirection link to get the final url.
And the idea is to run this every day but right now the spinder has not finnished after 7 days. What can we do? And why is it so slow? Are we missing something?
All suggestions are appreciated.
Best regards Joacim
0 Votes
nestor posted almost 7 years ago Admin Best Answer
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.
1 Votes
9 Comments
Aaron Cowper posted about 6 years ago
Hi, I'm having the same issue with crawlera, its only averaging around 4 items/min for a scrape of ~4000 items despite having the C100 plan with CONCURRENT_REQUESTS set to 50 and autothrottle turned off.
Could you let me know how this can be sped up?
0 Votes
nestor posted almost 7 years ago Admin
Try increasing CONCURRENT_REQUESTS (can be added on the UI settings too) to something within your plans limits, C50 plans can have up to 50. But if you put the max on your scrapy spdier then you won't be able to use it simultaneously on other applications as you will receive 429 error for exceeding the connections limit.
0 Votes
Ostapp posted almost 7 years ago
Thank you Nestor! It worked!
However, I am still facing the same issue - Crawlera is very slow. I don't upload the spider to Scrapy Cloud. My spider employs Selenium + Polipo and its not so trivial to make a correct Docker image of it so that it could be uploaded to Scrapy cloud. So, I use it locally. I disabled Autothrottle in my project settings, however the I am still getting 10 items per minute on average when using Crawlera and around 150 items per minute without it. Can you please advise me if there is any way to speed Cralwera up? Maybe I should purchase some extra package?
P.S. I am crawling this website - https://www.jameda.de/
0 Votes
nestor posted almost 7 years ago Admin
On your project settings in the UI: https://app.scrapinghub.com/p/projectid/job-settings/standard. Just need to select it from the dropdown and make sure it is unchecked and the click Save.
0 Votes
Ostapp posted almost 7 years ago
0 Votes
nestor posted almost 7 years ago Admin
You're welcome :)
0 Votes
joacimgunnarsson posted almost 7 years ago
Thanks nestor!
Disabling Autothrottle seems to have speeded up things.
0 Votes
nestor posted almost 7 years ago Admin Answer
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.
1 Votes
joacimgunnarsson posted almost 7 years ago
Also we get a bunch of 301s and 5xx responses. Don't get why.
0 Votes
Login to post a comment