We must use Crawlera for a site because otherwise the request is blocked but it goes so slow. We have an average of 1 request per minute and 0.4 items per minute and this is too slow.
def parse(self, response):
for href in response.css('...::attr(href)'):
yield response.follow(self.base_url+href.extract(), self.parse_content)
for href in response.css(...::attr(href)'):
yield response.follow(href, self.parse)
The code is pretty straight forward, let's say it's a bunch of articles with pagination.
Also on every article we need to visit a redirection link to get the final url.
And the idea is to run this every day but right now the spinder has not finnished after 7 days. What can we do? And why is it so slow? Are we missing something?
All suggestions are appreciated.
Best regards Joacim
Best Answer
n
nestor
said
over 5 years ago
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
Also we get a bunch of 301s and 5xx responses. Don't get why.
nestor
said
over 5 years ago
Answer
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
Thank you Nestor! It worked! However, I am still facing the same issue - Crawlera is very slow. I don't upload the spider to Scrapy Cloud. My spider employs Selenium + Polipo and its not so trivial to make a correct Docker image of it so that it could be uploaded to Scrapy cloud. So, I use it locally. I disabled Autothrottle in my project settings, however the I am still getting 10 items per minute on average when using Crawlera and around 150 items per minute without it. Can you please advise me if there is any way to speed Cralwera up? Maybe I should purchase some extra package? P.S. I am crawling this website - https://www.jameda.de/
nestor
said
over 5 years ago
Try increasing CONCURRENT_REQUESTS (can be added on the UI settings too) to something within your plans limits, C50 plans can have up to 50. But if you put the max on your scrapy spdier then you won't be able to use it simultaneously on other applications as you will receive 429 error for exceeding the connections limit.
A
Aaron Cowper
said
almost 5 years ago
Hi, I'm having the same issue with crawlera, its only averaging around 4 items/min for a scrape of ~4000 items despite having the C100 plan with CONCURRENT_REQUESTS set to 50 and autothrottle turned off.
joacimgunnarsson
Hey,
We must use Crawlera for a site because otherwise the request is blocked but it goes so slow. We have an average of 1 request per minute and 0.4 items per minute and this is too slow.
The code is pretty straight forward, let's say it's a bunch of articles with pagination.
Also on every article we need to visit a redirection link to get the final url.
And the idea is to run this every day but right now the spinder has not finnished after 7 days. What can we do? And why is it so slow? Are we missing something?
All suggestions are appreciated.
Best regards Joacim
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.
- Oldest First
- Popular
- Newest First
Sorted by Oldest Firstjoacimgunnarsson
Also we get a bunch of 301s and 5xx responses. Don't get why.
nestor
You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.
About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.
1 person likes this
joacimgunnarsson
Thanks nestor!
Disabling Autothrottle seems to have speeded up things.
nestor
You're welcome :)
Ostapp
nestor
On your project settings in the UI: https://app.scrapinghub.com/p/projectid/job-settings/standard. Just need to select it from the dropdown and make sure it is unchecked and the click Save.
Ostapp
Thank you Nestor! It worked!
However, I am still facing the same issue - Crawlera is very slow. I don't upload the spider to Scrapy Cloud. My spider employs Selenium + Polipo and its not so trivial to make a correct Docker image of it so that it could be uploaded to Scrapy cloud. So, I use it locally. I disabled Autothrottle in my project settings, however the I am still getting 10 items per minute on average when using Crawlera and around 150 items per minute without it. Can you please advise me if there is any way to speed Cralwera up? Maybe I should purchase some extra package?
P.S. I am crawling this website - https://www.jameda.de/
nestor
Try increasing CONCURRENT_REQUESTS (can be added on the UI settings too) to something within your plans limits, C50 plans can have up to 50. But if you put the max on your scrapy spdier then you won't be able to use it simultaneously on other applications as you will receive 429 error for exceeding the connections limit.
Aaron Cowper
Hi, I'm having the same issue with crawlera, its only averaging around 4 items/min for a scrape of ~4000 items despite having the C100 plan with CONCURRENT_REQUESTS set to 50 and autothrottle turned off.
Could you let me know how this can be sped up?
-
Crawlera 503 Ban
-
Amazon scraping speed
-
Website redirects
-
Error Code 429 Too Many Requests
-
Bing
-
Subscribed to Crawlera but saying Not Subscribed
-
Selenium with c#
-
Using Crawlera with browsermob
-
CRAWLERA_PRESERVE_DELAY leads to error
-
How to connect Selenium PhantomJS to Crawlera?
See all 381 topics