I'm new to web scraping and I read a lot of tutorials, but some questions are still open. I want to scrape about 1M housing offers daily from 7 domains (= sources) and want to use Scrapy Cloud, Splash for Screenshots und Crawlera for this job. Scrapy comes with a HTTP Cache Middleware and my questions are related to this cache mechanism and the pricing:
1.) Only a small percentage of the 1M housing offers are new or changed in a daily crawl. With the Cache Middleware (RFC2616 policy) enabled, the crawler checks the E-Tag or header from server first (and than consults the cache or server for a fresh response). Do these E-Tag or Header-Only requests via Crawlera count as full/successfull requests towards the quota (for pricing)? Or isn't it neccessary to request the E-Tag / Header-only via Crawlera?
2.) A screenshot is only necessary if the housing offer page is new or changed. In my opinion, a second request must be "send" by Splash via Crawlera. Does this means, that a second request via Crawlera is required? Or are Splash and Scrapy Cloud using the same cache and the second request is answered by cache? Or does Crawlera cache a request for a short time, so that a second request is answered by this cache and isn't count towards the quota?
Thanks in advance Christian
Best Answer
n
nestor
said
almost 6 years ago
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
c
chops
said
almost 6 years ago
Thanks. So I have to calculate: (1,1M [1M housing offers plus 100k following link pages] * 31 [days in a month]) + (probably 100k changed/new housing offers in a month: 100k additional requests for screenshot)) = 34,2M requests in a month. Phew!
nestor
said
almost 6 years ago
I would suggest Crawlera Enterprise for that amount of requests in a month.
chops
I'm new to web scraping and I read a lot of tutorials, but some questions are still open. I want to scrape about 1M housing offers daily from 7 domains (= sources) and want to use Scrapy Cloud, Splash for Screenshots und Crawlera for this job. Scrapy comes with a HTTP Cache Middleware and my questions are related to this cache mechanism and the pricing:
1.) Only a small percentage of the 1M housing offers are new or changed in a daily crawl. With the Cache Middleware (RFC2616 policy) enabled, the crawler checks the E-Tag or header from server first (and than consults the cache or server for a fresh response). Do these E-Tag or Header-Only requests via Crawlera count as full/successfull requests towards the quota (for pricing)? Or isn't it neccessary to request the E-Tag / Header-only via Crawlera?
2.) A screenshot is only necessary if the housing offer page is new or changed. In my opinion, a second request must be "send" by Splash via Crawlera. Does this means, that a second request via Crawlera is required? Or are Splash and Scrapy Cloud using the same cache and the second request is answered by cache? Or does Crawlera cache a request for a short time, so that a second request is answered by this cache and isn't count towards the quota?
Thanks in advance
Christian
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
- Oldest First
- Popular
- Newest First
Sorted by Oldest Firstnestor
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
chops
nestor
I would suggest Crawlera Enterprise for that amount of requests in a month.
Rebecca Foley
I am glad to read this article.
-
Crawlera 503 Ban
-
Amazon scraping speed
-
Website redirects
-
Error Code 429 Too Many Requests
-
Bing
-
Subscribed to Crawlera but saying Not Subscribed
-
Selenium with c#
-
Using Crawlera with browsermob
-
CRAWLERA_PRESERVE_DELAY leads to error
-
How to connect Selenium PhantomJS to Crawlera?
See all 381 topics