I'm new to web scraping and I read a lot of tutorials, but some questions are still open. I want to scrape about 1M housing offers daily from 7 domains (= sources) and want to use Scrapy Cloud, Splash for Screenshots und Crawlera for this job. Scrapy comes with a HTTP Cache Middleware and my questions are related to this cache mechanism and the pricing:
1.) Only a small percentage of the 1M housing offers are new or changed in a daily crawl. With the Cache Middleware (RFC2616 policy) enabled, the crawler checks the E-Tag or header from server first (and than consults the cache or server for a fresh response). Do these E-Tag or Header-Only requests via Crawlera count as full/successfull requests towards the quota (for pricing)? Or isn't it neccessary to request the E-Tag / Header-only via Crawlera?
2.) A screenshot is only necessary if the housing offer page is new or changed. In my opinion, a second request must be "send" by Splash via Crawlera. Does this means, that a second request via Crawlera is required? Or are Splash and Scrapy Cloud using the same cache and the second request is answered by cache? Or does Crawlera cache a request for a short time, so that a second request is answered by this cache and isn't count towards the quota?
Thanks in advance Christian
0 Votes
nestor posted
about 7 years ago
AdminBest Answer
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
0 Votes
4 Comments
Sorted by
nestorposted
about 7 years ago
AdminAnswer
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
0 Votes
c
chopsposted
about 7 years ago
Thanks. So I have to calculate: (1,1M [1M housing offers plus 100k following link pages] * 31 [days in a month]) + (probably 100k changed/new housing offers in a month: 100k additional requests for screenshot)) = 34,2M requests in a month. Phew!
0 Votes
nestorposted
about 7 years ago
Admin
I would suggest Crawlera Enterprise for that amount of requests in a month.
I'm new to web scraping and I read a lot of tutorials, but some questions are still open. I want to scrape about 1M housing offers daily from 7 domains (= sources) and want to use Scrapy Cloud, Splash for Screenshots und Crawlera for this job. Scrapy comes with a HTTP Cache Middleware and my questions are related to this cache mechanism and the pricing:
1.) Only a small percentage of the 1M housing offers are new or changed in a daily crawl. With the Cache Middleware (RFC2616 policy) enabled, the crawler checks the E-Tag or header from server first (and than consults the cache or server for a fresh response). Do these E-Tag or Header-Only requests via Crawlera count as full/successfull requests towards the quota (for pricing)? Or isn't it neccessary to request the E-Tag / Header-only via Crawlera?
2.) A screenshot is only necessary if the housing offer page is new or changed. In my opinion, a second request must be "send" by Splash via Crawlera. Does this means, that a second request via Crawlera is required? Or are Splash and Scrapy Cloud using the same cache and the second request is answered by cache? Or does Crawlera cache a request for a short time, so that a second request is answered by this cache and isn't count towards the quota?
Thanks in advance
Christian
0 Votes
nestor posted about 7 years ago Admin Best Answer
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
0 Votes
4 Comments
nestor posted about 7 years ago Admin Answer
1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.
2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.
0 Votes
chops posted about 7 years ago
0 Votes
nestor posted about 7 years ago Admin
I would suggest Crawlera Enterprise for that amount of requests in a month.
0 Votes
Rebecca Foley posted over 5 years ago
I am glad to read this article.
0 Votes
Login to post a comment