Cache and Pricing

Posted over 7 years ago by chops

Post a topic

Answered

chops

Hello,

I'm new to web scraping and I read a lot of tutorials, but some questions are still open. I want to scrape about 1M housing offers daily from 7 domains (= sources) and want to use Scrapy Cloud, Splash for Screenshots und Crawlera for this job. Scrapy comes with a HTTP Cache Middleware and my questions are related to this cache mechanism and the pricing:

1.) Only a small percentage of the 1M housing offers are new or changed in a daily crawl. With the Cache Middleware (RFC2616 policy) enabled, the crawler checks the E-Tag or header from server first (and than consults the cache or server for a fresh response). Do these E-Tag or Header-Only requests via Crawlera count as full/successfull requests towards the quota (for pricing)? Or isn't it neccessary to request the E-Tag / Header-only via Crawlera?

2.) A screenshot is only necessary if the housing offer page is new or changed. In my opinion, a second request must be "send" by Splash via Crawlera. Does this means, that a second request via Crawlera is required? Or are Splash and Scrapy Cloud using the same cache and the second request is answered by cache? Or does Crawlera cache a request for a short time, so that a second request is answered by this cache and isn't count towards the quota?

Thanks in advance
Christian

0 Votes

nestor posted over 7 years ago Admin Best Answer

1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.

2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.

0 Votes

4 Comments

Rebecca Foley posted about 6 years ago

I am glad to read this article.

0 Votes

nestor posted over 7 years ago Admin

I would suggest Crawlera Enterprise for that amount of requests in a month.

0 Votes

chops posted over 7 years ago

Thanks. So I have to calculate: (1,1M [1M housing offers plus 100k following link pages] * 31 [days in a month]) + (probably 100k changed/new housing offers in a month: 100k additional requests for screenshot)) = 34,2M requests in a month. Phew!

0 Votes

nestor posted over 7 years ago Admin Answer

1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.

0 Votes