Do file downloads count as separate requests?

n

nathanwailes

started a topic almost 5 years ago

Background:

I've been hired to scrape ~123,000 text files from a government website.
To get the URLs for a file, I need to submit a search request to that same website while providing a unique ID corresponding to the file, and then scrape the URL of the file from the HTML the website sends me.
To accomplish my goal, I signed up for a Scrapy Cloud for one month, and I signed up for the 150,000 request per month plan on Crawlera to avoid getting IP blocked.
My initial idea was to first crawl the target website to create a CSV containing the URLs for the files I want to download, and then to do a separate job that actually downloads the files.
I sent ~123,000 requests to get the URLs and successfully created the CSV.
I now want to download the files corresponding to the URLs I have in the CSV.

Problem:

After researching how to download files with Scrapy Cloud, I realized that the normal way of getting the files would have been to have the files downloaded with the initial job by using a FilesPipeline, rather than getting the file's URL and downloading the URL as two separate jobs.
I'm now close to hitting my 150,000 request per month limit for Crawlera, and I want to know what plan I should sign up for to be able to download the files.
If a file download will count as the same request as the search request (that finds the file URL), I'd prefer to just modify my original job (that retrieved the URLs) and re-run it while having it download the files, because that looks like it will be less work for me.
However, if the search query and the file download will count as two separate requests, that will increase the cost of the plan I'll need to use enough that I might want to try just having a scraper directly download the URLs I have in my CSV file rather than re-querying for the URLs.

Best Answer

p

peixoto said over 4 years ago

Hello,

1. Both approaches would work OK. There's no silver bullet. Regardless, both approaches will need 2 requests per file.

2. You can subscribe to Cralera C50, which is 1 tier higher then your current plan.

3. No, you you have to issue a request to find the file URL, and then another request to get the URL, that would count as two different requests.

4. If you have already built the CSV file, there's not really a reason to go through all the search links again. You may provide that file along with the project and use the list of file URLs to get every single file.

It also does not mean direct increase in costs for your project: If you can wait for the next billing cycle, you would have renewed quotas.

Hope this is the info you were looking for.

Thanks!

peixoto

said over 4 years ago

Answer