videocamWeb Data Extraction Summit - September 30th, 2021.
Join some of the greatest minds in web scraping to educate, inspire, and innovate.
Register for free!
Start a new topic
Answered

Scraping using the API Only

Hello,

I'm looking into the viability of using scrapinghub from the API only.

Is it possible to fully define & configure scrapers using the API only: tell them what to scrape and what not to, how to authenticate, where to drop content, etc.?

Thanks!


Best Answer

Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.

URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS 


Anyone?

Answer

Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.

URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS 

Thanks!

It does look promising, though it would be nice to have explicit APIs to configure jobs.
Would make discovery much easier.

What kind of configurations are you looking for? Also please elaborate more on what do you mean with making discovery easier.

Making discovering the features easier.

I am working on designing an app that needs to scrape configured websites (among many other things). So, I need to be able to programmatically:

1. configure multiple spiders

2. set auth options for each spider

3. set rules for which website to scrape, what url patterns to include/exclude, etc.

4. tell it where to securely drop content. or a way to securely get the output (output needs to be a file of some sort)

5. tell it what to do with embedded/linked files.

6. and other spidering usual settings (like a scheduler)

Maybe I'm missing something, but what I would have expected to see is an API like /spiders/ PUT with a JSON body taking the spider configuration. 

then use the job api to request the execution of the spider.


Having this all in the runjob endpoint works, but it's not easy to spot reading the docs.
Maybe a sample would help.

Login to post a comment