Scraping using the API Only

Posted over 5 years ago by Catalin Adler

Post a topic
Answered
C
Catalin Adler

Hello,

I'm looking into the viability of using scrapinghub from the API only.

Is it possible to fully define & configure scrapers using the API only: tell them what to scrape and what not to, how to authenticate, where to drop content, etc.?

Thanks!

0 Votes

nestor

nestor posted over 5 years ago Admin Best Answer

Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.

URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS 

0 Votes


5 Comments

Sorted by
C

Catalin Adler posted over 5 years ago

Anyone?

0 Votes

nestor

nestor posted over 5 years ago Admin Answer

Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.

URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS 

0 Votes

C

Catalin Adler posted over 5 years ago

Thanks!

It does look promising, though it would be nice to have explicit APIs to configure jobs.
Would make discovery much easier.

0 Votes

nestor

nestor posted over 5 years ago Admin

What kind of configurations are you looking for? Also please elaborate more on what do you mean with making discovery easier.

0 Votes

C

Catalin Adler posted over 5 years ago

Making discovering the features easier.

I am working on designing an app that needs to scrape configured websites (among many other things). So, I need to be able to programmatically:

1. configure multiple spiders

2. set auth options for each spider

3. set rules for which website to scrape, what url patterns to include/exclude, etc.

4. tell it where to securely drop content. or a way to securely get the output (output needs to be a file of some sort)

5. tell it what to do with embedded/linked files.

6. and other spidering usual settings (like a scheduler)

Maybe I'm missing something, but what I would have expected to see is an API like /spiders/ PUT with a JSON body taking the spider configuration. 

then use the job api to request the execution of the spider.


Having this all in the runjob endpoint works, but it's not easy to spot reading the docs.
Maybe a sample would help.

0 Votes

Login to post a comment