I'm looking into the viability of using scrapinghub from the API only.
Is it possible to fully define & configure scrapers using the API only: tell them what to scrape and what not to, how to authenticate, where to drop content, etc.?
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
0 Votes
C
Catalin Adlerposted
over 5 years ago
Thanks!
It does look promising, though it would be nice to have explicit APIs to configure jobs. Would make discovery much easier.
0 Votes
nestorposted
over 5 years ago
Admin
What kind of configurations are you looking for? Also please elaborate more on what do you mean with making discovery easier.
0 Votes
C
Catalin Adlerposted
over 5 years ago
Making discovering the features easier.
I am working on designing an app that needs to scrape configured websites (among many other things). So, I need to be able to programmatically:
1. configure multiple spiders
2. set auth options for each spider
3. set rules for which website to scrape, what url patterns to include/exclude, etc.
4. tell it where to securely drop content. or a way to securely get the output (output needs to be a file of some sort)
5. tell it what to do with embedded/linked files.
6. and other spidering usual settings (like a scheduler)
Maybe I'm missing something, but what I would have expected to see is an API like /spiders/ PUT with a JSON body taking the spider configuration.
then use the job api to request the execution of the spider.
Having this all in the runjob endpoint works, but it's not easy to spot reading the docs. Maybe a sample would help.
Hello,
I'm looking into the viability of using scrapinghub from the API only.
Is it possible to fully define & configure scrapers using the API only: tell them what to scrape and what not to, how to authenticate, where to drop content, etc.?
Thanks!
0 Votes
nestor posted over 5 years ago Admin Best Answer
Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
0 Votes
5 Comments
Catalin Adler posted over 5 years ago
Anyone?
0 Votes
nestor posted over 5 years ago Admin Answer
Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
0 Votes
Catalin Adler posted over 5 years ago
Thanks!
It does look promising, though it would be nice to have explicit APIs to configure jobs.
Would make discovery much easier.
0 Votes
nestor posted over 5 years ago Admin
What kind of configurations are you looking for? Also please elaborate more on what do you mean with making discovery easier.
0 Votes
Catalin Adler posted over 5 years ago
Making discovering the features easier.
I am working on designing an app that needs to scrape configured websites (among many other things). So, I need to be able to programmatically:
1. configure multiple spiders
2. set auth options for each spider
3. set rules for which website to scrape, what url patterns to include/exclude, etc.
4. tell it where to securely drop content. or a way to securely get the output (output needs to be a file of some sort)
5. tell it what to do with embedded/linked files.
6. and other spidering usual settings (like a scheduler)
Maybe I'm missing something, but what I would have expected to see is an API like /spiders/ PUT with a JSON body taking the spider configuration.
then use the job api to request the execution of the spider.
Having this all in the runjob endpoint works, but it's not easy to spot reading the docs.
Maybe a sample would help.
0 Votes
Login to post a comment