I'm looking into the viability of using scrapinghub from the API only.
Is it possible to fully define & configure scrapers using the API only: tell them what to scrape and what not to, how to authenticate, where to drop content, etc.?
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
C
Catalin Adler
said
over 5 years ago
Thanks!
It does look promising, though it would be nice to have explicit APIs to configure jobs. Would make discovery much easier.
nestor
said
over 5 years ago
What kind of configurations are you looking for? Also please elaborate more on what do you mean with making discovery easier.
C
Catalin Adler
said
over 5 years ago
Making discovering the features easier.
I am working on designing an app that needs to scrape configured websites (among many other things). So, I need to be able to programmatically:
1. configure multiple spiders
2. set auth options for each spider
3. set rules for which website to scrape, what url patterns to include/exclude, etc.
4. tell it where to securely drop content. or a way to securely get the output (output needs to be a file of some sort)
5. tell it what to do with embedded/linked files.
6. and other spidering usual settings (like a scheduler)
Maybe I'm missing something, but what I would have expected to see is an API like /spiders/ PUT with a JSON body taking the spider configuration.
then use the job api to request the execution of the spider.
Having this all in the runjob endpoint works, but it's not easy to spot reading the docs. Maybe a sample would help.
Catalin Adler
Hello,
I'm looking into the viability of using scrapinghub from the API only.
Is it possible to fully define & configure scrapers using the API only: tell them what to scrape and what not to, how to authenticate, where to drop content, etc.?
Thanks!
Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
- Oldest First
- Popular
- Newest First
Sorted by Oldest FirstCatalin Adler
Anyone?
nestor
Not sure if this is exactly what you're after, but you can schedule jobs with specific settings and arguments: https://doc.scrapinghub.com/api/jobs.html#run-json.
URLs to scrape can be passed as arguments and other things like authentication can be passed as settings depending on the middleware(s) that you're using. Where to drop content can also be passed as a setting, ie: for FEED_EXPORTS
Catalin Adler
Thanks!
It does look promising, though it would be nice to have explicit APIs to configure jobs.
Would make discovery much easier.
nestor
What kind of configurations are you looking for? Also please elaborate more on what do you mean with making discovery easier.
Catalin Adler
Making discovering the features easier.
I am working on designing an app that needs to scrape configured websites (among many other things). So, I need to be able to programmatically:
1. configure multiple spiders
2. set auth options for each spider
3. set rules for which website to scrape, what url patterns to include/exclude, etc.
4. tell it where to securely drop content. or a way to securely get the output (output needs to be a file of some sort)
5. tell it what to do with embedded/linked files.
6. and other spidering usual settings (like a scheduler)
Maybe I'm missing something, but what I would have expected to see is an API like /spiders/ PUT with a JSON body taking the spider configuration.
then use the job api to request the execution of the spider.
Having this all in the runjob endpoint works, but it's not easy to spot reading the docs.
Maybe a sample would help.
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 458 topics