This is not something you set on Crawlera level, but on your spider. If you use Scrapy, you could manually set dont_proxy in request.meta for those URLs that you don't want to use the proxy for.
N
Noah Cinquini
said
almost 6 years ago
DeltaFetch and DotScrapy Persistance is blacking re crawled data, I am surprised I can not do it, where ever that data is being kept.
nestor
said
almost 6 years ago
I'm sorry, but I don't quite understand what you want to do. You mention Crawlera first, which is not possible to set anything to, because it is proxy API. The point of DeltaFetch is to not crawl those URLs which you've already crawled. Please provide more details to what you want to do, so I can provide assistance.
N
Noah Cinquini
said
almost 6 years ago
Sorry Nestor, should read:
Add URLs to Delta Fetch manually.
Currently, we have Delta Fetch added, and its working for any new URL, but I have 250,000 old urls that I do not want crawled again, I am looking to add these to the list of crawled urls, so they are not crawled again.
nestor
said
almost 6 years ago
I renamed and moved the topic to the appropriate section.
Where do those old URLs come from? If they were already crawled in previous jobs of the spider when DeltaFetch was enabled, then they should be added automatically to the list of URLs not to crawl because they were already crawled in previous jobs of the spider, DF does that automatically.
Noah Cinquini
How can I manually add URL's to crawlera? So a pre set list of URLs are not crawled?
You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.
- Oldest First
- Popular
- Newest First
Sorted by Oldest Firstnestor
This is not something you set on Crawlera level, but on your spider. If you use Scrapy, you could manually set dont_proxy in request.meta for those URLs that you don't want to use the proxy for.
Noah Cinquini
DeltaFetch and DotScrapy Persistance is blacking re crawled data, I am surprised I can not do it, where ever that data is being kept.
nestor
I'm sorry, but I don't quite understand what you want to do. You mention Crawlera first, which is not possible to set anything to, because it is proxy API. The point of DeltaFetch is to not crawl those URLs which you've already crawled. Please provide more details to what you want to do, so I can provide assistance.
Noah Cinquini
Sorry Nestor, should read:
Add URLs to Delta Fetch manually.
Currently, we have Delta Fetch added, and its working for any new URL, but I have 250,000 old urls that I do not want crawled again, I am looking to add these to the list of crawled urls, so they are not crawled again.
nestor
I renamed and moved the topic to the appropriate section.
Where do those old URLs come from? If they were already crawled in previous jobs of the spider when DeltaFetch was enabled, then they should be added automatically to the list of URLs not to crawl because they were already crawled in previous jobs of the spider, DF does that automatically.
Noah Cinquini
1) Previously crawled jobs BEFORE DF was active
2) Other crawler
Is their away to manually add these urls?
nestor
You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 442 topics