I renamed and moved the topic to the appropriate section.
Where do those old URLs come from? If they were already crawled in previous jobs of the spider when DeltaFetch was enabled, then they should be added automatically to the list of URLs not to crawl because they were already crawled in previous jobs of the spider, DF does that automatically.
0 Votes
N
Noah Cinquiniposted
about 7 years ago
Sorry Nestor, should read:
Add URLs to Delta Fetch manually.
Currently, we have Delta Fetch added, and its working for any new URL, but I have 250,000 old urls that I do not want crawled again, I am looking to add these to the list of crawled urls, so they are not crawled again.
0 Votes
nestorposted
about 7 years ago
Admin
I'm sorry, but I don't quite understand what you want to do. You mention Crawlera first, which is not possible to set anything to, because it is proxy API. The point of DeltaFetch is to not crawl those URLs which you've already crawled. Please provide more details to what you want to do, so I can provide assistance.
0 Votes
N
Noah Cinquiniposted
about 7 years ago
DeltaFetch and DotScrapy Persistance is blacking re crawled data, I am surprised I can not do it, where ever that data is being kept.
0 Votes
nestorposted
about 7 years ago
Admin
This is not something you set on Crawlera level, but on your spider. If you use Scrapy, you could manually set dont_proxy in request.meta for those URLs that you don't want to use the proxy for.
How can I manually add URL's to crawlera? So a pre set list of URLs are not crawled?
0 Votes
nestor posted about 7 years ago Admin Best Answer
You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.
0 Votes
7 Comments
nestor posted about 7 years ago Admin Answer
You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.
0 Votes
Noah Cinquini posted about 7 years ago
1) Previously crawled jobs BEFORE DF was active
2) Other crawler
Is their away to manually add these urls?
0 Votes
nestor posted about 7 years ago Admin
I renamed and moved the topic to the appropriate section.
Where do those old URLs come from? If they were already crawled in previous jobs of the spider when DeltaFetch was enabled, then they should be added automatically to the list of URLs not to crawl because they were already crawled in previous jobs of the spider, DF does that automatically.
0 Votes
Noah Cinquini posted about 7 years ago
Sorry Nestor, should read:
Add URLs to Delta Fetch manually.
Currently, we have Delta Fetch added, and its working for any new URL, but I have 250,000 old urls that I do not want crawled again, I am looking to add these to the list of crawled urls, so they are not crawled again.
0 Votes
nestor posted about 7 years ago Admin
I'm sorry, but I don't quite understand what you want to do. You mention Crawlera first, which is not possible to set anything to, because it is proxy API. The point of DeltaFetch is to not crawl those URLs which you've already crawled. Please provide more details to what you want to do, so I can provide assistance.
0 Votes
Noah Cinquini posted about 7 years ago
DeltaFetch and DotScrapy Persistance is blacking re crawled data, I am surprised I can not do it, where ever that data is being kept.
0 Votes
nestor posted about 7 years ago Admin
This is not something you set on Crawlera level, but on your spider. If you use Scrapy, you could manually set dont_proxy in request.meta for those URLs that you don't want to use the proxy for.
0 Votes
Login to post a comment