Add URLs to Delta Fetch manually

Posted about 8 years ago by Noah Cinquini

Post a topic

Answered

Noah Cinquini

How can I manually add URL's to crawlera? So a pre set list of URLs are not crawled?

0 Votes

nestor posted about 8 years ago Admin Best Answer

You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.

0 Votes

7 Comments

nestor posted about 8 years ago Admin Answer

0 Votes

Noah Cinquini posted about 8 years ago

These urls are from

1) Previously crawled jobs BEFORE DF was active

2) Other crawler

Is their away to manually add these urls?

0 Votes

nestor posted about 8 years ago Admin

I renamed and moved the topic to the appropriate section.

Where do those old URLs come from? If they were already crawled in previous jobs of the spider when DeltaFetch was enabled, then they should be added automatically to the list of URLs not to crawl because they were already crawled in previous jobs of the spider, DF does that automatically.

0 Votes

Noah Cinquini posted about 8 years ago

Sorry Nestor, should read:

Add URLs to Delta Fetch manually.

Currently, we have Delta Fetch added, and its working for any new URL, but I have 250,000 old urls that I do not want crawled again, I am looking to add these to the list of crawled urls, so they are not crawled again.

0 Votes

nestor posted about 8 years ago Admin

I'm sorry, but I don't quite understand what you want to do. You mention Crawlera first, which is not possible to set anything to, because it is proxy API. The point of DeltaFetch is to not crawl those URLs which you've already crawled. Please provide more details to what you want to do, so I can provide assistance.

0 Votes

Noah Cinquini posted about 8 years ago

DeltaFetch and DotScrapy Persistance is blacking re crawled data, I am surprised I can not do it, where ever that data is being kept.

0 Votes

nestor posted about 8 years ago Admin

This is not something you set on Crawlera level, but on your spider. If you use Scrapy, you could manually set dont_proxy in request.meta for those URLs that you don't want to use the proxy for.

0 Votes