Add URLs to Delta Fetch manually

Posted almost 7 years ago by Noah Cinquini

Post a topic
Answered
N
Noah Cinquini

How can I manually add URL's to crawlera? So a pre set list of URLs are not crawled? 

0 Votes

nestor

nestor posted almost 7 years ago Admin Best Answer

You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.

0 Votes


7 Comments

Sorted by
nestor

nestor posted almost 7 years ago Admin Answer

You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.

0 Votes

N

Noah Cinquini posted almost 7 years ago

These urls are from 

1) Previously crawled jobs BEFORE DF was active

2) Other crawler


Is their away to manually add these urls? 

0 Votes

nestor

nestor posted almost 7 years ago Admin

I renamed and moved the topic to the appropriate section.

Where do those old URLs come from? If they were already crawled in previous jobs of the spider when DeltaFetch was enabled, then they should be added automatically to the list of URLs not to crawl because they were already crawled in previous jobs of the spider, DF does that automatically.

0 Votes

N

Noah Cinquini posted almost 7 years ago

Sorry Nestor, should read: 


Add URLs to Delta Fetch manually. 


Currently, we have Delta Fetch added, and its working for any new URL, but I have 250,000 old urls that I do not want crawled again, I am looking to add these to the list of crawled urls, so they are not crawled again. 


0 Votes

nestor

nestor posted almost 7 years ago Admin

I'm sorry, but I don't quite understand what you want to do. You mention Crawlera first, which is not possible to set anything to, because it is proxy API. The point of DeltaFetch is to not crawl those URLs which you've already crawled. Please provide more details to what you want to do, so I can provide assistance.

0 Votes

N

Noah Cinquini posted almost 7 years ago

DeltaFetch and DotScrapy Persistance is blacking re crawled data, I am surprised I can not do it, where ever that data is being kept. 

0 Votes

nestor

nestor posted almost 7 years ago Admin

This is not something you set on Crawlera level, but on your spider. If you use Scrapy, you could manually set dont_proxy in request.meta for those URLs that you don't want to use the proxy for.

0 Votes

Login to post a comment