Start a new topic

Recrawling a page

I'm trying to test my spiders locally, but it seems I'm not able to recrawl the same page within a certain window of time. My project is using Frontera for scheduling, and when I try to recrawl a page, the slots in the frontier show 0 requests. In the logs for the producer spider, I see it distinguishes between new_links and total_links, and the number in the slots in the frontier will match the new_links number for that particular crawl.


My (perhaps overlapping) questions are:

  1. Is there a time window in which frontera will not allow a link previously sent to it to be recrawled? If so, how long is that period (e.g. 24 hours), and is there a way to change it in the settings?
  2. Is the default behavior truly that Frontera will only crawl those links it identifies as new ones? Can this be overwritten?
  3. What is Frontera's definition of "new_links"?

Login to post a comment