Is there a way to block certain URLs while a spider is running?

Posted almost 6 years ago by Logan Anderson

Post a topic

Un Answered

Logan Anderson

I am writing a spider (I am using the crawl spider to crawl every link in a domain) to pull certain files from a given domain. I want to block certain URLs where the spider is not finding files. For example, if the sider visits a URL with /news/ path in it one hundred times and doesn't find a file I want it to stop looking in /news/

I have already tried updating the self.rules variable when it finds and the path that doesn't yield file but this did not work and it continued crawling URLs with that path

This is the function that I am trying to use to update the rule

def add_block_rule(self, match):
        new_rule = match.replace('/','\/')
        new_rule = f'/(.*{new_rule}.*)'
        if new_rule in self.deny_rules:
            return
        print(f'visted {match} to many times with out finding a file')
        self.deny_rules.append(new_rule)
        self.rules = (
            Rule( 
                LinkExtractor(
                    allow_domains=self.allowed_domains,
                    unique=True,
                    deny=self.deny_rules,),
                callback='parse_page',
                follow=True),
        )
        print(self.deny_rules)

I know that this function is being called with certain paths are visited one hundred times without finding a file but the new rule is not being used. I also know that the regex works as I have tried defining one in the init and it blocked the desired path.

I would expect that all paths that are visited over 100 times without finding a file would be blocked and not visited further

0 Votes

0 Comments