videocamWeb Data Extraction Summit - September 30th, 2021.
Join some of the greatest minds in web scraping to educate, inspire, and innovate.
Register for free!
Start a new topic

Is there a way to block certain URLs while a spider is running?

 I am writing a spider (I am using the crawl spider to crawl every link in a domain) to pull certain files from a given domain. I want to block certain URLs where the spider is not finding files. For example, if the sider visits a URL with /news/ path in it one hundred times and doesn't find a file I want it to stop looking in /news/

I have already tried updating the self.rules variable when it finds and the path that doesn't yield file but this did not work and it continued crawling URLs with that path


This is the function that I am trying to use to update the rule 

 

def add_block_rule(self, match):
        new_rule = match.replace('/','\/')
        new_rule = f'/(.*{new_rule}.*)'
        if new_rule in self.deny_rules:
            return
        print(f'visted {match} to many times with out finding a file')
        self.deny_rules.append(new_rule)
        self.rules = (
            Rule( 
                LinkExtractor(
                    allow_domains=self.allowed_domains,
                    unique=True,
                    deny=self.deny_rules,),
                callback='parse_page',
                follow=True),
        )
        print(self.deny_rules)

 I know that this function is being called with certain paths are visited one hundred times without finding a file but the new rule is not being used. I also know that the regex works as I have tried defining one in the init and it blocked the desired path.


I would expect that all paths that are visited over 100 times without finding a file would be blocked and not visited further

Login to post a comment