videocamWeb Data Extraction Summit - September 30th, 2021.
Join some of the greatest minds in web scraping to educate, inspire, and innovate.
Register for free!
Start a new topic

Got memusage_exceeded error with a simple spider

Hi, 


I'm trying to discover why this extremely simple spider is returning a memusage_exceeded error:


class RefreshCache(CrawlSpider):
    # Spider's name
    name = "refreshcache"

    # Start URLs
    start_urls = [
        'URL1,
        'URL2'
    ]

    # Allowed domains for the crawl
    allowed_domains = [domain[len('https://'):] for domain in start_urls]

    def parse(self, response):
        linked_urls = [link.url for link in LxmlLinkExtractor().extract_links(response)]

        # Extract and request valid links
        for url in linked_urls:
            yield scrapy.Request(url)

        # Create item
        yield {'url': response.url, 'download_latency': response.meta['download_latency']}

 These are my current settings in place:

CONCURRENT_REQUESTS = 60
CONCURRENT_REQUESTS_PER_DOMAIN = 15
AUTOHROTTLE_ENABLED = False

When using it inside Scrapinghub with 1 unit it only crawles aprox. 1000 items before throwing the error.


Any ideas? Thanks!

Login to post a comment