Got memusage_exceeded error with a simple spider

Posted almost 6 years ago by Alfonso Moure

Post a topic

Un Answered

Alfonso Moure

Hi,

I'm trying to discover why this extremely simple spider is returning a memusage_exceeded error:

class RefreshCache(CrawlSpider):
    # Spider's name
    name = "refreshcache"

    # Start URLs
    start_urls = [
        'URL1,
        'URL2'
    ]

    # Allowed domains for the crawl
    allowed_domains = [domain[len('https://'):] for domain in start_urls]

    def parse(self, response):
        linked_urls = [link.url for link in LxmlLinkExtractor().extract_links(response)]

        # Extract and request valid links
        for url in linked_urls:
            yield scrapy.Request(url)

        # Create item
        yield {'url': response.url, 'download_latency': response.meta['download_latency']}

These are my current settings in place:

CONCURRENT_REQUESTS = 60
CONCURRENT_REQUESTS_PER_DOMAIN = 15
AUTOHROTTLE_ENABLED = False

When using it inside Scrapinghub with 1 unit it only crawles aprox. 1000 items before throwing the error.

Any ideas? Thanks!

0 Votes

0 Comments