I'm trying to discover why this extremely simple spider is returning a memusage_exceeded error:
class RefreshCache(CrawlSpider):
# Spider's name
name = "refreshcache"
# Start URLs
start_urls = [
'URL1,
'URL2'
]
# Allowed domains for the crawl
allowed_domains = [domain[len('https://'):] for domain in start_urls]
def parse(self, response):
linked_urls = [link.url for link in LxmlLinkExtractor().extract_links(response)]
# Extract and request valid links
for url in linked_urls:
yield scrapy.Request(url)
# Create item
yield {'url': response.url, 'download_latency': response.meta['download_latency']}
Hi,
I'm trying to discover why this extremely simple spider is returning a memusage_exceeded error:
These are my current settings in place:
When using it inside Scrapinghub with 1 unit it only crawles aprox. 1000 items before throwing the error.
Any ideas? Thanks!
0 Votes
0 Comments
Login to post a comment