I'm trying to discover why this extremely simple spider is returning a memusage_exceeded error:
class RefreshCache(CrawlSpider):
# Spider's name
name = "refreshcache"
# Start URLs
start_urls = [
'URL1,
'URL2'
]
# Allowed domains for the crawl
allowed_domains = [domain[len('https://'):] for domain in start_urls]
def parse(self, response):
linked_urls = [link.url for link in LxmlLinkExtractor().extract_links(response)]
# Extract and request valid links
for url in linked_urls:
yield scrapy.Request(url)
# Create item
yield {'url': response.url, 'download_latency': response.meta['download_latency']}
Hi,
I'm trying to discover why this extremely simple spider is returning a memusage_exceeded error:
class RefreshCache(CrawlSpider): # Spider's name name = "refreshcache" # Start URLs start_urls = [ 'URL1, 'URL2' ] # Allowed domains for the crawl allowed_domains = [domain[len('https://'):] for domain in start_urls] def parse(self, response): linked_urls = [link.url for link in LxmlLinkExtractor().extract_links(response)] # Extract and request valid links for url in linked_urls: yield scrapy.Request(url) # Create item yield {'url': response.url, 'download_latency': response.meta['download_latency']}These are my current settings in place:
When using it inside Scrapinghub with 1 unit it only crawles aprox. 1000 items before throwing the error.
Any ideas? Thanks!
0 Votes
0 Comments
Login to post a comment