Could someone upload a sample on how to implement splash with the crawlspider class?
Alternatively i wrote a normal spider doing a similar job like a crawlspider but the handy linkextractor rules missing. I replaced the linkextractor rules with a custom build linkextractor but i miss a seperate rule to only parse specific links:
# WORKING manual way Crawlspider!!!!!!
# https://stackoverflow.com/questions/15500281/scrapy-crawl-whole-website
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from w3lib.http import basic_auth_header
from CrawlSpiderSplashTest.items import CrawlspidersplashtestItem
from scrapy.http import Request
import re
class MySpider(Spider):
name = 'reccrawler'
allowed_domains = ["toscrape.com"]
start_urls = ["http://quotes.toscrape.com/js/page/2/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
splash_headers={
'Authorization': basic_auth_header(self.settings['APIKEY'], ''),
},
)
def parse(self, response):
links = response.xpath('//a/@href').extract()
# We stored already crawled links in this list
crawledLinks = []
# Pattern to check proper link
linkPattern = re.compile(".*/js/.*")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
if linkPattern.match(link) and not link in crawledLinks:
crawledLinks.append(link)
yield SplashRequest(
response.urljoin(link),
splash_headers={
'Authorization': basic_auth_header(self.settings['APIKEY'], ''),
},
)
for quote in response.css('div.quote'):
item = CrawlspidersplashtestItem()
item["text"] = quote.css('span.text::text').extract_first()
yield item
Sebastian Pachl
Hi there,
i coded a normal spider using splash and ur great samples on github (https://github.com/scrapinghub/sample-projects), but i couldn't get a crawlspider to work with splash.
Could someone upload a sample on how to implement splash with the crawlspider class?
Alternatively i wrote a normal spider doing a similar job like a crawlspider but the handy linkextractor rules missing. I replaced the linkextractor rules with a custom build linkextractor but i miss a seperate rule to only parse specific links: