Crawlspider and Splash

Posted over 7 years ago by Sebastian Pachl

Post a topic
Un Answered
S
Sebastian Pachl

Hi there,


i coded a normal spider using splash and ur great samples on github (https://github.com/scrapinghub/sample-projects), but i couldn't get a crawlspider to work with splash.


Could someone upload a sample on how to implement splash with the crawlspider class?


Alternatively i wrote a normal spider doing a similar job like a crawlspider but the handy linkextractor rules missing. I replaced the linkextractor rules with a custom build linkextractor but i miss a seperate rule to only parse specific links:


# WORKING manual way Crawlspider!!!!!!
# https://stackoverflow.com/questions/15500281/scrapy-crawl-whole-website

from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from w3lib.http import basic_auth_header
from CrawlSpiderSplashTest.items import CrawlspidersplashtestItem

from scrapy.http import Request
import re


class MySpider(Spider):
    name = 'reccrawler'
    allowed_domains = ["toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/js/page/2/"]



    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                splash_headers={
                    'Authorization': basic_auth_header(self.settings['APIKEY'], ''),
                },
            )


    def parse(self, response):
        links = response.xpath('//a/@href').extract()

        # We stored already crawled links in this list
        crawledLinks = []

        # Pattern to check proper link
        linkPattern = re.compile(".*/js/.*")

        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            if linkPattern.match(link) and not link in crawledLinks:
                crawledLinks.append(link)

                yield SplashRequest(
                    response.urljoin(link),
                    splash_headers={
                        'Authorization': basic_auth_header(self.settings['APIKEY'], ''),
                    },
                )

        for quote in response.css('div.quote'):
            item = CrawlspidersplashtestItem()
            item["text"] = quote.css('span.text::text').extract_first()
            yield item

 

0 Votes


3 Comments

Sorted by
S

Sebastian Pachl posted over 7 years ago

Thx in Advance

 

0 Votes

N

Nickolas Verdegem posted almost 6 years ago

would also appreciate this. CrawlSpider (with rules) by using splash-crawlera combination. I couldn't get this done. An example would help a lot.

0 Votes

A

Alessandro Eren posted about 3 years ago

looking forward

0 Votes

Login to post a comment