Here we will show you how to create your first Scrapy spider. We strongly recommend you also read the Scrapy tutorial for a more in-depth guide.
This assumes you have Scrapy already installed, otherwise please refer to the Scrapy installation guide.
For this example, we will build a spider to scrape famous quotes from this website: http://quotes.toscrape.com/
We begin by creating a Scrapy project which we will call quotes_crawler
:
$ scrapy startproject quotes_crawler
Then we create a spider for quotes.toscrape.com
:
$ scrapy genspider quotes-toscrape quotes.toscrape.com Created spider 'quotes-toscrape' using template 'basic' in module: quotes_crawler.spiders.quotes_toscrape
Then we edit the spider:
$ scrapy edit quotes-toscrape
Here is the code:
import scrapy class QuotesToScrapeSpider(scrapy.Spider): name = "quotes-toscrape" allowed_domains = ["quotes.toscrape.com"] start_urls = ['http://quotes.toscrape.com/', ] def parse(self, response): for quote in response.css("div.quote"): yield { 'text': quote.css("span.text ::text").extract_first(), 'author': quote.css("small.author ::text").extract_first(), 'tags': quote.css("div.tags > a.tag ::text").extract() } next_page_url = response.css("nav > ul > li.next > a ::attr(href)").extract_first() if next_page_url: yield scrapy.Request(response.urljoin(next_page_url))
For more information about Scrapy please refer to the Scrapy documentation.