Learn all about the latest trends and best practices in data extraction - Join us at Extract SummitGet tickets
Start a new topic
Answered

Adding a variable in yield

I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield. 


How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.


1. Have a set of keywords.

2. Need to use them in a search engine one by one.

3. Go through each search result links from the keyword search

4. Extract required data from each links

Working till here

 

Now I need to know which items are linked to the keyword that I have used to search.


Best Answer

You should put the variable 'word' in the meta data of the response in your first for loop

 

def parse(self, response):
        for word in self.key_words:
            url = response.urljoin(self.key_url.format(word))
            self.log('Visited ' + url)
            yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})

and then feed it to the next function

 

 for article_url in article_urls:
     yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})

    and finally add it to your yield

def parse_article(self, response):
    yield {
            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
            'url' : response.url,
            'word': response.meta['word']
            }

 I did not test it, but I assume something like this should work. Let me know!


You could try to add meta data to your request. You have to use something like

yield scrapy.Request(
    href, callback=self.parse_search, meta={"entry": search_query}
)

and then in the parse_search function retrieve it like

 

def parse_search(self, response):
        entry = response.meta["entry"]

  Let me know if this helps you!


1 person likes this

Here's my code


import scrapy

from scrapy.selector import Selector

 

class QuotesSpider(scrapy.Spider):

 

    name = "Sep_test"

    start_urls = ['https://searchengine.com/']

    key_url = 'key={0}'

    key_words = ['abc','xyz']

    var_txt = 'https://'

 

    custom_settings = {

        'FEED_EXPORT_ENCODING': 'utf-8'

    }

#looping through keyword list

    def parse(self, response):

        for word in self.key_words:

            url = response.urljoin(self.key_url.format(word))

            self.log('Visited ' + url)

            yield scrapy.Request(url=url, callback=self.parse_keyword)

 

#extracting all search result links href

 

    def parse_keyword(self, response):

        article_urls = response.xpath('//*[@id="searchTable"]/tr/td/a/@href').extract()

 

        for article_url in article_urls:

            yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article)

 

        #pagination

 

        next_page_url = response.css('a.next::attr(href)').extract_first()

        if next_page_url:

            next_page_url = response.urljoin(next_page_url)

            yield scrapy.Request(url=next_page_url, callback=self.parse_keyword)

 

#extracting details from each search result link

 

    def parse_article(self, response):

        yield {

            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),

            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),

            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),

            'url' : response.url,

            }


Need to attach the corresponding keyword to each yield result set. Right now I do not know which keyword is linked to each yield result set.

Also, I am not sure where to use the metadata in this code.


Answer

You should put the variable 'word' in the meta data of the response in your first for loop

 

def parse(self, response):
        for word in self.key_words:
            url = response.urljoin(self.key_url.format(word))
            self.log('Visited ' + url)
            yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})

and then feed it to the next function

 

 for article_url in article_urls:
     yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})

    and finally add it to your yield

def parse_article(self, response):
    yield {
            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
            'url' : response.url,
            'word': response.meta['word']
            }

 I did not test it, but I assume something like this should work. Let me know!


1 person likes this

It worked well. Thank you so much!!


1 person likes this
Login to post a comment