Start a new topic
Answered

Adding a variable in yield

I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield. 


How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.


1. Have a set of keywords.

2. Need to use them in a search engine one by one.

3. Go through each search result links from the keyword search

4. Extract required data from each links

Working till here

 

Now I need to know which items are linked to the keyword that I have used to search.


Best Answer

You should put the variable 'word' in the meta data of the response in your first for loop

 

def parse(self, response):
        for word in self.key_words:
            url = response.urljoin(self.key_url.format(word))
            self.log('Visited ' + url)
            yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})

and then feed it to the next function

 

 for article_url in article_urls:
     yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})

    and finally add it to your yield

def parse_article(self, response):
    yield {
            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
            'url' : response.url,
            'word': response.meta['word']
            }

 I did not test it, but I assume something like this should work. Let me know!


It worked well. Thank you so much!!


1 person likes this
Answer

You should put the variable 'word' in the meta data of the response in your first for loop

 

def parse(self, response):
        for word in self.key_words:
            url = response.urljoin(self.key_url.format(word))
            self.log('Visited ' + url)
            yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})

and then feed it to the next function

 

 for article_url in article_urls:
     yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})

    and finally add it to your yield

def parse_article(self, response):
    yield {
            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
            'url' : response.url,
            'word': response.meta['word']
            }

 I did not test it, but I assume something like this should work. Let me know!


1 person likes this

Here's my code


import scrapy

from scrapy.selector import Selector

 

class QuotesSpider(scrapy.Spider):

 

    name = "Sep_test"

    start_urls = ['https://searchengine.com/']

    key_url = 'key={0}'

    key_words = ['abc','xyz']

    var_txt = 'https://'

 

    custom_settings = {

        'FEED_EXPORT_ENCODING': 'utf-8'

    }

#looping through keyword list

    def parse(self, response):

        for word in self.key_words:

            url = response.urljoin(self.key_url.format(word))

            self.log('Visited ' + url)

            yield scrapy.Request(url=url, callback=self.parse_keyword)

 

#extracting all search result links href

 

    def parse_keyword(self, response):

        article_urls = response.xpath('//*[@id="searchTable"]/tr/td/a/@href').extract()

 

        for article_url in article_urls:

            yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article)

 

        #pagination

 

        next_page_url = response.css('a.next::attr(href)').extract_first()

        if next_page_url:

            next_page_url = response.urljoin(next_page_url)

            yield scrapy.Request(url=next_page_url, callback=self.parse_keyword)

 

#extracting details from each search result link

 

    def parse_article(self, response):

        yield {

            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),

            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),

            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),

            'url' : response.url,

            }


Need to attach the corresponding keyword to each yield result set. Right now I do not know which keyword is linked to each yield result set.

Also, I am not sure where to use the metadata in this code.


You could try to add meta data to your request. You have to use something like

yield scrapy.Request(
    href, callback=self.parse_search, meta={"entry": search_query}
)

and then in the parse_search function retrieve it like

 

def parse_search(self, response):
        entry = response.meta["entry"]

  Let me know if this helps you!


1 person likes this
Login to post a comment