Adding a variable in yield

Posted almost 7 years ago by Vinay C

Post a topic

Answered

Vinay C

I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield.

How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.

1. Have a set of keywords.

2. Need to use them in a search engine one by one.

3. Go through each search result links from the keyword search

4. Extract required data from each links

Working till here

Now I need to know which items are linked to the keyword that I have used to search.

0 Votes

jwaterschoot posted almost 7 years ago Best Answer

You should put the variable 'word' in the meta data of the response in your first for loop

def parse(self, response):
        for word in self.key_words:
            url = response.urljoin(self.key_url.format(word))
            self.log('Visited ' + url)
            yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})

and then feed it to the next function

 for article_url in article_urls:
     yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})

and finally add it to your yield

def parse_article(self, response):
    yield {
            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
            'url' : response.url,
            'word': response.meta['word']
            }

I did not test it, but I assume something like this should work. Let me know!

1 Votes

4 Comments

Vinay C posted almost 7 years ago

It worked well. Thank you so much!!

1 Votes

jwaterschoot posted almost 7 years ago Answer

You should put the variable 'word' in the meta data of the response in your first for loop

def parse(self, response):
        for word in self.key_words:
            url = response.urljoin(self.key_url.format(word))
            self.log('Visited ' + url)
            yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})

and then feed it to the next function

 for article_url in article_urls:
     yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})

and finally add it to your yield

def parse_article(self, response):
    yield {
            'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
            'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
            'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
            'url' : response.url,
            'word': response.meta['word']
            }

I did not test it, but I assume something like this should work. Let me know!

1 Votes

Vinay C posted almost 7 years ago

Here's my code

import scrapy

from scrapy.selector import Selector

class QuotesSpider(scrapy.Spider):

name = "Sep_test"

start_urls = ['https://searchengine.com/']

key_url = 'key={0}'

key_words = ['abc','xyz']

var_txt = 'https://'

custom_settings = {

'FEED_EXPORT_ENCODING': 'utf-8'

}

#looping through keyword list

def parse(self, response):

for word in self.key_words:

url = response.urljoin(self.key_url.format(word))

self.log('Visited ' + url)

yield scrapy.Request(url=url, callback=self.parse_keyword)

#extracting all search result links href

def parse_keyword(self, response):

article_urls = response.xpath('//*[@id="searchTable"]/tr/td/a/@href').extract()

for article_url in article_urls:

yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article)

#pagination

next_page_url = response.css('a.next::attr(href)').extract_first()

if next_page_url:

next_page_url = response.urljoin(next_page_url)

yield scrapy.Request(url=next_page_url, callback=self.parse_keyword)

#extracting details from each search result link

def parse_article(self, response):

yield {

'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),

'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),

'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),

'url' : response.url,

}

Need to attach the corresponding keyword to each yield result set. Right now I do not know which keyword is linked to each yield result set.

Also, I am not sure where to use the metadata in this code.

0 Votes

jwaterschoot posted almost 7 years ago

You could try to add meta data to your request. You have to use something like

yield scrapy.Request(
    href, callback=self.parse_search, meta={"entry": search_query}
)

and then in the parse_search function retrieve it like

def parse_search(self, response):
        entry = response.meta["entry"]

Let me know if this helps you!

1 Votes