I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield.
How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.
1. Have a set of keywords.
2. Need to use them in a search engine one by one.
3. Go through each search result links from the keyword search
4. Extract required data from each links
Working till here
Now I need to know which items are linked to the keyword that I have used to search.
0 Votes
j
jwaterschoot posted
over 6 years ago
Best Answer
You should put the variable 'word' in the meta data of the response in your first for loop
def parse(self, response):
for word in self.key_words:
url = response.urljoin(self.key_url.format(word))
self.log('Visited ' + url)
yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})
and then feed it to the next function
for article_url in article_urls:
yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})
I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield.
How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.
1. Have a set of keywords.
2. Need to use them in a search engine one by one.
3. Go through each search result links from the keyword search
4. Extract required data from each links
Working till here
Now I need to know which items are linked to the keyword that I have used to search.
0 Votes
jwaterschoot posted over 6 years ago Best Answer
You should put the variable 'word' in the meta data of the response in your first for loop
and then feed it to the next function
and finally add it to your yield
I did not test it, but I assume something like this should work. Let me know!
1 Votes
4 Comments
Vinay C posted over 6 years ago
It worked well. Thank you so much!!
1 Votes
jwaterschoot posted over 6 years ago Answer
You should put the variable 'word' in the meta data of the response in your first for loop
and then feed it to the next function
and finally add it to your yield
I did not test it, but I assume something like this should work. Let me know!
1 Votes
Vinay C posted over 6 years ago
Here's my code
import scrapy
from scrapy.selector import Selector
class QuotesSpider(scrapy.Spider):
name = "Sep_test"
start_urls = ['https://searchengine.com/']
key_url = 'key={0}'
key_words = ['abc','xyz']
var_txt = 'https://'
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8'
}
#looping through keyword list
def parse(self, response):
for word in self.key_words:
url = response.urljoin(self.key_url.format(word))
self.log('Visited ' + url)
yield scrapy.Request(url=url, callback=self.parse_keyword)
#extracting all search result links href
def parse_keyword(self, response):
article_urls = response.xpath('//*[@id="searchTable"]/tr/td/a/@href').extract()
for article_url in article_urls:
yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article)
#pagination
next_page_url = response.css('a.next::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse_keyword)
#extracting details from each search result link
def parse_article(self, response):
yield {
'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
'url' : response.url,
}
Need to attach the corresponding keyword to each yield result set. Right now I do not know which keyword is linked to each yield result set.
Also, I am not sure where to use the metadata in this code.
0 Votes
jwaterschoot posted over 6 years ago
You could try to add meta data to your request. You have to use something like
and then in the parse_search function retrieve it like
Let me know if this helps you!
1 Votes
Login to post a comment