I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield.
How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.
1. Have a set of keywords.
2. Need to use them in a search engine one by one.
3. Go through each search result links from the keyword search
4. Extract required data from each links
Working till here
Now I need to know which items are linked to the keyword that I have used to search.
0 Votes
j
jwaterschoot posted
about 7 years ago
Best Answer
You should put the variable 'word' in the meta data of the response in your first for loop
def parse(self, response):
for word in self.key_words:
url = response.urljoin(self.key_url.format(word))
self.log('Visited ' + url)
yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})
and then feed it to the next function
for article_url in article_urls:
yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})
I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield.
How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.
1. Have a set of keywords.
2. Need to use them in a search engine one by one.
3. Go through each search result links from the keyword search
4. Extract required data from each links
Working till here
Now I need to know which items are linked to the keyword that I have used to search.
0 Votes
jwaterschoot posted about 7 years ago Best Answer
You should put the variable 'word' in the meta data of the response in your first for loop
def parse(self, response): for word in self.key_words: url = response.urljoin(self.key_url.format(word)) self.log('Visited ' + url) yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})and then feed it to the next function
for article_url in article_urls: yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})and finally add it to your yield
def parse_article(self, response): yield { 'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(), 'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(), 'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(), 'url' : response.url, 'word': response.meta['word'] }I did not test it, but I assume something like this should work. Let me know!
1 Votes
4 Comments
Vinay C posted about 7 years ago
It worked well. Thank you so much!!
1 Votes
jwaterschoot posted about 7 years ago Answer
You should put the variable 'word' in the meta data of the response in your first for loop
def parse(self, response): for word in self.key_words: url = response.urljoin(self.key_url.format(word)) self.log('Visited ' + url) yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})and then feed it to the next function
for article_url in article_urls: yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})and finally add it to your yield
def parse_article(self, response): yield { 'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(), 'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(), 'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(), 'url' : response.url, 'word': response.meta['word'] }I did not test it, but I assume something like this should work. Let me know!
1 Votes
Vinay C posted about 7 years ago
Here's my code
import scrapy
from scrapy.selector import Selector
class QuotesSpider(scrapy.Spider):
name = "Sep_test"
start_urls = ['https://searchengine.com/']
key_url = 'key={0}'
key_words = ['abc','xyz']
var_txt = 'https://'
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8'
}
#looping through keyword list
def parse(self, response):
for word in self.key_words:
url = response.urljoin(self.key_url.format(word))
self.log('Visited ' + url)
yield scrapy.Request(url=url, callback=self.parse_keyword)
#extracting all search result links href
def parse_keyword(self, response):
article_urls = response.xpath('//*[@id="searchTable"]/tr/td/a/@href').extract()
for article_url in article_urls:
yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article)
#pagination
next_page_url = response.css('a.next::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse_keyword)
#extracting details from each search result link
def parse_article(self, response):
yield {
'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
'url' : response.url,
}
Need to attach the corresponding keyword to each yield result set. Right now I do not know which keyword is linked to each yield result set.
Also, I am not sure where to use the metadata in this code.
0 Votes
jwaterschoot posted about 7 years ago
You could try to add meta data to your request. You have to use something like
yield scrapy.Request( href, callback=self.parse_search, meta={"entry": search_query} )and then in the parse_search function retrieve it like
def parse_search(self, response): entry = response.meta["entry"]Let me know if this helps you!
1 Votes
Login to post a comment