I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield.
How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.
1. Have a set of keywords.
2. Need to use them in a search engine one by one.
3. Go through each search result links from the keyword search
4. Extract required data from each links
Working till here
Now I need to know which items are linked to the keyword that I have used to search.
Best Answer
j
jwaterschoot
said
about 5 years ago
You should put the variable 'word' in the meta data of the response in your first for loop
def parse(self, response):
for word in self.key_words:
url = response.urljoin(self.key_url.format(word))
self.log('Visited ' + url)
yield scrapy.Request(url=url, callback=self.parse_keyword, meta={"word": word})
and then feed it to the next function
for article_url in article_urls:
yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article, meta={'word': response.meta['word']})
Vinay C
I have a set of keywords which is used to run through a search engine one by one and get the list of all search results. After that, go through each search result and extract some information from each links. I am able to obtain results through yield.
How do I add a variable in yield to store the keyword that I used to search; so that I can understand which all results are linked to the corresponding keywords used.
1. Have a set of keywords.
2. Need to use them in a search engine one by one.
3. Go through each search result links from the keyword search
4. Extract required data from each links
Working till here
Now I need to know which items are linked to the keyword that I have used to search.
You should put the variable 'word' in the meta data of the response in your first for loop
and then feed it to the next function
and finally add it to your yield
I did not test it, but I assume something like this should work. Let me know!
- Oldest First
- Popular
- Newest First
Sorted by Oldest Firstjwaterschoot
You could try to add meta data to your request. You have to use something like
and then in the parse_search function retrieve it like
Let me know if this helps you!
1 person likes this
Vinay C
Here's my code
import scrapy
from scrapy.selector import Selector
class QuotesSpider(scrapy.Spider):
name = "Sep_test"
start_urls = ['https://searchengine.com/']
key_url = 'key={0}'
key_words = ['abc','xyz']
var_txt = 'https://'
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8'
}
#looping through keyword list
def parse(self, response):
for word in self.key_words:
url = response.urljoin(self.key_url.format(word))
self.log('Visited ' + url)
yield scrapy.Request(url=url, callback=self.parse_keyword)
#extracting all search result links href
def parse_keyword(self, response):
article_urls = response.xpath('//*[@id="searchTable"]/tr/td/a/@href').extract()
for article_url in article_urls:
yield scrapy.Request(url=(self.var_txt + article_url), callback=self.parse_article)
#pagination
next_page_url = response.css('a.next::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse_keyword)
#extracting details from each search result link
def parse_article(self, response):
yield {
'title' : Selector(response=response).xpath('//*[@id="basicinfo"]/div[1]/h1/text()').extract(),
'name': Selector(response=response).xpath('//*[@id="basicinfo"]/div/div/h2/text()').extract(),
'date': Selector(response=response).xpath('//*[@id="index_show"]/ul[1]/li[1]/text()').extract(),
'url' : response.url,
}
Need to attach the corresponding keyword to each yield result set. Right now I do not know which keyword is linked to each yield result set.
Also, I am not sure where to use the metadata in this code.
jwaterschoot
You should put the variable 'word' in the meta data of the response in your first for loop
and then feed it to the next function
and finally add it to your yield
I did not test it, but I assume something like this should work. Let me know!
1 person likes this
Vinay C
It worked well. Thank you so much!!
1 person likes this
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 446 topics