No recent searches
Popular Articles
Sorry! nothing found for
Posted about 4 years ago by A.
Hi,
When scraping a site I noticed many redirect attempts outside of allowed-domains. They are mostly to sites requesting authentication, like twitter.api, facebook etc. On the other hand other sites do get filtered by offsiterequestes middleware.
This is my spider:
class ScriptScrapy(CrawlSpider):
name = 'scriptscrapy'
allowed_domains = ['eldorado.ru']
start_urls = ['http://eldorado.ru']
rules = ( Rule(LinkExtractor(), callback='parse_item', follow=True), )
And this is a sample redirect I get 2020-11-04 15:00:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.twitter.com/oauth/authorize?oauth_token=2NS7MgAAAAAA7tOrAAABdZNYb9Q> (referer: https://www.eldorado.ru/cat/detail/smartfon-apple-iphone-12-pro-256gb-pacific-blue-mgmt3ru-a/?show=response)
When I visit the same URL via my browser then no redirects happen
0 Votes
0 Comments
Login to post a comment
People who like this
This post will be deleted permanently. Are you sure?
Hi,
When scraping a site I noticed many redirect attempts outside of allowed-domains. They are mostly to sites requesting authentication, like twitter.api, facebook etc. On the other hand other sites do get filtered by offsiterequestes middleware.
This is my spider:
class ScriptScrapy(CrawlSpider):
name = 'scriptscrapy'
allowed_domains = ['eldorado.ru']
start_urls = ['http://eldorado.ru']
rules = ( Rule(LinkExtractor(), callback='parse_item', follow=True), )
And this is a sample redirect I get 2020-11-04 15:00:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.twitter.com/oauth/authorize?oauth_token=2NS7MgAAAAAA7tOrAAABdZNYb9Q> (referer: https://www.eldorado.ru/cat/detail/smartfon-apple-iphone-12-pro-256gb-pacific-blue-mgmt3ru-a/?show=response)
When I visit the same URL via my browser then no redirects happen
0 Votes
0 Comments
Login to post a comment