No recent searches
Popular Articles
Sorry! nothing found for
Posted about 5 years ago by darren.thorpe
Hi
We are interested in explicitly tracking HTTP 3xx redirects during our web scraping.
An example URL that returns a 302 redirect in the browser is https://www.reg-body.gov/financial/default.htm, which redirects to https://www.reg-body.gov.
But when scraping via a Scrapinghub Splash server, we can't yet access the 302 response; we only see the final 200 response. Log messages:
2019-09-17 17:14:27,044 - scrapy.core.engine - DEBUG - Crawled (200) <GET https://www.reg-body.gov/financial/default.htm via https://t5oxntro-splash.scrapinghub.com/execute> (referer: None)
2019-09-17 17:14:27,160 - scrapy.core.scraper - DEBUG - Scraped from <200 https://www.reg-body.gov/financial/default.htm>
We have tried changing a couple of arguments/settings, but to no effect:
- Adding {'dont_redirect': True, 'handle_httpstatus_list': [301, 302]} to the SplashRequest meta.
- Setting REDIRECT_ENABLED to False.
Is there a way, either through config or code, that we can gain access to redirects from Splash requests?
Thanks
Darren
0 Votes
1 Comments
darren.thorpe posted about 5 years ago
Actually.... we just figured this out.... unless anyone has a better solution:
First, we used our Lua script to include splash:har() in our Splash response data.
Then we were then able to access the 302 response - including original URL, status code and location header - at:
response.data['har']['log']['entries'][0].
Posting in case anyone else has the same problem, or in case someone recommends a different solution.
Login to post a comment
People who like this
This post will be deleted permanently. Are you sure?
Hi
We are interested in explicitly tracking HTTP 3xx redirects during our web scraping.
An example URL that returns a 302 redirect in the browser is https://www.reg-body.gov/financial/default.htm, which redirects to https://www.reg-body.gov.
But when scraping via a Scrapinghub Splash server, we can't yet access the 302 response; we only see the final 200 response. Log messages:
2019-09-17 17:14:27,044 - scrapy.core.engine - DEBUG - Crawled (200) <GET https://www.reg-body.gov/financial/default.htm via https://t5oxntro-splash.scrapinghub.com/execute> (referer: None)
2019-09-17 17:14:27,160 - scrapy.core.scraper - DEBUG - Scraped from <200 https://www.reg-body.gov/financial/default.htm>
We have tried changing a couple of arguments/settings, but to no effect:
- Adding {'dont_redirect': True, 'handle_httpstatus_list': [301, 302]} to the SplashRequest meta.
- Setting REDIRECT_ENABLED to False.
Is there a way, either through config or code, that we can gain access to redirects from Splash requests?
Thanks
Darren
0 Votes
1 Comments
darren.thorpe posted about 5 years ago
Actually.... we just figured this out.... unless anyone has a better solution:
First, we used our Lua script to include splash:har() in our Splash response data.
Then we were then able to access the 302 response - including original URL, status code and location header - at:
response.data['har']['log']['entries'][0].
Posting in case anyone else has the same problem, or in case someone recommends a different solution.
0 Votes
Login to post a comment