Actually.... we just figured this out.... unless anyone has a better solution:
First, we used our Lua script to include splash:har() in our Splash response data.
Then we were then able to access the 302 response - including original URL, status code and location header - at:
response.data['har']['log']['entries'][0].
Posting in case anyone else has the same problem, or in case someone recommends a different solution.
darren.thorpe
Hi
We are interested in explicitly tracking HTTP 3xx redirects during our web scraping.
An example URL that returns a 302 redirect in the browser is https://www.reg-body.gov/financial/default.htm, which redirects to https://www.reg-body.gov.
But when scraping via a Scrapinghub Splash server, we can't yet access the 302 response; we only see the final 200 response. Log messages:
2019-09-17 17:14:27,044 - scrapy.core.engine - DEBUG - Crawled (200) <GET https://www.reg-body.gov/financial/default.htm via https://t5oxntro-splash.scrapinghub.com/execute> (referer: None)
2019-09-17 17:14:27,160 - scrapy.core.scraper - DEBUG - Scraped from <200 https://www.reg-body.gov/financial/default.htm>
We have tried changing a couple of arguments/settings, but to no effect:
- Adding {'dont_redirect': True, 'handle_httpstatus_list': [301, 302]} to the SplashRequest meta.
- Setting REDIRECT_ENABLED to False.
Is there a way, either through config or code, that we can gain access to redirects from Splash requests?
Thanks
Darren