Start a new topic

Need to capture 302 redirects from Splash

Hi


We are interested in explicitly tracking HTTP 3xx redirects during our web scraping.

 

An example URL that returns a 302 redirect in the browser is https://www.reg-body.gov/financial/default.htm, which redirects to https://www.reg-body.gov.

 

But when scraping via a Scrapinghub Splash server, we can't yet access the 302 response; we only see the final 200 response. Log messages:

 

2019-09-17 17:14:27,044 - scrapy.core.engine - DEBUG - Crawled (200) <GET https://www.reg-body.gov/financial/default.htm via https://t5oxntro-splash.scrapinghub.com/execute> (referer: None)

2019-09-17 17:14:27,160 - scrapy.core.scraper - DEBUG - Scraped from <200 https://www.reg-body.gov/financial/default.htm>

 

We have tried changing a couple of arguments/settings, but to no effect:

- Adding {'dont_redirect': True, 'handle_httpstatus_list': [301, 302]} to the SplashRequest meta.

- Setting REDIRECT_ENABLED to False.

 

Is there a way, either through config or code, that we can gain access to redirects from Splash requests?


Thanks


Darren

1 Comment

Actually.... we just figured this out.... unless anyone has a better solution:


 

We were able to access the original request and 302 response within Splash HAR logs.

 

First, we used our Lua script to include splash:har() in our Splash response data.

 

Then we were then able to access the 302 response - including original URL, status code and location header - at:


response.data['har']['log']['entries'][0].

 

Posting in case anyone else has the same problem, or in case someone recommends a different solution.

Login to post a comment