Need to capture 302 redirects from Splash

Posted almost 6 years ago by darren.thorpe

Post a topic

Un Answered

darren.thorpe

We are interested in explicitly tracking HTTP 3xx redirects during our web scraping.

An example URL that returns a 302 redirect in the browser is https://www.reg-body.gov/financial/default.htm, which redirects to https://www.reg-body.gov.

But when scraping via a Scrapinghub Splash server, we can't yet access the 302 response; we only see the final 200 response. Log messages:

2019-09-17 17:14:27,044 - scrapy.core.engine - DEBUG - Crawled (200) <GET https://www.reg-body.gov/financial/default.htm via https://t5oxntro-splash.scrapinghub.com/execute> (referer: None)

2019-09-17 17:14:27,160 - scrapy.core.scraper - DEBUG - Scraped from <200 https://www.reg-body.gov/financial/default.htm>

We have tried changing a couple of arguments/settings, but to no effect:

- Adding {'dont_redirect': True, 'handle_httpstatus_list': [301, 302]} to the SplashRequest meta.

- Setting REDIRECT_ENABLED to False.

Is there a way, either through config or code, that we can gain access to redirects from Splash requests?

Thanks

Darren

0 Votes

1 Comments

darren.thorpe posted almost 6 years ago

Actually.... we just figured this out.... unless anyone has a better solution:

We were able to access the original request and 302 response within Splash HAR logs.

First, we used our Lua script to include splash:har() in our Splash response data.

Then we were then able to access the 302 response - including original URL, status code and location header - at:

response.data['har']['log']['entries'][0].

Posting in case anyone else has the same problem, or in case someone recommends a different solution.

0 Votes