How to prevent Splash sending its default headers i.e. 'Host'?

Posted almost 6 years ago by Umair Ayub

Post a topic

Un Answered

Umair Ayub

I had just deployed Splash (in Docker) like a month ago on my dedicated server.

I am trying to scrape [this site](https://www.businesswire.com/portal/site/home/news/subject/?vnsId=31333) with Scrapy Splash, but I get following error no matter how many time I try that url

`([scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.businesswire.com/portal/site/home/news/subject/?vnsId=31333 via http://localhost:8050/render.html> (failed 1 times): User timeout caused connection failure: Getting http://localhost:8050/render.html took longer than 80.0 seconds..)`

Meanwhile, same Splash server successfully scrapes every site I try.

If I try to cURL or `scrapy.Request` the above url from my server, it works, the site does not block no matter how many times I scrape via cURL or `scrapy.Request`

Then I had idea to see if there are some headers Splash is sending, I debugged Splash request headers via http://httpbin.org/get and found out that it automatically adds few headers

So now I know that Splash is sending `"Host": "businesswire.com"` to the target site, which makes that website not scrape.

Question is, how do I make Splash not send any headers automatically? Or at least stop Splash from sending `Host` header?

0 Votes

0 Comments