I've opened this Github bug about a website identifying Splash as robot and asking for captcha.
To recap this is the issue:
I'm trying to scrape an url like this one with splash and scrapy, but somehow hotelscombined is capable of identify splash and ask for a captcha to solve.
The code below is used to make the request:
def start_requests(self):
script = """
function main(splash)
assert(splash:go(splash.args.url))
assert(splash:wait(1.25))
-- return result as a JSON object
return {
html = splash:html()
}
end
"""
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'args': {'lua_source': script},
'endpoint': 'execute',
}
})
I've tried changing the USER_AGENT, but there is not way to make working correctly. What can I do avoid splash to be detected as automated browser?
If I use Chrome or Firefox, the url works correctly.
The version of splash I'm using is : Splash version: 3.2 / Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2 latest docker container published as of today.
Anyone can suggest any workaround to avoid captcha request ?
I've opened this Github bug about a website identifying Splash as robot and asking for captcha.
To recap this is the issue:
I'm trying to scrape an url like this one with splash and scrapy, but somehow hotelscombined is capable of identify splash and ask for a captcha to solve.
The code below is used to make the request:
I've tried changing the USER_AGENT, but there is not way to make working correctly.
What can I do avoid splash to be detected as automated browser?
If I use Chrome or Firefox, the url works correctly.
The version of splash I'm using is : Splash version: 3.2 / Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
latest docker container published as of today.
Anyone can suggest any workaround to avoid captcha request ?
Thanks.
0 Votes
0 Comments
Login to post a comment