splash identified as robot and captcha request

Posted over 7 years ago by Alessio Pollero

Post a topic

Un Answered

Alessio Pollero

I've opened this Github bug about a website identifying Splash as robot and asking for captcha.

To recap this is the issue:

I'm trying to scrape an url like this one with splash and scrapy, but somehow hotelscombined is capable of identify splash and ask for a captcha to solve.

The code below is used to make the request:

def start_requests(self):
        script = """
        function main(splash)
           assert(splash:go(splash.args.url))
           assert(splash:wait(1.25))

           -- return result as a JSON object
           return {
               html = splash:html()
           }
        end
        """
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'args': {'lua_source': script},
                    'endpoint': 'execute',
                }
            })

I've tried changing the USER_AGENT, but there is not way to make working correctly.
What can I do avoid splash to be detected as automated browser?

If I use Chrome or Firefox, the url works correctly.

The version of splash I'm using is : Splash version: 3.2 / Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
latest docker container published as of today.

Anyone can suggest any workaround to avoid captcha request ?

Thanks.

0 Votes

0 Comments