Scrapy and Splash times out for a specific site.

Posted over 7 years ago by kamfor

Post a topic

Un Answered

kamfor

Hi there!

I have an issue with Scrapy and Splash when trying the fetch responses from this site.

I tried the following without luck:

pure scrapy - times out
scrapy + crawlera - times out
splash - times out

However I can scrape the site with the Firefox webdriver of Selenium. But I want to move away from that and move to Splash.

Is there a workaround to avoid these timeouts?

0 Votes

9 Comments

surge posted over 7 years ago Admin

Locally, with a docker image.

0 Votes

kamfor posted over 7 years ago

Tried it in the Splash UI again, without luck.
By the way what kind of splash instance do you use? I bet you tried it with something else than Scrapinghub's small Splash instance, right?

0 Votes

surge posted over 7 years ago Admin

Again, that's in Splash UI -- I've re-used your script with these two changes: added request:set_header('X-Crawlera-UA', 'pass') line under X-Crawlera-Cookies, and added splash:wait(30) before the return statement in main function. Got several successful renders in a row.

Check if you have CRAWLERA_ENABLED = True in the spider's settings, and flip it to False. This is required for the correct flow of requests when using Splash with Crawlera -- the latter should only be called from the Lua script itself.

0 Votes

kamfor posted over 7 years ago

Hi surge,

No success so far, I wonder how you can managed without timeouts.

My script is like this:

function use_crawlera(splash)
    local user = '<crawlera apikey>'

    local host = 'proxy.crawlera.com'
    local port = 8010
    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        -- Discard requests to advertising and tracking domains.
        if string.find(request.url, 'doubleclick%.net') or
           string.find(request.url, 'analytics%.google%.com') then
            request.abort()
            return
        end

        -- Avoid using Crawlera for subresources fetching to increase crawling speed.
        if string.find(request.url, '://static%.') ~= nil or
           string.find(request.url, '%.png$') ~= nil then
            return
        end

        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_timeout(90.0)
        request:set_header('Referer', 'https://www.arrow.com')
        request:set_header(session_header, session_id)
        request:set_proxy{host, port, username=user, password=''}
        
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
    splash:go(splash.args.url)

  return {
    html = splash:html(),
    png = splash:png(),
  }
end

I render for `https://www.arrow.com` and after some time all I get back as a response is either 504 Gateway timeout or:

Any thoughts?

0 Votes

surge posted over 7 years ago Admin

And if you use Crawlera in a LUA script, it would look like:

splash:on_request(function (request)

request:set_header('Referer', 'https://www.arrow.com')

request:set_header(session_header, session_id)

request:set_proxy{host, port, username=user, password=''}

end)

0 Votes

surge posted over 7 years ago Admin

I've updated your Splash instance max-timeout setting. The timeout is needed for Splash to have enough time to fetch and render the website's resources.

As for the Referer header, you should be able to pass it on Request, e.g. yield Request(url, headers={'Referer': 'https://www.arrow.com'}, callback=self.parse). Or add it as a part of DEFAULT_REQUEST_HEADERS.

If you want to test locally just with Crawlera, run a cURL call, e.g.:

curl -U $CRAWLERA_APIKEY: -vx proxy.crawlera.com:8010 "https://www.arrow.com" -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" --compressed -H "Accept-Language:en-US,en;q=0.9,ru;q=0.8,uk;q=0.7" -H "Host: www.arrow.com" -H "Referer: https://www.arrow.com"

0 Votes

kamfor posted over 7 years ago

Also why one need to setup 360 sec of timeout for a site that normally loads in 1 second or so? It just doesn't make much sense to me. Regards, kamfor

0 Votes

kamfor posted over 7 years ago

@surge thanks for your reply! I'm using a splash instance of scrapinghub, so I need another approach other than a docker argument, can you suggest one? Also please can you share the exact steps of setting up the Referer header in Crawlers? Regards, kamfor

0 Votes

surge posted over 7 years ago Admin

Haven't checked in Scrapy, but the website does render in Splash UI, when the docker image is run with --max-timeout 300 option. It's also reachable via Crawlera, provided the Referer header is sent.

0 Votes