Start a new topic

Scrapy and Splash times out for a specific site.

Hi there!


I have an issue with Scrapy and Splash when trying the fetch responses from this site.


I tried the following without luck:

  • pure scrapy - times out
  • scrapy + crawlera - times out
  • splash - times out

However I can scrape the site with the Firefox webdriver of Selenium. But I want to move away from that and move to Splash.

Is there a workaround to avoid these timeouts?



Haven't checked in Scrapy, but the website does render in Splash UI, when the docker image is run with --max-timeout 300 option. It's also reachable via Crawlera, provided the Referer header is sent.

@surge thanks for your reply! I'm using a splash instance of scrapinghub, so I need another approach other than a docker argument, can you suggest one? Also please can you share the exact steps of setting up the Referer header in Crawlers? Regards, kamfor
Also why one need to setup 360 sec of timeout for a site that normally loads in 1 second or so? It just doesn't make much sense to me. Regards, kamfor

I've updated your Splash instance max-timeout setting. The timeout is needed for Splash to have enough time to fetch and render the website's resources.


As for the Referer header, you should be able to pass it on Request, e.g. yield Request(url, headers={'Referer': 'https://www.arrow.com'}, callback=self.parse). Or add it as a part of DEFAULT_REQUEST_HEADERS.


If you want to test locally just with Crawlera, run a cURL call, e.g.: 


curl -U $CRAWLERA_APIKEY: -vx proxy.crawlera.com:8010 "https://www.arrow.com" -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" --compressed -H "Accept-Language:en-US,en;q=0.9,ru;q=0.8,uk;q=0.7" -H "Host: www.arrow.com" -H "Referer: https://www.arrow.com"

And if you use Crawlera in a LUA script, it would look like:


splash:on_request(function (request)

        request:set_header('Referer', 'https://www.arrow.com')

        request:set_header(session_header, session_id)

        request:set_proxy{host, port, username=user, password=''}

end)

Hi surge,


No success so far, I wonder how you can managed without timeouts.


My script is like this: 

function use_crawlera(splash)
    local user = '<crawlera apikey>'

    local host = 'proxy.crawlera.com'
    local port = 8010
    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        -- Discard requests to advertising and tracking domains.
        if string.find(request.url, 'doubleclick%.net') or
           string.find(request.url, 'analytics%.google%.com') then
            request.abort()
            return
        end

        -- Avoid using Crawlera for subresources fetching to increase crawling speed.
        if string.find(request.url, '://static%.') ~= nil or
           string.find(request.url, '%.png$') ~= nil then
            return
        end

        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_timeout(90.0)
        request:set_header('Referer', 'https://www.arrow.com')
        request:set_header(session_header, session_id)
        request:set_proxy{host, port, username=user, password=''}
        
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
    splash:go(splash.args.url)

  return {
    html = splash:html(),
    png = splash:png(),
  }
end

 

I render for `https://www.arrow.com` and after some time all I get back as a response is either 504 Gateway timeout or:

image

 

Any thoughts?

Again, that's in Splash UI -- I've re-used your script with these two changes: added request:set_header('X-Crawlera-UA', 'pass') line under X-Crawlera-Cookies, and added splash:wait(30) before the return statement in main function. Got several successful renders in a row.


Check if you have CRAWLERA_ENABLED = True in the spider's settings, and flip it to False. This is required for the correct flow of requests when using Splash with Crawlera -- the latter should only be called from the Lua script itself.

Tried it in the Splash UI again, without luck.
By the way what kind of splash instance do you use? I bet you tried it with something else than Scrapinghub's small Splash instance, right?

Locally, with a docker image.

Login to post a comment