Haven't checked in Scrapy, but the website does render in Splash UI, when the docker image is run with --max-timeout 300 option. It's also reachable via Crawlera, provided the Referer header is sent.
I've updated your Splash instance max-timeout setting. The timeout is needed for Splash to have enough time to fetch and render the website's resources.
As for the Referer header, you should be able to pass it on Request, e.g. yield Request(url, headers={'Referer': 'https://www.arrow.com'}, callback=self.parse). Or add it as a part of DEFAULT_REQUEST_HEADERS.
If you want to test locally just with Crawlera, run a cURL call, e.g.:
curl -U $CRAWLERA_APIKEY: -vx proxy.crawlera.com:8010 "https://www.arrow.com" -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" --compressed -H "Accept-Language:en-US,en;q=0.9,ru;q=0.8,uk;q=0.7" -H "Host: www.arrow.com" -H "Referer: https://www.arrow.com"
And if you use Crawlera in a LUA script, it would look like:
splash:on_request(function (request)
request:set_header('Referer', 'https://www.arrow.com')
request:set_header(session_header, session_id)
request:set_proxy{host, port, username=user, password=''}
end)
Hi surge,
No success so far, I wonder how you can managed without timeouts.
My script is like this:
function use_crawlera(splash) local user = '<crawlera apikey>' local host = 'proxy.crawlera.com' local port = 8010 local session_header = 'X-Crawlera-Session' local session_id = 'create' splash:on_request(function (request) -- Discard requests to advertising and tracking domains. if string.find(request.url, 'doubleclick%.net') or string.find(request.url, 'analytics%.google%.com') then request.abort() return end -- Avoid using Crawlera for subresources fetching to increase crawling speed. if string.find(request.url, '://static%.') ~= nil or string.find(request.url, '%.png$') ~= nil then return end request:set_header('X-Crawlera-Cookies', 'disable') request:set_timeout(90.0) request:set_header('Referer', 'https://www.arrow.com') request:set_header(session_header, session_id) request:set_proxy{host, port, username=user, password=''} end) splash:on_response_headers(function (response) if type(response.headers[session_header]) ~= nil then session_id = response.headers[session_header] end end) end function main(splash) use_crawlera(splash) splash:go(splash.args.url) return { html = splash:html(), png = splash:png(), } end
I render for `https://www.arrow.com` and after some time all I get back as a response is either 504 Gateway timeout or:
Any thoughts?
Again, that's in Splash UI -- I've re-used your script with these two changes: added request:set_header('X-Crawlera-UA', 'pass') line under X-Crawlera-Cookies, and added splash:wait(30) before the return statement in main function. Got several successful renders in a row.
Check if you have CRAWLERA_ENABLED = True in the spider's settings, and flip it to False. This is required for the correct flow of requests when using Splash with Crawlera -- the latter should only be called from the Lua script itself.
Tried it in the Splash UI again, without luck.
By the way what kind of splash instance do you use? I bet you tried it with something else than Scrapinghub's small Splash instance, right?
Locally, with a docker image.
kamfor
Hi there!
I have an issue with Scrapy and Splash when trying the fetch responses from this site.
I tried the following without luck: