I have an issue with Scrapy and Splash when trying the fetch responses from this site.
I tried the following without luck:
pure scrapy - times out
scrapy + crawlera - times out
splash - times out
However I can scrape the site with the Firefox webdriver of Selenium. But I want to move away from that and move to Splash.
Is there a workaround to avoid these timeouts?
0 Votes
9 Comments
Sorted by
surgeposted
almost 7 years ago
Admin
Haven't checked in Scrapy, but the website does render in Splash UI, when the docker image is run with --max-timeout 300 option. It's also reachable via Crawlera, provided the Referer header is sent.
0 Votes
k
kamforposted
almost 7 years ago
@surge thanks for your reply!
I'm using a splash instance of scrapinghub, so I need another approach other than a docker argument, can you suggest one?
Also please can you share the exact steps of setting up the Referer header in Crawlers?
Regards,
kamfor
0 Votes
k
kamforposted
almost 7 years ago
Also why one need to setup 360 sec of timeout for a site that normally loads in 1 second or so? It just doesn't make much sense to me.
Regards,
kamfor
0 Votes
surgeposted
almost 7 years ago
Admin
I've updated your Splash instance max-timeout setting. The timeout is needed for Splash to have enough time to fetch and render the website's resources.
As for the Referer header, you should be able to pass it on Request, e.g. yield Request(url, headers={'Referer': 'https://www.arrow.com'}, callback=self.parse). Or add it as a part of DEFAULT_REQUEST_HEADERS.
If you want to test locally just with Crawlera, run a cURL call, e.g.:
No success so far, I wonder how you can managed without timeouts.
My script is like this:
function use_crawlera(splash)
local user = '<crawlera apikey>'
local host = 'proxy.crawlera.com'
local port = 8010
local session_header = 'X-Crawlera-Session'
local session_id = 'create'
splash:on_request(function (request)
-- Discard requests to advertising and tracking domains.
if string.find(request.url, 'doubleclick%.net') or
string.find(request.url, 'analytics%.google%.com') then
request.abort()
return
end
-- Avoid using Crawlera for subresources fetching to increase crawling speed.
if string.find(request.url, '://static%.') ~= nil or
string.find(request.url, '%.png$') ~= nil then
return
end
request:set_header('X-Crawlera-Cookies', 'disable')
request:set_timeout(90.0)
request:set_header('Referer', 'https://www.arrow.com')
request:set_header(session_header, session_id)
request:set_proxy{host, port, username=user, password=''}
end)
splash:on_response_headers(function (response)
if type(response.headers[session_header]) ~= nil then
session_id = response.headers[session_header]
end
end)
end
function main(splash)
use_crawlera(splash)
splash:go(splash.args.url)
return {
html = splash:html(),
png = splash:png(),
}
end
I render for `https://www.arrow.com` and after some time all I get back as a response is either 504 Gateway timeout or:
Any thoughts?
0 Votes
surgeposted
almost 7 years ago
Admin
Again, that's in Splash UI -- I've re-used your script with these two changes: added request:set_header('X-Crawlera-UA', 'pass') line under X-Crawlera-Cookies, and added splash:wait(30) before the return statement in main function. Got several successful renders in a row.
Check if you have CRAWLERA_ENABLED = True in the spider's settings, and flip it to False. This is required for the correct flow of requests when using Splash with Crawlera -- the latter should only be called from the Lua script itself.
0 Votes
k
kamforposted
almost 7 years ago
Tried it in the Splash UI again, without luck. By the way what kind of splash instance do you use? I bet you tried it with something else than Scrapinghub's small Splash instance, right?
Hi there!
I have an issue with Scrapy and Splash when trying the fetch responses from this site.
I tried the following without luck:
0 Votes
9 Comments
surge posted almost 7 years ago Admin
Haven't checked in Scrapy, but the website does render in Splash UI, when the docker image is run with --max-timeout 300 option. It's also reachable via Crawlera, provided the Referer header is sent.
0 Votes
kamfor posted almost 7 years ago
0 Votes
kamfor posted almost 7 years ago
0 Votes
surge posted almost 7 years ago Admin
I've updated your Splash instance max-timeout setting. The timeout is needed for Splash to have enough time to fetch and render the website's resources.
As for the Referer header, you should be able to pass it on Request, e.g. yield Request(url, headers={'Referer': 'https://www.arrow.com'}, callback=self.parse). Or add it as a part of DEFAULT_REQUEST_HEADERS.
If you want to test locally just with Crawlera, run a cURL call, e.g.:
curl -U $CRAWLERA_APIKEY: -vx proxy.crawlera.com:8010 "https://www.arrow.com" -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" --compressed -H "Accept-Language:en-US,en;q=0.9,ru;q=0.8,uk;q=0.7" -H "Host: www.arrow.com" -H "Referer: https://www.arrow.com"
0 Votes
surge posted almost 7 years ago Admin
And if you use Crawlera in a LUA script, it would look like:
splash:on_request(function (request)
request:set_header('Referer', 'https://www.arrow.com')
request:set_header(session_header, session_id)
request:set_proxy{host, port, username=user, password=''}
end)
0 Votes
kamfor posted almost 7 years ago
Hi surge,
No success so far, I wonder how you can managed without timeouts.
My script is like this:
I render for `https://www.arrow.com` and after some time all I get back as a response is either 504 Gateway timeout or:
Any thoughts?
0 Votes
surge posted almost 7 years ago Admin
Again, that's in Splash UI -- I've re-used your script with these two changes: added request:set_header('X-Crawlera-UA', 'pass') line under X-Crawlera-Cookies, and added splash:wait(30) before the return statement in main function. Got several successful renders in a row.
Check if you have CRAWLERA_ENABLED = True in the spider's settings, and flip it to False. This is required for the correct flow of requests when using Splash with Crawlera -- the latter should only be called from the Lua script itself.
0 Votes
kamfor posted almost 7 years ago
Tried it in the Splash UI again, without luck.
By the way what kind of splash instance do you use? I bet you tried it with something else than Scrapinghub's small Splash instance, right?
0 Votes
surge posted almost 7 years ago Admin
Locally, with a docker image.
0 Votes
Login to post a comment