videocamWeb Data Extraction Summit - September 30th, 2021.
Join some of the greatest minds in web scraping to educate, inspire, and innovate.
Register for free!
Start a new topic
Answered

Not working against Incapsula sites

Hello, when we try to make a request against a incapsula site, for example:


curl -U <API_KEY> -x proxy.crawlera.com:8010 https://www.whoscored.com/ -k

We are being blocked and we can't get the content:


<html>

<head>

<META NAME="robots" CONTENT="noindex,nofollow">

<script src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05">

</script>

<script>

(function() { 

var z="";var bfor (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();

</script></head>

<body>

<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>



Is there something that can be done in the command to bypass this protection?


Thanks and regards


Best Answer

I added the rule so you won't get the false 200s, but you will need something like Splash (http://splash.readthedocs.io/en/stable/) to bypass Incapsula.


Could you post more of the body rather than the script or the entire HTML? Preferably in an attachment or https://pastebin.com/.

It might be possible to add this as a ban rule, also please indicate what's the HTTP code for this response.

Sorry, forgot to copy the body closing tag, that's all the response we get. Here is the pastebin with the source code: https://pastebin.com/nUL9AD40 and here the headers: https://pastebin.com/rrfkinfa


It seems like it does some javascript verification and is failing.

Answer

I added the rule so you won't get the false 200s, but you will need something like Splash (http://splash.readthedocs.io/en/stable/) to bypass Incapsula.

Login to post a comment