Learn all about the latest trends and best practices in data extraction - Join us at Extract SummitGet tickets
Start a new topic

Having Trouble With A Strange Request Pattern

I'm working on writing a Scrapy spider for a set of websites that all seem to use the same interaction / control pattern. The problem is that while it makes perfect sense as to what the sites are doing, and why they're doing it that way, I can't seem to figure out an even remotely elegant or reliable spider method for it. It goes like this:


1) A user's browser starts a "search session" by making a [GET] request to the page where the search form lives

2) The page host responds with an HTML page that contains obfuscated, encrypted JavaScript and a decryptor object

3) The user's browser sees, acts on (thereby decrypting, loading, and running) the embedded JS

4) The loaded JS makes a [POST] request to the same URL the [GET] request was made to

5) The host responds with the actual form data, which is then inserted into the DOM


All of that is fairly standard (obviously), but enter now the problems:


1) The decrypted JS object appears to make its [POST] request with an included Cookie header that isn't passed back or actually set in the browser itself. This makes it irritating to "catch".

2) The host absolutely will not respond favorably without this "phantom cookie" value being included in the headers, but it must be included in the initial request that actually sets the "search values" (i.e. submission of the search form) and in every subsequent session-specific request (like pagination).

3) The host's response to the [POST] request does include more than one Set-Cookie directive, all of which are also required to be included in all subsequent requests

4) All subsequent requests appear to work the same way regardless of method, where only some (if any) of the actual page data is provided in the host's initial response but is instead loaded by a self-decrypting obfuscated JS object (which also appears to include the initial "phantom cookie" value as well)


I've got the spider working (via scrapy-splash) to the point that it will correctly load the search page and catch three of the four cookie values required to correctly "submit" the form. The issue I'm running into is finding a not-terrible way to:


1) "Catch" the phantom cookie value so I can add it to the spider's cookiejar

2) Pass all the necessary cookie values to Splash and also get the additional cookie valued created by subsequent responses back from Splash


Can anyone help to shed some light on this for me? 


I'm a hair's width away from trying to see if it's possible to snag the phantom cookie, "populate" the form, submit it, and hand back the response along with its set cookies and the phantom one all in LUA. That feels like the wrong idea though, even if it might work.

Login to post a comment