Since a good number of our customers use our serverless platform to more easily deploy and scale their web bots and scrapers, I thought I’d write a post about a fun scraping challenge I encountered. Solving it required thinking a little bit outside the box. I thought I’d share it here since it demonstrates a fairly re-usable approach to scraping heavily-obfuscated sites. This post will dive into how you can use request interception in Puppeteer to beat heavily obfuscated sites that are built to be resistant to scraping.
NOTE:This post is meant to be about the technical content and not about actually scraping the given site. For that reason the site is redacted throughout this post. We don't advocate you go and scrape any individual site, you should always use your best judgement when engaging in any scraping project.
Background: The Problem
Note: Feel free to skip this section if you don’t care about the background for why the scraping was necessary.
Recently, I was working on a project to calculate the most optimal route for spinning Pokestops in the mobile game Pokemon Go (a topic for another post). For those unfamiliar with Pokemon Go, you can obtain in-game items by physically visiting “Pokestop” locations in the real world and “spinning” them in the app.
Since the entire game hinges on having enough resources (Pokeballs, eggs, berries, etc) spinning as many Pokestops as possible is a shared-goal for basically all players.
Pokestops are designed to encourage players to constantly be on the move instead of standing around. For example, once you spin a Pokestop you can’t spin it again for five minutes, so you always want to be heading to the next-nearest stop. Even if you wait around to re-spin the same stop, you are only awarded one fifth of the experience points for doing so (50 XP instead of 250 XP for a new stop). Suffice to say, to get the most bang for your buck you should be minimizing the amount of time walking between Pokestops and maximizing the number of unique Pokestops spun.
Given routing problems are something computers have gotten pretty good at, I figured if I had the GPS coordinates of all the Pokestops near me I could calculate the best route that hits all of them in the shortest distance possible. However, this is easier said than done, as Niantic (the creators of the game) are extremely aggressive on banning scrapers from the game.
Risking a ban was not something I was interested in doing. Luckily, there are community-efforts to build out an effective map of all Pokestops in the game. PokestopMap (the codeword we'll use to reference the site) is one such website, which allows players to mark and confirm Pokestops that they see in the app. This community-built map offers a great data source for where to find Pokestops:
In order to get the data for the routing portion of the project, we need to scrape this website for the GPS coordinates of all of these stops. How hard could that be?
Obfuscation, Obfuscation Everywhere!
The first thing I took a look at was just automating the HTTP requests made by the webpage to the backend API. Taking a look at the XMLHTTPRequests made by the page, the following request/response seemed to be what I was looking for:
The fields which should contain the GPS coordinates appear to be obfuscated. The keys are seemingly random strings and the values are base64-encoded floats which do not match up with the coordinates of the points on the map. The “realrand” field also gives us a hint that this response data is intentionally obfuscated to prevent scraping. Bummer.
The rabbit hole is for the rabbits, let’s keep it that way.
That’s far too much work for us, let’s work smart instead of hard (OK reverse engineering requires plenty of smarts, but regardless…). We’ll do this by tackling the problem at a different layer: the browser.
Full Browser Scraping with Puppeteer vs Raw HTTP Requests
When working on a scraping project you often have to make the choice of going the route of doing raw HTTP requests or using a full web browser to do the scraping.
The pros of going with raw HTTP requests are generally:
Speed of scraping: Generally scraping via raw HTTP requests is much faster because you can write a script to only request exactly the endpoints necessary to scrape the data you need. You can skip all of the unnecessary steps (loading webpage resources, rendering DOM elements, etc).
The pros of going with headless browser scraping are generally:
Stealth: A regular browser loading all of the resources of the page is often much more stealthy than just doing raw HTTP requests. When doing raw HTTP requests, are you being careful to ensure your headers are ordered like a browser? Are you making sure to request the same resources as the browser? Are your cookies handled correctly? Doing any of these things differently than a browser can lead to detection, and there are endless tricks to fingerprint your HTTP client*.
Visual debugging: Using a full web browser is often easier just because you get to see the full picture of what’s happening in the browser much more clearly. You also have a wide variety of built-in developer tools for Chrome which are extremely helpful for scraping and debugging.
*To be fair, fingerprinting a browser to figure out if it’s a bot is also entirely possible (there are even more data points to fingerprint if it’s a full browser vs a low-level HTTP client). However, this is less common and ultimately emulating the “real case” is almost always going to be the most stealthy method possible.
Generally speaking, the more complicated/obfuscated the web app, the quicker I’ll just reach for utilizing a full web browser.
“What would be the most painful to change?”: Choosing a Stable Scraping Technique
Doing a quick assessment of the site, I assessed a couple of potential routes for browser-level scraping:
Doing raw HTTP requests: Painful. The API requests and responses are obfuscated as discussed earlier in the post. Additionally detection and blocking was implemented for HTTP clients not doing proper request formatting.
Extracting data from the webpage DOM: Painful. The app appears to take pretty careful measures to ensure the GPS coordinates of all of the points are not inserted into the DOM.
All things being equal, the developers of PokestopMap did a really good job making it hard to scrape their site. All of the low hanging fruit seems to be well trimmed, looks like we’ll have to get creative!
PokestopMap clearly uses Leaflet to draw it’s interactive Pokestop map. This is good, since it means that the data we’re trying to extract must be handed over to Leaflet in an un-obfuscated form at some point to accomplish this. Doing a quick search through the API documentation and we find the function for adding Markers:
Perfect, so if we hook the L.Marker function we’ll be able to intercept all of the metadata used to create the map markers. This will have the GPS coordinates we need to scrape (and likely extra metadata as well).
Popping open the developer tools and doing some quick searches for “marker” confirms this. After setting a breakpoint for the minified function, we see the data we’re looking for:
Great! But it’s one thing to see the data in a breakpoint in the Chrome developer console and another to automate scraping it. How can we use this to automate scraping all of the coordinates for our area?
One of Puppeteer’s underrated APIs is page.setRequestInterception. It allows you to intercept HTTP requests made by the browser and modify the request and the response data. This is something you can’t do even in Chrome extensions, since the Chrome extension API for working with web requests is brutallylimited.
By using this library we can replace the Leaflet.js library with a version that is modified to do a little extra. Of course by “extra” I mean record all of the data passed to the “addMarker” API calls. To avoid having to compile the Leaflet.js project, I just downloaded the existing minified leaflet.js file and changed the initialize call we breakpointed earlier to the following:
Now when our Chrome browser loads the website the Leaflet.js script will be replaced with our modified version which will dump the GPS coordinates and other metadata into the global variable “window.dumpedPoints”. We can then just do a page.evaluate() to get the results:
We’ve now scraped all of the data we’re looking for. All that’s required to get the GPS points for Pokestops in our area is to launch our script with some seed GPS coordinates as parameters and we’ll receive an array of GPS points of all nearby Pokestops.
For those interested in seeing the final code for this, check out this Refinery project.
Bonus Technique: Speed Up Browser Scraping By Stubbing Out Assets
In addition to being useful for scraping, the page.setRequestInterception API is also useful for speeding up browser-level scraping. Using this API you can intercept requests for assets and third-party resources and skip loading them. You can do this by either calling interceptedRequest.abort() to just return a network error, or by returning a response containing the actual data for the asset.
The following code snippet demonstrates using the API to skip loading all PNG and JPG images for a given page:
Since generally most of a page's load times are spent on pulling external assets this can be a great way to speed up and increase the stability of your browser-level scraping. You can use this to implement an effective "whitelist" of requests that are necessary for your scraping and make all the non-necessary requests load instantly.
One fairly common thing people want to do with Lambdas is chain them together to build microservices and workflows. This sounds easy enough in theory, but in practice tends to be much more complex (as is the case with most things in AWS). This post will walk through a few different methods to chain Lambdas together. We'll cover how you can chain together Lambdas using only vanilla Lambda functions, using AWS Step Functions, and using our platform, Refinery.
In this post we show how to scale up a simple security proof-of-concept test into a highly-concurrent serverless distributed scanner. This allows us to rapidly scan thousands of hosts in seconds and quickly identify vulnerable hosts.