A post on scheduling serverless jobs and crons via AWS Lambda and Cloudwatch Event Rules
Since a good number of our customers use our serverless platform to more easily deploy and scale their web bots and scrapers, I thought I’d write a post about a fun scraping challenge I encountered. Solving it required thinking a little bit outside the box. I thought I’d share it here since it demonstrates a fairly re-usable approach to scraping heavily-obfuscated sites. This post will dive into how you can use request interception in Puppeteer to beat heavily obfuscated sites that are built to be resistant to scraping.
NOTE: This post is meant to be about the technical content and not about actually scraping the given site. For that reason the site is redacted throughout this post. We don't advocate you go and scrape any individual site, you should always use your best judgement when engaging in any scraping project.
Note: Feel free to skip this section if you don’t care about the background for why the scraping was necessary.
Recently, I was working on a project to calculate the most optimal route for spinning Pokestops in the mobile game Pokemon Go (a topic for another post). For those unfamiliar with Pokemon Go, you can obtain in-game items by physically visiting “Pokestop” locations in the real world and “spinning” them in the app.
Since the entire game hinges on having enough resources (Pokeballs, eggs, berries, etc) spinning as many Pokestops as possible is a shared-goal for basically all players.
Pokestops are designed to encourage players to constantly be on the move instead of standing around. For example, once you spin a Pokestop you can’t spin it again for five minutes, so you always want to be heading to the next-nearest stop. Even if you wait around to re-spin the same stop, you are only awarded one fifth of the experience points for doing so (50 XP instead of 250 XP for a new stop). Suffice to say, to get the most bang for your buck you should be minimizing the amount of time walking between Pokestops and maximizing the number of unique Pokestops spun.
Given routing problems are something computers have gotten pretty good at, I figured if I had the GPS coordinates of all the Pokestops near me I could calculate the best route that hits all of them in the shortest distance possible. However, this is easier said than done, as Niantic (the creators of the game) are extremely aggressive on banning scrapers from the game.
Risking a ban was not something I was interested in doing. Luckily, there are community-efforts to build out an effective map of all Pokestops in the game. PokestopMap (the codeword we'll use to reference the site) is one such website, which allows players to mark and confirm Pokestops that they see in the app. This community-built map offers a great data source for where to find Pokestops:
In order to get the data for the routing portion of the project, we need to scrape this website for the GPS coordinates of all of these stops. How hard could that be?
The first thing I took a look at was just automating the HTTP requests made by the webpage to the backend API. Taking a look at the XMLHTTPRequests made by the page, the following request/response seemed to be what I was looking for:
<p>CODE: https://gist.github.com/mandatoryprogrammer/c7248325318c15a140d14b881643029b.js</p>
This seems easy enough, but taking a closer look at the response JSON we notice something troubling:
<p>CODE: https://gist.github.com/mandatoryprogrammer/96a3e7df38ef4985feb28086c848c615.js</p>
The fields which should contain the GPS coordinates appear to be obfuscated. The keys are seemingly random strings and the values are base64-encoded floats which do not match up with the coordinates of the points on the map. The “realrand” field also gives us a hint that this response data is intentionally obfuscated to prevent scraping. Bummer.
Often when encountering obfuscation the first response will be to go down the rabbit-hole that is reverse-engineering the app’s client-side JavaScript. From my experience, this should be an absolute last resort for a couple of reasons:
That’s far too much work for us, let’s work smart instead of hard (OK reverse engineering requires plenty of smarts, but regardless…). We’ll do this by tackling the problem at a different layer: the browser.
When working on a scraping project you often have to make the choice of going the route of doing raw HTTP requests or using a full web browser to do the scraping.
The pros of going with raw HTTP requests are generally:
The pros of going with headless browser scraping are generally:
*To be fair, fingerprinting a browser to figure out if it’s a bot is also entirely possible (there are even more data points to fingerprint if it’s a full browser vs a low-level HTTP client). However, this is less common and ultimately emulating the “real case” is almost always going to be the most stealthy method possible.
Generally speaking, the more complicated/obfuscated the web app, the quicker I’ll just reach for utilizing a full web browser.
Doing a quick assessment of the site, I assessed a couple of potential routes for browser-level scraping:
All things being equal, the developers of PokestopMap did a really good job making it hard to scrape their site. All of the low hanging fruit seems to be well trimmed, looks like we’ll have to get creative!
Taking a closer look at the page HTML I noticed that while their main web application JavaScript is heavily obfuscated, the JavaScript mapping library they use “leaflet.js” was not (although, it was minified):
<p>CODE: https://gist.github.com/mandatoryprogrammer/0615880dada2dc682f6b6d0c0dbaffc8.js</p>
Leaflet.js is an open-source JavaScript library for making interactive maps that are mobile-friendly. It provides an easy API for adding map markers, and other map-related functionality.
PokestopMap clearly uses Leaflet to draw it’s interactive Pokestop map. This is good, since it means that the data we’re trying to extract must be handed over to Leaflet in an un-obfuscated form at some point to accomplish this. Doing a quick search through the API documentation and we find the function for adding Markers:
Perfect, so if we hook the L.Marker function we’ll be able to intercept all of the metadata used to create the map markers. This will have the GPS coordinates we need to scrape (and likely extra metadata as well).
Popping open the developer tools and doing some quick searches for “marker” confirms this. After setting a breakpoint for the minified function, we see the data we’re looking for:
Great! But it’s one thing to see the data in a breakpoint in the Chrome developer console and another to automate scraping it. How can we use this to automate scraping all of the coordinates for our area?
For those unfamiliar with Puppeteer, it’s an awesome JavaScript library to control and automate the Chrome web browser. Every-time I read through the Puppeteer API docs it seems they’ve added some new API or way to control the browser. When it comes to doing full browser scraping it’s hard to recommend any other library for the job.
One of Puppeteer’s underrated APIs is page.setRequestInterception. It allows you to intercept HTTP requests made by the browser and modify the request and the response data. This is something you can’t do even in Chrome extensions, since the Chrome extension API for working with web requests is brutally limited.
By using this library we can replace the Leaflet.js library with a version that is modified to do a little extra. Of course by “extra” I mean record all of the data passed to the “addMarker” API calls. To avoid having to compile the Leaflet.js project, I just downloaded the existing minified leaflet.js file and changed the initialize call we breakpointed earlier to the following:
<p>CODE: https://gist.github.com/mandatoryprogrammer/98bed159798fe56fa39189d99e78be5d.js</p>
Now using the page.setRequestInterception API we can swap the JavaScript library when Chrome loads the webpage. The following code snippet demonstrates this:
<p>CODE: https://gist.github.com/mandatoryprogrammer/09355d32f29da8ca251429cd98dacc51.js</p>
Now when our Chrome browser loads the website the Leaflet.js script will be replaced with our modified version which will dump the GPS coordinates and other metadata into the global variable “window.dumpedPoints”. We can then just do a page.evaluate() to get the results:
<p>CODE: https://gist.github.com/mandatoryprogrammer/16fe7c0c634edcc0e31e42db91724682.js</p>
We’ve now scraped all of the data we’re looking for. All that’s required to get the GPS points for Pokestops in our area is to launch our script with some seed GPS coordinates as parameters and we’ll receive an array of GPS points of all nearby Pokestops.
For those interested in seeing the final code for this, check out this Refinery project.
In addition to being useful for scraping, the page.setRequestInterception API is also useful for speeding up browser-level scraping. Using this API you can intercept requests for assets and third-party resources and skip loading them. You can do this by either calling interceptedRequest.abort() to just return a network error, or by returning a response containing the actual data for the asset.
The following code snippet demonstrates using the API to skip loading all PNG and JPG images for a given page:
<p>CODE: https://gist.github.com/mandatoryprogrammer/3d0aa7b26f379ed4b424ecd9f274e820.js</p>
Since generally most of a page's load times are spent on pulling external assets this can be a great way to speed up and increase the stability of your browser-level scraping. You can use this to implement an effective "whitelist" of requests that are necessary for your scraping and make all the non-necessary requests load instantly.
----
By Matthew Bryant (@IAmMandatory)
CEO @ Refinery.io
Interested in using Refinery? Sign up now and get a $5 credit for your first month of usage!
A post on scheduling serverless jobs and crons via AWS Lambda and Cloudwatch Event Rules
One fairly common thing people want to do with Lambdas is chain them together to build microservices and workflows. This sounds easy enough in theory, but in practice tends to be much more complex (as is the case with most things in AWS). This post will walk through a few different methods to chain Lambdas together. We'll cover how you can chain together Lambdas using only vanilla Lambda functions, using AWS Step Functions, and using our platform, Refinery.
A post about the journey to calculate the optimal Pokestop route for the popular AR game Pokemon Go.
Hit Go to Search or X to close