< Back

Development diaries -- scraping the Toronto lobbyist registry


The Toronto lobbyist registry has been a thorn in my side for as long as I've been privileged to be a developer.

For four years, I've irregularly bashed my head against the wall trying to scrape it this ancient, monster Java Server Page. The applicaiton's markup lacks basic modern web development best practices; there's no URLs for a specific registry entry, anchor tags execute javascript instead of navigating to a new page, most of the markup doesn't even have distinguishing classes or IDs to select.

It's unfortunate, because the registry is also an incredible tool to help understand the unseen, unelected forces at work in Toronto lobbying politicians and city staff behind the scenes. So, in the name of opening up Toronto's data, I've finally buckled down and started building a scraper that could eventually power a larger web application.

How it works

I'm using Puppeteer, a headless browser API run in Node. It has some excellent functionality to reduce flaky and unpredictable results -- because it's built upon Chromium, you can wait for a network idle event to ensure that a page has fully navigated, is loaded in and ready before running any kind of assertions or HTML querying. Very beneficial for an ancient single-page application!

Now, there's some specific challenges associated with using Puppeteer I quickly discovered. For example, best practices include getting a list of URLs from anchor tags, then navigating to those links in a new page. Not possible when your anchor tags execute JavaScript!

Screen Shot 2019-08-06 at 3.53.50 PM


So, traditional strategies for scraping wouldn't work. Instead, it was time to get ✨creative✨.

Since all anchor tags execute JavaScript to navigate between pages, it became pretty clear that we needed to execute JavaScript in the browser environment. Lucky for us, we can do this by using a fun little method on the browser page called evaluate -- you can just execute any ol' JavaScript, and pass any data you need

So, this small but mighty function led to jump-starting my scraper. It enabled me to move on from trying to click links and navigate to pages to instead hoovering up the data in the doSubmit function then navigating to those pages using the doSubmit function in combination with using the browser back functionality.

Where before I was struggling to even navigate between pages I can now dynamically navigate between n number of pages, collect data from those pages using HTML selectors and spit them into memory. Next steps...

The future

My ideal roadmap for this kind of application:

1) Get the data -- we're close! Just need to save the data! 2) Fill a database with scraped data 3) Make an API off the scraped data 4) Build a web app powered by the API.

Easy, right? 😂

Learn more

Want to see where I'm at? Take a look at the repo here -- it's still a work in progress, but if you're interested in seeing how I build things it's a great example of how I try to get things working, then get them working right. It's a little bit messy -- still -- but it's getting to a place where I can be proud.