The Toronto lobbyist registry has been a thorn in my side for as long as I've been privileged to be a developer.
It's unfortunate, because the registry is also an incredible tool to help understand the unseen, unelected forces at work in Toronto lobbying politicians and city staff behind the scenes. So, in the name of opening up Toronto's data, I've finally buckled down and started building a scraper that could eventually power a larger web application.
How it works
I'm using Puppeteer, a headless browser API run in Node. It has some excellent functionality to reduce flaky and unpredictable results -- because it's built upon Chromium, you can wait for a network idle event to ensure that a page has fully navigated, is loaded in and ready before running any kind of assertions or HTML querying. Very beneficial for an ancient single-page application!
So, traditional strategies for scraping wouldn't work. Instead, it was time to get ✨creative✨.
So, this small but mighty function led to jump-starting my scraper. It enabled me to move on from trying to click links and navigate to pages to instead hoovering up the data in the
doSubmit function then navigating to those pages using the
doSubmit function in combination with using the browser back functionality.
Where before I was struggling to even navigate between pages I can now dynamically navigate between n number of pages, collect data from those pages using HTML selectors and spit them into memory. Next steps...
My ideal roadmap for this kind of application:
1) Get the data -- we're close! Just need to save the data! 2) Fill a database with scraped data 3) Make an API off the scraped data 4) Build a web app powered by the API.
Easy, right? 😂
Want to see where I'm at? Take a look at the repo here -- it's still a work in progress, but if you're interested in seeing how I build things it's a great example of how I try to get things working, then get them working right. It's a little bit messy -- still -- but it's getting to a place where I can be proud.