Scrapey Scrapey

Recently I have been doing a lot of scraping data from the web. Scraping is an invaluable skill to have as a modern netizen. I use the term scraping to refer to the act of programmatically accessing published data.

Most normal people’s interaction with the web these days is probably through an app. And they probably never think about how the stuff™︎ they see ends up on their screen. There is such a wide variety of methods for 1s and 0s to get from Alice to Bob. Some kinds I’ve recently encountered:

  • JSON delivered through a web API to be read by a web app
  • XML delivered from a URL to be read by a feed reader
  • HTML delivered from a URL to be read by a web browser
  • VDF delivered from a command line shell to be read by a script

There’s so many ways data gets shunted around, and once you learn how to find it, you can do so much:

  • Interrupt it to stop it from reaching its end destination
  • Redirect it to a place where it can be put to better use
  • Modify it into a form that serves your purposes more directly
  • Store it to keep a backup in case it disappears from the web
  • Use it in any way you please, because you are in charge

In this post I’m going to summarise some of the different personal projects I’ve used scraping in.

Making Feeds

Feeds are an excellent way of getting updates from people you follow on the web. You can get feeds delivered through an app, and maybe for some cases like social media the app might be the most enjoyable method. Places like Mastodon and Pixelfed let you have access to reverse chronological feeds which I think is the best kind of algorithmic sorting and delivery. But what about the places that don’t offer a reasonable app nor accessible feeds? How can you stay up to date without having to bookmark hundreds of URLs and check them every day, just in case something’s updated? Well, with scraping, you can make your own feeds.

I built a browser extension that once a day runs through a list of URLs, processes their outputs, and turns that into a feed that can be loaded on to a localhost server, and then read by my feed reader. It runs in an extension that way it gets the browser context, like cookies, so I don’t need to faff around with reading them from the disk. Having access within a browser means I’m less likely to trigger anti-scraping mechanisms, it’ll just look to the server that I’ve opened a tab. Web extensions can also directly access DOM methods which is easier than dealing with a 3rd party library when handling webpages.

Most sites will deliver data in one of a few ways:

  1. Sites that deliver data in an annoying webpage.
  2. Sites that deliver data for client side processing, through something like XML or JSON

When dealing with annoying feed-less webpages, what I do is download the entire page I’m interested, then look for the common selector that describes what I’m after. For example, a local news site has news elements listed in a list that I can grab using .news-featured li. It’s even easier if the site can offer a JSON API. I spot these by looking at the network inspector for a page. If it contacts one of these open APIs,I can get the data directly without having to download the whole page.

Once I’ve got the rough data I want to get updates from, I need to find a title, update timestamp, and the content itself (or a decent summary), and then I can form a valid XML feed document. I put this somewhere my feed reader can find it (like a localhost web server), and voila! Easy to read feeds, and just the updates I care about.

Additionally, blogs like WordPress sites often have feed URLs that can be used directly, but they might be hidden. Guessing the right URL is often enough. This can be as simple as setting the URL to /feed/.

Avoiding Annoying Sites

Sometimes there are sites you occasionally want to check, but where it doesn’t make sense to have a daily updating feed for it. If you’re lucky, the people behind that site will have made it fast and one-click easy to use. If you’re most people, the site will probably forget who you are every time, require clicking through a bunch of annoying popups, and be slow.

Take my local government’s recycling calendar page. I need to visit it every month or so to get the days when kerbside collections for our Paper, Packaging, Compostable, and Landfill bins will take place. If I use the website there are so many annoying nags:

  • I don’t care about cookies
  • It demands I type my postcode, then select my address from a drop-down
  • It forgets who I am every time

This is annoying, slow, and difficult to use. But, by looking at the page with the browser’s inspector, I can see that it uses an open JSON API. I created a shell script that runs through these steps automatically. It gets a free API token, it enters my house’s unique address point, and prints out a nice list of upcoming collection dates.

So much easier than having to login to the website, and with a tool like termux, I can even get it quickly on my phone.

I also use youtube-dl and yt-dlp a lot to download media from webpages where accessing them can be tricky. There’s nothing worse than trying to listen to a podcast only to find it want you to jump through hoops when it should be as easy as tapping an mp3 file to start playing it. I’ve helped contribute to these projects to ensure that this media, especially when it’s from sources funded by public money, remain accessible to all.

Reading Blog Posts

When I’m reading, I find it easier to read very long articles on my e-reader device. I don’t mind reading short articles on an LED or LCD screen, but for longer pieces the eink display makes more a much more traditional paper-like experience. Knowing that epub files are basically just HTML in a zip folder, I often scrape copies for myself. This also has the handy benefit of acting as a backup in case the original article goes missing and I want to read it again.

The process for most articles is simple enough. I almost hesitate to call it “scraping” because it’s as simple as pressing Ctrl+S and saving as an HTML file.

From here I usually run it through a program called Calibre. Calibre has a configurable environment that lets me do much of the same thing my Feed scraper extension does – get rid of all the fluff and nonsense that might be sitting around the article, ensuring that it will display properly on my ereader. Most pages would render fine on an ereader without this step, but because some pages do funny things with scrolling and margins, I find it’s a necessary extra step.

Calibre also has a lot of plugins that can be useful in different cases. If you read a lot of fan fiction, for example, you might want to keep an up to date collection of ebooks. But keeping up to date can be hard, and not all online indie publishing sites offer auto-generated ebooks. The FanFicFare plugin for calibre does an excellent job of automating downloading and processing of works into ebooks. Another useful plugin is the Epubmerge plugin. Say you’ve downloaded a bunch of tutorials from a webpage and want to merge them into one, this plugin speeds that process up and keeps things neat and tidy within the calibre library.

Learning how to make personal ebook copies of online published works is a very empowering experience – it brings the feel of owning a book to browsing the web.

Building Datasets

With a lot of my online life these days being spread across so many different services, it’s difficult to keep track of the things I care about in one central place. Some things, like my music library and my ebooks, I can keep on disk, and apps like VLC or Calibre allow easy access. But what about all the movies that I stream, and how do I track the ones that I watch on TV or in a Cinema? Ditto videogames, that might exist across any number of platforms.

There are some sites that you can sign up to that will let you aggregate lists and share reviews, but what if I just want something local and offline that I can use for my own benefit? All I really want is a spreadsheet that I can fill out, with a few columns for metadata and my thoughts, in case some day I want to come back and find something to reminisce. The digital equivalent of disc cases on a shelf.

It’s easy enough to build a spreadsheet that I can put all the films I’ve watched or games I’ve played into, but trickier if I want it to be a usable source of truth for everything. When adding single new entries, it’s easy enough to add the metadata right then and there, but I can’t do that when I have thousands of past entries. Most services offer some kind of data export, and for those that don’t, scraping your profile page is a quick way to get the data you need.

Of course, those only give you a basic summary. What if you want some extra stuff, like genres, release dates, banner images? Again, scraping is your friend. Over a few days I got all of the raw data into a CSV file, and then set about finding ways to bulk out the metadata.

  • From sources like GOG.com, the webpages have data in a neat little box on the side of each game page. Grabbing this can be easily automated.
  • When looking at films, amazon just embeds all the relevant JSON for a film page on an entry’s IMDB HTML page, no need to pay amazon for an API key for access
  • Almost all of steam’s data is available via the steam command line, though it is formatted in a really annoying proprietary VDF format. It’s line-separated nested text though, so could be worse.
  • When some data isn’t available, other sources like SteamDB have also aggregated data that can be queried. For example, common tag IDs are all listed there

Get Scraping

In doing all my scraping I’ve learned more than a few things:

  • How to properly format things. CSV files in theory are very flexible, but in practice you want to wrap strings in quotes, especially when libraries like Python’s csv get upset by lone apostrophes
  • Picking up someone else’s old language parser and fixing it up to get it working on new data is a challenge, but a very nice feeling when it works.
  • How to spot data. Sometimes you want to get some data and it can be frustrating knowing it’s right there, just out of reach. The more practised I become with scraping, the fewer the times when I feel tech is working against me. Unfortunately it has the side effect of making the times when tech doesn’t work right all the more noticeable
  • Gaining an appreciation for what I actually want to read. I can filter and focus on exactly how I want to spend my time on the web, rather than being at the mercy of an algorithm’s (or an annoying editor’s) decision

If you’ve never done it before, I would highly recommend doing some scraping. Even if you are not technically minded, try pressing Ctrl+S and saving a document. Open it up and have a look inside. Hell, just right click > inspect a webpage every now and then.

The stuff™︎ on the websites and apps you use is not abstract and intangible. If you want it, you can take it. Happy Scraping!

Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.