Anwar Ziani

Welcome to my 127.0.0.1

Topics:

Automating Scraping with GitHub Actions

Published on 2019-10-02

A problem that I often encounter is having to check statuses on some websites that don’t support APIs, for example a visa application status or a product price on some e-commerce website.

Instead of manually checking the website, it would be nice to have that automated away and only get notified when the corresponding DOM value get changed.

While I was looking for tools or libraries that can help me achieve this, I considered just using an HTTP client like axios along with cheerio to get that jQuery-like syntax for querying the DOM. The problem that rose, in this case, is that many websites are Single-page-application (SPA) which rely on Javascript to construct the DOM and the HTML you get back is not evaluated and final, therefore, It wouldn’t find the DOM node that I want.

To get around this I chose to go with a headless browser which will emulate a real browser, the popular ones are Selenium, Phantomjs, and Puppeteer. I chose the latter since it has a friendly API and hassle-free to set up.

I called this project DOMCheck, the idea is to have some Nodejs scripts which will act as DOM checkers that will notify me when a given value has changed, to do that we need to:

  1. Maintain a history of those values to be able to check against them when scraping:
    I didn’t want to use a Database for such small need, I decided to write the values in a CSV file, which will be loaded first to check the scraped value against old values, and keep updating it on every check.

  2. A notification mechanism that will let me know when a value has changed:
    I figured I don’t want to provide a very specific way of delivering notifications, since that may vary sometimes, I just provided a callback function called notify that will receive some parameters which will construct the notification such as:

    • name: the name of the checker,
    • value: the new scraped value if it has changed historically
    • error: an error if something went wrong since I need to make sure my script has been run.

Combining all this logic, here’s an example of a checker that will track the top link in Hacker News:

// hackernews.checker.js

const domcheck = require("./domcheck");
const axios = require("axios");

domcheck({
  /**
   * name {string} [required] Name of the checker.
   */
  name: "hackernews",

  /**
   * url {string} [required] The URL of the website to scrap using Puppeeter
   */
  url: "https://news.ycombinator.com/",

  /**
   * history {string} [optional] The path to the history file
   * that records DOM node values, this file is checked every time the code runs
   * to compare the scraped value against old values,
   * and notify in case it has changed.
   */
  history: "hackernews.csv",

  /**
   * historyDir {string} [optional] This defines where the `history` file
   * will be stored, by default it's `history`.
   */
  historyDir: "history",

  /**
   * waitForSelector {string} [required] DOM node selector to wait for
   * to load before scraping.
   */
  waitForSelector: ".itemlist tr:first-child .title a",

  /**
   * onDocument {function} [optional] Function that specifies how to get
   * the DOM data from the url, by default it will just query the text value
   * of the selector defined in `waitForSelector` using `document.querySelectorAll`.
   *
   * @param {string} selector This the same as the `waitForSelector`.
   */
  onDocument: (selector) => {
    const nodeList = document.querySelectorAll(selector);
    return nodeList[0] && nodeList[0].innerText.trim();
  },

  /**
   * notify {function} [required] Function that defines
   * how you will get notified with the result,
   *
   * @param {string} name The name of this checker, same as the `name` property.
   * @param {string} value The new value of the DOM node.
   * @param {string} error The error message if any error happened.
   */
  notify: (name, value, error) => {
    const success = `✓ ${name} status upated to '${value}'`;
    const failure = `𝗑  Error running checker ${name}: '${error}'`;
    const message = error ? failure : success;
    console.log("Notification:", message);
  },
});

When hackernews.checker.js runs, it will scrape the url for the given selector defined in the config file, and it will record values in a CSV file defined by history.

Since there are no big resource requirements for these scripts to run in production, a simple cron to trigger them is ideal, and I don’t want to dedicate a whole VM just to run a couple of scripts, I figured I could use GitHub Actions since I never used them before and this can be also a good learning experience as well.

GitHub Actions have this feature where you can trigger them to run some code on a schedule using cron syntax, and that works perfectly on my case. I only need to write some YAML and we are good to go:

# .github/workflows/action.yml

name: "Domcheck GitHub Action"

on:
  schedule:
    - cron: "0 19 * * *" # 7:00pm UTC = 11am PST

jobs:
  bot:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@master

      - name: Setup Node
        uses: actions/setup-node@v1
        with:
          node-version: "12.13.x"

      - name: Install npm dependencies
        run: yarn install

      # Run your checkers at this step
      - name: Run Hackernews domcheck Script
        run: node hackernews.checker.js

      - name: Commit Files
        run: |
          git config --local user.email "[email protected]"
          git config --local user.name "GitHub Action"
          git add .
          git commit -m "Set history files"
      - name: Push Changes
        uses: ad-m/github-push-action@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}

A couple of things to notice in the action is that every time the script runs, it updates the history files which will be pushed automatically thanks to the github-push-action This makes GitHub not only host our code but also runs it for us. Thank you GitHub!

Github Action Bot

In the hackernews.checker.js example above, I have only logged the change to the console and didn’t do something useful with it, in my case I use Telegram and IFTTT, and it turns out IFTTT allows to connect an HTTP webhook to Telegram so that when you send an HTTP request to that webhook, the IFTTT bot will message you on Telegram.

Github Action Bot

All I need to make this happen is to change the notify function to this:

notify: (name, value, error) => {
  const success = `✓ ${name} status updated to '${value}'`;
  const failure = `𝗑  Error running checker ${name}: '${error}'`;
  const message = error ? failure : success;

  const iftttWebhook = (event, key) =>
    `https://maker.ifttt.com/trigger/${event}/with/key/${key}`;
  const url = new URL(iftttWebhook("_EVENT_", "_KEY_"));
  url.searchParams.append("value1", message);

  return axios.get(url.toString());
};

_EVENT_ and _KEY_ are parameters you need to fill-in after you get them when you setup the IFTTT webhook.

The code is on GitHub, you can check it out at zianwar/domcheck-github-action feel free to fork the repo and play with, any feedback is welcome!