Google sheets authenticator for Jupyterhub

Back in November I was again involved in running the programming course for the Casimir graduate school of the universities of Delft and Leiden. In addition to the usual tweaks to the material in response to previous year’s feedback, we also wanted to tweak the setup of our programming environment.

The course is taught in Python and we provide a Jupyter-based environment for our learners for the duration of the course by running our own deployment of JupyterHub. We’ve found that it’s very effective in getting everyone up and running as quickly as possible, as everyone has the same environment and it’s super easy to push updates to the course materials (though that’s more due to the fact that we use Docker).

When we ran the course in 2016 we were still relative noobs when it came to Jupyterhub deployments, but after a year of experience setting up around 10 different Jupyterhubs (with the help of our ever-evolving Ansible role!) we were starting to get the hang of things. One thing in particular that we wanted to streamline was the signup process.

When signing up for the course people give their Github username (which we use in the Git portion of the course). This means that we can use the OAuthenticator module. However, we still need to whitelist the usernames of participants, otherwise we’d be letting anyone with a Github account access to our environment!

We had a few options as to how to do this. Last year we just manually added the names to a whitelist file, but this is not optimal because the file is only read when the hub starts, meaning that any people who sign up late need to be added manually (or we’d have to bounce the hub just to update the whitelist). In addition we wanted to be able to give people access to the hub as soon as they signed up, so they could have time to get used to it and work through some of the preliminary material if they wanted. Manually adding people just wasn’t going to cut it. Another possibility was to make all the participants request access to a Github organization (which we would set up specifically for the course) and use the new “group whitelisting” functionality of OAuthenticator to whitelist everyone in that organization. This was not ideal either, as we would need to manually accept each participant’s request to join the organization, and the whole point was to avoid O(N_participants) effort!

The solution that we came up with with was pretty hacky, but actually ended up working perfectly for us. Learners would sign up using a google form that we had prepared and the submitted form data is magically added to a google docs spreadsheet set up for the purpose. Our idea was to “share” the google sheet via a web link, which we could then fetch from within out whitelisting logic. While this might seem insanely insecure (it seems like we’re making private data public by sharing using the web link), it’s actually not that bad. The URLs that google docs generates contain a random string of 20 or so alphanumeric characters that’s probably got as much entropy as a reasonable passphrase (sounds like a good topic for a future blog post!). It goes without saying that we only hit this URL using HTTPS and don’t ever share it around in non-secure channels.

The following 50(ish) line snippet is the whole thing! (also available as a gist).

import csv
import subprocess

from tornado import gen, AsyncHTTPClient


@gen.coroutine
def get_whitelist(sheets_url, usernames_field):
    # Get CSV from sheet
    client = AsyncHTTPClient()
    resp = yield client.fetch(sheets_url)
    raw_csv = resp.body.decode('utf-8', 'replace').split('\n')

    reader = csv.reader(raw_csv)

    # Extract column index of usernames
    headers = next(reader)
    try:
        username_column = headers.index(usernames_field)
    except ValueError:
        raise ValueError('header field "{}" not found in sheet {}'
                         .format(usernames_field, sheets_url))

    usernames = [row[username_column] for row in reader]
    return usernames


class SheetWhitelister:

    sheets_url = 'https://docs.google.com/spreadsheets/d/xxxxxx'
    usernames_column = 'Github username'

    @gen.coroutine
    def check_whitelist(self, username):
        if super().check_whitelist(username):
            return True
        try:
            whitelist = yield get_whitelist(self.sheets_url,
                                            self.usernames_column)
            self.log.info('Retrieved users from spreadsheet: {}'
                          .format(whitelist))
            self.whitelist.update(whitelist)
        except Exception:
            self.log.error('Failed to fetch usernames from spreadsheet',
                           exc_info=True)
        return (username in self.whitelist)

The above defines a mixin class, SheetWhitelister, that we can use with an existing Jupyterhub authenticator to “plug in” the custom whitelisting logic. To actually use it in the Jupyterhub config we’d need to combine it with an existing authenticator (e.g. Github), as below:

from oauthenticator.github import GithubOAuthenticator

class GithubWithSheets(SheetWhitelister, GithubOAuthenticator):
    pass

c.JupyterHub.authenticator_class = GithubWithSheets

I’m really not a fan of the mixin class pattern because you always need to make these boilerplate classes that combine all the required functionality, and combining these behaviours at runtime it’s more cumbersome. Give me a nice functional strategy pattern any day! But hey, it works so I can’t complain, and hopefully somebody on the internet will find this useful.