GitHub analytics with Mathematica
GitHub GraphQL git JSON Mathematica programming REST APIContents
Introduction
Why Mathematica and not Python? Well, for starters, there is a ton of examples in Python, so adding one more to the pile wouldn’t make any difference. Plus, although I do program in Python, I don’t enjoy it as much as I enjoy Mathematica. Finally, Jupyter notebooks are nowhere near as polished as Mathematica’s.
REST API
REST API stands for “Representational State Transfer Application Programming Interface”. In simple terms, it’s a set of agreed rules on how to retrieve data when you connect to a specific URL. To make a REST API call, you need to know the following ingredients of such a request:
- The endpoint, which is basically the URL you request for. For example, GitHub’s endpoint is https://api.github.com.
- The path that determines the specific resource you are asking for. For example, in the URL https://api.github.com/user/repos, the path is /user/repos, which captures our intention to have the user’s repositories returned. When you read in a doc an expression like /repos/:owner/:repo/, the owner and repo are variables. You need to replace them with the actual value of that variable. E.g., write /repos/ekamperi/rteval, if you are interested in the repository named rteval of the user ekamperi.
- Query parameters. Sometimes a request is accompanied by a list of parameters that modify the request. These always begin with a question mark “?” and each parameter=value pair is delimited by an ampersand “&”. E.g., in /repos/ekamperi/rteval/commits&per_page=100&sha=master, we inform the server that we want 100 commits to be returned, and we want the listing of commits to start from the HEAD of the master branch.
- The method defines the kind of request that we are submitting to the server. It may be one of GET, POST, PUT, PATCH, DELETE. They allow the following operations: Create, Read, Update, and Delete (the so-called CRUD). In short, GET performs the READ operation (we ask the server to send us back some data). POST performs the CREATE operation (we ask the server to create a new resource in it). PUT and PATCH perform an UPDATE operation, and DELETE, well, you know what DELETE does.
- The headers are used to exchange metadata between client and server. For example, they are used to perform authentication by injecting some authorization token into the HTTP header.
- The data or body hold the client’s information to the server, and it is used with POST, PUT, PATCH, and DELETE methods.
Authentication
To experiment with GitHub’s REST API, we need to authenticate to the service. User-to-server requests are rate-limited at 5.000 requests per hour and per authenticated user. However, for unauthenticated requests, only up to 60 requests per hour per originating IP are allowed. So, for any serious experimentation, authentication is a must. The best way to proceed is to create a personal access token (PAT), as an alternative to using passwords for authentication to GitHub when using the GitHub API or the command line. Here is how you could authenticate via curl, by including the authorization token as an extra header to the HTTP request with the “-H” flag.
A simple example of a REST API call
Mathematica will respond with something like:
We can request the properties of the response object returned by URLRead[]
:
And then print the value of some property:
We extract the data from HTTP Message Body (the data bytes transmitted immediately after the HTTP headers), import it as a JSON string and list the associated keys:
More involved examples
How to get the weekly commit count
We will issue a GET /repos/:owner/:repo:/stats/participation request, that returns the total commit counts for the owner and total commit counts in all (all is everyone combined, including the owner in the last 52 weeks). The array order is the oldest week (index 0) to the most recent week.
How to get the list of repositories
In order to get the list of repositories, we send a request to the https://api.github.com/user/repos endpoint.
However, we need to pass our personal access token to the list of headers that will be sent to the server.
The string that we will send must be of the form “Authorization token
We send a request to the url, read back the response, interpret the body message as JSON and then display the results:
How to get the size of all repositories broken down by language
We start by creating a function that talks to the /repos/:owner/:repo/languages path. Same as before, we pass our personal access token to the header of the request:
Let’s test what data the server returns:
So, the repository named rteval of the user ekamperi contains 6440911 bytes of Python, 28787 bytes of R, 1800 bytes of CSS and 1096 bytes of MATLAB code. Let’s collect the data for all languages:
Now we’d like to calculate the aggregate data:
And then plot the results:
How to get the dates of the commits in a repository
First, we create a function that, given an SHA sum, it returns a list of (commit, date) tuples.
We then apply the function above repeatedly (via FixedPointList
) and accumulate the results:
We sort the commits by their date:
Take their difference and plot the results:
GraphQL
GraphQL is a data query and a manipulation language for APIs. Initially developed by Facebook for internal use was then released to the public. GraphQL provides an approach to developing web APIs similar to REST, yet it is different from REST. Its difference lies in that it allows clients to describe the structure of the data required. Other features include a type system, a query language, and type introspection. In GraphQL, there is only one endpoint, here https://api.github.com/graphql. The user submits a JSON formatted query describing what data exactly wants the server to return. We can experiment with GraphiQL, a graphical user interface for submitting GraphQL requests and getting back the answers. For instance, to get the currently authenticated user, we need to issue the following JSON query:
Should you want to do the same thing programmatically, you’d have to escape the “ by writing: "query": "query { viewer { login } }"
: