Computational lab notebooks using git and git-annex
| Updated:
Hi! I'm Ryan Moore, NBA fan & PhD candidate in Eric Wommack's viral ecology lab @ UD. Follow me on Twitter!
Disclaimer: if you need a lab notebook for legal records, copyright, patent rights, or anything like that, then this article probably isn’t for you. This post is not providing any recommendations for those cases.
Contents
Too long; didn’t read: Check out the cln app on GitHub. It helps you manage a computational lab notebook using git and git-annex. You can find the documentation here.
Overview
Keeping a good lab notebook for your computational work is important, but it can be challenging. A quick Google search will show you lots of examples of people talking about it:
- Ten Simple Rules for a Computational Biologist’s Laboratory Notebook
- Notebook & Data Management
- Lab Notebooks for Computational Science
- How to Keep a Lab Notebook for Bioinformatic Analyses
- Keeping a good lab notebook in a computational field?
I have tried a lot of different methods, but they all more or less boil down to a workflow sort of like this:
- Write down some summary of what I’m about to do and why.
- Run some commands, programs, or bash stuff.
- Copy what I did into a document. (e.g., Markdown notes files, TiddlyWiki, etc.)
- Write a bit more about what happened.
- Rinse and repeat.
Then, depending on my needs, I may clean up the analysis and put it into an R Markdown or Jupyter notebooks notebook so it will be easier to reproduce later.
One problem with this general workflow is that it requires tracking a lot of things manually (e.g., copying and pasting). Whenever you do a lot of that, you will inevitably forget to paste a command into your notebook. You might make a mistake or typo when running a command, and rather than noting it down in your notebook, you just rerun it and pretty soon your lab notebook is out of sync with the commands that you have actually run. Another issue is that you may be running a bunch of commands quickly, just testing some ideas out. When doing this, you end up needing to track a ton of things in an ad-hoc manner leading to a messy lab notebook that you need to come back to later and reorganize.
In other words, you need to manually track a lot of information, and it can be quite a challenge to keep track of everything!
Provenance tracking
One approach to dealing with this problem is by tracking the
provenance of files. An example of this is how QIIME
2 includes metadata in
their artifact files (.qza
files) to track things that were done in
an
analysis.
I like the idea of provenance tracking, but even if you do use QIIME, there are a lot of things you need to do outside of QIIME that will need tracking. While not quite the same, this sort of provenance tracking reminds me a bit of using git or other version control software. Git is software used to track changes in a set of files, and is often used by programmers during software development.
Note: If you have never used git before, the official docs have a lot of info that may be of use to you. I have also written a small git tutorial that you may find useful!
While I had used git while working on software, I had never tried using it to manage a computational lab notebook. One reason is that it doesn’t handle large files well. For computational work, whether bioinformatics or data science, you will be dealing with a lot of large files. Sequencing files easily get over 10 GB in size, so using git alone is going to be problematic. However, there are extensions to git like Git Large File Storage and git-annex that help to address this problem. (Essentially, git-annex tracks symbolic links in the git repository rather than the file itself. There is a lot more to it than that, so you check out the git-annex walkthrough if you want to know more.)
A git-based lab notebook
Note: I’m not the first one to think of using git to help manage a computational lab notebook. In fact, you can find some interesting discussion on whether version control is even useful for lab notebooks here, here, and here.
Using git and git-annex, I figured that I could get a pretty decent workflow going for my computational lab notebook. After playing around with it for a while (and seeing that git-annex was a good solution to git’s large file problem), I settled into a pretty familiar workflow:
- Run a program, script, whatever.
- Track any new files or changes with git.
- Commit the changes.
- Repeat.
One key difference from my “typical” workflow is that instead of putting the commands that I ran and their explanations into some external document like a markdown file, I would put all the information into the commit message. That way, all the info about how and why I did something would be tracked in the git repository along with the actual files and changes.
That works pretty well, but you still run in to the issue of having to remember what you ran, copy and paste it correctly into the commit message, blah blah blah. In other words, it’s still a bit of a pain. While you get the added benefits of git logs and history tracking, you have to do a lot of repetitive, annoying stuff to get things to work. So, of course, I wrote a little program to help automate some of the tedious stuff!
A CLI app to help manage git-based lab notebooks
While working with the above workflow, in addition to QIIME’s
provenance tracking, I was also reminded of database
migrations.
Basically, the way they work is that you write some script that says
how the database is supposed to change (e.g., add column first_name
to table authors
), and then some migration
tool
handles actually making any changes to the database. In theory, this
gives you a simpler way to track how your database has changed over
time–you can just follow the paper trail of your migration files.
The app I wrote works in a similar way, except that instead of making
incremental changes to a database, you are formalizing making changes
to the repository itself. The app is called cln
(it stands for
“computational lab notebooks”…clever, I know!). You can find it on
GitHub.
There is also some pretty extensive documentation
available
to help you get started using the software.
While I suggest you check out the docs for a more detailed explanation
of its installation and usage, I want to show a quick, little
example to give you a flavor of how the cln
program can help you
manage you git-based lab notebook.
A super simple example
The cln
command provides a couple of subcommands to help you manage
your lab notebook with git and git-annex. (For more details on
individual subcommands, see
here).
Create a project
To start, you make a new project.
The cln init
command initializes a new project, creates a git
repository, and generates some scaffolding for actions and git commit
templates.
Prepare an action
Next, you prepare an action to run. (Again, this is just a silly example…for a more in depth tutorial, see the documentation).
In this case the action is just running a printf
command and saving
the contents in a file. Of course, you can prepare an action
containing anything that you would normally run at the command line.
For example, you could prepare a crazy action like this:
Note: That’s actually an action I prepared and ran in a real project.
Previously, I would have put that little ad-hoc
Ruby script into a file and ran it in
a way that is easier to track, but with the cln
to help me manage
things, everything will be nicely tracked automatically.
The cln prepare
command creates an action file and a git commit
template.
The action file is simply a bash script with the command you want to
run, but having it there in your repository as a standalone script
helps you see what is going on if you’re running a complicated command
or when you come back to the project a couple of months later.
Run the pending action
Next, you can check that everything is okay doing a dry run. It will spit out some stuff to the terminal to let you know what’s going on and suggests what steps to take next. Note: I’ve edited the terminal output a bit.
If it looks good, you can go ahead and run the action.
See how the cln run
command gives you hints on what to do next? I
tried to make all the cln
commands spit out helpful info like that
to the terminal.
Track and commit changes
Now, you will be able to see any files that were created or changed as
the result of running the action using git status
. Depending on the
size(s) of the file(s) that were created or changed, you can add them
to the git
index
with either git add
or git-annex add
. Finally, you commit the
changes using the git commit template that was made when you prepared
the action.
The template file will look something like this:
When you run the git commit
command, a text editor will pop up with
the contents of the git template file ready for you to fill out. This
is nice because you can avoid manually copying in the commands you
ran. For such a small example it’s not really a big deal, but if
you’re running some complicated bioinformatics software with a lot of
flags and options, it’s pretty convenient!
Browse the git history
After editing the message and saving the commit, you can browse through your nicely organized repository history and see something like this:
Notice how I put a short, descriptive commit message for the first
line, and then added in any additional details that I think I will
need later. The == Details ==
section would hold all the extra
stuff I would put in my lab notebook anyway, but it is really
convenient to have it right there in the git log.
Having the command that you ran, the details about that command, and the changes that command effected in your repository opens up some really powerful ways to track your analyses.
Get individual file provenance info
For example, you can use the git
cli app (e.g., git whatchanged
or
git log
) or a GUI like gitk to get
detailed info about the provenance of any files in the repository.
You could run something like this to see all the history for the
msg.txt
file.
As you can imagine, having output like that for all the files in your project folder as well as the chronological logs is a very powerful way to track your analyses and makes managing a computational lab notebook so much easier.
Wrap up
Managing a computational lab notebook is tricky. I have found that
using git and git-annex can be a good way to keep all the info you
need right in the same directory as all your data files, scripts, and
analysis code. To help you more easily manage lab notebooks using git
and git-annex, I created a command line app called cln
. You can
find the code on
GitHub.
Installation instructions and usage examples can be found in the
documentation.
If you enjoyed this post, consider sharing it on Twitter and subscribing to the RSS feed! If you have questions or comments, you can find me on Twitter or send me an email directly.
← Go back