Saturday 17 August 2013

Writing scientific papers with git and LaTeX


I wrote my last paper using the 'git' version control software.



You may have heard of git, and you may have even downloaded a program or R script from GitHub. It's been around a while (8 years), but it's only recently starting to be used by non-technical types (= biologists!). It's mainly used by programmers and web designers to keep track of changes to their code, but this applies equally to writing a manuscript. The principles are the same.

No matter how organised you are, everyone must have had at some time folders containing files called 'final.doc', 'finalfinal.doc', 'finalfinalfinal.doc', 'finalfinalfinal-version2_july12.doc', finalfinalfinal-version2_aug12_TJedit3_submitted.doc'. You get the picture. With git, this is a thing of the past. You have one file for your manuscript, and one file only. Git's magic happens in the background. Should you need to, you can roll back to any previous version of your manuscript, and it instantly changes in your working directory.

One of the things with writing manuscripts is different journals will require different formatting, or even an entirely different structure/focus of your work. Using git, we can accommodate this using the branching functions. To set up a new branch is simple, and it acts effectively as an additional, independent copy of your manuscript (although only the changes are actually stored by git—it's very efficient like that). The beauty is in the way that git allows you to transfer changes from one version to another, or merge them completely should you want.

Lets say you want to submit to Nature, but realistically you have to admit that they're unlikely to publish your important research on the length of ant's legs (but it's worth a try just in case). You branch off from 'master' into a new version called 'nature' and alter all the formatting to their requirements, but it's not the final version and you notice some typos or something more you want to change. It's easy to switch between branches, so you make the corrections, and using the 'cherry-pick' tool in git, you send only these specific changes you made in the nature branch, back to master while ignoring the new formatting. Unfortunately, you get rejected by Nature, and you decide that perhaps the Bulgarian Journal of Myrmecology is more appropriate. No problem, your master is up to date, and you just create another new branch from master (which can also be cherry picked if needed). If you are organised and wrote informative commit messages, you can even do the cherry picking at a much later date.

I hope I've demonstrated that this is a more intelligent way than copy/pasting, but one of the key features is that all authors can work on the same document at the same time, without fear of screwups. This is where git really shines. No longer will you have to email drafts out to all authors and then clean up the mess afterwards using track changes. Each person can independently work on the project at the same time and changes can be incorporated as desired.

However, there's a big snag in adopting this git approach, and as usual, it's other people. Lets be honest, it's not easy to persuade busy/important people to drop what they're doing to learn how to do something new, even if they are genuinely curious. This could make collaborating on a paper hard, which is ironic, as this is one of git's big strengths. So in my case I sent a pdf to my co-authors and received back comments annotated on the pdf. I was relatively fortunate in that my co-authors only wanted minor changes to the text, so this was not a problem to do manually. If they needed to get really stuck in, then the pdf option would have been a no-go (same goes for the dead tree option if they are in the same building).

But what's this about pdfs and this 'LaTeX' thing? Why can't git manage my Word documents? Well, git can track word documents, but it's a bad idea. Word stores its content as binary or compressed data, and while git can in theory be set to handle this, it gets complicated and unless you know exactly what you're doing, you can lose the main benefits of git—i.e. how it tracks differences between files and effortlessly merges them. Git works best with a plain text file, and therefore the LaTeX system is the obvious choice, as it stores the content of your manuscript in this format. You simply run the text file through compiler software, and a fully typeset pdf is produced. The formatting relies on 'markup language', so for example italic text would be presented as follows: \textit{Homo sapiens}. If you've ever written anything in html it's a similar idea, and not as difficult as it sounds.

However, again, the big problem with LaTeX is other people. If the journal you want to submit to is a nice modern one, then a LaTeX template will be available on their site. Conforming to their punctilious formatting rules is a doddle—you just use the template, and all is good. I've submitted to PLoS, Springer, and Elsevier* journals, and each was very straightforward (almost a pleasure). If your chosen journal does not accept LaTeX, however, you're in a world of pain. Converting to .rtf and then .doc via latex2rtf is straightforward enough, but how do you conform to their ridiculous rules (nobody could possibly peer review a manuscript if the subfigures are numbered with lower case rather than upper case letters, right?). You could do this by hand in the Word doc, but our time is just too valuable to be wasted like that. Changing these things in LaTeX is possible, but it's a royal pain the arse sometimes, especially when you need to change minutiae in the reference formatting. Besides, it goes against the LaTeX mantra of letting LaTeX take care of these things for you.

So, if I haven't put you off, how does it all work? First you install git. Next you create a local folder on your machine to hold your documents. This would just be the same as for any other file. Next, you need to set up a repository, or 'repo' as it's known in the trade. This repo is usually online, but need not be; it could be another folder on the same computer. Obviously if you wish to work on multiple machines, or you wish to collaborate, it needs to be online. There are a few options out there for that. GitHub is the most well known; public accounts are free, but if you want a private project you'll need to pay (I think some academics can apply for a free private account). BitBucket is another option that does offer free, private repos. So, what do I use? I use Dropbox. If your reading this with any knowledge of git, you'll know that Dropbox is not recommended to be used as a git repo, as it is simply not designed for it. However, the main problem lies with the fact that it can't deal with two people working on the same files at the same time—it becomes corrupted. But in my experience, if you are the only user of the repo, then that isn't a problem and it works fine.

Git essentially works by tracking your files and noticing when they have been changed. Once this occurs, the changes are now sitting in what's called the 'staging area'. When you are happy with the changes, you can 'commit' them to git, and they are assigned a unique 'hash', which acts as a permanent record of those changes. At the end of the day, you can 'push' your commits to the remote repo, and they can be accessed by you on another machine later, or by a collaborator. The main advantage of this three step system is that you can craft exactly who sees which changes and when. Git is a command line program, and although GUIs are available, it is good to start familiarising yourself with the basic commands when you learn. They are very simple (see below), and any problems/questions can be easily Googled. There's tons of information out there.

So, I will definitely be using the git/LaTeX combination in future, assuming I can convince people to join me. There's a lot more to it than I've mentioned here, but here's a few commonly used commands below, mainly to illustrate how simple it is. For further information, read these helpful git tutorials here, here, and here. If you're interested in LaTeX, the Wikibook is here.

Git is not limited to dealing with manuscripts either. I also added my figures and data there too. In fact, any version of the whole project at any time can be accessed with a single command. Another cool feature is that a repo such as GitHub can double as a preprint server, should you wish to share your results with the research community prior to journal submission.

#adds file(s) to be tracked by git
#all files in directory can be added with 'git add .'
git add manuscript.tex

#you can make some changes to several files and commit these changes (e.g. a day's work) all at once with one message (the -a specifies all)
#you can alternatively tailor your commits to apply to just one file, or just one specific edit, and this makes rolling back a specific file a lot easier
#your future self will thank you for informative commit messages!  
git commit -a -m "a message describing what you did"

#view the history of commits
git log

#send your commits to the repo
#can be set up to push automatically with just 'git push'
git push remotename branchname

#to create a new branch called 'newbranch'
git branch newbranch

#switch to new branch
git checkout newbranch

#switch back to master
git checkout master

#rollback to a previous commit
#the commits are stored as unique alphanumerical 'hashes' and can be accessed with 'git log'
#they can be truncated too.
git checkout c96c8009

#permanently reset to a previous commit: you lose all later commits
git reset --hard c96c8009

#cherry pick a specific commit and incorporate into your current branch
#need to have checked out the branch you want to cherry-pick IN to to do this.
#a tip for using with LaTeX is to write each sentence on a separate line. This will minimise conflicts (the same line getting edited in different places by different people).
git cherry-pick c96c8009

#for significant points in your timeline, add a version tag to a commit
git tag -a v2.0 -m 'version submitted to Nature' c96c8009

#compare two versions of the same file
#there are many additional options including colouring and word differences
git diff <commit1> <commit2> <file_name>


*Say what you like about Elsevier, their LaTeX support is very good.

2 comments:

  1. Te Felicito por el artículo.!!!. Soy de Argentina y quiero tratar de convencer a investigadores a que adopten una filosoía similar. Y es duro!!! jejeje. Pero seguiré intentando. Ah, como aporte: te faltó el "git init" , al comienzo de tus sentencias. Saludos Gabriel.

    ReplyDelete
  2. You can also use a GUI and bring the entire process online with Authorea (http://authorea.com)

    ReplyDelete