The protein crystallization grid project I am undertaking has convinced me of the virtues of version control. Knowing I can revert back to an older version has ensured that I spend more time being adventurous , than being paranoid of going down the wrong track and not being able to trace my path back.
On Coast to Coast Bio , Atom and I have often talked about the many ways people are using git : Blog posts , publication manuscripts and database entries to name a few. Since crystallographic refinement occupies a significant portion of a crystallographers time , I decided to see how my personal git workflow would adapt to crystallographic refinement.
In crystallographic refinement , most of the routines are scripted using script files which typically manipulate binary data and asci coordinate files ( the protein databank format) . Each step spits out a new coordinate file and a text log file which serves as a record of that operation . For eg a partuclar refinement step that calls on the phenix refinement routine would be run as
“phenix.refine myinput.pdb mydata.mtz > run1_myinput_mydata.log”
here the input pdb and output log are text files and the data is a binary formatted file ( the mtz format). When finished this analysis would output a pdb file whose name is often automatically “versionned” by the program using a different name say “myinput_001.pdb” . Versionning is assured by keeping a series of input, output and pdb files all resident on the project directory. Retracing your path is easy if you knew which version you wanted to go back to based on a file name and say some scribbled notes in a Readme file or log file.
Now you would think this is indeed something that works. But imagine the case when you come back to your refinement directory a few weeks or months later. All you see is a directory full of tens of pdb files and log files and hopefully a single Readme file detailing all the steps along the way. This can be quite difficult to follow along with . Stepping back is posisble using an old model file . But once this is done i have to come up with a new naming system to understand the history of the refinement , or worse still rely on timestamps. Also and very importantly stepping back to a previous step is only possible for files whose names have changed at every step when their contents changed .
After using git for just one project I am quite convinced that git has a lot to offer for crystallographic refinement. Git allows me to return my directory at any point of time to its state at an earlier commit . Say I used a series of refinement steps that generated tens of log files and then suddenly decided i was getting nowwhere. With the non-git setup, i could revert to an earlier model file . But that still leaves tens of log files around cluttering up my work directory. In the git case checking out a previous snapshot returns my working directory to its clutter free early stage without deleting a record of all my failed approaches.
Also during most refinements i tend to use similar sounding names for my model files. This can quickly get messy . With git even if accidentally use the same name like myfinalmodel.pdb. I can always version this file without descriptive suffixes. Importantly also git preserves the history of commits as commit trees . A flat directory heirachy does not achieve this as well as a commit history. Another big plus is that git allows me to work on multiple machines and merge my work between them. Without this , i am left with moving files back and forth and making sure their content didnt change while keeping their names the same.
At the presnt moment I use the ccp4 and phenix guis as front ends extensively to manage my refinement “workflow” . In the case i am using git . Git sits on top of these files versionning things as they go along. If I had a few months of spare time ( yeah right) I would love to create a backend to ccp4i that builds in sha1 based versionning of all files handled by all the refinement methods . Its quite nice to use git alongside and watch my commit trees to keep track of my refinement. I have just begun using git in this way and hope to have my screencasts detaling my git workflow soon.
I just realized I need to revert to my pdb files of two build sessions back , so its time to
“git checkout 3ac94e79552c11025d7bb01f9a98b7afc1637e60 myfinalpdb.pdb”
Tags: git ccp4i, linkedin