One of the biggest differences between writing academic code and writing production code is the emphasis on readability, maintainability and traceability. While academic code is often only needed once, and used by one or a few people with a low turnover rate, production code is used over and over again, must be adaptable to change and intelligible to anyone who needs to use it.
In my academic work, I wrote MATLAB code for myself, and maybe a few close collaborators. I didn’t worry about version control or documentation, any questions were easily dealt with, and my familiarity with the codebase made it easy for me to navigate. When a paper was published, I posted the complete code and brief instructions on Github using Git, but that was the extent of the documentation. In this case, my use of Git and Github is exclusively as a code storage facility and was not integrated with my development process.
Version control, and Git in particular, are much more powerful than this. I recently read through Atlassian’s Git Tutorial. It was one of the better online tutorials I’ve read, especially because of the excellent context provided throughout.
The tutorial wasn’t just a rundown of git commands. Instead, it provides just enough history of version control and sample workflows that I feel I understand how to use Git in the context of an actual project, and why Git is usually preferred to centralized systems like SVN. It also mixes code snippets, but doesn’t overload on technical details I’m not going to remember for very long.
The tutorial makes technical points when necessary for comprehension and points out their most common uses in typical workflows. It makes the distinction between commits and the branches which point to them, for example, and how this leads to a difference between “git checkout <branch>” and “git checkout <commit>”. It also talks about the common practice of cleaning up the development history of a new branch into logical commits before merging with master. This is a practice which is no doubt critical to long-term project maintenance, but wouldn’t be obvious from a simple description of “git rebase -i”.
The tutorial doesn’t have enough information on its own to serve as a viable Git documentation, but then it doesn’t have to. The official Git documentation, and the excellent book Pro Git are always only a google search away. The importance of the tutorial is the ideas, structures and workflows it imparts to readers, who should then know what to google later when working on a project.
For a data scientist, Git and version control workflows are important skillsets. The development of models and data cleaning code should be traceable through Git, but knowledge of Git for application development is also key, as data scientists are inevitably involved in greater or lesser degree with that process as well.