A Quick Guide to Organizing Computational (Biology) Projects

Presenter Notes

In Short ...

  • Poor organizational choices lead to significantly slower research progress.
  • It is critical to make your results reproducible.

Presenter Notes

The Stolen Briefcase

"Once, several years ago, at a conference, one of us had a briefcase stolen. The briefcase contained originals figure which had been developed while an employee of a large commercial seismic exploration outfit. [..] A manuscript had already been written. The figures were so convincing and pivotal [..] that without them, the manuscript made no sense. The manuscript had to be abandoned."

Presenter Notes

Who's on First?

"A Graduate Student comes into a Professor's office and says, "That idea you told me to try - it doesn't work!". [..] Unfortunately, the Student's descriptions of the problems he is facing don't give the Professor much insight on what's going on."

Presenter Notes

A year is time long in this business

"When he went back to the old software library [..], he couldn't remember how the software worked - invocation of sequences, data structures, etc. in the end, he abandoned the project, saying he just didn't have time to get into it anymore."

Presenter Notes

A la recherche des paramètres perdues

"Well, actually, the reason we didn't give many details in the paper is that we forgot which parameters gave the nice pictures you see in the published article; when we tried to reconstruct that figure using parameters that we thought had been used, we only got ugly looking results. So we knew there had been some parameter settings which worked well, and perhaps on day we would stumble on them again; but we thought it best to leave things vague"

(note: this story is actually a composite of two separate true incidents)

Presenter Notes

Principles

Presenter Notes

First Principle

"Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why."

Presenter Notes

Second Principle

"Everything you do, you will have to do over and over again"

-- Murphy's law

Presenter Notes

File and directory organization

Presenter Notes

So far, so good...

./images/01_files.png

Presenter Notes

Now what ?

./images/02_files.png

Presenter Notes

I guess this is alright

./images/03_files.png

Presenter Notes

Which one is the most recent?

./images/04_files.png

Presenter Notes

Another (bad) common approach

./images/another_common_approach.png

Presenter Notes

A story told by filenames

./images/version_control.gif

Presenter Notes

A (possible) solution

./images/correct_.png

Presenter Notes

Still missing something...

  • We give the project to a collaborator
  • A new student joins the project
  • 3 years later, haven't we forgotten the details of the projects?

We need context. We need metadata.

Presenter Notes

Metadata

  • who is the data from?
  • when was it generate?
  • what were the experiment conditions?
./images/data.gif

Presenter Notes

Project organisation

./images/project_organization.png

Presenter Notes

The lab notebook

Presenter Notes

What is it?

"A laboratory notebook (colloq. lab notebook) is a primary record of research. Researchers use a lab notebook to document their hypotheses, experiments and initial analysis or interpretation of these experiments. The notebook serves as an organizational tool, a memory aid, and can also have a role in protecting any intellectual property that comes from the research."

-- Wikipedia

Presenter Notes

The notebook

  • entries should be dated
  • verbose, links or embedded images, tables
  • results of all the experiments performed

Presenter Notes

Carrying out a Single Experiment

Presenter Notes

Experiments

  • record all operations you do, in order to make those operations transparents and reproducable.
  • in practice, create a README, in which you store every command line you use

Presenter Notes

6 steps

  • Record every operation you perform
  • Comment generously
  • Avoid editing intermediate files by hand
  • Store all files and directory names in the script
  • Use relative pathnames to access files within the same project
  • Make the script restartable

Presenter Notes

Handling and preventing errors

Presenter Notes

Bugs...

You will introduce errors into your code

./images/bug.png

Presenter Notes

3 suggestions for error handling

  • Write robust code to detect errors
  • When an error occurs abort
  • Whenever possible, create an output file using a temporary name, and rename the file when the script is complete

Presenter Notes

Command line vs script vs program

Presenter Notes

Software engineering

./images/good_code.png

Presenter Notes

4 types of script

  • Driver script:
  • Single use script: data format conversion
  • Project specific script: contains a generic functionality used by multiple experiments
  • Multi projects script: functionnalities used across many projects (ROC curve, n-fold cross validation, etc).

Presenter Notes

The Value of Version Control

Presenter Notes

images/stolen_briefcase.png

Presenter Notes

Thanks for your attention

Presenter Notes