Archive for February, 2017

Reproducible Analysis Through Automated Jupyter Notebook Pipelines

February 1, 2017

Amanda Birmingham (abirmingham at ucsd.edu)

Replicability and reproducibility* have been important components of the scientific method since Boyle and Huygens argued over their vacuum experiments in the 17th century. Since repeating a process the same way every time is one of the things computers do best, one might expect computational biology would outshine lab biology in this critical area. However, early findings incorporating bioinformatic analyses were often published with fulsome details on the wet lab work but barely a mention of the computational efforts.


In a 1986 paper from the Proceedings of the National Academy of Sciences, the fitting code used wasn’t even mentioned in the methods–though it did rank an acknowledgment (after “helpful discussions”)!


Fortunately, the field has improved, but the road from computational ‘methods’ like “Alignments were run” to “Alignments were run with BLAST” to “Alignments were run with BLASTN version 2.2.6 against human” to “Alignments were run with NCBI BLASTN v.2.2.9 using the command blastn -W 7 -q -1 -F F against the NCBI RefSeq release 80 human transcriptome” has been a long one. Comprehensive methods sections in manuscripts, moreover, really shouldn’t be the end of this road for bioinformaticists. Here is another area where computational work should exceed the limits of wet labs, since a perfectly configured and analysis-ready computing environment could be disseminated with a new published work.

Jupyter Notebooks: Friend or Foe?

One option for such distribution is the Jupyter Notebook, a new evolution of the IPython project. It describes itself as “a web-based interactive computing platform” and has widely been hailed as a powerful new weapon in the war for reproducibility. For example, the non-profit Data Carpentry, which provides data analysis training to researchers, offers an entire workshop on “Reproducible Research using Jupyter Notebooks“. (Giving a primer on notebooks themselves is beyond the scope of this post, so if you haven’t experienced them, skim the gentle introduction at Basics of Jupyter Notebook and Python, then visit http://try.jupyter.org/ and click on the *.ipynb file for your favorite language–e.g.,Welcome to Python.ipynb–to try one out yourself.)

Jupyter Notebooks can alternate static code (in a variety of programming languages) with human-readable text, in the literate programming paradigm. However, Jupyter inventor Fernando Perez created them to support the more expansive concept of literate computing, in which the iterative and interactive exploratory computing so critical to investigating a new data set is captured along with a narrative describing its motivations and results. As notebooks are easily savable, modifiable, and extensible, they also offer a convenient tool for rerunning or tweaking previous data analyses. In fact, here at CCBB, we decided to deliver our analyses to customers in Jupyter Notebooks: not merely a faithful record of the precise commands we executed, the notebooks are themselves software tools that can be easily rerun if necessary.


Not merely a faithful record of the precise commands we executed, Jupyter Notebooks are themselves software tools that can be easily rerun if necessary. … Well, almost.


Well, almost. In our experience, notebooks’ greatest strength–their exceptional interactivity–can sometimes feel like their greatest flaw. Two particular issues we encountered were:

  1. Subsequent analysis runs need automation more than interactivity. When running a custom analysis pipeline developed in a Jupyter Notebook over new data, we want to leverage the code+narrative aspect of notebooks (to ensure methods are correctly documented in a human-readable way for every pipeline run) but don’t want to have to open, update, and step through the relevant notebook by hand every time new data comes in.

  2. Official records need immutability more than interactivity. Different cells in a Jupyter Notebook can depend on each other: type x = 3+3 into cell A, run it, and then type print(x) into cell B and run it, and you’ll see the output 6. However, changing the code in cell A to x = 3+25 doesn’t trigger automatic rerunning of cell B–so you still see an output of 6. While this behavior is by design, its side-effect is that careless interactions with a notebook can destroy its value as a computational record (especially since by default notebooks auto-save all changes every few minutes!) See an illustration below, shown as a looping gif:

Of course, one could argue that this just means Jupyter Notebooks aren’t the right tool for codifying analysis pipelines and/or keeping official records (don’t drain pasta through a baseball glove and then complain that you burnt your fingers!) However, given notebooks other advantages–like ability to mix Python and R in the same notebook, easy integration with useful tools both inside and outside bioinformatics (like Anaconda and the Integrated Genomics Viewer(IGV)), and option to present live, interactive notebooks as slides, to name just a few–we were strongly motivated to look for a way around the flaws we had encountered without leaving the Jupyter Notebook ecosystem.

Building a Self-Documenting Analysis Pipeline

Fortunately, we weren’t the first to encounter these issues. In fact, #1 (need to automate notebooks) has already been addressed with the introduction of helpful APIs and extensions like nbconvert, nbformat, and nbparameterise (Note the British spelling!). The first two allow you read and execute Jupyter Notebooks from external Python scripts, while the last provides the critical ability to inject modified variable values before each execution. These tools make it possible to serve two masters, keeping an analysis pipeline’s core functionality in notebooks–where it is easy for non-coders and external reveiwers to examine it–while still allowing a bioinformaticist to run the pipeline automatically from script. Here’s sample code to inject variables and run a notebook from a Python script, writing out the results to a new notebook file:

# standard libraries
import os

# third-party libraries
import nbformat
import nbparameterise
from nbconvert.preprocessors import ExecutePreprocessor


# modified from https://nbconvert.readthedocs.io/en/latest/execute_api.html
def execute_notebook(notebook_filename, notebook_filename_out, params_dict, 
    run_path="", timeout=6000000):
    
    notebook_fp = os.path.join(run_path, notebook_filename)
    nb = read_in_notebook(notebook_fp)
    new_nb = set_parameters(nb, params_dict)
    ep = ExecutePreprocessor(timeout=timeout, kernel_name='python3')

    try:
        ep.preprocess(new_nb, {'metadata': {'path': run_path}})
    except:
        msg = 'Error while executing: "{0}".\n\n'.format(notebook_filename)
        msg = '{0}See notebook "{1}" for traceback.'.format(
                msg, notebook_filename_out)
        print(msg)
        raise
    finally:
        with open(notebook_filename_out, mode='wt') as f:
            nbformat.write(new_nb, f)   


def read_in_notebook(notebook_fp):
    with open(notebook_fp) as f:
        nb = nbformat.read(f, as_version=4)
    return nb


def set_parameters(nb, params_dict):
    orig_parameters = nbparameterise.extract_parameters(nb)
    params = nbparameterise.parameter_values(orig_parameters, **params_dict)
    new_nb = nbparameterise.replace_definitions(nb, params, execute=False)
    return new_nb

… and here’s the super-simple, one-line call to execute this functionality, assuming the notebook to be run (named mynotebook.ipynb) has two variables named x and y that I want to set:

execute_notebook("mynotebook.ipynb","my_new_notebook.ipynb",{"x":6,"y":"blue"})

By the way, if this isn’t your first rodeo, you might be wondering “wait, where and how does one define which variables a notebook has?” Good catch! nbparameterise, which is doing all the heavy lifting on injecting those variable values, looks at the first code cell in the notebook; if that cell contains only simple assignment statements, it takes the variables therein as the ones it can modify. Here’s an example:

Here the cell containing the “x” and “y” variables is the first code cell in the notebook, and it contains only assignments (and comments), so nbparameterise will recognize all its contents as variables that can be injected. (Don’t get carried away and try to put more complex assignments–like the call your_handle = "{0}-{1}".format(y, x)–into that first code cell. If nbparameterise sees anything other than simple assignments of strings or numbers, it will give up on the whole cell and decide there are no variables it can inject!)

Running the execute_notebook command given above produces a new my_new_notebook.ipynb file in the same directory as the first; the new notebook like this:

You can see that all the explanatory text remains, but the variable values in the first code cell have been swapped for the ones we injected–and the subsequent code has been rerun with those new variable values. Nifty!

But wait, we’ve only solved half the problems! Issue #1 is fixed, since we can easily automate a notebook-based pipeline (or a notebooks-based pipeline, although stringing together the execution of multiple notebooks is left as an exercise for the reader–hint: passing values generated by an earlier notebook into a later one can’t be done with nbparameterise but can be done with a temporary file). What about issue #2, the potential for corruption of a notebook-as-record?

Well, I have hopes that someday this will be addressed directly through the notebook, and one option is already available from that direction. If you’re using Python in a Juypter Notebook, you can invoke the %logstart magic with the -o option to record sequentially all Python commands run in the notebook–and their output–to a log file. The log record is faithful even when the notebook is corrupted; this is demonstrated by the log file for our earlier corrupted notebook example, which correctly shows that I actually assigned x twice, once before and once after running the print statement:

# IPython log file

x = 3+3
print(x)
x = 3+25

(Why didn’t this record the output of print(x)? Logging doesn’t include print outputs.) Of course, it also doesn’t include any of the other goodies in the notebook–descriptive text, images, nice formatting–basically all the stuff that makes notebooks notebooks and not just Python scripts 🙂 . So until Fernando and company come up with some niftier option, what to do?

We punted. Rather than trying to lock down the notebooks we deliver–which would remove their utility for rerunning analyses, anyway–we now deliver both a notebook and an HTML export of that notebook with each analysis. The former file is the tool and the latter is the record. The exported HTML files have everything the notebooks do–nice formatting and scrolling, descriptive text, images, and of course the relevant code–EXCEPT that the code is no longer runnable, which is exactly what we want in a record. (Of course, an HTML file could be explicitly changed by someone motivated to misbehave, but our main concern is preventing accidental corruption.) Here again, nbconvert comes to the rescue: adding one new function called by one new line to the end of our execute_run method automatically generates the HTML output alongside the new notebook every time we run a notebook-based pipeline (the other methods presented earlier remain unchanged).

# additional third-party libraries
from nbconvert import HTMLExporter

# modified from https://nbconvert.readthedocs.io/en/latest/execute_api.html
def execute_notebook(notebook_filename, notebook_filename_out, params_dict, 
    run_path="", timeout=6000000):
    
    notebook_fp = os.path.join(run_path, notebook_filename)
    nb = read_in_notebook(notebook_fp)
    new_nb = set_parameters(nb, params_dict)
    ep = ExecutePreprocessor(timeout=timeout, kernel_name='python3')

    try:
        ep.preprocess(new_nb, {'metadata': {'path': run_path}})
    except:
        msg = 'Error while executing: "{0}".\n\n'.format(notebook_filename)
        msg = '{0}See notebook "{1}" for traceback.'.format(
                msg, notebook_filename_out)
        print(msg)
        raise
    finally:
        with open(notebook_filename_out, mode='wt') as f:
            nbformat.write(new_nb, f)
        export_notebook_to_html(new_nb, notebook_filename_out)


def export_notebook_to_html(nb, notebook_filename_out):
    html_exporter = HTMLExporter()
    body, resources = html_exporter.from_notebook_node(nb)
    out_fp = notebook_fileout_name.replace(".ipynb", ".html")
    with open(out_fp, "w", encoding="utf8") as f:
        f.write(body)

And hey, presto, instant HTML records!

Conclusions

The public excitement around Jupyter Notebooks is largely deserved: they’re an extremely versatile new tool for data analysis. For example, here at CCBB we’ve used the techniques discussed in this post to build a self-documenting custom software pipeline for quantifying the results of dual-CRISPR genetic screens. To see this real-world example, check out mali_pipeliner.py and its subsidiary scripts in CCBB’s jupyter-genomics repository on GitHub (and of course don’t miss the notebooks themselves, in the parallel notebooks directory). Our experience has taught us that notebooks are no magic bullet against irreproducible research–but with care on the part of the notebook designer, they can be a strong new weapon nonetheless!

* These two terms mean different things, but there is some confusion about exactly how. While the distinctions aren’t relevant to this discussion, they are elucidated with examples in an excellent post from UPenn entitled Replicability vs. reproducibility — or is it the other way around?

See How CCBB Can Help With Your Bioinformatics Data

Request Free Consult