Reproducible Analysis Through Automated Jupyter Notebook Pipelines

February 1, 2017

Amanda Birmingham (abirmingham at ucsd.edu)

Replicability and reproducibility* have been important components of the scientific method since Boyle and Huygens argued over their vacuum experiments in the 17th century. Since repeating a process the same way every time is one of the things computers do best, one might expect computational biology would outshine lab biology in this critical area. However, early findings incorporating bioinformatic analyses were often published with fulsome details on the wet lab work but barely a mention of the computational efforts.


In a 1986 paper from the Proceedings of the National Academy of Sciences, the fitting code used wasn’t even mentioned in the methods–though it did rank an acknowledgment (after “helpful discussions”)!


Fortunately, the field has improved, but the road from computational ‘methods’ like “Alignments were run” to “Alignments were run with BLAST” to “Alignments were run with BLASTN version 2.2.6 against human” to “Alignments were run with NCBI BLASTN v.2.2.9 using the command blastn -W 7 -q -1 -F F against the NCBI RefSeq release 80 human transcriptome” has been a long one. Comprehensive methods sections in manuscripts, moreover, really shouldn’t be the end of this road for bioinformaticists. Here is another area where computational work should exceed the limits of wet labs, since a perfectly configured and analysis-ready computing environment could be disseminated with a new published work.

Jupyter Notebooks: Friend or Foe?

One option for such distribution is the Jupyter Notebook, a new evolution of the IPython project. It describes itself as “a web-based interactive computing platform” and has widely been hailed as a powerful new weapon in the war for reproducibility. For example, the non-profit Data Carpentry, which provides data analysis training to researchers, offers an entire workshop on “Reproducible Research using Jupyter Notebooks“. (Giving a primer on notebooks themselves is beyond the scope of this post, so if you haven’t experienced them, skim the gentle introduction at Basics of Jupyter Notebook and Python, then visit http://try.jupyter.org/ and click on the *.ipynb file for your favorite language–e.g.,Welcome to Python.ipynb–to try one out yourself.)

Jupyter Notebooks can alternate static code (in a variety of programming languages) with human-readable text, in the literate programming paradigm. However, Jupyter inventor Fernando Perez created them to support the more expansive concept of literate computing, in which the iterative and interactive exploratory computing so critical to investigating a new data set is captured along with a narrative describing its motivations and results. As notebooks are easily savable, modifiable, and extensible, they also offer a convenient tool for rerunning or tweaking previous data analyses. In fact, here at CCBB, we decided to deliver our analyses to customers in Jupyter Notebooks: not merely a faithful record of the precise commands we executed, the notebooks are themselves software tools that can be easily rerun if necessary.


Not merely a faithful record of the precise commands we executed, Jupyter Notebooks are themselves software tools that can be easily rerun if necessary. … Well, almost.


Well, almost. In our experience, notebooks’ greatest strength–their exceptional interactivity–can sometimes feel like their greatest flaw. Two particular issues we encountered were:

  1. Subsequent analysis runs need automation more than interactivity. When running a custom analysis pipeline developed in a Jupyter Notebook over new data, we want to leverage the code+narrative aspect of notebooks (to ensure methods are correctly documented in a human-readable way for every pipeline run) but don’t want to have to open, update, and step through the relevant notebook by hand every time new data comes in.

  2. Official records need immutability more than interactivity. Different cells in a Jupyter Notebook can depend on each other: type x = 3+3 into cell A, run it, and then type print(x) into cell B and run it, and you’ll see the output 6. However, changing the code in cell A to x = 3+25 doesn’t trigger automatic rerunning of cell B–so you still see an output of 6. While this behavior is by design, its side-effect is that careless interactions with a notebook can destroy its value as a computational record (especially since by default notebooks auto-save all changes every few minutes!) See an illustration below, shown as a looping gif:

Of course, one could argue that this just means Jupyter Notebooks aren’t the right tool for codifying analysis pipelines and/or keeping official records (don’t drain pasta through a baseball glove and then complain that you burnt your fingers!) However, given notebooks other advantages–like ability to mix Python and R in the same notebook, easy integration with useful tools both inside and outside bioinformatics (like Anaconda and the Integrated Genomics Viewer(IGV)), and option to present live, interactive notebooks as slides, to name just a few–we were strongly motivated to look for a way around the flaws we had encountered without leaving the Jupyter Notebook ecosystem.

Building a Self-Documenting Analysis Pipeline

Fortunately, we weren’t the first to encounter these issues. In fact, #1 (need to automate notebooks) has already been addressed with the introduction of helpful APIs and extensions like nbconvert, nbformat, and nbparameterise (Note the British spelling!). The first two allow you read and execute Jupyter Notebooks from external Python scripts, while the last provides the critical ability to inject modified variable values before each execution. These tools make it possible to serve two masters, keeping an analysis pipeline’s core functionality in notebooks–where it is easy for non-coders and external reveiwers to examine it–while still allowing a bioinformaticist to run the pipeline automatically from script. Here’s sample code to inject variables and run a notebook from a Python script, writing out the results to a new notebook file:

# standard libraries
import os

# third-party libraries
import nbformat
import nbparameterise
from nbconvert.preprocessors import ExecutePreprocessor


# modified from https://nbconvert.readthedocs.io/en/latest/execute_api.html
def execute_notebook(notebook_filename, notebook_filename_out, params_dict, 
    run_path="", timeout=6000000):
    
    notebook_fp = os.path.join(run_path, notebook_filename)
    nb = read_in_notebook(notebook_fp)
    new_nb = set_parameters(nb, params_dict)
    ep = ExecutePreprocessor(timeout=timeout, kernel_name='python3')

    try:
        ep.preprocess(new_nb, {'metadata': {'path': run_path}})
    except:
        msg = 'Error while executing: "{0}".\n\n'.format(notebook_filename)
        msg = '{0}See notebook "{1}" for traceback.'.format(
                msg, notebook_filename_out)
        print(msg)
        raise
    finally:
        with open(notebook_filename_out, mode='wt') as f:
            nbformat.write(new_nb, f)   


def read_in_notebook(notebook_fp):
    with open(notebook_fp) as f:
        nb = nbformat.read(f, as_version=4)
    return nb


def set_parameters(nb, params_dict):
    orig_parameters = nbparameterise.extract_parameters(nb)
    params = nbparameterise.parameter_values(orig_parameters, **params_dict)
    new_nb = nbparameterise.replace_definitions(nb, params, execute=False)
    return new_nb

… and here’s the super-simple, one-line call to execute this functionality, assuming the notebook to be run (named mynotebook.ipynb) has two variables named x and y that I want to set:

execute_notebook("mynotebook.ipynb","my_new_notebook.ipynb",{"x":6,"y":"blue"})

By the way, if this isn’t your first rodeo, you might be wondering “wait, where and how does one define which variables a notebook has?” Good catch! nbparameterise, which is doing all the heavy lifting on injecting those variable values, looks at the first code cell in the notebook; if that cell contains only simple assignment statements, it takes the variables therein as the ones it can modify. Here’s an example:

Here the cell containing the “x” and “y” variables is the first code cell in the notebook, and it contains only assignments (and comments), so nbparameterise will recognize all its contents as variables that can be injected. (Don’t get carried away and try to put more complex assignments–like the call your_handle = "{0}-{1}".format(y, x)–into that first code cell. If nbparameterise sees anything other than simple assignments of strings or numbers, it will give up on the whole cell and decide there are no variables it can inject!)

Running the execute_notebook command given above produces a new my_new_notebook.ipynb file in the same directory as the first; the new notebook like this:

You can see that all the explanatory text remains, but the variable values in the first code cell have been swapped for the ones we injected–and the subsequent code has been rerun with those new variable values. Nifty!

But wait, we’ve only solved half the problems! Issue #1 is fixed, since we can easily automate a notebook-based pipeline (or a notebooks-based pipeline, although stringing together the execution of multiple notebooks is left as an exercise for the reader–hint: passing values generated by an earlier notebook into a later one can’t be done with nbparameterise but can be done with a temporary file). What about issue #2, the potential for corruption of a notebook-as-record?

Well, I have hopes that someday this will be addressed directly through the notebook, and one option is already available from that direction. If you’re using Python in a Juypter Notebook, you can invoke the %logstart magic with the -o option to record sequentially all Python commands run in the notebook–and their output–to a log file. The log record is faithful even when the notebook is corrupted; this is demonstrated by the log file for our earlier corrupted notebook example, which correctly shows that I actually assigned x twice, once before and once after running the print statement:

# IPython log file

x = 3+3
print(x)
x = 3+25

(Why didn’t this record the output of print(x)? Logging doesn’t include print outputs.) Of course, it also doesn’t include any of the other goodies in the notebook–descriptive text, images, nice formatting–basically all the stuff that makes notebooks notebooks and not just Python scripts 🙂 . So until Fernando and company come up with some niftier option, what to do?

We punted. Rather than trying to lock down the notebooks we deliver–which would remove their utility for rerunning analyses, anyway–we now deliver both a notebook and an HTML export of that notebook with each analysis. The former file is the tool and the latter is the record. The exported HTML files have everything the notebooks do–nice formatting and scrolling, descriptive text, images, and of course the relevant code–EXCEPT that the code is no longer runnable, which is exactly what we want in a record. (Of course, an HTML file could be explicitly changed by someone motivated to misbehave, but our main concern is preventing accidental corruption.) Here again, nbconvert comes to the rescue: adding one new function called by one new line to the end of our execute_run method automatically generates the HTML output alongside the new notebook every time we run a notebook-based pipeline (the other methods presented earlier remain unchanged).

# additional third-party libraries
from nbconvert import HTMLExporter

# modified from https://nbconvert.readthedocs.io/en/latest/execute_api.html
def execute_notebook(notebook_filename, notebook_filename_out, params_dict, 
    run_path="", timeout=6000000):
    
    notebook_fp = os.path.join(run_path, notebook_filename)
    nb = read_in_notebook(notebook_fp)
    new_nb = set_parameters(nb, params_dict)
    ep = ExecutePreprocessor(timeout=timeout, kernel_name='python3')

    try:
        ep.preprocess(new_nb, {'metadata': {'path': run_path}})
    except:
        msg = 'Error while executing: "{0}".\n\n'.format(notebook_filename)
        msg = '{0}See notebook "{1}" for traceback.'.format(
                msg, notebook_filename_out)
        print(msg)
        raise
    finally:
        with open(notebook_filename_out, mode='wt') as f:
            nbformat.write(new_nb, f)
        export_notebook_to_html(new_nb, notebook_filename_out)


def export_notebook_to_html(nb, notebook_filename_out):
    html_exporter = HTMLExporter()
    body, resources = html_exporter.from_notebook_node(nb)
    out_fp = notebook_fileout_name.replace(".ipynb", ".html")
    with open(out_fp, "w", encoding="utf8") as f:
        f.write(body)

And hey, presto, instant HTML records!

Conclusions

The public excitement around Jupyter Notebooks is largely deserved: they’re an extremely versatile new tool for data analysis. For example, here at CCBB we’ve used the techniques discussed in this post to build a self-documenting custom software pipeline for quantifying the results of dual-CRISPR genetic screens. To see this real-world example, check out mali_pipeliner.py and its subsidiary scripts in CCBB’s jupyter-genomics repository on GitHub (and of course don’t miss the notebooks themselves, in the parallel notebooks directory). Our experience has taught us that notebooks are no magic bullet against irreproducible research–but with care on the part of the notebook designer, they can be a strong new weapon nonetheless!

* These two terms mean different things, but there is some confusion about exactly how. While the distinctions aren’t relevant to this discussion, they are elucidated with examples in an excellent post from UPenn entitled Replicability vs. reproducibility — or is it the other way around?

A Step-By-Step Guide to Generating Gene Interaction Networks with GeneMANIA

December 19, 2016

Amanda Birmingham (abirmingham at ucsd.edu)

The excellent network-building web tool GeneMANIA has recently been given a facelift. Its new user interface is minimal and uncluttered–so much so that it took me a little trial and error to figure out what all the buttons and options did! In case you’re in the same boat, here’s the missing manual for how to use the new-and-improved GeneMANIA website to generate a network from your gene list. (To be clear, I don’t speak for the GeneMANIA project and am not associated with it in any way–just a fan 🙂 )

  1. Before beginnning, prepare a list of the genes for which you wish to make a network. (Note that GeneMANIA accepts gene symbols or NCBI Gene IDs, but not Ensembl gene ids.) Ensure each gene identifier is on a separate line, as shown below:

  2. Visit genemania.org. The home page looks like this:

  3. If your organism of interest is not human (the default), click on the small picture of a person and select the desired organism from the resulting drop-down box:

  4. Click in the white text box with the letters “e.g.” in it . When it expands, paste the list of your genes of interest into it (to run GeneMANIA on a sample list of genes, click on the “e.g.”). Any genes that cannot be found in GeneMANIA’s database will be highlighted in red, as shown below. These genes will be ignored in the rest of the analysis, so check to ensure that they do not represent your most interesting genes!

  5. (Optional step) If you want to modify the interactions that are included in your network, click the three-dot icon next to the white text box. This displays all the interaction sources and allows you to select which should be included by checking their boxes:

  6. (Optional step) Clicking on “Customize advanced options” at the bottom of this menu provides the option to modify additional behaviors. For example, in some cases GeneMANIA defaults to adding 20 “related” genes into the network you build (determined based on the input genes’ interactions). You can change this number by sliding the slider under “Max resultant genes”.

  7. Click on the magnifying glass button to begin network creation. The magnifying glass will change to a spiral and the text box to a progress bar while the network creation is in process:

  8. The finished network will appear in the main screen. Genes with identified interactions will be connected, while those without identified interactions will be shown in a row across the bottom of the screen:

    • Genes that you input are shown with cross-hatched circles of a uniform size, while those that were added as “relevant” genes by GeneMANIA are shown with solid circles whose size is proportional to the number of interactions they have.

    • Different kinds of interactions are represented by different colored connector lines. Click on the three horizontal lines icon at the right of the screen to see the the color legend and to select or deselect which interactions are displayed:

    • Mousing over a single gene’s circle highights only its interactions:

    • Clicking on a single gene displays information about it and offers options to remove it from the network or rerun a new network analysis based on only that gene. Click the X at the top right of this info box to make it disappear when you are done with it.

    • Clicking the info button on the left of the screen shows basic information about the network, such as how many genes and interactions it includes, as well as links to the help documentation and other resources. Click again on the info button to clear it from the screen when you are done with it.

    • Clicking the pie chart button on the bottom left of the screen displays the functions associated with genes in the network and their FDR and coverage (as number of genes annotated with that function in the network versus number of genes annotated with that function in the genome):

    Checking any of the functions’ checkboxes will assign colors to those functions; once the functions list is hidden again, genes in the network image that are annotated with the chosen functions have their circles colored with the relevant colors:

    Clicking on the X next to any selected function clears its coloring from the network.

    • To move the network around the screen, click on any background space. A gray circle will appear under the mouse; drag the network to your desired position and release the mouse.

    • GeneMANIA stores your search history. To view that history, click on the circular arrow button on the bottom of the screen:

    From here, you can rerun a past search by clicking on its image, clear an individual past search by clicking on the red X at its top right corner, or clear all history by clicking on the large red button at the far left. To hide history information, click the circular arrow again.

  9. (Optional step) There are several options for modifying the on-screen layout of the network.

    • To reposition a single gene manually, simply grab that gene’s circle and drag it to where you would like it to be. To make the changed network fill up the screen again, click the diagonal arrow button on the left of the screen to reposition the modified network:
    • To redraw the network in a circular layout, click the target button on the left of the screen: . Depending on the size of your network, the re-layout may take a few moments. The resultant layout will look something like this:

    • To redraw the network in a linear layout, click the two downward arrows button on the left of the screen: . Depending on the size of your network, the re-layout may take a few moments. The resultant layout will look something like this:

    • To return to the default layout, click the intertwined arrows button on the left of the screen: . Depending on the size of your network, the re-layout may take a few moments.
  10. Once you have adjusted the network to your satisfaction, you can save it in one of several formats. These options are visible when clicking on the floppy disk button on the left of the screen:

    • Report generates and displays a PDF report of the network analysis, including the GeneMANIA software version, network image, search parameters, interaction sources searched, gene details, and interaction sources in which interactions for these genes were found. It can be downloaded from your browser window like any other PDF once generated.

    • Network image As shown downloads a jpeg of the image as currently shown on the screen.
    • Network image With plain, top labels downloads a jpeg of the network with gene labels shown above their circles rather than in them:

    • Network downloads a text file detailing the connected nodes in the network and the details of their connections:

    • Networks data downloads a text file listing the details (including citation information) for each interaction source used in the network generation:

    • Attributes data downloads a text file listing the attributes identified for each gene.

    Note that this list will be empty (except for the header line) if the Attributes checkbox is not checked in the Networks menu during set-up of the network generation:

    … or if the Max resultant attributes slider is set to zero in the Customize advanced options menu during set-up of the network generation:

    … or if there are no attributes found for the input genes in the selected Attributes sources.

    • Genes data downloads a text file with a list of all genes included in the network, as well as links to their NCBI Gene records and their GeneMANIA-assigned scores in the network:

    • Functions data downloads a text file containing functions found to be identified with genes in the network and the enrichment level of those functions (shown as FDR) based on the number of genes with that function in the network and the number of genes with that function in the genome. This list may be empty except for the header line if no such functions are identified.

    • Interactions data downloads a text file containing much the same information as the Network option but in a slightly different format:

    • Search parameters as text does not appear to function at the moment.
    • Search parameters as JSON downloads a representation of the search parameters for the network in JSON-formatted text:

  11. If you use GeneMANIA for your research, please give credit where credit is due: cite it! The appropriate citation is:

    • Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A, Mostafavi S, Montojo J, Shao Q, Wright G, Bader GD, Morris Q. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010 Jul;38(Web Server issue):W214-20. doi: 10.1093/nar/gkq537. PubMed PMID: 20576703; PubMed Central PMCID: PMC2896186.

Visualizing the similarity of two networks

December 9, 2016

 

Julia Len (jlen at ucsd.edu)

Introduction

When working with networks, it is often useful to consider how similar two networks are.  There are a number of ways of quantifying network similarity however.  One could simply consider the number of nodes two networks have in common.  However, this would miss any structural similarity, or lack thereof, between the edges.  For example, it is possible for two networks to have completely identical node sets, but have completely disjoint edge sets.  Note however that in order for two networks to share edges, they must share nodes as well (since edges are defined by the nodes they connect).  

In this post, we will introduce a network overlap visualization function (draw_graph_union) in the visJS2jupyter package, and explore a few possible scenarios.

Installation

To install visJS2jupyter, run

    pip install visJS2jupyter

in your terminal. To import the visualizations module, use the statement

    import visJS2jupyter.visualizations as visualizations

in your jupyter notebook. The source code is also available on github here.

Simple example with default parameters

We will now go through a simple example using two 10-node networks whose intersection is exactly 5 nodes. We create two small, random networks using the networkx function ‘connected_watts_strogatz_graph’. Each network will have 10 nodes, with each node initially connected to its 5 nearest neighbors. These connections are then randomly rewired with probability 0.1.

    G1 = nx.connected_watts_strogatz_graph(10,5,.1)

    G2 = nx.connected_watts_strogatz_graph(10,5,.1)

This produces two networks who both have nodes labelled from 0 to 9. Their intersection is then all the nodes for each graph. This is an unexciting case, so let’s relabel some nodes, so that they share only 5 nodes in common. We can do this by relabelling the nodes 0 to 9 of the second graph, G2, to 5 to 14 using the networkx function ‘relabel_nodes’. The code for this is shown below:

    old_nodes = range(5)

    new_nodes = range(10,15)

    new_node_labels = dict(zip(old_nodes,new_nodes))

    G2 = nx.relabel_nodes(G2,new_node_labels)

Now nodes 0 to 4 belong to only G1, nodes 5 to 9 belong to G1 and G2, and nodes 10 to 14 belong to only G2. Let’s see what this looks like by using draw_graph_union:

    visualizations.draw_graph_union(G1,G2)

And that’s it! We get an interactive graph fairly quickly and easily. Notice that the nodes are color-coded and shaped based on which network they belong to. For instance, nodes in the intersection of G1 and G2 are orange and triangular shapes, while nodes which only belong to G1 are red circles, and nodes which only belong to G2 are yellow squares. Also notice that edges found in both G1 and G2 are colored red while all other edges are colored blue.

You can take a look at the sample notebook here.  Notice that hovering over a node pops up a tooltip with information about the node’s name and graph membership.

From the previous example, we saw that draw_graph_union not only depicts the intersection of nodes, but it also visualizes the intersection of edges as well. Let’s take a look at how this works with two networks having identical nodes but but only a few overlapping edges.

Identical nodes, some overlapping edges

We’ll again be using the connected_watts_strogatz_graph to create our two networks, but this time both networks will contain 50 nodes:

    G1 = nx.connected_watts_strogatz_graph(50,5,.1)

    G2 = nx.connected_watts_strogatz_graph(50,5,.1)

This produces two networks with identical nodes and randomly intersecting edges. We want the sets of edges to only intersect over 5 nodes. Python’s built-in set object can help with this. We can get the edges for G1 and G2, convert the lists of edges to sets of edges, and then find their intersection using &. We can then subtract out the intersecting edges from each set of edges.

    edges_1 = set(G1.edges())

    edges_2 = set(G2.edges())

    intersecting_edges = edges_1 & edges_2

    edges_1_disjoint = edges_1 - intersecting_edges

    edges_2_disjoint = edges_2 - interesecting_edges

This produces two disjoint sets of edges. We now want to add back in 5 edges from the intersection into each disjoint set.

    for i in range(0,5):

        new_edge = intersecting_edges.pop()

        edges_1_disjoint.add(new_edge)

        edges_2_disjoint.add(new_edge)

We can then remove the current edges from G1 and add back in the desired edges. We do the same for G2.

    G1.remove_edges_from(edges_1)

    G1.add_edges_from(list(edges_1_disjoint))

Let’s now draw the two graphs using draw_graph_union, but this time let’s customize the graph a bit. The function sets the default color of the nodes to matplotlib’s colormap autumn and the default color of the edges to matplotlib’s colormap coolwarm. However, matplotlib has many wonderful colormaps available that we can choose from (click here for more details). To set the colormap of the nodes and edges, use the arguments node_cmap and edge_cmap. If you decide to change the colormap, make sure to import the matplotlib package:

    import matplotlib as mpl

We can add other customizations as well, such as setting edge width and edge shadows. The function allows for any argument available in visJS_module in the visJS2jupyter package. This allows many potential customizing features for the function. Now, let’s see what this looks like overall:

    visualizations.draw_graph_union(G1,G2,

        node_cmap=mpl.cm.cool,

        edge_cmap=mpl.cm.winter_r,

        edge_width=5,

        edge_shadow_enabled=True,

        edge_shadow_size=2)

As you can see in the graph above, there is now only one set of nodes, all of which are triangle shaped because all the nodes overlap. The edges are mostly colored in green except for 5 edges in blue: the edges in the intersection. Notice that the edges and nodes are colored differently from before and the edges now have added shadows. You can take a look at the interactive notebook with this example here.

One network contains the other network

So far, we’ve seen graphs where there is a small intersection of the nodes and the node sets are equal. What happens if the set of nodes for one graph is a subset of the nodes for the other graph? We’ll take a look at this case now.

We again create two networks using connected_watts_strogatz. Graph 1 will have 50 nodes and graph 2 will have 20 so that all of the second graph’s nodes intersect with graph 1. We will then call draw_graph_union on these two graphs. This time, we use some more features of the function. We can set the name of the nodes for graph 1 and graph 2. Notice that previously when we hovered over a node, the tooltip showed something like “graph 1 + graph 2”. Using the arguments node_name_1 and node_name_2, we can customize what is shown in the tooltip. Pretty cool!

If you’ve played around with some of the example notebooks, you’ve probably noticed that the nodes move around when dragged as if they have a gravitational field. This is the physics_enabled feature. It is set by default for graphs of less than 100 nodes, while it is turned off for any larger graphs. One nice feature is that you can override this by setting the physics_enabled argument to true or false. Let’s turn off this setting for this example.

    visualizations.draw_graph_union(G1,G2,edge_width=5

        node_name_1=”superset”,

        node_name_2=”subset”,

        physics_enabled=False)

In just a couple of lines of code, we have produced an interactive network! Notice that when we hover over a node, it has the name that we set, just like we wanted. You can also see that dragging a node around makes it stay stuck in place, so the physics_enabled setting has been turned off. The example notebook can be found here.

Overall, draw_graph_union provides a quick and easy way to create customizable and interactive visualizations for network similarities, enabling visual assessment of what two networks share and what they don’t.   

Making Heat Maps In R

November 15, 2016

Amanda Birmingham (abirmingham at ucsd.edu)

Heat maps are a staple of data visualization for numerous tasks, including differential expression analyses on microarray and RNA-Seq data. Many people have already written heat-map-plotting packages for R, so it takes a little effort to decide which to use; here I investigate the performance of the six that I found referenced most frequently online.

My main goals (YMMV) beyond basic plotting were to be able to (a) annotate rows and columns with metadata information, (b) include scales and labels in the figure itself (since often figures are reused in presentations without caption information), and (c) do as much label customization as possible with the shallowest learning curve. I also want automatic dendrogram creation, so using ggplot2 or another graphics-only package was out. Note that throughout I have accepted the default colors for every heat map tool, as these are pretty easy to change after the fact if you care.

TL;DR: I recommend using heatmap3 (NB: not “heatmap.3”). It mimics the easy-to-use interface of heatmap.2 but can be extended into more complex settings if you later find you need fine-grained control.

Table of Contents

Table of Contents

Set-Up

This blog post is adapted from a Jupyter Notebook and supporting files that you can download from GitHub and run yourself if you have Jupyter Notebook server with an R kernel installed.

All test “data” used here are just random numbers 🙂

In [1]:
# This line prevents SVG output, which does not play well with export to HTML
options(jupyter.plot_mimetypes = c("text/plain", "image/png" ))
In [2]:
# Load the example "data"
gLogCpmData = as.matrix(read.table("heatmap_test_matrix.txt"))
gLogCpmData
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 ⋯ Sample47 Sample48 Sample49 Sample50 Sample51 Sample52 Sample53 Sample54 Sample55 Sample56
GeneA 30 67 34 98 32 3 79 15 6 18 97 49 12 6 87 58 55 13 48 28
GeneB 80 70 28 51 74 76 85 98 7 64 45 69 1 9 8 49 97 5 83 66
GeneC 43 36 41 24 71 76 91 50 81 57 21 10 75 35 77 92 85 73 97 12
GeneD 88 66 57 8 11 91 71 84 89 63 12 16 61 42 48 96 26 84 75 78
GeneE 90 57 73 51 86 32 22 78 84 31 37 40 1 37 74 79 89 79 68 42
GeneF 31 87 65 36 64 15 28 89 94 58 58 32 11 18 81 47 27 60 79 91
GeneG 1 93 70 64 98 28 65 96 83 2 51 83 99 3 37 54 6 22 56 19
GeneH 27 18 37 85 59 61 85 3 16 7 38 14 22 66 100 14 99 37 39 51
GeneI 85 35 24 73 21 25 45 80 20 94 34 10 59 40 100 32 40 11 32 4
GeneK 16 55 76 60 40 48 85 90 24 44 66 53 91 23 90 48 61 56 47 74
In [3]:
# Load the example annotation/metadata
gAnnotationData = read.table("heatmap_test_annotation.txt")
gAnnotationData
sample_name subject_drug
1 Sample1 MiracleDrugA
2 Sample2 MiracleDrugA
3 Sample3 MiracleDrugA
4 Sample4 MiracleDrugA
5 Sample5 MiracleDrugB
6 Sample6 MiracleDrugB
7 Sample7 MiracleDrugB
8 Sample8 MiracleDrugB
9 Sample9 MiracleDrugB
10 Sample10 MiracleDrugB
11 Sample11 MiracleDrugB
12 Sample12 MiracleDrugC
13 Sample13 MiracleDrugC
14 Sample14 MiracleDrugC
15 Sample15 MiracleDrugC
16 Sample16 MiracleDrugC
17 Sample17 MiracleDrugC
18 Sample18 MiracleDrugC
19 Sample19 MiracleDrugC
20 Sample20 MiracleDrugA
21 Sample21 MiracleDrugA
22 Sample22 MiracleDrugA
23 Sample23 MiracleDrugA
24 Sample24 MiracleDrugA
25 Sample25 MiracleDrugA
26 Sample26 MiracleDrugA
27 Sample27 MiracleDrugA
28 Sample28 MiracleDrugA
29 Sample29 MiracleDrugA
30 Sample30 MiracleDrugA
31 Sample31 MiracleDrugC
32 Sample32 MiracleDrugC
33 Sample33 MiracleDrugC
34 Sample34 MiracleDrugC
35 Sample35 MiracleDrugC
36 Sample36 MiracleDrugC
37 Sample37 MiracleDrugC
38 Sample38 MiracleDrugC
39 Sample39 MiracleDrugA
40 Sample40 MiracleDrugA
41 Sample41 MiracleDrugA
42 Sample42 MiracleDrugA
43 Sample43 MiracleDrugA
44 Sample44 MiracleDrugA
45 Sample45 MiracleDrugA
46 Sample46 MiracleDrugA
47 Sample47 MiracleDrugA
48 Sample48 MiracleDrugA
49 Sample49 MiracleDrugA
50 Sample50 MiracleDrugA
51 Sample51 MiracleDrugA
52 Sample52 MiracleDrugA
53 Sample53 MiracleDrugA
54 Sample54 MiracleDrugA
55 Sample55 MiracleDrugA
56 Sample56 MiracleDrugA
In [4]:
# Make helper function to map metadata category to color
mapDrugToColor<-function(annotations){
    colorsVector = ifelse(annotations["subject_drug"]=="MiracleDrugA", 
        "blue", ifelse(annotations["subject_drug"]=="MiracleDrugB", 
        "green", "red"))
    return(colorsVector)
}

Table of Contents

heatmap

heatmap is the built-in option for heat maps in R:

In [5]:
# Test heatmap with column annotations
testHeatmap<-function(logCPM, annotations) {    
    sampleColors = mapDrugToColor(annotations)
    heatmap(logCPM, margins=c(5,8), ColSideColors=sampleColors)
}

testHeatmap(gLogCpmData, gAnnotationData)

Not bad, but there are no legends for either the main values or the annotation information.

Table of Contents

heatmap.2

heatmap.2 is an “enhanced” heat map function from the add-on package gplots (not to be confused with ggplot!):

In [6]:
install.packages("gplots")
The downloaded binary packages are in
	/var/folders/hn/rpn4rhms41v939mg20d7w0dh0000gn/T//RtmpjRP53o/downloaded_packages
In [7]:
library(gplots)

# Test heatmap.2 with column annotations and custom legend text
testHeatmap2<-function(logCPM, annotations) {    
    sampleColors = mapDrugToColor(annotations)
    heatmap.2(logCPM, margins=c(5,8), ColSideColors=sampleColors,
        key.xlab="log CPM",
        key=TRUE, symkey=FALSE, density.info="none", trace="none")
}

testHeatmap2(gLogCpmData, gAnnotationData)
Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

I turned off a few of the default options (density.info, trace) to make the graphic a bit less busy. The default main legend is nice, but I don’t see an option to include a legend for the annotation information.

Table of Contents

aheatmap

aheatmap, which stands for “annotated heatmap”, is a heat map plotting function from the NMF package:

In [8]:
install.packages("NMF")
The downloaded binary packages are in
	/var/folders/hn/rpn4rhms41v939mg20d7w0dh0000gn/T//RtmpjRP53o/downloaded_packages
In [9]:
library(NMF)

# Test aheatmap with column annotations
testAheatmap<-function(logCPM, annotations) {    
    aheatmap(logCPM, annCol=annotations[
        "subject_drug"])
}

testAheatmap(gLogCpmData, gAnnotationData)
Loading required package: pkgmaker
Loading required package: registry

Attaching package: 'pkgmaker'

The following object is masked from 'package:base':

    isNamespaceLoaded

Loading required package: rngtools
Loading required package: cluster
NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 3/4
  To enable shared memory capabilities, try: install.extras('
NMF
')

Yay, legends for both the main data and the annotations! However, note that something weird is going on here: The dendrograms aren’t showing up right. It appears that somehow the body of the heatmap is overlapping with the finer levels of the dendrograms at both top and left. There may be a way to fix this by digging further into the settings of aheatmap, but since I’m looking for something easy to use out-of-the-box, I consider this a disqualifier for my usage.

Table of Contents

pheatmap

pheatmap, where the “p” stands for “pretty”, is the sole function of the package pheatmap:

In [10]:
install.packages("pheatmap")
The downloaded binary packages are in
	/var/folders/hn/rpn4rhms41v939mg20d7w0dh0000gn/T//RtmpjRP53o/downloaded_packages
In [11]:
library(pheatmap)

# Test pheatmap with two annotation options
testPheatmap<-function(logCPM, annotations) {    
    drug_info = data.frame(annotations[,"subject_drug"])
    rownames(drug_info) = annotations[["sample_name"]]
    
    # Assign the column annotation straight from 
    # the input annotation dataframe
    pheatmap(logCPM, annotation_col=drug_info, 
        annotation_names_row=FALSE,
        annotation_names_col=FALSE,
        fontsize_col=5)
    
    # Assign the column annotation to an intermediate
    # variable first in order to change the name 
    # pheatmap uses for its legend
    subject_drug = annotations[["subject_drug"]]
    drug_df = data.frame(subject_drug)
    rownames(drug_df) = annotations[["sample_name"]]
    
    pheatmap(logCPM, annotation_col=drug_df, 
        annotation_names_row=FALSE,
        annotation_names_col=FALSE,
        fontsize_col=5)
}
testPheatmap(gLogCpmData, gAnnotationData)

Again, nice to have legends for both main and annotation information. Note that:

  1. You control the annotation legend title through the variable name, which I consider suboptimal as variable names often do not read nicely as English text.
  2. The function uses row names to match annotations to data, so all data and annotations must be contained in dataframes (not matrices).

Table of Contents

heatmap3

heatmap3 is the central function of the heatmap3 package. Beware that this is different from “heatmap.3”, of which there are numerous versions (e.g., here, here, and here–apparently a lot of people felt heatmap.2 needed an upgrade! I don’t investigate these others here because I haven’t seen them discussed online by users very often.)

In [12]:
install.packages("heatmap3")
The downloaded binary packages are in
	/var/folders/hn/rpn4rhms41v939mg20d7w0dh0000gn/T//RtmpjRP53o/downloaded_packages
In [13]:
library(heatmap3)

# Test heatmap3 with several annotation options
testHeatmap3<-function(logCPM, annotations) {    
    sampleColors = mapDrugToColor(annotations)
    
    # Assign just column annotations
    heatmap3(logCPM, margins=c(5,8), ColSideColors=sampleColors) 
    
    # Assign column annotations and make a custom legend for them
    heatmap3(logCPM, margins=c(5,8), ColSideColors=sampleColors, 
        legendfun=function()showLegend(legend=c("MiracleDrugA", 
        "MiracleDrugB", "?"), col=c("blue", "green", "red"), cex=1.5))
    
    # Assign column annotations as a mini-graph instead of colors,
    # and use the built-in labeling for them
    ColSideAnn<-data.frame(Drug=annotations[["subject_drug"]])
    heatmap3(logCPM,ColSideAnn=ColSideAnn,
        ColSideFun=function(x)showAnn(x),
        ColSideWidth=0.8)
}
             
testHeatmap3(gLogCpmData, gAnnotationData)

This one follows the syntax of heatmap.2, which is good if you already know the latter. However, its added functionality is quite complicated … definitely complicated enough to get me into trouble (e.g., in the second option above, my annotation legend runs into my heat map and I’ve lost the main legend). It may also be complicated enough to get me out of trouble again (e.g., via explicit setting of the legend and/or heat map placement) but it would clearly take more digging.

Table of Contents

annHeatmap2

annHeatmap2 is the core function of the Heatplus package. Unlike other packages discussed in this evaluation, Heatplus is available through the bioconductor bioinformatics software project rather than through CRAN.

In [14]:
# Source bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Heatplus")
Bioconductor version 3.3 (BiocInstaller 1.22.3), ?biocLite for help
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.3 (BiocInstaller 1.22.3), R 3.3.0 (2016-05-03).
Installing package(s) 'Heatplus'
The downloaded binary packages are in
	/var/folders/hn/rpn4rhms41v939mg20d7w0dh0000gn/T//RtmpjRP53o/downloaded_packages
Old packages: 'AnnotationDbi', 'DBI', 'IRanges', 'IRdisplay', 'Matrix', 'R6',
  'Rcpp', 'S4Vectors', 'curl', 'devtools', 'digest', 'httr', 'irlba',
  'jsonlite', 'limma', 'manipulate', 'mgcv', 'mime', 'plyr', 'repr',
  'rstudioapi', 'statmod', 'stringi', 'stringr', 'survival', 'withr'
In [15]:
library(Heatplus)

# Test annHeatmap2 with column annotations
testAnnHeatmap2<-function(logCPM, annotations){
    ann.dat = data.frame(annotations[,"subject_drug"])

    plot(annHeatmap2(logCPM, legend=2,
        ann = list(Col = list(data = ann.dat))))
}

testAnnHeatmap2(gLogCpmData, gAnnotationData)

Like heatmap3, annHeatmap2 does metadata annotations as a mini-graph; apparently it doesn’t do such annotations as color bars (?!) It requires that all annotation (and dendrogram, etc) options be passed in as lists, which clearly offers a lot of powerful abilities but is sort of heavy-weight.

Table of Contents

Summary

Below I summarize the features I assessed for these tools:

feature heatmap heatmap.2 aheatmap pheatmap heatmap3 annHeatmap2
source built-in cran cran cran cran bioconductor
can add main legend x x x x x
can control main legend text x x
can add row/col annotations x x x x x x
can specify annotations as table column x x x
can add annotation legend x x x ~
can control annotation legend text ~ x ~
notes no clear advantages pretty nice results with little tinkering Appears UNUSABLE as dendrograms show up wrong, at least in notebook pretty nice results with little tinkering powerful but complicated powerful but complicated; doesn’t support ColSideColors?

Table of Contents

Recommendation

I recommend using heatmap3 (NB: not “heatmap.3”). It mimics the easy-to-use interface of heatmap.2 but can be extended into more complex settings if you later find you need fine-grained control.

Table of Contents

Communities and cliques

November 4, 2016

Brin Rosenthal (sbrosenthal at ucsd.edu)

You probably won’t get far learning about networks and graph theory before coming across communities and cliques in graphs. At first glance, these two concepts are quite similar- they both describe highly connected sets of nodes, after all. There are however situations which are best suited to one or the other. In this post we will explore some similarities and differences between communities and cliques, and a specific problem I came upon which I thought would be easily solved by a community-finding algorithm, but soon realized that cliques were the much better option!

A network community is a set of nodes which have dense connections within the group, and sparse connections outside the group. A large number of community finding algorithms have been developed, which vary in complexity, and have different strengths and weaknesses depending on context, of which we list a small sampling: Fortunato, S, Malliaros, F, Newman, M., Airoldi, E. et al, Karrer, B., Newman, M., Peixoto, T., Johnson, S.. The community-finding algorithms generally optimize some parameter (usually related to the number of within-group and between-group edges). They are also generally stochastic, meaning that you may get slightly different answers on different runs. It can be useful to explore the community structure of many real-world networks, to establish substructure in the graph. Unlike cliques, there are no exact communities in a graph, rather you will get different answers depending on what algorithm you use, and what you are optimizing for.

A clique is in some sense a stronger version of a community. A set of nodes forms a clique (equivalently, a complete subgraph) if all possible connections between nodes exist. A two-node clique is simply two connected nodes. A three node clique is also known as a triangle. Graphs also contain maximal cliques, which are complete subgraphs such that no other node can be added while maintaining completeness. In Figure 1 below, there are only two maximal cliques of size greater than two. There is a maximal clique of size 5 (orange nodes), and a maximal clique of size 3 (green nodes).

Figure 1: Nodes color-coded by clique-membership.

Figure 1: Nodes color-coded by clique-membership.

One of the most commonly used maximal clique-finding algorithms may be found here, and recursively searches through the graph finding all maximal cliques. Depending on structure, graphs may contain a large number of cliques, resulting in a high memory cost for the algorithm.

In my research I have mainly focused on applications of community-finding algorithms to biological networks. However, I recently came upon a problem that was solved much better using a clique-finding approach than a community-finding one. Briefly, we had a matrix of similarity between objects (Figure 2), and we wanted to find sets of objects which were all very dissimilar from each other.

Figure 2: Randomly generated similarity scores between 100 objects

Figure 2: Randomly generated similarity scores between 100 objects

Application of community-finding algorithms resulted in either groups of highly similar objects grouped together (modularity maximization- Figure 3), or groups of objects which were mostly dissimilar, but contained a couple similar objects in the group. In our application, we couldn’t tolerate any similar objects in our groups, so both of these solutions weren’t satisfactory.

Figure 3: Objects clustered by dissimilarity, using modularity maximization algorithm.  Groups are only weakly dissimilar, with many similar pairs existing in the same groups.

Figure 3: Objects clustered by dissimilarity, using modularity maximization algorithm. Groups are only weakly dissimilar, with many similar pairs existing in the same groups.

I then realized that I could reform the problem of finding sets of highly dissimilar objects into a clique-finding framework. I simply created a network from our similarity matrix by connecting nodes whenever they had a similarity level less than a certain tolerance. Once this dissimilarity network was created, it was simply a matter of applying the networkx function find_cliques to the graph. In our network of 100 nodes and 649 edges, there were 362 maximal cliques, with the largest of these maximal cliques containing 4 nodes.

While most community-finding algorithms discretely partition graphs (with some exceptions), nodes can belong to a large number of maximal cliques (see example below- Figure 4). For our problem, this meant we had a large number of maximal cliques to choose from, but many of them contained a lot of the same information.

Figure 4: Five largest maximal cliques outlined in black.

Figure 4: Five largest maximal cliques outlined in black.

We can tune the sizes of our dissimilar sets by changing the threshold of similarity we can tolerate. When we increase the similarity threshold from 0.0 to 1.0, we now have 2840 maximal cliques, and the largest maximal clique contains 6 nodes (Figure 5).

Figure 5: Cliques formed with sets of nearly perfectly dissimilar objects.

Figure 5: Cliques formed with sets of nearly perfectly dissimilar objects.

That’s about all for this post, although please note we have barely scratched the surface of communities and cliques in graphs. Please see the links for more in-depth reading!

Box Plot Blunders

October 17, 2016

Amanda Birmingham (abirmingham at ucsd.edu)

It is a truth universally acknowledged that a scientist in possession of a new dataset must be in want of visualization. Unfortunately, while its certainly better to look at our data than not to, choosing the wrong visual summary can impede good analyses by obscuring relevant features while encouraging unwarranted assumptions. In fact, even the friendly and familiar box plot can mislead the unwary!


Even the friendly and familiar box plot can mislead the unwary, but better options are a simple R command away!


While the limitations of box plots have long been known among statisticians and data scientists, this visualization remains widely used by practicing researchers. Fortunately, box plots’ failings are easy to understand, and better options are a simple R command away.

Box Plot Basics

Box plots, also sometimes known as box-and-whisker plots, summarize a dataset using a box delimited by the data’s first and third quartiles, broken by a band at the median, and flanked by lines (“whiskers”) representing more outlying data. While the exact meaning of the whiskers can vary, the standard R boxplot command and the ggplot2 geom_boxplot command both make the whisker length equal to 1.5 * the interquartile range; any data more outlying than that are displayed as individual points. Here’s a basic box plot of 50 random normally-distributed points using ggplot2:

library(ggplot2)

numPoints = 1000
testData = data.frame(group = factor(rep(c("A"), each=numPoints)), 
                  measurement = rnorm(numPoints, 50, 15))
ggplot(testData, aes(x=group, y=measurement)) + geom_boxplot()

A basic box plot

Now, using only box plots, compare the data generated above with a new mystery dataset:

Comparing data using box plots

What conclusions would you draw? Although the second one has less variation, they look pretty similar …. Might you, for instance, conclude that the second dataset is “normal enough” to compare to the first with a t-test? If so, you’ve fallen prey to a box plot blunder!

Quartile Statistics Aren’t Enough

Here’s how I generated the group B data in above plot:

mysteryData = data.frame(group = factor(rep(c("B"), each=numPoints)), 
                  measurement = c(rnorm(numPoints, 40, 5), 
                  rnorm(numPoints, 60, 5)))
combinedData = rbind(testData, mysteryData)
ggplot(combinedData, aes(x=group, y=measurement)) + geom_boxplot()

For clarity, try visualizing group B a different way:

ggplot(combinedData, aes(x=measurement)) + geom_histogram()

A histogram of group B

Yup, that’s a bimodal distribution! Compare it to group A:

ggplot(normalData, aes(x=measurement)) + geom_histogram()

A histogram of group A

The visual similarity between group B’s box plot and that of the normally distributed group A is due to analogous first, second, and third quartile values, but those quartile statistics don’t capture the important distribution information that shows how different these two data sets really are. Histograms show us that distributional info, but aren’t easy to compare when we have more than a couple of groups. What’s a researcher to do?

The World’s Tiniest Violin

Here’s a different–and much more informative–view of these same two datasets:

A basic violin plot

This nifty visualization, which fits a smoothed distribution to each dataset, mirrors it around the group’s y axis, and plots all groups along the x axis as is done in box plots, is called a “violin plot”. (The reason for this lyrical name should be readily apparent from, for example, the shape of group B’s plot.) Even the most cursory examination of this graphic makes it clear that these two groups have radically different distributions. Happily, violin plots are built right in to ggplot2 and can be created just as easily as box plots:

ggplot(combinedData, aes(x=group, y=measurement)) + geom_violin()

Actually, there’s (at least) one more way in which violin plots can visualize information that is unfortunately hidden by box plots. Scroll back up and look at the y axis scale on the histograms we plotted earlier–what do you see? That’s right: not only do these data sets have different distributions, but they also have very different sizes! Specifically, group B is twice as large as group A. By default, ggplot2 violin plots are scaled to have the same area in each group, but a very small tweak to the command above will instead scale the area to reflect the number of observations in each group:

ggplot(combinedData, aes(x=group, y=measurement)) + 
    geom_violin(scale="count")

A violin plot scaled by number of observations per group

Now we can easily see both distribution and size information about these datasets. One more little tweak adds the same first-through-third quartile information seen in box plots:

ggplot(combinedData, aes(x=group, y=measurement)) + 
    geom_violin(scale="count", draw_quantiles = c(0.25, 0.5, 0.75)) 

A scaled violin plot with 1st, 2nd, and 3rd quartiles marked

Voila (or perhaps “viola”? Sorry, couldn’t resist the pun!)–goodbye, box plot blunders!

Beans and Beyond

Of course, violin plots aren’t the only box plot alternatives. Violin plots still disguise the actual data points in favor of smoothed distributions and quartile statistics, unlike another visualization known as the “beanplot”:

A basic beanplot

Beanplots display the smoothed distribution of the data, a single bold line at the average (not median) of the dataset, and also a small line for every data point. (This is called a “beanplot” because its inventor sees the distribution lines as looking like an enclosing bean pod and the individual observation lines as looking like the beans inside.) In larger datasets like the 3000-point test set we’ve been using, these small lines can blend together as shown here, so let’s look instead at some data that is better suited to a beanplot:

smallNumPoints = 100
smNormalData = data.frame(group = factor(rep(c("A"), each=smallNumPoints)), 
                  measurement = c(rnorm(smallNumPoints-10, 50, 15), 
                  rep(51, 10)))
smBimodalData = data.frame(group = factor(rep(c("B"), each=smallNumPoints)), 
                  measurement = c(rnorm(smallNumPoints, 40, 5), 
                  rnorm(smallNumPoints, 60, 5)))
combinedSmallData = rbind(smNormalData, smBimodalData)

library(beanplot)
beanplot(measurement ~ group, data=combinedSmallData)

A beanplot of a small, repetitive dataset

This dataset is a tenth the size, allowing us to see individual data points more clearly. I’ve also doped 10 repeated measurements into group A, each with a value of 51; since the beanplot scales each observation line by the number of observations with that value, we can easily see the spike of identical, repeated observations that would be invisible in either a box plot or a violin plot.

Unfortunately, currently beanplots aren’t built into ggplot2 (although they can be pieced together using that library’s extensive options, as shown here) and the stand-alone R library that implements them requires learning different command conventions and so forth. It also doesn’t trim the distributions to end at the most outlying data points, as ggplot2’s violin plot command does by default, and it won’t scale the distributions to the size of the dataset as demonstrated above for violin plots. Still, it offers one extra goodie: asymmetrical plotting.

An asymmetrical beanplot

The distribution-mirroring in violin plots is just a visual convention and in my opinion doesn’t really make the data more interpretable. If your data only has two groups (like our test data), the beanplot library allows you to drop the mirroring and instead plotting those groups on two sides of the same axis, as shown above, for easier comparison:

beanplot(measurement ~ group, data=combinedSmallData, side = "both",
    col = list("black", c("grey", "white")))

Note that the asymmetry comes from the simple side = "both" setting; it actually takes more work to set up the colors than the mixed-group display.

Better Living Through Visualization

Hopefully this little tour has convinced you that box plots are not always trustworthy reporters of a dataset’s salient features–but, frankly, neither are violin plots or beanplots in all circumstances. Fortunately, existing R packages make creating multiple different visualizations from the same data (as done here) a cinch. Take an extra minute or two to look at your data using more than one of them in order to get the full story!

Bringing interactivity to network visualization in Jupyter notebooks: visJS2Jupyter

September 30, 2016

Brin Rosenthal (sbrosenthal at ucsd.edu)

Introduction

Data is everywhere these days, and being able to interact with visual representations of that data in real time can help bring it to life.  You have to look no further than the D3 (data-driven-documents) examples page to see this.  If you haven’t spent time browsing through the D3 examples library, I would highly recommend doing so, but be warned it is easy to spend a few captivating hours here! (A few of my favorites: collision avoidancecollapsible force layout,  NCAA march madness predictionspreferential attachment).

 

Unfortunately, D3 is pretty nontrivial to learn, which can be a significant barrier to those of us looking for a quick but awesome solution.  There are some good visualization libraries which are based on D3, and simpler to use.  One of our favorites is vis.js.

If you’re anything like me, you love the fast and flexible development and documentation environment that Jupyter notebooks provide.  But I had been frustrated with the limited interactivity that is available for plotting of data.  While matplotlib, seaborn, and networkx provide nice static ways of graphing data and networks, they left me wanting more.  Python widgets are ok, but a bit clunky (see earlier post…) .

A group of us at the CCBB had the idea to write a tool which would bring the interactivity of D3 (through vis.js) into Jupyter notebook cells.  This turned out to be quite simple.  We repurposed some existing html code from another project, to set the styles of nodes and edges in a network.  We modified this code to allow style arguments to be passed in through a function.  Every time this function is called, a new style_file.html is created, containing the properties set by the user.  This style_file.html is then loaded into the Jupyter cell using the python HTML module, and the network is rendered in the cell.  Once we figured these pieces out, we had a fully interactive graph!  Right there in the Jupyter notebook cell!  We can now freely pan, zoom, click and drag nodes, and even embed more information in the node and edge hover-bubbles.  One of the coolest things about this tool is that it is almost infinitely flexible, and we’ve designed it to work with networkx graph formats- are one of the most standard python graph libraries.

In this post, I’ll walk you through two simple examples of how to use visJS2Jupyter.

 

Installation

To install, run “pip install visJS2jupyter” in your terminal. To import, use the statement “import visJS2jupyter.visJS_module” in your notebook.  Source code for the package may be found here https://github.com/ucsd-ccbb/visJS_2_jupyter.

Use example with default parameters

Now that we have the package installed, we’re going to walk through a very simple use example, using only the default parameters.  First, we need a network to draw.  Let’s make a random one using the networkx function ‘connected_watts_strogatz_graph’.  This network has 30 nodes, each of which is initially connected to 5 nearest neighbors.  Each of these connections randomly rewired with probability 0.2.  We will also need the lists of nodes and edges that comprise this graph. 


    G=nx.connected_watts_strogatz_graph(30,5,.2)
    nodes = G.nodes()
    edges = G.edges()

Next, we will simply construct dictionaries which contain all of the node-specific and edge-specific traits which will be passed to the visualizer.  (Note that we also need to make a node_map here, which maps the names of the nodes in the graph to integers, because of the way visJS interprets node/edge data).

    nodes_dict = [{"id":n} for n in nodes]
    node_map = dict(zip(nodes,range(len(nodes)))) # map to indices for source/target in edges
    edges_dict = [{"source":node_map[edges[i][0]], "target":node_map[edges[i][1]],
                  "title":'test'} for i in range(len(edges))]


Now all that’s left is calling the visualizer function:


    visJS_module.visjs_network(nodes_dict, edges_dict, time_stamp=0)

Done! Now we are free to click, drag, and zoom at will. Note that if you click on a node, that node’s nearest neighbors are highlighted.

visjs2jupyter_basic_example

Now that we have the basic use example under our belt, let’s move on to something more complicated, because there is so much potential here!

More complicated use example

In this example, we will start by mapping some features to node and edge properties.  To map node/edge attributes to properties, simply add the property to the graph as a node/edge-attribute (using nx.set_node_attribute and nx.set_edge_attribute), then use the return_node_to_color function to select which property you would like to map to the node colors.  You can map anything you want to node color, as long as you represent it numerically.  You can also choose which matplotlib colormap  you’d like to use for the mapping.  For example, let’s calculate the node-level clustering coefficient and betweenness centrality and degree for our random network we made above, and add them as attributes.


    # add a node attributes to color-code by
    cc = nx.clustering(G)
    degree = G.degree()
    bc = nx.betweenness_centrality(G)
    nx.set_node_attributes(G,'clustering_coefficient',cc)
    nx.set_node_attributes(G,'degree',degree)
    nx.set_node_attributes(G,'betweenness_centrality',bc)

Now that we’ve added each of these properties as node attributes, let’s map the node colors to betweenness centrality, and use the matplotlib colormap spring_r for our color scheme. We can also set the node transparency, using alpha, (1 = fully opaque, 0 = fully transparent), and we can choose which section of the colormap we’d like to use. Here we’re setting the lowest value of betweenness centrality to 10% of spring_r, and the highest value to 90%. This is useful if you like most of a colormap, but only want to use the part you like (if it starts too light or too dark for example). You can also transform your color scale, using the ‘color_vals_transform’ argument. Valid options are ‘log’, ‘sqrt’, and ‘ceil’.


    node_to_color =   visJS_module.return_node_to_color(G,field_to_map='betweenness_centrality',cmap=mpl.cm.spring_r,
alpha = 1, color_max_frac = .9,color_min_frac = .1)

Now that we have our color mapping, we can fill out nodes_dict, node_map, and edges_dict, as we did in the simple example. This time, however, we will set more node and edge level properties, including:

  • the positions of each node (x and y) using the output from nx.spring_layout
  • The color of each node using our color mapping node_to_color
  • The degree of each node (if degree is passed in, it is used to map node size by default)
  • We’ll pass in dummy values for the node title field (this is what will show up in the hover).
  • The color of each edge (for now we set every edge to be the same color- gray, but you can easily individualize the edge colors too, using visJS_module.return_edge_to_color(…)).

This is the current list of properties you can modify at the node level

  • ‘node_shape’
  • ‘color’
  • ‘border_width’
  • ‘title’ (e.g. the hover information)
  • The default node size is mapped to the node degree, but you can override that default by setting ‘node_size_field’ in the visjs_network function.  For example, simply add a ‘node_size’ key:value entry to the nodes_dict, and call visjs_network with node_size_field = ‘node_size’.
  • ‘degree’: the degree of each node- used for default size mapping
  • All of the above are optional additions to nodes_dict.  Default values will be filled in if they are missing.

 


    pos = nx.spring_layout(G)    
    nodes_dict = [{"id":n,"color":node_to_color[n],
                   "degree":nx.degree(G,n),
                  "x":pos[n][0]*1000,
                  "y":pos[n][1]*1000} for n in nodes
                  ]
    node_map = dict(zip(nodes,range(len(nodes))))  # map to indices for source/target in edges
    edges_dict = [{"source":node_map[edges[i][0]], "target":node_map[edges[i][1]], 
                  "color":"gray","title":'test'} for i in range(len(edges))]

We’ll also pass in some more graph-level properties (properties that aren’t node and edge specific). These include:

  • node_size_multiplier: multiply each node’s size by this (useful if you have very few or very many nodes)
  • node_color_highlight_border
  • node_color_highlight_background
  • node_color_hover_border
  • node_color_hover_background
  • node_font_size
  • edge_arrow_to: Should we draw arrows at the target end?
  • edge_color_highlight
  • edge_color_hover
  • edge_width: how wide should the edges be?
  • physics_enabled, min_velocity, max_velocity: controls the physics of the nodes
  • Time_stamp: This appends the value to the end of the style-file, thus creating a new one instead of writing over the old one.  You need a unique style-file for every network you render within the same Jupyter notebook.

We have mapped most (still working on getting the complete list) of the modifiable fields from visJS network into our package.  You can find documentation on the full list here .


    visJS_module.visjs_network(nodes_dict,edges_dict,time_stamp=1,
                              node_size_multiplier=5,
                              node_size_transform = '',
                              node_color_highlight_border='red',
                              node_color_highlight_background='#D3918B',
                              node_color_hover_border='blue',
                              node_color_hover_background='#8BADD3',
                              node_font_size=25,
                              edge_arrow_to=True,
                              edge_color_highlight='#8A324E',
                              edge_color_hover='#8BADD3',
                              edge_width=3,
                              physics_enabled=True,
                              min_velocity=1,
                              max_velocity=15)

Ok there we go! Now we have drawn a much more interesting network.  Click on the image below to be redirected to the interactive version, hosted on bl.ocks.org.

visjs2jupyter_complex_example

For an even more complicated use case, see this notebook I wrote (http://bl.ocks.org/brinrosenthal/raw/fd7d7277ce74c2b762d3a4d66326215c/).  In this example, we display the bipartite network composed of diseases in The Cancer Genome Atlas (http://cancergenome.nih.gov/), and the top 25 most common mutations in each disease. We also overlay information about drugs which target those mutations. Genes which have a drug targeting them are displayed with a bold black outline. The user may hover over each gene to get a list of associated drugs.

Outputting Beautiful Jupyter Notebooks (R-Kernel Edition)

September 13, 2016

Amanda Birmingham (abirmingham at ucsd.edu)

Jupyter notebooks are wonderful, but eventually you will need to present your work to someone unable (or unwilling) to view it on a notebook server. Unfortunately, there are surprising difficulties in printing or otherwise outputting Jupyter notebooks attractively into a static, offline format. These difficulties are not limited to Python-kernel notebooks: R-kernel notebooks have their own issues. Here’s a description of those issues, and a work-around that doesn’t require learning to modify jinja2 templates.

Table of Contents

Table of Contents

HTML Output: Mangled Graphics Text

At first blush, it looks as though the HTML conversion built into Jupyter notebooks (shown below) works fine for R-kernel notebooks, as no errors are thrown and the output generally looks attractive.

However, as you scroll through your document, you will find that something sinister has happened to plots after the first plot that uses legends/axis labels. For example, in my sample notebook, the first plot with a legend looks great:

… but all subsequent ones have some of their text labels sadly mangled (e.g., look at the legend at the far right of the plot below):

Apparently the cause of this mess is that (a) the Jupyter R kernel, IRKernel, by default outputs all graphics as inline SVG, but (b) the nbconvert tool that Jupyter uses to create HTML doesn’t "honor the ‘isolated’: true flag" in the metadata that tells it to put the SVG in its own iframe (about half of this statement is Greek to me, but feel free to get more details from the horse’s mouth–the nbconvert issue itself is at https://github.com/jupyter/nbconvert/issues/129, and is still open as of 08/25/2016).

Table of Contents

PDF Output: Line Truncation

So, let’s just output as PDF, which also "works" (i.e., doesn’t error out) for R-kernel notebooks, right?

Wrong! In PDFs, it is true that the plots all look lovely:

However, something else has gone off the rails! The HTML output, for all its plot failures, does pretty well with text: it tries (and often succeeds in) coercing tables to fit the screen size, and when that fails adds a scrollbar to allow access to content too wide for the screen:

Unfortunately, no such grace is forthcoming from the PDF output:

As shown by the gray bar at the right of each of these screenshots, long content, whether tabular or textual, simply runs off the edge of the pdf page (a problem, sadly, that plagues all Jupyter notebooks regardless of kernel, as they all use nbconvert to make PDFs). Oh, the humanity!

Table of Contents

Workaround: HTML to PDF Without SVG

Fortunately, this mishigas can be side-stepped with minimal loss of quality and sanity. As described at https://github.com/IRkernel/IRkernel/issues/331, simply add this line to the top of your R-kernel notebook:

options(jupyter.plot_mimetypes = c("text/plain", "image/png" ))

Then be sure to restart your kernel so it takes effect:

What effect? Well, it tells IRKernel to stop trying to output all graphics as SVG (by the way, this is also helpful if you have very LARGE plots that are bloating the on-disk size of your notebook or causing it to hang when rendered). You are instead telling it to make all graphics inline PNGs. PNGs look just slightly different than their SVG counterparts–a little blockier, since the former are raster graphics while the latter are vector graphics. I notice it most in the text:

SVG

PNG

But what if you need a PDF? Well, with the HTML conversion licked, we can now get an acceptable PDF by simply opening the HTML version of the notebook in the browser and using the browser’s ability to "print" it to PDF (as shown here in Chrome):

This gives us unmangled plots (that, by the way, are appropriately placed so they aren’t broken across pages–unlike PDFs created from HTML by tools such as wkhtmltopdf):

It also wraps long text lines:

and fits tables to the page width, where possible:

You still (of course) lose the view of very wide tables, as the HTML scroll bars don’t work in PDF, but at some point you’ve got to accept that you can’t fit 10 pounds in a 5-pound sack!

The "print to PDF" option in the browser also reproduces the appearance of the HTML notebook much more faithfully than Jupyter nbconvert‘s PDF conversion, which imposes a very LaTeX-y format (this of course makes sense, as the nbconvert PDF conversion goes by way of a trip through LaTeX). Finally, "Print to PDF" is also noticeably faster than nbconvert, although the time it takes to generate a PDF either way is unlikely to be a bottleneck!

Visualize and analyze differential expression data in a network

December 16, 2015

In analysis of differential expression data, it is often useful to analyze properties of the local neighborhood of specific genes. I developed a simple interactive tool for this purpose, which takes as input diferential expression data, and gene interaction data (from http://www.genemania.org/). The network is then plotted in an interactive widget, where the node properties, edge properties, and layout can be mapped to different network properties. The interaction type (of the 6 options from genemania) can also be selected.

This notebook will also serve as an example for how to create, modify, visualize and analyze weighted and unweighted gene interaction networks using the highly useful and flexible python package NetworkX (https://networkx.github.io/)

This tool is most useful if you have a reasonably small list of genes (~100) with differential expression data, and want to explore properties of their interconnections and their local neighborhoods.

The interactive ipython notebook version of this post may be accessed here (https://github.com/brinrosenthal/DE_network_visualizer), where you can use the network visualizer with our example data, or insert your own data.

Import a real network (from this experiment http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4419)

This experiment contains fold change information for genes in an experiment studying ‘alveolar macrophage response to bacterial endotoxin lipopolysaccharide exposure in vivo’. We selected a list of genes from the experiment which had high differential expression, and were enriched for ‘immune response’ and ‘response to external biotic stimulus’ in the gene ontology. This experiment and gene list were selected purely as examples for how to use this tool for an initial exploration of differential expression data.

 

Description of options:

  • focal_node_name: Select gene to focus on (a star will be drawn on this node)
  • edge_threshold: Change the number of edges included in the network by moving the edge_threshold slider. Higher values of edge_threshold means fewer edges will be included in the graph (and may improve interpretability). The threshold is applied to the ‘Weight’ column of DE_network, so the less strongly weighted edges are not included as the threshold increases
  • network_algo: Select the network algorithm to apply to the graph. Choices are:
    • ‘spl’ (shortest path length): Plot the network in a circular tree layout, with the focal gene at the center, with nodes color-coded by log fold-change.
    • ‘clustering coefficient’: Plot the network in a circular tree layout, with nodes color-coded by the local clustering coefficient (see https://en.wikipedia.org/wiki/Clustering_coefficient).
    • ‘pagerank’: Plot the network in a spring layout, with nodes color-coded by page rank score (see https://en.wikipedia.org/wiki/PageRank for algorithm description)
    • ‘community’: Group the nodes in the network into communities, using the Louvain modularity maximization algorithm, which finds groups of nodes optimizing for modularity (a metric which measures the number of edges within communities compared to number of edges between communities, see https://en.wikipedia.org/wiki/Modularity_(networks) for more information). The nodes are then color-coded by these communities, and the total modularity of the partition is printed above the graph (where the maximal value for modularity is 1 which indicates a perfectly modular network so that there are no edges connecting communities). Below the network the average fold-change in each community is shown with box-plots, where the focal node’s community is indicated by a white star, and the colors of the boxes correspond to the colors of the communities above.
  • map_degree: Choose whether to map the node degree to node size
  • plot_border_col: Choose whether to plot the log fold-change as the node border color
  • draw_shortest_paths: If checked, draw the shortest paths between the focal node and all other nodes in blue transparent line. More opaque lines indicate that section of path was traveled more often.
  • coexpression, colocalization, other, physical_interactions, predicted_interactions, shared_protein_domain: Select whether to include interactions of these types (types come from GeneMania- http://pages.genemania.org/data/)

 

Some examples

First let’s look at the graph when ‘spl’ (shortest path length) is selected as the network algo. ADA is the focal node in this case, and it has 4 nearest neighbors (MX1, CD44, FITM1, and CD80). Note that CD44 connects the focal node ADA to many other nodes in the network, as it is an opaque blue line. Also note that there is only one gene with anegative fold change in this gene set (CCL13). The white nodes are genes included by genemania- they are the 20 genes nearest to the input genelist.

 

 spl_ADA

Community detection

When the network_algo button is switched to ‘community’, the louvain modularity maximization algorithm runs on the network, and partitions the nodes into communities which maximize the modularity. In this case (with CXCL10 as the focal node), the nodes are partitioned into 5 groups, with the three largest groups indicated by red, green, and teal circles. While you can see some support for this grouping by eye, the overall graph modularity is 0.33, which is a relatively low value. This means that although groups were found in the graph, the graph itself is not very modular. As a rule of thumb, very modular graphs have modularities of about 0.5 or 0.6.

modularity_CXC10

Below the graph, there is a panel showing the average fold change for the nodes in this community. Since most of the nodes in the input gene list have positive fold changes here, all communities also have positive average fold changes. Were the input gene list to have fewer large fold changes, this would enable you to see if a particular grouping of nodes had significantly higher (or lower) levels of differential expression than alternative groupings.

bar_plot_CXC10

See How CCBB Can Help With Your Bioinformatics Data

Request Free Consult 858-822-6258