Archive for November, 2016

Making Heat Maps In R

November 15, 2016

Amanda Birmingham (abirmingham at

Heat maps are a staple of data visualization for numerous tasks, including differential expression analyses on microarray and RNA-Seq data. Many people have already written heat-map-plotting packages for R, so it takes a little effort to decide which to use; here I investigate the performance of the six that I found referenced most frequently online.

My main goals (YMMV) beyond basic plotting were to be able to (a) annotate rows and columns with metadata information, (b) include scales and labels in the figure itself (since often figures are reused in presentations without caption information), and (c) do as much label customization as possible with the shallowest learning curve. I also want automatic dendrogram creation, so using ggplot2 or another graphics-only package was out. Note that throughout I have accepted the default colors for every heat map tool, as these are pretty easy to change after the fact if you care.

TL;DR: I recommend using heatmap3 (NB: not “heatmap.3”). It mimics the easy-to-use interface of heatmap.2 but can be extended into more complex settings if you later find you need fine-grained control.

Table of Contents

Table of Contents


This blog post is adapted from a Jupyter Notebook and supporting files that you can download from GitHub and run yourself if you have Jupyter Notebook server with an R kernel installed.

All test “data” used here are just random numbers 🙂

In [1]:
# This line prevents SVG output, which does not play well with export to HTML
options(jupyter.plot_mimetypes = c("text/plain", "image/png" ))
In [2]:
# Load the example "data"
gLogCpmData = as.matrix(read.table("heatmap_test_matrix.txt"))
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 ⋯ Sample47 Sample48 Sample49 Sample50 Sample51 Sample52 Sample53 Sample54 Sample55 Sample56
GeneA 30 67 34 98 32 3 79 15 6 18 97 49 12 6 87 58 55 13 48 28
GeneB 80 70 28 51 74 76 85 98 7 64 45 69 1 9 8 49 97 5 83 66
GeneC 43 36 41 24 71 76 91 50 81 57 21 10 75 35 77 92 85 73 97 12
GeneD 88 66 57 8 11 91 71 84 89 63 12 16 61 42 48 96 26 84 75 78
GeneE 90 57 73 51 86 32 22 78 84 31 37 40 1 37 74 79 89 79 68 42
GeneF 31 87 65 36 64 15 28 89 94 58 58 32 11 18 81 47 27 60 79 91
GeneG 1 93 70 64 98 28 65 96 83 2 51 83 99 3 37 54 6 22 56 19
GeneH 27 18 37 85 59 61 85 3 16 7 38 14 22 66 100 14 99 37 39 51
GeneI 85 35 24 73 21 25 45 80 20 94 34 10 59 40 100 32 40 11 32 4
GeneK 16 55 76 60 40 48 85 90 24 44 66 53 91 23 90 48 61 56 47 74
In [3]:
# Load the example annotation/metadata
gAnnotationData = read.table("heatmap_test_annotation.txt")
sample_name subject_drug
1 Sample1 MiracleDrugA
2 Sample2 MiracleDrugA
3 Sample3 MiracleDrugA
4 Sample4 MiracleDrugA
5 Sample5 MiracleDrugB
6 Sample6 MiracleDrugB
7 Sample7 MiracleDrugB
8 Sample8 MiracleDrugB
9 Sample9 MiracleDrugB
10 Sample10 MiracleDrugB
11 Sample11 MiracleDrugB
12 Sample12 MiracleDrugC
13 Sample13 MiracleDrugC
14 Sample14 MiracleDrugC
15 Sample15 MiracleDrugC
16 Sample16 MiracleDrugC
17 Sample17 MiracleDrugC
18 Sample18 MiracleDrugC
19 Sample19 MiracleDrugC
20 Sample20 MiracleDrugA
21 Sample21 MiracleDrugA
22 Sample22 MiracleDrugA
23 Sample23 MiracleDrugA
24 Sample24 MiracleDrugA
25 Sample25 MiracleDrugA
26 Sample26 MiracleDrugA
27 Sample27 MiracleDrugA
28 Sample28 MiracleDrugA
29 Sample29 MiracleDrugA
30 Sample30 MiracleDrugA
31 Sample31 MiracleDrugC
32 Sample32 MiracleDrugC
33 Sample33 MiracleDrugC
34 Sample34 MiracleDrugC
35 Sample35 MiracleDrugC
36 Sample36 MiracleDrugC
37 Sample37 MiracleDrugC
38 Sample38 MiracleDrugC
39 Sample39 MiracleDrugA
40 Sample40 MiracleDrugA
41 Sample41 MiracleDrugA
42 Sample42 MiracleDrugA
43 Sample43 MiracleDrugA
44 Sample44 MiracleDrugA
45 Sample45 MiracleDrugA
46 Sample46 MiracleDrugA
47 Sample47 MiracleDrugA
48 Sample48 MiracleDrugA
49 Sample49 MiracleDrugA
50 Sample50 MiracleDrugA
51 Sample51 MiracleDrugA
52 Sample52 MiracleDrugA
53 Sample53 MiracleDrugA
54 Sample54 MiracleDrugA
55 Sample55 MiracleDrugA
56 Sample56 MiracleDrugA
In [4]:
# Make helper function to map metadata category to color
    colorsVector = ifelse(annotations["subject_drug"]=="MiracleDrugA", 
        "blue", ifelse(annotations["subject_drug"]=="MiracleDrugB", 
        "green", "red"))

Table of Contents


heatmap is the built-in option for heat maps in R:

In [5]:
# Test heatmap with column annotations
testHeatmap<-function(logCPM, annotations) {    
    sampleColors = mapDrugToColor(annotations)
    heatmap(logCPM, margins=c(5,8), ColSideColors=sampleColors)

testHeatmap(gLogCpmData, gAnnotationData)

Not bad, but there are no legends for either the main values or the annotation information.

Table of Contents


heatmap.2 is an “enhanced” heat map function from the add-on package gplots (not to be confused with ggplot!):

In [6]:
The downloaded binary packages are in
In [7]:

# Test heatmap.2 with column annotations and custom legend text
testHeatmap2<-function(logCPM, annotations) {    
    sampleColors = mapDrugToColor(annotations)
    heatmap.2(logCPM, margins=c(5,8), ColSideColors=sampleColors,
        key.xlab="log CPM",
        key=TRUE, symkey=FALSE,"none", trace="none")

testHeatmap2(gLogCpmData, gAnnotationData)
Attaching package: 'gplots'

The following object is masked from 'package:stats':


I turned off a few of the default options (, trace) to make the graphic a bit less busy. The default main legend is nice, but I don’t see an option to include a legend for the annotation information.

Table of Contents


aheatmap, which stands for “annotated heatmap”, is a heat map plotting function from the NMF package:

In [8]:
The downloaded binary packages are in
In [9]:

# Test aheatmap with column annotations
testAheatmap<-function(logCPM, annotations) {    
    aheatmap(logCPM, annCol=annotations[

testAheatmap(gLogCpmData, gAnnotationData)
Loading required package: pkgmaker
Loading required package: registry

Attaching package: 'pkgmaker'

The following object is masked from 'package:base':


Loading required package: rngtools
Loading required package: cluster
NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 3/4
  To enable shared memory capabilities, try: install.extras('

Yay, legends for both the main data and the annotations! However, note that something weird is going on here: The dendrograms aren’t showing up right. It appears that somehow the body of the heatmap is overlapping with the finer levels of the dendrograms at both top and left. There may be a way to fix this by digging further into the settings of aheatmap, but since I’m looking for something easy to use out-of-the-box, I consider this a disqualifier for my usage.

Table of Contents


pheatmap, where the “p” stands for “pretty”, is the sole function of the package pheatmap:

In [10]:
The downloaded binary packages are in
In [11]:

# Test pheatmap with two annotation options
testPheatmap<-function(logCPM, annotations) {    
    drug_info = data.frame(annotations[,"subject_drug"])
    rownames(drug_info) = annotations[["sample_name"]]
    # Assign the column annotation straight from 
    # the input annotation dataframe
    pheatmap(logCPM, annotation_col=drug_info, 
    # Assign the column annotation to an intermediate
    # variable first in order to change the name 
    # pheatmap uses for its legend
    subject_drug = annotations[["subject_drug"]]
    drug_df = data.frame(subject_drug)
    rownames(drug_df) = annotations[["sample_name"]]
    pheatmap(logCPM, annotation_col=drug_df, 
testPheatmap(gLogCpmData, gAnnotationData)

Again, nice to have legends for both main and annotation information. Note that:

  1. You control the annotation legend title through the variable name, which I consider suboptimal as variable names often do not read nicely as English text.
  2. The function uses row names to match annotations to data, so all data and annotations must be contained in dataframes (not matrices).

Table of Contents


heatmap3 is the central function of the heatmap3 package. Beware that this is different from “heatmap.3”, of which there are numerous versions (e.g., here, here, and here–apparently a lot of people felt heatmap.2 needed an upgrade! I don’t investigate these others here because I haven’t seen them discussed online by users very often.)

In [12]:
The downloaded binary packages are in
In [13]:

# Test heatmap3 with several annotation options
testHeatmap3<-function(logCPM, annotations) {    
    sampleColors = mapDrugToColor(annotations)
    # Assign just column annotations
    heatmap3(logCPM, margins=c(5,8), ColSideColors=sampleColors) 
    # Assign column annotations and make a custom legend for them
    heatmap3(logCPM, margins=c(5,8), ColSideColors=sampleColors, 
        "MiracleDrugB", "?"), col=c("blue", "green", "red"), cex=1.5))
    # Assign column annotations as a mini-graph instead of colors,
    # and use the built-in labeling for them
testHeatmap3(gLogCpmData, gAnnotationData)

This one follows the syntax of heatmap.2, which is good if you already know the latter. However, its added functionality is quite complicated … definitely complicated enough to get me into trouble (e.g., in the second option above, my annotation legend runs into my heat map and I’ve lost the main legend). It may also be complicated enough to get me out of trouble again (e.g., via explicit setting of the legend and/or heat map placement) but it would clearly take more digging.

Table of Contents


annHeatmap2 is the core function of the Heatplus package. Unlike other packages discussed in this evaluation, Heatplus is available through the bioconductor bioinformatics software project rather than through CRAN.

In [14]:
# Source bioconductor
Bioconductor version 3.3 (BiocInstaller 1.22.3), ?biocLite for help
Using Bioconductor 3.3 (BiocInstaller 1.22.3), R 3.3.0 (2016-05-03).
Installing package(s) 'Heatplus'
The downloaded binary packages are in
Old packages: 'AnnotationDbi', 'DBI', 'IRanges', 'IRdisplay', 'Matrix', 'R6',
  'Rcpp', 'S4Vectors', 'curl', 'devtools', 'digest', 'httr', 'irlba',
  'jsonlite', 'limma', 'manipulate', 'mgcv', 'mime', 'plyr', 'repr',
  'rstudioapi', 'statmod', 'stringi', 'stringr', 'survival', 'withr'
In [15]:

# Test annHeatmap2 with column annotations
testAnnHeatmap2<-function(logCPM, annotations){
    ann.dat = data.frame(annotations[,"subject_drug"])

    plot(annHeatmap2(logCPM, legend=2,
        ann = list(Col = list(data = ann.dat))))

testAnnHeatmap2(gLogCpmData, gAnnotationData)

Like heatmap3, annHeatmap2 does metadata annotations as a mini-graph; apparently it doesn’t do such annotations as color bars (?!) It requires that all annotation (and dendrogram, etc) options be passed in as lists, which clearly offers a lot of powerful abilities but is sort of heavy-weight.

Table of Contents


Below I summarize the features I assessed for these tools:

feature heatmap heatmap.2 aheatmap pheatmap heatmap3 annHeatmap2
source built-in cran cran cran cran bioconductor
can add main legend x x x x x
can control main legend text x x
can add row/col annotations x x x x x x
can specify annotations as table column x x x
can add annotation legend x x x ~
can control annotation legend text ~ x ~
notes no clear advantages pretty nice results with little tinkering Appears UNUSABLE as dendrograms show up wrong, at least in notebook pretty nice results with little tinkering powerful but complicated powerful but complicated; doesn’t support ColSideColors?

Table of Contents


I recommend using heatmap3 (NB: not “heatmap.3”). It mimics the easy-to-use interface of heatmap.2 but can be extended into more complex settings if you later find you need fine-grained control.

Table of Contents

Communities and cliques

November 4, 2016

Brin Rosenthal (sbrosenthal at

You probably won’t get far learning about networks and graph theory before coming across communities and cliques in graphs. At first glance, these two concepts are quite similar- they both describe highly connected sets of nodes, after all. There are however situations which are best suited to one or the other. In this post we will explore some similarities and differences between communities and cliques, and a specific problem I came upon which I thought would be easily solved by a community-finding algorithm, but soon realized that cliques were the much better option!

A network community is a set of nodes which have dense connections within the group, and sparse connections outside the group. A large number of community finding algorithms have been developed, which vary in complexity, and have different strengths and weaknesses depending on context, of which we list a small sampling: Fortunato, S, Malliaros, F, Newman, M., Airoldi, E. et al, Karrer, B., Newman, M., Peixoto, T., Johnson, S.. The community-finding algorithms generally optimize some parameter (usually related to the number of within-group and between-group edges). They are also generally stochastic, meaning that you may get slightly different answers on different runs. It can be useful to explore the community structure of many real-world networks, to establish substructure in the graph. Unlike cliques, there are no exact communities in a graph, rather you will get different answers depending on what algorithm you use, and what you are optimizing for.

A clique is in some sense a stronger version of a community. A set of nodes forms a clique (equivalently, a complete subgraph) if all possible connections between nodes exist. A two-node clique is simply two connected nodes. A three node clique is also known as a triangle. Graphs also contain maximal cliques, which are complete subgraphs such that no other node can be added while maintaining completeness. In Figure 1 below, there are only two maximal cliques of size greater than two. There is a maximal clique of size 5 (orange nodes), and a maximal clique of size 3 (green nodes).

Figure 1: Nodes color-coded by clique-membership.

Figure 1: Nodes color-coded by clique-membership.

One of the most commonly used maximal clique-finding algorithms may be found here, and recursively searches through the graph finding all maximal cliques. Depending on structure, graphs may contain a large number of cliques, resulting in a high memory cost for the algorithm.

In my research I have mainly focused on applications of community-finding algorithms to biological networks. However, I recently came upon a problem that was solved much better using a clique-finding approach than a community-finding one. Briefly, we had a matrix of similarity between objects (Figure 2), and we wanted to find sets of objects which were all very dissimilar from each other.

Figure 2: Randomly generated similarity scores between 100 objects

Figure 2: Randomly generated similarity scores between 100 objects

Application of community-finding algorithms resulted in either groups of highly similar objects grouped together (modularity maximization- Figure 3), or groups of objects which were mostly dissimilar, but contained a couple similar objects in the group. In our application, we couldn’t tolerate any similar objects in our groups, so both of these solutions weren’t satisfactory.

Figure 3: Objects clustered by dissimilarity, using modularity maximization algorithm.  Groups are only weakly dissimilar, with many similar pairs existing in the same groups.

Figure 3: Objects clustered by dissimilarity, using modularity maximization algorithm. Groups are only weakly dissimilar, with many similar pairs existing in the same groups.

I then realized that I could reform the problem of finding sets of highly dissimilar objects into a clique-finding framework. I simply created a network from our similarity matrix by connecting nodes whenever they had a similarity level less than a certain tolerance. Once this dissimilarity network was created, it was simply a matter of applying the networkx function find_cliques to the graph. In our network of 100 nodes and 649 edges, there were 362 maximal cliques, with the largest of these maximal cliques containing 4 nodes.

While most community-finding algorithms discretely partition graphs (with some exceptions), nodes can belong to a large number of maximal cliques (see example below- Figure 4). For our problem, this meant we had a large number of maximal cliques to choose from, but many of them contained a lot of the same information.

Figure 4: Five largest maximal cliques outlined in black.

Figure 4: Five largest maximal cliques outlined in black.

We can tune the sizes of our dissimilar sets by changing the threshold of similarity we can tolerate. When we increase the similarity threshold from 0.0 to 1.0, we now have 2840 maximal cliques, and the largest maximal clique contains 6 nodes (Figure 5).

Figure 5: Cliques formed with sets of nearly perfectly dissimilar objects.

Figure 5: Cliques formed with sets of nearly perfectly dissimilar objects.

That’s about all for this post, although please note we have barely scratched the surface of communities and cliques in graphs. Please see the links for more in-depth reading!

See How CCBB Can Help With Your Bioinformatics Data

Request Free Consult