Analysis of Biostatistical Article

Biostatistics(2014),15,3,pp.413–426 doi:10.1093/biostatistics/kxt053 Advance Access publication on January 6, 2014 Differential expression analysis of RNA-seq data at single-base resolution ALYSSA C. FRAZEE Department of Biostatistics, The Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA SARVEN SABUNCIYAN Department of Pediatrics, The Johns Hopkins University School of Medicine, 600 North Wolfe Street, Baltimore, MD 21287, USA KASPER D. HANSEN, RAFAEL A. IRIZARRY, JEFFREY T. LEEK ∗ Department of Biostatistics, The Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA [email protected] S UMMARY RNA-sequencing (RNA-seq) is a flexible technology for measuring genome-wide expression that is rapidly replacing microarrays as costs become comparable. Current differential expression analysis methods for RNA-seq data fall into two broad classes: (1) methods that quantify expression within the boundaries of genes previously published in databases and (2) methods that attempt to reconstruct full length RNA tran- scripts. The f irst class cannot discover differential expression outside of previously known genes. While the second approach does possess discovery capabilities, statistical analysis of differential expression is complicated by the ambiguity and variability incurred while assembling transcripts and estimating their abundances. Here, we propose a novel method that f irst identif ies differentially expressed regions (DERs) of interest by assessing differential expression at each base of the genome. The method then segments the genome into regions comprised of bases showing similar differential expression signal, and then assigns a measure of statistical signif icance to each region. Optionally, DERs can be annotated using a refer- ence database of genomic features. We compare our approach with leading competitors from both current classes of differential expression methods and highlight the strengths and weaknesses of each. A software implementation of our method is available on github (https://github.com/alyssafrazee/derf inder).

Keywords: Bioinformatics; Differential expression; False discovery rate; Genomics; RNA sequencing. 1.I NTRODUCTION Microarrays revolutionized the way we measure gene expression by providing, for the f irst time, genome- wide transcript-level measurements, wheretranscriptis used here to refer to the molecule associated ∗To whom correspondence should be addressed.c The Author 2014. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 414A. C. F RAZEE AND OTHERS with expression at the RNA level. However, assigning only one measurement to each known gene has greatly over-simplif ied the biological process in two ways. The f irst is that we have not yet discovered or annotated all regions of the genome capable of expressing transcripts. Secondly, most genes pro- duce not one but several transcripts through the process ofalternative splicing(Mortazaviand others, 2008;Trapnelland others,2010;Katzand others,2010). In principle, RNA-sequencing (RNA-seq) provides measurements of transcript expression from which we can obtain a more complete picture of reality. While microarrays rely on hybridization to predef ined probes, by explicitly sequencing transcripts, RNA-seq is potentially capable of measuring expression in regions not previously anno- tated (Guttmanand others,2010;Clarkand others,2011), and to measure multiple transcripts for indi- vidual genes (Trapnelland others,2010;Mortazaviand others,2008). This flexibility, coupled with rapidly declining sequencing costs, has led to explosive growth in the use of RNA-seq technology (Steinand others,2010).

The most common goal among investigators using either microarrays or RNA-seq is detecting differen- tial expression, for example: discovering transcripts showing different average expression levels across two populations. A major difference between the two technologies is that in microarrays, measurement units are f ixed in advance: only the abundances of the specif ic RNA sequences that correspond to probes on the microarrays are measured. With this approach, differential expression is relatively straightforward to quan- tify: measurements from the same probe are compared across samples. In contrast, RNA-seq reads out short sequences of molecules produced by shearing and reading RNA transcripts (the measurements produced are referred to asreads). Unlike with a microarray, across-sample comparisons are not straightforward as measurement units are not def ined in advance. Therefore, reads must be summarized into units of expres- sion before differential expression analysis can be performed. Different summarization approaches can lead to very different statistical inference.

Here, we group the most popular differential expression analysis approaches into two categories based on the ways that the reads are summarized. We refer to these two categories as (1)annotate-then-identify and (2)assemble-then-identify. The f irst category counts the number of reads that fall within previously identif ied boundaries of known genes. The second class seeks to assemble full transcripts directly from the reads. In either case, differential expression analysis is then performed on the resulting measurements at the gene or transcript level. In Section 2, we describe the limitations with the existing approaches and propose a new intermediate class of differential expression methods which we refer to asidentify-then-annotate.In Section 3, we propose a specif ic implementation of the identify-then-annotate class of methods, which we call Differentially Expressed Region Finder (DER Finder). And in Section 4, using an example dataset, we show that identify-then-annotate models provide a good compromise between current RNA-seq analysis methods.

2.D IFFERENTIAL EXPRESSION ANALYSIS REVIEW In this section, we review existing approaches to differential expression analysis with RNA-seq data and discuss how the philosophy behind DER Finder f its into this context. RNA-seq generates millions or bil- lions of short sequences from individual mRNA molecules. Analyzing these sequence reads requires sev- eral steps: First, each read must be matched to the position it originates from in the genome in a process called alignment. Then, the number of reads aligned to specif ic regions must be summarized into quanti- tative measurements. The measurements are then normalized for the total number of reads measured for a particular sample and statistical models are applied to the summarized units.Oshlackand others(2010) describe this RNA-seq data analysis process in much more detail. Based on the summarization step, current statistical methods for the analysis of RNA-seq data can be grouped into two major classes. The methods in the f irst class, which we call annotate-then-identify, summarize the reads by counting the number that fall Differential expression analysis of RNA-seq data415 within pre-specif ied exons or genes. The exon and gene specif ications, collectively called theannotation, are obtained from databases of previously identif ied genomic features.

Once the reads have been summarized at the exon or gene level, the statistical problem is very sim- ilar to statistical analysis of microarray data, with some deviations because the raw measurements take the form of counts. Note that the results from this step can be naturally summarized into matrices like those produced by microarray experiments, where rows are genes or exons and columns are samples.

Therefore, many of the earliest statistical methods for analysis of RNA-seq data fall into this category because they were natural extensions of methods developed for microarrays. Two of the most widely used annotate-then-identify methods are EdgeR (Robinsonand others,2010;McCarthyand others,2012) and DESeq (Anders and Huber,2010); Alexa-seq (Griff ithand others,2010), DEXSeq (Andersand others, 2012), and a method developed byWangand others(2012) are further examples of annotate-then-identify pipelines focusing on differential expression analysis of genomic structures that may indicate splicing or transcriptional differences between groups.

The annotate-then-identify approach provides a straightforward and interpretable analysis and that tested statistical methodology is available once raw read counts have been summarized into a gene-level matrix. However, one disadvantage is that it relies heavily on the accuracy of annotation databases of gene and exon boundaries, and current annotation may be unreliable or hard to interpret (Klimkeand others, 2011). As shown in Figure1(a), the annotated transcript structure at individual genomic loci can be com- plex. Biologically, the distinct but overlapping regions in vertical columns represent a single exon used slightly differently in multiple transcripts. This complexity requires the analyst to make important count- ing decisions in advance, since each distinct use of an exon (represented by a box in Figure1(b)) repre- sents a distinct potential counting region for annotate-then-identify methods. It is well known that different choices in how to count (all regions, only non-overlapping regions, or other choices) may lead to dramat- ically different results (Oshlackand others,2010;Bullardand others,2010), especially for genes whose transcripts have a low degree of similarity. In the case shown in Figure1, using a union model might allow for discovery of whole-gene differential expression, but it may mask a differential expression signal if, say, just one of the transcripts is overexpressed. Also, there is no “correct” gene model to use, so methods requiring this choice are at a disadvantage to those that do not. DER Finder does not require a gene model:

if just a few transcripts or exons are differentially expressed, even in a complex scenario like Figure1 shows, the gene will simply be flagged as displaying a complicated differential expression pattern. This type of result is not possible in a gene-model-based approach. A second disadvantage of annotate-then- identify methods is that they do not allow for discovery of novel or previously uncharacterized exons or genes, since they rely on previously constructed databases.

The methods in the second class, which we call assemble-then-identify, attempt to assemble the full sequences of the mRNA molecules from which the short reads originated. These methods rely less heav- ily on annotation databases of exon or gene boundaries. Another advantage is that assemble-then-identify methods aim to fully quantify all the potential isoforms of mRNA molecules emanating from each gene.

However, the short length of typical sequencing reads leads to inevitable ambiguity when attempting to assemble and quantify abundances of individual mRNA molecules: it is virtually impossible to deter- mine which of many possible sets of assembled transcripts truly generated the observed RNA-seq data.

This ambiguity also leads to varying and structured covariances between transcript measurements within genes, which complicates statistical analysis. There is also an increased computational cost associated with assembling full transcripts, quantifying their abundances, and performing transcript-level statisti- cal tests, when compared with the more direct annotate-then-identify approach. The most widely used algorithm in this category is Cufflinks/Cuffdiff (Trapnelland others,2010,2012); others include Scrip- ture (Guttmanand others,2010 ), and IsoLasso (Liand others,2011). In our experience, the computational cost of transcriptome assembly is non-trivial: running Cufflinks (the transcript assembly step) took approx- imately 5 h on 4 standard cores for each sample, running Cuffmerge (merging 15 assemblies in preparation 416A. C. F RAZEE AND OTHERS 20936000 20938000 20940000 20942000 Genomic Position (a) Annotated Transcripts: Ensembl 61, Chromosome 22 20941800 20941850 20941900 20941950 Genomic Position (b) Close up of Exon Annotation Differences Fig. 1. (a) Structures of annotated transcripts in a 6 kb region of the human genome (corresponding gene ID:

ENSG00000099917). A transcript structure this complex causes problems in annotate-then-identify pipelines, as there is no clear way to determine which transcript or exon generated each read, especially if there is a high degree of overlap between unique features, as shown in (b): here, we zoom in on the exon on the right-hand side of (a) and see four over- lapping yet distinct regions. Biologically, this could indicate a single exon with a varying transcription end site, but analytically, it introduces four potential counting regions and requires a critical counting decision to be made. Using a method like DER Finder eliminates the need for these decisions: if just one transcript or one form of an exon is differentially expressed, the genomic regions that uniquely identify that transcript or exon form will be called differ- entially expressed, and further analysis can be done on the small region to determine the exact phenomenon causing the observed pattern.

for DE analysis) took 1 h 39 min on 4 standard cores, and running Cuffdiff (assigning reads to transcripts and identifying DE) took about 42 h on 4 standard cores, comparing a 9-sample group with a 6-sample group. For comparison, alignment with Tophat took about 30 h per sample on 4 standard cores. Other researchers (Patroand others,2013) have conf irmed that assembly with tools other than Cufflinks also took several hours, and Cufflinks is one of the fastest assembly algorithms. Many of these tools allow the user to avoid the assembly problem by testing known transcripts for differential expression, but they then suffer from the previously mentioned shortcomings of annotate-then-identify methods.

Here, we propose an intermediate class of methods which we call identify-then-annotate. These meth- ods f irst summarize the reads by counting the number of reads with alignments overlapping each indi- vidual base in the genome. Then we form a base-by-base statistic to identify bases that are differentially expressed between groups. Consecutive bases showing a common differential expression signature are grouped into DERs. The unit of statistical analysis is then the DER, which can be evaluated for statisti- cal signif icance using permutation or bootstrap approaches. DERs can then be compared with previous databases of exons and genes to identify: (1) regions of differential expression corresponding to known exons or genes and (2) novel regions of differential expression. Currently, the closest analysis frame- work to an identify-then-annotate method is to combine pipelines: use an existing tool (e.g. rnaSeqMap Le´ sniewska and Okoniewski,2011or an assembler like Cufflinks) to identify expressed genomic regions, then test those regions for differential expression using existing statistical methods (e.g.Anders and Huber, 2010). Another identify-then-annotate pipeline has been proposed in the form of maximum mean discrep- Differential expression analysis of RNA-seq data417 ancy (Stegleand others,2010) but does not have a software implementation available and is designed to test known genes for differential transcript expression—not to be run on an entire genome. We propose a new identify-then-annotate model that builds on the ideas behind the combining-pipelines approach: we feature a full statistical framework for expression detection and differential expression analysis.

The proposed identify-then-annotate model (1) allows for detection of differential expression in regions outside of known exons or genes, (2) allows for direct evaluation of differential expression of known genes and exons, (3) does not incur the added ambiguity and computational cost of assembly from short reads, and (4) can nonetheless detect differential splicing patterns and other expression differences between populations. Also, an identify-then-annotate tool can be used to address several commonly posed research questions at once, including differential expression, splicing analysis, and detection of novel features. For example, we could analyze differential expression of known features with annotate-then-identify tools, then use an assembly tool to detect novel features, then re-run the annotate-then-identify tool to analyze differential expression of the novel features—but an identify-then-annotate tool would address all of these issues at once. The primary disadvantage is that the proposed class of methods does not allow for direct quantif ication of alternative transcription. However, regions of potential alternative transcription can be easily identif ied where a subset of exons for a gene overlaps DERs but another subset does not, and those regions could be explored further with other tools.

3.DER F INDER METHODOLOGY 3.1Base-level statistics The f irst step in DER Finder is quantifying the evidence for differential expression at the nucleotide level.

Since RNA-seq produces reads from mRNA transcripts, rather than directly from the genome, reads must be aligned using a strategy that accounts for reads that span intron-exon boundaries, calledjunction reads.

In identify-then-annotate approaches like DER Finder, these junction reads are treated identically to reads that map directly to the genome when computing coverage. Tophat (Trapnelland others,2009)isanexam- ple of an aligner that appropriately handles junction reads. The user must make choices about mapping parameters to use during the alignment step: for example, some reads will map to more than one genomic location (due to, e.g. repetitive regions or pseudogenes). Non-unique read alignments can either be dis- carded, in which case repetitive regions would not appear to be expressed at all, or kept, which would allow all repetitive regions to appear expressed but would not allow those regions to be distinguished from each other. Whatever alignment strategy and corresponding parameters are used, the result is ultimately a large matrix with rows corresponding to bases and columns corresponding to samples; entries of this matrix are the number of aligned reads from a particular sample that overlap a particular nucleotide. We refer to this matrix as thecoverage matrix.

To quantify differential expression while accounting for biological variability and possible confounders, we f it a linear regression model to each row of the coverage matrix. Specif ically, we let g(Y ij)=α(l j)+ P p=2 βp(lj)X pi+ K k=1 γk(lj)W ik+ε ij,(3.1) whereY ijis coverage for sampleiat locationl j,gis a Box–Cox style transformation (e.g. a log transforma- tion) that makes the linear assumption acceptable,α(l j)represents the baseline gene expression (coverage) level at locationl j,X piis an indicator as to whether sampleifalls into categorypwith category 1 being the reference group (e.g. whenP=2, we have the case/control scenario, whereX 2iis a 0/1 indicator vari- able for whether sampleiis a case or a control),β p(lj)is the parameter of interest quantifying differential expression between categorypand the reference category at locationl j(e.g. in the case/control scenario, if 418A. C. F RAZEE AND OTHERS gis a log transform, thenβ 2(lj)represents the log fold change in expression for cases compared with con- trols),W ik(k=1,...,K) are the values of potential confounders for samplei, which may include sample- specif ic guanine/cytosine content effect (Hansenand others,2012;Rissoand others,2011), sex or other demographic variables, or processing data,γ k(lj)represents the effect of confounderkon gene expression at locationl j, andε ijrepresents residual measurement error at locationl j. Including confounders in this model is optional. We recommend settingW i1to be some measurement of library size for samplei(e.g.

median or 75th percentile of coverage for the sample across all bases).

Our goal is to segment the genome into contiguous regionsAwhereβ p(lj)| =0 for at least onepfor all l j∈A. Instead of modelingβ p(lj)as a functions (for example, with wavelet models or splines), we adopt a modular approach in which we f irst estimateβ p(lj)for each locationl jand then divide the estimates into regions in a separate step. To estimateβ p(lj)along the genome and obtain test statistics from testing the null hypothesis that any of theβ p(lj)=0, we can use methods for estimating regularized linear contrasts (Smythand others,2004), which take a shrinkage approach that is appropriate for small sample sizes and borrows information across bases. Details of this approach are available in supplementary material available atBiostatisticsonline.

3.2Identifying candidate DERs with segmentation In this section, we refer to the aforementioned test statistic resulting from the test for whether anyβ p(lj)=0 ass(l j). (For ease of notation, we omit thejsubscript in the discussion that follows). For most experiments, we expect the functions(l)to be a step function that is mostly 0, since most of the genome is not differen- tially expressed. We do not expects(l)to be smooth because gene expression usually has a clear-cut start and end location. Hidden Markov models (HMMs) are a natural way of modelings, and we describe the specif ics of our implementation here.

We assume that there is an underlying Markov process along the genomeD(l)with three hidden states:

D(l)=0ifα(l)=β(l)=0,D(l)=1ifα(l) =0 andβ(l)=0, andD(l)=2ifβ(l) =0. StateD(l)=0 corresponds to regions producing practically no gene expression. This state will be the most common, as most bases will not be covered by any reads because abundant gene expression is conf ined to a relatively small fraction of the genome. StateD(l)=1 corresponds to regions for which gene expression is observed but does not differ between populations. We are interested in f inding regions in the differentially expressed state,D(l)=2.

We assume thatD(l)is a f irst-order Markov chain with hidden state probabilitiesπ d=Pr(D(l)=d).

We treat the transition matrix as f ixed. As defaults, we set the retain state probabilities as very high with low transition probabilities between states, due to the sparsity of genes in the genome. The hidden state probabilities can be roughly estimated based on the relative frequencies of bases covered or not covered by genes, along with a prior estimate of the number of differentially expressed genes. DER Finder results are largely robust to changes in the prior estimates forπ d(see Section 2.3 of supplementary material available atBiostatisticsonline).

Conditional on the hidden state of each basel, we then assume thats(l)follows a normal distribution.

Specif ically,s(l)|D(l)=d∼N(μ d,σ 2 d). WhenD(l)=0, there is little expression observed for basel,so we model the distribution asN(0,δ), whereδis an arbitrary, very small positive number, to restrict values to very close to zero. We estimateπ 0empirically by calculating the fraction of bases where the average coverage is less than a thresholdc.

The model parameters for statesD(l)=1 andD(l)=2(μ 1,μ 2,σ 2 1, andσ 2 2) can be estimated using a standard two-groups mixture model, f irst proposed for the analysis of differential expression in microarray experiments (Efron,2008). We assume that the statisticss(l)from these two states are drawn from a mixture f(s)=f 1(s)π ∗ 1+f 2(s)π ∗ 2, whereπ ∗ 1+π ∗ 2=1. (Estimates forπ ∗ 1andπ ∗ 2are scaled by the estimate ofπ 0 to obtain estimates for the overall state probabilities,π 1andπ 2, such thatπ 0+π 1+π 2=1.) Each mixture Differential expression analysis of RNA-seq data419 component is again assumed to be normal and can be estimated using the empirical null distribution. We can then directly estimate the most likely path of unobserved statesD(l)based on the observed statistics s(l)using standard estimation techniques for HMMs. Details on the specif ic form of the test statistics, the parameters of the HMM, and validity of HMM assumptions are available in Sections 1–2 of supplementary material available atBiostatisticsonline.

3.3Statistical significance The HMM essentially segments the genome into regions, where a region is def ined as a set of contiguous bases having the same predicted latent state. A region of bases with predicted latent stateD(l)=2is referred to as a candidate DER. Beyond the segmentation step in the DER Finder pipeline, all analysis is done on the region level rather than the base level. Region-level analysis ensures that the number of statistical tests is not unreasonably large (as it would be if we did a formal test at every base) and makes it such that variations in read coverage at individual bases that can arise due to technical artifacts in RNA-seq data will not affect the f inal results.

After segmenting the genome into regions, we assign ap-value to each candidate DER using a permu- tation procedure. In calculating thep-values for each candidate DER, we consider the size of the individual statistics within each region, since regions with very large test statistics are more likely to be truly differ- entially expressed. We apply an approach similar toJaffeand others(2012): f irst, we calculate the average base-level test statistic within each potential DERr:¯ s r= l∈DERr s(l). Note that¯ s ris the region-level test statistic for regionr. In the simple case–control scenario with no confounders, we can assignp-values to DERs with the following permutation procedure:

1. Permute the values of the covariate of interest (X i) for all samples.

2. Re-calculate the base-level statistics using (3.1). Denote these null statistics bys 0(l).

3. Re-run the HMM on thes 0(l)s to identify a set of null DERs, indexed byρand denoted by DER 0 ρ.

4. To form region-level null test statistics, calculate the average base-level statistic within each null DER¯ s 0 ρ= l∈DER 0ρs0(l).

Steps 1–4 are repeatedBtimes, and the empiricalp-value for regionrisp r= (1/ B b=1 Pb) B b=1 Pbρ=1 1(¯ s 0 ρ>¯ s r), whereP bis the number of null DERs for permutationb.This quantity is the percent of null DERs with average statistic as or more extreme than the observed statistic for candidate DERrcalculated on the observed data. Standard false discovery rate calculations can then be applied to adjust thesep-values for multiple testing.

In the case where confounders or additional covariates are included in model (3.1), a straightforward bootstrap extension of this permutation approach can be derived. After assigning statistical signif icance to each region, the DERs can be annotated using a reference database of known genomic features; an example of an annotation procedure can be found in supplementary material available atBiostatisticsonline.

4.R ESULTS :COMPARISON ON REAL DATA Our method is designed for differential expression with biological replicates, but many published experi- ments do not include such replicates (Hansenand others2011). We therefore designed an analysis com- paring brain tissue between nine human males and six human females to assess the competing methods:

the Y chromosome was tested for differential expression between sexes using DER Finder, Edge R, and DESeq (using previously annotated exons) and Cufflinks/Cuffdiff. Specif ic details of the experiment can be found in Section 4 of supplementary material available atBiostatisticsonline. 420A. C. F RAZEE AND OTHERS Two sets of results were obtained: one analysis compared males to females, and the other compared a randomly selected set of f ive of the males to the other four males. We expect virtually all genomic features of the Y chromosome (barring the pseudoautosomal region, pseudogenes, and other irregularities) to be differentially expressed between males and females, since females do not have a Y chromosome, and no genomic features to be differentially expressed between control males.

4.1DER Finder results DER Finder identif ied 534 Y-chromosome regions as differentially expressed (q<0.05) between males and females. Six of these regions were classif ied as underexpressed in males, which we know to be an artifact, but the other 528 were identif ied as overexpressed in males as expected. Additionally, we found 333 novel differentially transcribed regions (q<0.05). These novel transcribed regions ranged in length from 1 to 3814 bases. These regions may indicate noise from the method, but they also may point to regions that should be examined further, either because they have interesting mapability characteristics or because they might truly be expressed and not yet annotated. The 534 DERs pointed to 411 differentially expressed exons, using the criteria outlined in Table 1 of supplementary material available atBiostatis- ticsonline. These 411 exons came from 33 different genes, which means we found those 33 genes to be differentially expressed or indicate an event of interest. In comparing males with each other, we did not identify any differential expression on the Y chromosome: the minimumq-value for the regions found to be differentially expressed in the HMM step was 0.86.

4.2Cufflinks/Cuffdiff results Of 808 assembled transcripts tested for differential expression on the Y chromosome between males and females, the Tophat–Cufflinks–Cuffdiff pipeline found no differentially expressed transcripts. The mini- mumq-value for these assembled transcripts was 0.45. While 736 of these transcripts showed non-zero abundance in males and zero abundance in females, these differences were not found to be statistically sig- nif icant using the Cuffdiff methodology. Similar, too-conservative results were reported in supplementary material available atBiostatisticsonline of the manuscript accompanying the release of Cuffdiff version 2(Trapnelland others2012). In the comparison of normal males, none of the 818 assembled transcripts were called differentially expressed: the minimumq-value was 0.63.

4.3EdgeR and DESeq results Both of these methods tested 433 exons on the Y chromosome for differential expression between males and females. The other annotated exons on the Y chromosome did not have any reads mapping to them or the counting model did not allow any reads to be counted for them. Of these 433 exons, EdgeR classif ied 113 and DESeq classif ied 115 as differentially expressed between males and females (q<0.05). Ninety- seven exons were found by both EdgeR and DESeq. When comparing the males with each other, neither method found any exons to be differentially expressed: allq-values were 1 except for 2 exons withq=0.12 in EdgeR.

4.4Comparison of results across methods DER Finder exhibits performance comparable with that of EdgeR and DESeq, while all three methods out- perform Cufflinks/Cuffdiff. DER Finder also has major advantages over EdgeR and DESeq: DER Finder is agnostic to annotation, which means it can identify differential expression signal in two important cases:

(a) the case where a feature may be slightly mis-annotated or where the read mappings do not quite match up with the feature’s annotation and (b) the case where differential expression exists in regions that do Differential expression analysis of RNA-seq data421 5.06.0 7.0 8.0log2(count+32) (a) chrY: 22737611 22737773 female male 0 2 4 68 t statistic exons states 22737545 22737645 22737745 5.0 5.5 6.0 6.57.0log2(count+32) (b) chrY: 20662506 20662937 female male 01 2 3 45 6 t statistic exons states 20662506 20662606 20662706 20662806 20662909 genomic position genomic position Fig. 2. Cases where DER Finder correctly calls differential expression and annotate-then-identify methods do not.

(a) Example of an exon (from gene EIF1AY, Ensembl exon id ENSE00001435537) whose location appears to be mis- annotated, leading EdgeR and DESeq to underestimate the exon’s abundance and therefore incorrectly call this exon not differentially expressed. (b) Example of a DER (q=0.001) falling outside of an annotated exon, which can be found by DER Finder but not by EdgeR or DESeq. Although there are no annotated exons in this region, we believe this f inding is more than noise because it is supported by the following annotated ESTs: DR001278, BF693629, BF672674, BM683941, BM931807, and CD356860 (GenBank accession numbers. Top panels: single-base resolution coverage (on log2 scale). Middle panels:t-statistics from linear model f it by DER Finder. Bottom panels: exon locations and state calls from DER Finder: light gray=not expressed, black=equally expressed, red or dark gray=overexpressed in men).

not overlap annotated features. Both these scenarios occurred in the dataset studied. Figure2(a) illustrates a case where the location or length of an exon may be incorrectly annotated. In that example, the mis- annotation caused the exon’s expression to be underestimated when counting reads overlapping it. As a result, the statistical tests used in EdgeR and DESeq did not have enough power to call this Y-chromosome exon differentially expressed (both tools reportq=1) between males and females. DER f inder more accu- rately reported the shown DER as overlapping 61.3% of an annotated exon with aq-value=0.001. Also, DER Finder can f ind regions of interest that fall outside of annotated exons (Figure2(b)). Closer inspec- tion of the illustrated region using the UCSC Genome Browser (build: hg19, tracks: UCSC genes, Ref- Seq genes, Human ESTs, Spliced ESTs) reveals no annotated genes in the region, but shows that the expression is supported by six ESTs, providing evidence that the signal here is truly biological rather than simply background noise. Section 5 of supplementary material available atBiostatisticsonline discusses other advantages of DER Finder over identify-then-annotate methods. Additional analysis of the agree- ment between DER Finder’s results and those of Cufflinks/Cuffdiff, EdgeR, and DESeq can also be found in Section 6 of supplementary material available atBiostatisticsonline. DER Finder’s ability to f ind dif- ferential expression signal even in the presence of wrong or missing annotation is a key advantage over annotate-then-identify methods.

To assess whether the tools give reasonable results and to compare overall performance of the three methods,MAplots (Dudoitand others2002) were used to show the relationship between each unit’s aver- age expression (denoted withM) and the magnitude of differential expression it exhibits (denoted with A). The unit is a region for DER Finder, an exon for EdgeR and DESeq, or a transcript for Cufflinks. The 422A. C. F RAZEE AND OTHERS Fig. 3.MAplots for Y-chromosome regions, transcripts, or exons, for each method and for both male vs. female (red) and male vs. male (blue) comparisons. On each plot, thex-axis represents the average log (base 2) abun- dance for each unit (region for DER Finder, transcript for Cufflinks, exon for EdgeR and DESeq), and they-axis represents the log (base 2) fold change between males and females (red points) or the two groups of males (blue points). We expect to see the red, positively sloped diagonal on all plots: this represents genomic regions expressed in males but not in females. In DER Finder, EdgeR, and DESeq, this diagonal corresponds with differential expres- sion detected, however, no differential expression was detected in Cufflinks even though the red diagonal exists as expected. The displayedMandAvalues for EdgeR and DESeq are normalized. Specif ically, the EdgeR plot islogCPMvs.logFC,wherelogCPMis log 2counts-per-million andlogFCis the log 2fold change (male to female); both are normalized for library size and dispersion and are reported in the output of theexactTestfunc- tion. The DESeq plot is(log 2(baseMeanA+0.5)+log 2(baseMeanB+0.5))/2vslog 2(baseMeanA+0.5)− log 2(baseMeanB+0.5),wherebaseMeanAandbaseMeanBrepresent library-size-normalized counts for males and females, respectively, and are reported in the output table from the functionnbinomTest.SincebaseMeanA andbaseMeanBwere sometimes 0, we added 0.5 as an offset to avoid calculating log 2(0). Differential expression analysis of RNA-seq data423 0.6 0.7 0.8 0.9 1.0 p value percentile Percent from Male vs. Female Comparisons 1 0.9 0.8 0.7 0.6 DER Finder Cufflinks EdgeR DESeq Fig. 4. Percentage of signif icantly DERs/transcripts/exons originating from male-to-female comparisons, using various percentiles of thep-value distribution as a signif icance cutoff. We f ind that most highly signif icant results are true positives, i.e. results with lowp-values and high test statistics stem from comparing males with females, for DER Finder, EdgeR, and DESeq, while Cufflinks exhibits problems in this area.

MAplots resulting from the Y-chromosome experiment (Figure3) reveal that DER Finder, EdgeR, and DESeq all produce reasonable results, but the f indings from Cufflinks are somewhat problematic. While there does seem to be more overexpression of transcripts in males in the male/female differential expres- sion analysis done by Cufflinks, we observe several extreme fold changes in the opposite direction, and the male-to-male comparison also produced these extreme fold changes. These problems do not exist in the other methods, whoseMAplots illustrate high fold changes found between males and females and very little change found between males, as expected.

Finally, to get a sense of each method’s accuracy, we evaluated the tables of DERs between sexes and between males produced by each method. We gathered all resulting regions—both negative results, from the male vs. male comparison, and positive results, from the male vs. female comparison—and ordered them by the value of their test statistic. An algorithm ranking all positive results ahead of the negative ones is preferred. Figure4shows, at each percentile of the differential expression test statistic, the percent of regions that are positive. This is analogous to f inding the percentage of f indings that were truly positive at different signif icance cutoffs, assuming all tests in the sex comparison should be positives and tests in the male comparison should be negatives. We f ind that EdgeR, DESeq, and DER Finder perform comparably:

all or most of the top 20% of regions, ranked by test statistic, came from comparisons between sexes.

Cufflinks does much worse: only about 60% of the top 20% of their top transcripts came from the male- to-female comparison. DER Finder performs just slightly better than EdgeR and DESeq in addition to having other advantages over these methods, as discussed earlier. 424A. C. F RAZEE AND OTHERS 5.D ISCUSSION We propose DER Finder as a specif ic implementation of a new class of methods for differential expression analysis of RNA-seq data. The new class deals with identif ied challenges by (a) not relying on existing annotation when calling differential expression and (b) avoiding the immensely diff icult problem of full transcript assembly by putting differential expression into a more straightforward framework. We have built on the ideas behind the approach of combining pipelines (e.g. rnaSeqMap combined with DESeq) to create a full pipeline for statistical analysis of differential expression. DER Finder outperforms Cufflinks/Cuffdiff and performs comparably with EdgeR and DESeq, while having the added advantages of sensitivity even in the presence of incorrect annotation and transcript discovery capability. We have also considered DER Finder’s performance in other scenarios: a simulation study is presented in Section 7 of supplementary material available atBiostatisticsonline that addresses experimental design questions and examines DER Finder’s accuracy. An identify-then-annotate method like DER Finder is an important step in developing new ways to analyze RNA-seq data, so further properties of these types of methods are worth investigating.

6.S OFTWARE All software and code used in this analysis is available on github (https://github.com/ alyssafrazee/derf inder).

S UPPLEMENTARY MATERIAL Supplementary material is available athttp://biostatistics.oxfordjournals.org.

A CKNOWLEDGMENTS We acknowledge helpful discussions with Geo Pertea and Steven Salzberg. Postmortem brain tissue was donated by The Stanley Medical Research Institute’s brain collection. We also thank Ms Ou Chen for her technical assistance with high throughput sequencing.Conflict of Interest:None declared.

F UNDING R.A.I. was partially supported by NIH Ro1 HG005220. J.T.L. and K.D.H. were partially funded by NIH Ro1 GM105705-01. J.T.L. was partially funded by NIH P50 MH-094268 Silvo O. Conte Center. S.S. was provided by the Stanley Brain Research Institute. Funding to pay the Open Access publication charges for this article was provided by Jeffrey Leek’s discretionary fund.

R EFERENCES ANDERS ,S. AND HUBER ,W.(2010). Differential expression analysis for sequence count data.Genome Biology11, R106.

A NDERS ,S.,R EYES ,A. AND HUBER ,W.(2012). Detecting differential usage of exons from RNA-seq data.Genome Research22(10), 2008–2017.

B ULLARD ,J.,P URDOM ,E.,H ANSEN ,K.D. AND DUDOIT ,S.(2010). Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments.BMC Bioinformatics11. R package version 1.8.0. Differential expression analysis of RNA-seq data425 CLARK ,M.B.,A MARAL ,P.P.,S CHLESINGER ,F.J.,D INGER ,M.E.,T AFT ,R.J.,R INN ,J.L.,P ONTING ,C.P.,S TADLER , P. F. , M ORRIS ,K.V.,M ORILLON ,A.and others. (2011). The reality of pervasive transcription.PLoS Biology9(7), e1000625.

D UDOIT ,S.,Y ANG ,Y.H.,C ALLOW ,M.J. AND SPEED ,T.P.(2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.Statistica Sinica12(1), 111–140.

E FRON ,B.(2008). Microarrays, empirical Bayes and the two-groups model.Statistical Science23(1), 1–22.

G RIFFITH ,M.,G RIFFITH ,O.L.,M WENIFUMBO ,J.,G OYA ,R.,M ORRISSY ,A.S.,M ORIN ,R.D.,C ORBETT ,R.,T ANG , M. J., H OU,Y.,P UGH ,T.J.and others. (2010). Alternative expression analysis by RNA sequencing.Nature Methods 7(10), 843–847.

G UTTMAN ,M.,G ARBER ,M.,L EVIN ,J.Z.,D ONAGHEY ,J.,R OBINSON ,J.,A DICONIS ,X.,F AN,L.,K OZIOL , M. J., G NIRKE ,A.,N USBAUM ,C.and others. (2010). Ab initio reconstruction of transcriptomes of pluripo- tent and lineage committed cells reveals gene structures of thousands of lincRNAs.Nature Biotechnology28(5), 503–510.

H ANSEN ,K.D.,I RIZARRY ,R.A. AND ZHIJIN ,W U. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization.Biostatistics13(2), 204–216.

H ANSEN ,K.D.,W U,Z.,I RIZARRY ,R.A. AND LEEK ,J.T.(2011). Sequencing technology does not eliminate biological variability.Nature Biotechnology29(7), 572–573.

J AFFE ,A.E.,M URAKAMI ,P.,L EE,H.,L EEK ,J.T.,F ALLIN ,M.D.,F EINBERG ,A.P. AND IRIZARRY ,R.A.(2012). Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.International Journal of Epidemiology41(1), 200–209.

K AT Z ,Y.,W ANG ,E.T.,A IROLDI ,E.M. AND BURGE ,C.B.(2010). Analysis and design of RNA sequencing experi- ments for identifying isoform regulation.Nature Methods7(12), 1009–1015.

K LIMKE ,W.,O’D ONOVAN ,C.,W HITE ,O.,B RISTER ,J.R.,C LARK ,K.,F EDOROV ,B.,M IZRACHI ,I.,P RUITT ,K.D. AND TATUSOVA ,T.(2011). Solving the problem: genome annotation standards before the data deluge.Standards in Genomic Sciences5(1), 168.

L E´SNIEWSKA ,A. AND OKONIEWSKI ,M.J.(2011). rnaSeqMap: a Bioconductor package for RNA sequencing data exploration.BMC Bioinformatics12(1), 200.

L I,W.,F ENG ,J. AND JIANG ,T.(2011). Isolasso: a lasso regression approach to RNA-seq based transcriptome assembly.

Journal of Computational Biology18(11), 1693–1707.

M CCARTHY ,D.J.,C HEN ,Y. AND SMYTH ,G.K.(2012). Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation.Nucleic Acids Research40(10), 4288–4297.

M ORTAZAVI ,A.,W ILLIAMS ,B.A.,M CCUE,K.,S CHAEFFER ,L. AND WOLD ,B.(2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq.Nature Methods5(7), 621–628.

O SHLACK ,A.,R OBINSON ,M.D. AND YOUNG ,M.D.(2010). From RNA-seq reads to differential expression results.

Genome Biology11(12), 220.

P AT R O ,R.,M OUNT ,S.M. AND KINGSFORD ,C.(2013). Sailf ish: alignment-free isoform quantif ication from RNA-seq reads using lightweight algorithms. Preprint, arXiv:1308.3700.

R ISSO ,D.,S CHWARTZ ,K.,S HERLOCK ,G. AND DUDOIT ,S.(2011). GC-content normalization for RNA-seq data.BMC Bioinformatics12(1), 480.

R OBINSON ,M.D.,M CCARTHY ,D.J. AND SMYTH ,G.K.(2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.Bioinformatics26(1), 139–140.

S MYTH ,G.K.(2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.Statistical Applications in Genetics and Molecular Biology3(1), 3. 426A. C. F RAZEE AND OTHERS STEGLE ,O.,D REWE ,P.,B OHNERT ,R.,B ORGWARDT ,K. AND R¨ATSCH ,G.(2010). Statistical tests for detecting differ- ential RNA-transcript expression from read counts.Nature Precedings.

S TEIN ,L.D.(2010). The case for cloud computing in genome informatics.Genome Biology11(5), 207.

T RAPNELL ,C.,H ENDRICKSON ,D.G.,S AUVAGEAU ,M.,G OFF ,L.,R INN ,J.L. AND PACHTER ,L.(2012). Differential analysis of gene regulation at transcript resolution with RNA-seq.Nature Biotechnology31(1), 46–53.

T RAPNELL ,C.,P ACHTER ,L. AND SALZBERG ,S.L.(2009). TopHat: discovering splice junctions with RNA-Seq.Bioin- formatics25(9), 1105–1111.

T RAPNELL ,C.,W ILLIAMS ,B.A.,P ERTEA ,G.,M ORTAZAVI ,A.,K WA N ,G., VA N BAREN ,M.J.,S ALZBERG ,S.L.,W OLD , B. J. AND PACHTER ,L.(2010). Transcript assembly and quantif ication by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.Nature Biotechnology28(5), 511–515.

W ANG ,W.,Q IN,Z.,F ENG ,Z.,W ANG ,X. AND ZHANG ,X.(2012). Identifying differentially spliced genes from two groups of RNA-seq samples.Gene518, 164–170.

[Received December 6, 2012; revised September 19, 2013; accepted for publication November 2, 2013]