Understanding Sequencing Data as Compositions an Outlook and Review Ncbi
New Results
Understanding sequencing data equally compositions: an outlook and review
doi: https://doi.org/ten.1101/206425
Abstract
Motivation Although seldom acknowledged explicitly, count information generated by sequencing platforms exist as compositions for which the abundance of each component (east.g., gene or transcript) is but coherently interpretable relative to other components within that sample. This property arises from the assay applied science itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.due east., library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that renders invalid many conventional analyses, including distance measures, correlation coefficients, and multivariate statistical models.
Results The purpose of this review is to summarize the principles of compositional information analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing information, and highlight time to come directions with regard to this field of study.
1 From raw sequences to counts
Automated Sanger sequencing served equally the primary sequencing tool for decades, ushering in meaning accomplishments including the sequencing of the unabridged human genome ([50]). Since the mid-2000s, however, attention has shifted away this "start-generation applied science" toward new technologies collectively know equally side by side-generation sequencing (NGS) ([l]). A number of NGS products be, each differing in the sample preparation required and chemistry used ([50]). Although each product tends toward a different application, they all piece of work by determining the base guild (i.e., sequence) from a population of nucleotides, such that information technology becomes possible to gauge the abundances of unique sequences ([50]). Nevertheless, these sequence abundances are not absolute abundances because the total number of sequences measured by NGS technology (i.e., the library size) ultimately depends on the chemistry of the assay, not the input fabric.
Depending on the input material, NGS has many uses. These include (1) variant discovery, (2) genome assembly, (3) transcriptome associates, (4) epigenetic and chromatin profiling (eastward.yard., ChIP-seq, methyl-seq, and DNase-seq), (v) meta-genomic species classification or gene discovery, and (half dozen) transcript abundance quantification ([50]). The application of NGS to itemize transcript abundance is better known as RNA-Seq ([50]) and tin can be used to estimate the portional presence of transcript isoforms, gene archetypes, or other. RNA-Seq works past taking a population of (full or fractionated) RNA, converting them to a library of cDNA fragments, optionally amplifying the fragments, then sequencing those fragments in a "loftier-throughput manner" ([73]). When sequencing smaller RNA (eastward.g., microRNA), an additional size choice pace is used to ensure a uniform size of the RNA production ([36]).
The result of RNA-Seq is a virtual "library" of many short sequence fragments that are converted to a numeric data set through alignment (most often to a previously established reference genome or transcriptome) and quantification ([33]). The alignment and quantification steps summarize the raw sequence data (i.due east., reads) every bit a "count matrix", a tabular array containing the estimated number of times a sequence successfully aligns to a given reference note. The "count matrix" therefore provides a numeric distillation of the raw sequence reads collected by the analysis; equally such, it constitutes the information routinely used in statistical modeling, including differential expression assay ([33]). Ii factors complicate alignment and quantification. First, assembled references (east.thou., genomes or transcriptomes) are only just references: sequences measured from biological samples volition take an expected corporeality of variation, either systematic or random, when compared with the reference. This variation necessitates that the alignment process accommodates (at least optionally) a sure amount of mismatch ([16]). Meanwhile, some reads (notably short reads) tin ambiguously map to multiple reference sites, an undesired result that is amplified by mismatch tolerance ([16]). Many alignment and quantification methods exist (eastward.g., TopHat ([68]), STAR ([18]), Salmon ([51]), and others) and are reviewed elsewhere (e.m.,[29]; [27]; [42]; [21]; [34]; [9]; [72]; [eight]).
The "count matrix" (or equivalent) produced by alignment and quantification is routinely analyzed using statistical hypothesis testing (e.g., generalized linear models) or data scientific discipline techniques (e.g., clustering or classification). Most usually, data are studied using differential expression analysis, a constellation of methods that seek to identify which unique sequence fragments (if any) differ in abundance across the experimental condition(due south). Like alignment and quantification, many differential expression methods exist (e.g., Cufflinks ([69]), limma ([55]), edgeR ([56]), DESeq ([7]), and others) and are reviewed elsewhere (e.one thousand., [17]; [54]; [63]; [26]; [35]; [59]; [61]; [65]; [49]). However, it is important to note that conclusions drawn from RNA-Seq data announced to have a certain "robustness" to the selection in the alignment and quantification method, such that the choice in the differential expression method impacts the terminal event most ([75]).
The focus of this review is not to elaborate on the niceties of alignment, quantification, or differential expression, simply rather to discuss the relative (i.e., compositional) nature of sequencing count data and the implications this has on many analyses (including differential expression analysis). In this review, we show how sequencing count information mensurate abundances as portions, rendering many conventional methods invalid. Nosotros and then discuss methods available for dealing with portional data. Finally, we conclude past discussing challenges specific to these analyses and by considering advancements to this field of study. Although we emphasize RNA-Seq data throughout this paper, the principles discussed here apply to any NGS abundance data set.
ii Counts equally parts of a whole
two.1 Image brightness as portions
As an analogy, permit u.s.a. imagine that we instructed two photographers to take a series of black and white photographs using a digital photographic camera. We can represent the captured images as a set of North-dimensional vectors where each chemical element (i.e., pixel) records the amount of lite that hit a respective part of the film sensor. Considering this data gear up, let united states of america enquire a pointed experimental question: which photographer captured their photographs in brighter light? Better yet, for which pixels, on average, did Photographer A capture brighter low-cal than Lensman B?
On first glance, this appears straight-forward. However, we want to know about the amount of calorie-free present when the photograph was taken, not the amount of light recorded by the film sensor. Although related, many factors influence the light measured at a given pixel. These include, for instance, exposure time, aperture diameter, and the sensitivity of the flick sensor. Changing any 1 of these parameters will change the image. Of course, such a change in the epitome does not mean a change in the reality.
At each pixel, we could then define two variables: luminance, the corporeality of low-cal nowadays at the moment of the photo, and brightness, the amount of calorie-free perceived by the film sensor. Intuitively, we tin can understand brightness (the observed value, o), as a function, f, of luminance (the actual value, a:)
Fifty-fifty if we practice not know the role, f, that relates these two measures, we see here that the total brightness recorded (i.east., Σ o) is an antiquity of the conditions under which the luminance is measured. Withal, if we can presume that the motion picture sensor responds proportionally to low-cal and does not clip (an unrealistic and idealized assumption), then the portional effulgence would equal the portional luminance:
In this scenario, we can sympathise each element of o as a portion of the whole. Equally such, the brightness of a single pixel is only meaningful when interpreted relative to the full effulgence (or to the brightness of the other pixels). Chiefly, it follows that the ratio of whatsoever two parts of brightness will equal the ratio of any two parts of luminance.
2.ii Sequence abundance every bit portions
RNA-Seq data, through alignment and quantification, measure out transcript affluence as counts. However, like the brightness of a digitalized prototype, the amount of RNA estimated for each transcript depends on some factors other than the amount of RNA molecules nowadays in the assayed cell. Like a photograph, it is possible to change the observed magnitude while keeping the actual input the same. As such, RNA-Seq count data are not actually counts per se, merely rather portions of a whole.
- Download figure
- Open up in new tab
In fact, this is a property of all NGS abundance information: the abundances for each sample are constrained by an capricious full sum (i.e., the library size) ([63]). Since the library size is arbitrary, the individual values of the observed counts are irrelevant. However, the relative abundances of the observed counts still conduct meaning. We can understand this by considering how, for a given sample, o, the library size (i.e., Σ o) cancels for a ratio of any ii transcripts, i and j:
Analogous to how the relationship between luminance and brightness is unique to each photograph, the relationship betwixt the actual abundances and the observed abundances is unique to each sample. Each contained sample, whether derived from a human subject area or a cell line, may have undergone systematic or random differences in processing at any stage of RNA extraction, library preparation, or sequencing, causing between-sample biases ([63]). As such, library sizes typically differ between samples, making direct comparisons impossible ([63]). Nevertheless, because the counts are portions of a whole, the interpretation is complicated even when library sizes are constant. For example, a large increase (or large decrease) in merely a few transcripts volition necessarily lead to a decrease (or increment) in all other measured counts ([63]). Figure 1 provides an bathetic visualization of how this might happen.
3 Counts as compositional data
3.1 The definition of compositional data
Compositional information measure each sample as a limerick, a vector of non-zippo positive values (i.e., components) carrying relative data ([ii]). Compositional information accept two unique properties. First, the total sum of all component values (i.e., the library size) is an antiquity of the sampling procedure ([71]). Second, the difference between component values is simply meaningful proportionally (e.m., the difference between 100 and 200 counts carries the same data equally the divergence between chiliad and 2000 counts ([71]).
Examples of compositional data include anything measured every bit a percent or proportion. It also includes other data that are incidentally constrained to an capricious sum. NGS affluence data accept compositional properties, simply differ slightly from the formally defined compositional data in that they contain integer values only. However, except for possibly at near-zero values, we can treat so-called count compositional data every bit compositional data ([43]; [53]). Note that information technology is non a requirement for the arbitrary sum to represent consummate unity: many data sets (including mayhap NGS abundance data) lack information about potential components and hence exist as incomplete compositions ([ane]).
3.2 The consequences of compositional data
Compositional data do not be in real Euclidean space, only rather in a sub-infinite known equally the simplex ([2]). Even so, many ordinarily used metrics implicitly assume otherwise; such metrics are invalid for relative data. This includes distance measures, correlation coefficients, and multivariate statistical models ([12]). For compositional data, the altitude between any ii variables is erratically sensitive to the presence or absence of other components ([4]). Meanwhile, correlation reveals spurious (i.eastward., falsely positive) associations between unrelated variables ([52]). In addition, multivariate statistics yield erroneous results considering representing variables equally portions of the whole makes them mutually-dependent, multivariate objects (i.eastward., increasing the abundance of one decreases the portional affluence of the others) ([12]). All of this applies to NGS abundance information too ([43]).
In the life sciences, count data are normally modeled using the Poisson distribution or negative binomial distribution ([eleven]). For NGS affluence data, the negative binomial model is preferred because it accommodates situations in which the variance is much larger than the mean, a mutual feature of biological replicates in RNA-Seq studies ([63]). These models are necessary because analyzing not-normalized and non-transformed count data as if they were normally distributed would imply that it is possible to sample negative and not-integer values, contradicting the assumptions behind many statistical hypotheses ([fifteen]) (although it is possible to extend Gaussian analysis to counts by use of precision weights ([39])). Moreover, NGS abundance data are compositional counts, not counts, meaning that the measured variables (i.e., components) are not univariate objects ([thirteen]).
iii.three Normalization to effective library size
Although the negative binomial distribution is still used to model NGS abundance data ([63]), doing so necessitates (at the very to the lowest degree) an additional normalization step ([63]). The simplest normalization would involve rescaling counts by the library size (i.e., the full number of mapped reads from a sample) ([63]), only this does non transform compositional counts into absolute counts. Instead, analysts most often use other, more elaborate normalization methods that (by and large speaking) conform the private counts of each sample based on the counts of a reference (or pseudo-reference) sample ([17]). The sum of these rescaled counts is called the effective library size.
Effective library size normalization for RNA-seq data was first proposed in an try to accost the relative (i.eastward., closed) nature of the data through a method known as the trimmed mean of K (TMM) ([57]). This normalization works past inferring an platonic (i.due east., unchanged) reference from a subset of transcripts based on the assumption that the majority of transcripts remain unchanged across atmospheric condition. Here, the reference was chosen to be a trimmed mean ([57]), although others have proposed using the median over the transcripts equally the reference ([seven]). The TMM normalizes data to an effective library size based on the principle that if counts are evaluated relative to (i.east., divided by) an unchanged reference, the original scale of the data is recovered. In the language of compositional information analysis, this arroyo is described equally an try to "open" the closed data, and is often criticized on the ground that "at that place is no magic pulverisation that tin can be sprinkled on airtight data to brand them open" ([three]). Yet, if the data were open up originally (and simply incidentally closed past the sequencing procedure), this indicate of view is perhaps farthermost. On the other paw, if the cells themselves produce closed information by default (e.g., due to their limited chapters for mRNA production ([60])), any try to open up the data might prove futile.
Given the difficulties in identifying a truly unchanged reference (and in interpreting it correctly in the case that closed data is being produced past the cells themselves) avoiding normalization birthday would seem desirable. Afterward all, the choice of normalization method impacts the last results of an analysis. For example, the number and identity of genes reported as differentially expressed change with the normalization method ([41]), as do simulated discovery rates ([40]). This also holds true for compositional metabolomic data ([58]). Moreover, at least some normalization methods are sensitive to the removal of lowly abundant counts ([41]), equally well every bit to data asymmetry ([63]).
- Download figure
- Open up in new tab
iv Principles of compositional data assay
4.1 Approaches to compositional data
In lieu of normalization, many compositional data analyses begin with a transformation. Although compositional data exist in the simplex, Aitchison first documented that these data could become mapped into real space past utilize of the log-ratio transformation ([2]). Past transforming data into real space, measurements like Euclidean distance go meaningful ([iv]). However, it is besides possible to analyze compositional data without log-ratio transformations. I approach involves performing calculations on the components themselves (chosen the "staying-in-the-simplex" arroyo) ([47]). Another involves performing calculations on ratios of the components (called the "pragmatic" approach) ([32]). However, many compositional data analyses still begin with a log-ratio transformation.
Dissimilar normalizations, log-ratio transformations do not merits to open the data. Instead, the estimation of the transformed data (and some of their results) depend on the reference used. In dissimilarity, normalizations assume that an unchanged reference is available to recover the information (i.e., up to a proportionality constant) every bit they existed prior to closure by sequencing. Nonetheless, while log-ratio transformations are conceptually singled-out from normalizations, they are sometimes interpreted equally if they were normalizations themselves ([25]). Although this contradicts compositional data analysis principles, conceiving of transformations as normalizations is helpful in understanding their use in some RNA-Seq analyses. Such log-ratio "normalizations", like conventional normalizations, aim to recast compositional data in accented terms, assuasive for a straight-frontward univariate interpretation of the data. Similar effective library size normalization, this is washed through use of an ideal reference.
iv.two The log-ratio transformation
First, let united states consider a small relative information set with merely 3 features measured across 100 samples. These samples belong to one of 2 groups. I of the features, "Ten", can differentiate these groups perfectly. The other features, "Y" and "Z", institute noise. We can turn an absolute data set into a compositional data gear up by dividing each chemical element of the sample vector past the total sum. (Figure 2 shows how the relationship between the samples (represented every bit points) changes when made compositional. Although the ii groups appear conspicuously linearly separable in accented infinite, the boundaries betwixt groups become unclear in relative space. Meanwhile, the distances betwixt samples become arbitrary.
When analyzing compositional data, information technology is sometimes possible to reclaim the discriminatory potential of relative data through transformation. For example, by setting all or some of the features relative to (i.e., divided by) a reference characteristic, i might discover that the resultant ratios tin can divide the groups ([66]). In fact, whatever separation revealed by such ratios can exist analyzed by standard statistical techniques ([66]). This illustrates the concept behind the additive log-ratio (alr) transformation, achieved by taking the logarithm of each measurement within a limerick (i.e., each sample vector containing relative measurements) as divided by a reference feature (i.eastward., xD ) ([2]):
Instead of a specific reference characteristic, one could use an abstracted reference. In the example of the centered log-ratio (clr) transformation, the geometric mean of the composition (i.e., sample vector) is used in place of x D ([two]). Nosotros use the notation m(x) to point the geometric mean of the sample vector, ten. Note that because these transformations employ to each sample vector independently, the presence of an outlier sample does non change the transformation of the other samples:
Likewise, other transformations exist that use the geometric mean of a characteristic subset as the reference. For example, the ALDEx2 parcel introduces the inter-quartile log-ratio (iqlr) transformation, which includes merely features that fall within the inter-quartile range of full variance in the geometric hateful calculation ([24]; [25]). Another, more than circuitous, transformation, called the isometric log-ratio (ilr) transformation ([20]), likewise exists and is used in geological studies ([15]) and at to the lowest degree one assay of RNA-Seq data ([67]). The ilr transforms the data with respect to an orthonormal coordinate system that is constructed from sequential binary partitions of features ([13]). Its default application to standard bug has been criticized by Aitchison on the basis that it lacks interpretability ([v]). Applications where the basis construction follows a microbiome phylogeny seem an interesting possibility ([74]).
4.3 The log-ratio "normalization"
In some instances, the log-ratio transformation is technically equivalent to a normalization. For example, let u.s.a. consider the case where nosotros know nearly our data the identity of a feature with a fixed abundance in accented space across all samples. We could and then use a log-ratio procedure to "sacrifice" this feature in gild to "dorsum-calculate" the accented abundances. This is akin to using the alr transformation as a kind of normalization. Withal, because a single unchanged reference is rarely available or knowable (although constructed RNA spike-ins may correspond one mode forward ([37])), we could endeavour to approximate an unchanged reference from the data. For this, i might use the geometric mean of a characteristic subset, thereby using a clr (or iqlr) transformation equally if information technology were a normalization.
Although log-ratio "normalizations" differ from log-ratio transformations only in the estimation of their results, transformations alone are still useful fifty-fifty when they practise not normalize the data. This is because they provide a manner to move from the simplex into real space ([4]), rendering Euclidean distances meaningful. Importantly, clr- and ilr-transformed data impart iv key properties to analyses: scale invariance (i.e., multiplying a composition by a constant k volition non alter the results), perturbation invariance (i.e., converting a composition between equivalent units volition non change the results), permutation invariance (i.due east., changing the order of the components within a composition will not alter the results), and sub-compositional authorization (i.e., using a subset of a complete limerick carries less information than using the whole) ([thirteen]). Yet, the interpretation of transformation-based analyses remains complicated considering the analyst must consider their results with respect to the chosen reference, or otherwise translate the results back into compositional terms.
iv.iv Measures of distance
Euclidean distances do not brand sense for compositional data ([4]). In contrast, the Aitchison distance does, providing a measure of distance between ii d-dimensional compositions, x and X ([4]):
Although the Aitchison distance is merely the Euclidean distance between clr-transformed compositions, this distance (unlike Euclidean altitude) has scale invariance, perturbation invariance, permutation invariance, and sub-compositional dominance. Few other distance measures satisfy all four of these properties, including none of the metrics routinely used in hierarchical clustering ([45]) (a routine role of RNA-Seq assay). The property of sub-compositional dominance is especially important: fifty-fifty if the log-ratio transformation does not normalize the information, the addition of more sequence data volition never brand 2 samples appear less distant. This follows logically: as the amount of data available grows, the distance between samples should not compress.
4.v Measures of association
Like the Aitchison distance, there also exists a compositionally valid measure of association: the log-ratio variance (VLR) measures the agreement between two components (a and b) beyond ii or more compositions. Specifically, it computes the variance of the logarithm of 1 component as divided by a 2nd component. As such, a D-component data set contains D ii associations (albeit with symmetry). Unlike Aitchison altitude, however, the VLR does not require a log-ratio transformation whatsoever; in fact, if using log-ratio transformed data, the reference denominators would abolish out. Note that, while distances occur between compositions (i.e., between samples), associations occur between components (i.e., between transcripts).
We can proceeds an intuition of the VLR by considering its formula. Recall that the relationship between components is one of relative importance: for the feature pair [a, b], the coordinates [2,4] and [4, viii] have equivalent pregnant. Therefore, it follows that the features a and b are associated if remains abiding beyond all samples. Hence, we measure the variance of the (log-) ratios, such that VLR ranges from [0, inf] where 0 indicates a perfect association. Unfortunately, VLR lacks an intuitive calibration, making non-zero values difficult to interpret ([43]).
Importantly, the VLR is sub-compositionally coherent: the removal of a tertiary feature c would accept no bearing on the variance of the (log-)ratio . Yet, the VLR suffers from a primal limitation: information technology is unscaled with respect to the variances of the log components ([43]). In other words, the magnitude of VLR depends partially on the variances of its constituent parts (i.due east., var(a) and var(b)). This makes it difficult to compare VLR across pairs (due east.thousand., comparing with ) ([43]). Still, dissimilar correlation, the VLR does not produce spurious results for compositional data, and in fact, provides the aforementioned outcome for both relative information and the accented counter-role, all without requiring normalization or transformation.
4.6 Principal Component Assay
Just as there are problems regarding between-sample distances and between-feature correlations, information technology follows that Main Component Analysis (PCA) should not get applied directly to compositional data. Instead, analysts could apply PCA to clr-transformed information (resulting in an boosted centering of the rows subsequently log- transformation) ([six]). Nevertheless, analysts must take care when interpreting the resultant PCA: covariances and correlations between features now exist with respect to the geometric hateful reference. As such, when plotting features every bit arrows in the new coordinate infinite, the angles between them (i.due east., the correlations) volition unremarkably change when subsets of the data are analyzed. All the same, the distances betwixt feature pairs (i.due east., the links betwixt the arrow heads) remain invariable with respect to sub-compositions: these stand for to their log- ratio variance ([6]). Meanwhile, the usual PCA plot (with samples as points in a new coordinate space) projects the distances between samples using the Aitchison distance (which has the desired property of sub-compositional dominance). In combining these into a joint visualization of features and samples, the resultant log-ratio biplot (i.due east., the "relative variation biplot") reveals associations betwixt samples and features, and can besides exist used to infer power law relationships betwixt features in an exploratory analysis ([6]). Such biplots are reminiscent of the visualizations obtained by Correspondence Analysis (CA). In fact, CA can indeed be used to gauge relative variation biplots provided the data are raised to a (small) power ([30]), the optimal size of which tin can be obtained by analyzing sub-compositional incoherence ([31]). Using CA with ability transformation has the advantage that zeros in the data are handled naturally past the technique.
5 Compositional methods for sequence data
5.1 Methods for differential abundance
The ALDEx2 package, available for the R programming linguistic communication, uses compositional data analysis principles to measure differential expression betwixt two or more groups ([24]; [25]). Dissimilar conventional approaches to differential expression, ALDEx2 uses log-ratio transformation instead of effective library size normalization. The algorithm has v chief parts. Showtime, ALDEx2 uses the input data to create randomized instances based on the compositionally valid Dirichlet distribution ([24]; [25]). This renders the data complimentary of zeros. Second, each of these so-called Monte Carlo (MC) instances undergoes log-ratio transformation, near unremarkably clr or iqlr transformation ([24]; [25]). 3rd, conventional statistical tests (i.e., Welch's t and Wilcoxon tests for ii groups; glm and Kruskal-Wallis for two or more groups) go practical to each MC instance to generate p-values (p) and Benjamini-Hochberg adjusted p-values (BH) for each transcript ([24]; [25]). 4th, these p-values get averaged across all MC instances to yield expected p-values ([24]; [25]). Fifth, one considers any transcript with an expected BH < α as statistically significant ([24]; [25]).
Although popular among meta-genomics researchers for analyzing the differential abundance of operational taxonomic units (OTUs) (east.chiliad., [48]; [70]), the ALDEx2 parcel has non received wide-spread adoption in the analysis of RNA-Seq data. In office, this may have to practice with our observation that ALDEx2 requires a large number of samples. This requirement may stem from its utilize of not-parametric testing, as suggested by the reduced power of other non-parametric differential expression methods ([16]; [75]), for instance NOISeq ([64]). However, competing software packages like limma ([62]) and edgeR ([56]) likewise benefit from moderated t-tests that "share information between genes" to reduce per-transcript variance estimates and increase statistical power.
Nevertheless, fifty-fifty in the setting of large sample sizes, ALDEx2 has ane major limitation: its usefulness depends largely on interpreting the log-ratio transformation as a normalization. If the log-ratio transformation does not sufficiently estimate an unchanged reference, the statistical tests will yield results that are hard to translate. Some other tool developed for analyzing the differential affluence of OTUs suffers from a similar limitation: ANCOM ([44]) uses presumed invariant features to guide the log-ratio transformation. The tendency to interpret differential abundance results equally if they were derived from log-ratio "normalizations" highlights the importance of pursuing numeric and experimental techniques that can establish an unchanged reference. It also highlights the benefit of seeking novel methods that do not require using log-ratio transformations as a kind of normalization.
5.2 Methods for association
The SparCC package, available for the R programming linguistic communication, replaces Pearson's correlation coefficient with an estimation of correlation based on its human relationship to the VLR (and other terms) ([28]). The algorithm works by iteratively computing a "ground correlation" nether the assumption that the majority of pairs exercise non correlate (i.e., a sparse network) ([28]). Another algorithm, SPIEC-EASI, makes the same supposition that the underlying network is sparse, but bases its method on the changed covariance matrix of clr-transformed data ([38]).
The propr bundle ([53]), available for the R programming linguistic communication, implements proportionality as introduced in ([43]) and expounded in ([22]). Proportionality provides an alternative measure of association that is valid for relative data. One could think of proportionality as a modification to the VLR that uses data about the variability of individual features (gained by a log-ratio transformation) to requite the VLR scale. Information technology tin can exist defined for the i-th and j-thursday features (e.k., transcripts) of a log-ratio transformed data matrix, ã i and ã j , and thus too depends on the reference used for transformation. Unlike SparCC and SPIEC-EASI, proportionality does non presume an underlying thin network.
At to the lowest degree three measures of proportionality exist. The showtime, ϕ, ranges from [0,inf] with 0 indicating perfect proportionality ([43]):
Its definition adjusts the VLR (in the numerator) by the variance of one of the log-ratio transformed features in that pair (in the denominator). The use of only i characteristic variance in the adjustment makes ϕ asymmetric (i.e., ϕ(ã i , ã j ) ≠ ϕ (ã j , ã i )).
The second, ϕ s, as well ranges from [0, inf] with 0 indicating perfect proportionality, but has a natural symmetry ([53]). Its definition adjusts the VLR by the variance of the log-product of the ii features:
The tertiary, ρ p , like correlation, takes on values from [-ane,one], where a value of 1 indicates perfect proportionality ([22]). Its definition adjusts the VLR past the sum of the variances of the log-ratio transformed features in that pair (equally subtracted from the value one). Thus, ρ p is symmetric.
Annotation that ρ p and ϕs are monotonic functions of one another (i.eastward., you can compute ρ p directly from ϕdue south and vice versa) (due east.g., see ([22]) where ϕdue south is called ). Unlike Pearson's correlation coefficient, proportionality coefficients tend non to produce spurious results ([53]). Instead, proportionality serves as a robust mensurate of association when analyzing relative information ([43]). Although proportionality gives VLR scale, it is limited in that its interpretation yet depends partly on using transformation as a kind of normalization (i.eastward., for the calculation of individual feature variances) ([22]). Even so, its interpretability, along with its observed resilience to spurious results, makes it a practiced choice for inferring co-expression from RNA-Seq data ([43]) or co-affluence from meta-genomics information ([x]).
half-dozen Challenges to compositional analyses
6.ane Challenges unique to count compositions
Compositional data analysis, considering it relies on log-transformations, does non piece of work when the data contain zeros. Notwithstanding, count compositional data are notably prone to zeros, those of which could signify either that a component is absent from a sample or otherwise only present at a quantity below the detection limit ([14]). For NGS abundance data, the divergence between a zero and a one might be stochastic. How all-time to handle zeros remains a topic of ongoing research. Notwithstanding, it is common to supplant zeros with a number less than the detection limit ([14]). Other replacement strategies would include adding a fixed value to all components, replacing zeros with the value one, or omitting nil-laden components altogether. A more principled (yet computationally expensive) way of replacing zeros is the Dirichlet sampling procedure implemented in ALDEx2 (equally described above). Note that the simple addition of a pseudo-count to all components does non preserve the ratios between them, which can exist amended by modifying the not-zero components in a multiplicative manner ([46]).
Moreover, while count compositional information carry relative information, they differ from true compositional data in that they contain integer values only. Restricting the data to integer space can introduce issues with an analysis because the sampling variation becomes more noticeable as the measurements approach zip ([53]). In other words, the difference between one and two counts is not exactly the same as the difference betwixt ane,000 and ii,0000 counts ([53]). While it is not mathematically necessary to remove low counts, analysts should proceed advisedly in their presence.
6.2 Challenges unique to sequencing data
In the second department, nosotros discussed how between-sample biases render NGS abundances unequalled between samples, thus necessitating normalization or transformation. However, nosotros did non address two important sources of within-sample biases for sequencing information. The get-go is read length bias, in which more reads map to longer transcripts ([63]). The second is GC content bias, in which more reads map to high GC regions ([19]). Such biases distort the ratios between features and are thus relevant to compositional analysis as well. Nonetheless, considering inside-sample biases are commonly assumed to accept the aforementioned proportional affect across all samples, they are usually ignored ([63]). For the same reason, one might also ignore these biases when interpreting NGS affluence data as compositions (as long as nosotros are only interested in between-sample effects). However, if a sample were to contain, for instance, a polymorphic or epigenetic change which alters the size or GC content of a transcript, the compositional nature of sequencing information could crusade a skew in the observed abundances for all other transcripts (for reasons suggested by (Figure i). More work is needed to understand the extent to which within-sample biases impacts compositional data assay in practice.
half-dozen.3 Limitations of transformation-based analysis
Formal transformation-based approaches oft suffer from a lack of interpretability or otherwise go interpreted erroneously. For example, when using the centered log-ratio (clr) transformation, one may be tempted to interpret the transformed data every bit if they referred to single features (e.g., transcripts); however, the transformed information actually refer to the ratios of the transcripts to their geometric mean. As such, an analyst must interpret results with regard to their dependence on this hateful. Moreover, because the geometric mean can change with the removal of features, the transformed data are breathless with respect to sub-compositions.
When log-ratio transformations are used for scaled measures of association (i.east., proportionality), the resulting covariations depend on the implicitly chosen reference. Therefore, they volition not give the same results for absolute and relative information (unless both data were transformed). The formal human relationship of results when applying ρ p with and without transformation is investigated elsewhere ([22]). Although lacking a natural scale, the log-ratio variance (VLR) has an advantage in that it provides identical results for both absolute and relative information, without requiring normalization or transformation.
vi.iv The claim of ratio-based analysis
Aitchison's preferred summary of the covariance structure of a compositional information set was a matrix containing the log-ratio variances for all feature pairs (i.eastward., the variation matrix) ([2]). Although this matrix formally contains a lot of redundant information, an analyst who is familiar with the features might nonetheless find this kind of representation useful. Recently, the focus on ratios has been chosen the "businesslike" approach to compositional data analysis ([32]), and offers some benefits. For ane, transformation (i.e., the brake to ratios with the same denominator) is not needed. Instead, the ratios can be dealt with straight equally if they were unconstrained (i.e., absolute) data ([66]). Moreover, ratios may carry a articulate meaning to the analyst interpreting them. Recently, Greenacre proposed a formal procedure to select a not-redundant subset of feature pairs that contains the entire variability of the data ([32]).
Such ratio-based analyses are as well applicative to NGS abundance data. For instance, Erb et al. proposed a method to identify the differential expression of gene ratios, a technique comprising part of what is termed differential proportionality analysis ([23]). When comparison factor ratios across two groups, this method selects ratios in which only a pocket-sized portion of the total log-ratio variance (i.east., VLR) is explained by the sum of the inside-group log-ratio variances ([23]). These selected factor ratios tend to bear witness differences in the group means of those ratios, analogous to how genes selected by differential expression assay show differences between their ways ([23]). Reinforcing the analogy further, Erb et al. have shown how it is possible to use the limma bundle to apply an empirical Bayes model with underlying count-based precision weights ([62]; [39]) to gene ratios, thus quantifying "second order" expression effects while still fugitive normalization ([23]).
In addition to measuring differences in the means of factor ratios betwixt groups, ratio-based methods (such as those used in differential proportionality analysis) can besides assist identify differences in the coordination of gene pairs. Such "differential coordination analysis" would otherwise depend on correlation ([76]), and therefore fall susceptible to spurious results. Instead, nosotros can harness the advantages of the VLR to define a sub-compositionally coherent measure out that tests for changes in the magnitude (i.east., slope of association) or force (i.e., coefficient of association) of co-regulated gene pairs. Moreover, ratio-based analyses could work as normalization-free feature selection methods for data science applications (such as clustering and classification). Such techniques would especially suit large data sets aggregated from multiple sequencing centers, platforms, or modalities, where heterogeneity and batch effects are not easily normalized.
vii Summary
All NGS abundance information are compositional because sequencers sample only a portion of the full input material. Notwithstanding, RNA-Seq data might accept compositional backdrop regardless owing to constraints on the cellular capacity for mRNA production. Any the reason, compositional data cannot undergo conventional analysis directly, at least without prior normalization or transformation. Otherwise, measures of differential expression, correlation, distance, and principal components become unreliable.
In the analysis of RNA-Seq data, effective library size normalization is used to recast the data in absolute terms prior to analysis. However, successful normalization requires meeting certain (oft untestable) assumptions. Alternatively, log-ratio transformations provide a manner to interrogate the data using familiar methods, but analysts must interpret their results with respect to the chosen reference. Sometimes, log-ratio transformations tin be used to normalize the data, but this requires an approximation of an unchanged reference. Instead, shifting focus to the analysis of ratios yields methods that avoid normalization and transformation entirely. These ratio-based methods may represent an important hereafter direction in the compositional analysis of relative NGS abundance information.
Footnotes
-
↵* contact to mquinn{at}gmail.com
References
- [1].↵
J. Aitchison . The Statistical Analysis of Compositional Information. Journal of the Regal Statistical Guild. Series B (Methodological), 44(2):139–177, 1982.
- [2].↵
J Aitchison . The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd., London, United kingdom, Great britain, 1986.
- [3].↵
J. Aitchison . A concise guide to compositional information analysis. 2nd Compositional Data Analysis Workshop; Girona, Italy, 2003.
- [4].↵
J. Aitchison , C. Barceló-Vidal , J. A. Martín-Fernández , and Five. Pawlowsky-Glahn . Logratio Analysis and Compositional Altitude. Mathematical Geology, 32(3):271–275, April 2000.
- [5].↵
John Aitchison . The unmarried principle of compositional data assay, continuing fallacies, confusions and misunderstandings and some suggested remedies. Proceedings of CoDaWork'08, The 3rd Compositional Information Assay Workshop; , Spain, 2008.
- [6].↵
John Aitchison and Michael Greenacre . Biplots of compositional data. Periodical of the Royal Statistical Society: Series C (Applied Statistics), 51(4):375–392, Oct 2002.
- [7].↵
Simon Anders and Wolfgang Huber . Differential expression analysis for sequence count data. Genome Biology, 11:R106, 2010.
- [8].↵
Giacomo Baruzzo , Katharina Due east. Hayer , Eun Ji Kim , Barbara Di Camillo , Garret A. FitzGerald , and Gregory R. Grant . Simulation-based comprehensive benchmarking of RNA-seq aligners. Nature Methods, xiv(2):135–139, February 2017.
- [9].↵
Ashlee K. Benjamin , Marshall Nichols , Thomas West. Burke , Geoffrey S. Ginsburg , and Joseph E. Lucas . Comparing reference-based RNA-Seq mapping methods for non-human primate data. BMC Genomics, xv:570, July 2014.
- [10].↵
Gaorui Bian , Gregory B. Gloor , Aihua Gong , Changsheng Jia , Wei Zhang , Jun Hu , Hong Zhang , Yumei Zhang , Zhenqing Zhou , Jiangao Zhang , Jeremy P. Burton , Gregor Reid , Yongliang Xiao , Qiang Zeng , Kaiping Yang , and Jiangang Li . The Gut Microbiota of Healthy Aged Chinese Is Similar to That of the Salubrious Immature. mSphere, 2(5):e00327–17, Oct 2017.
- [eleven].↵
C. I. Elation and R. A. Fisher . Plumbing equipment the Negative Binomial Distribution to Biological Data. Biometrics, ix(2):176–200, 1953.
- [12].↵
M. Gerald van den Boogaart and Raimon Tolosana-Delgado . Descriptive Analysis of Compositional Data. In Analyzing Compositional Data with R, Use R!, pages 73–93. Springer, Berlin, Heidelberg, 2013. DOI: x.1007/978-3-642-36809-7_4.
- [xiii].↵
1000. Gerald van den Boogaart and Raimon Tolosana-Delgado . Fundamental Concepts of Compositional Data Assay. In Analyzing Compositional Information with R, Utilize R!, pages 13–50. Springer Berlin Heidelberg, 2013. DOI: ten.1007/978-iii-642-36809-7_2.
- [fourteen].↵
K. Gerald van den Boogaart and Raimon Tolosana-Delgado . Zeroes, Missings, and Outliers. In Analyzing Compositional Data with R, Use R!, pages 209–253. Springer, Berlin, Heidelberg, 2013. DOI: x.1007/978-3-642-36809-7_7.
- [xv].↵
Antonella Buccianti . Is compositional information analysis a way to see beyond the illusion? Computers & Geosciences, 50:165–173, January 2013.
- [16].↵
Ana Conesa , Pedro Madrigal , Sonia Tarazona , David Gomez-Cabrero , Alejandra Cervera , Andrew McPher-son , Michal, Wojciech Szcześniak , Daniel J. Gaffney , Laura 50. Elo , Xuegong Zhang , and Ali Mortazavi . A survey of all-time practices for RNA-seq data analysis. Genome Biological science, 17:xiii, 2016.
- [17].↵
Marie-Agnès Dillies , Andrea Rau , Julie Aubert , Christelle Hennequet-Antier , Marine Jeanmougin , Nicolas Servant , Céline Keime , Guillemette Marot , David Castel , Jordi Estelle , Gregory Guernec , Bernd Jagla , Luc Jouneau , Denis Laloë , Caroline Le Gall , Brigitte Schaëffer , Stéphane Le Crom , Mickaël Guedj , and Florence Jaffrézic . A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing information analysis. Briefings in Bioinformatics, 14(6):671–683, November 2013.
- [eighteen].↵
Alexander Dobin , Carrie A. Davis , Felix Schlesinger , Jorg Drenkow , Chris Zaleski , Sonali Jha , Philippe Batut , Mark Chaisson , and Thomas R. Gingeras . STAR: ultrafast universal RNA-seq aligner. Bioinfor-matics, 29(1):15–21, January 2013.
- [19].↵
Juliane C. Dohm , Claudio Lottaz , Tatiana Borodina , and Heinz Himmelbauer . Substantial biases in ultra-brusque read data sets from loftier-throughput Dna sequencing. Nucleic Acids Research, 36(sixteen):e105, September 2008.
- [20].↵
J. J. Egozcue , V. Pawlowsky-Glahn , Grand. Mateu-Figueras , and C. Barceló-Vidal . Isometric Logratio Trans-formations for Compositional Information Analysis. Mathematical Geology, 35(3):279–300, April 2003.
- [21].↵
Pär One thousand. Engström , Tamara Steijger , Botond Sipos , Gregory R. Grant , André Kahles , The RGASP Consor-tium, Gunnar Rätsch, Nick Goldman, Tim J. Hubbard, Jennifer Harrow, Roderic Guigó, and Paul Bertone. Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods, ten(12):1185–1191, Dec 2013.
- [22].↵
Ionas Erb and Cedric Notredame . How should we measure proportionality on relative gene expression data? Theory in Biosciences, Jan 2016.
- [23].↵
Ionas Erb , Thomas Quinn , David Lovell , and Cedric Notredame . Differential Proportionality - A Normalization-Gratuitous Arroyo To Differential Gene Expression. Proceedings of CoDaWork 2017, The 7th Compositional Information Analysis Workshop; available nether bioRxiv, folio 134536, May 2017.
- [24].↵
Andrew D. Fernandes , Jean Yard. Macklaim , Thomas G. Linn , Gregor Reid , and Gregory B. Gloor . ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq. PLOS ONE, viii(7):e67019, July 2013.
- [25].↵
Andrew D. Fernandes , Jennifer Ns Reid , Jean M. Macklaim , Thomas A. McMurrough , David R. Edgell , and Gregory B. Gloor . Unifying the assay of high-throughput sequencing datasets: characterizing RNA-seq, 16s rRNA cistron sequencing and selective growth experiments past compositional information analysis. Microbiome, two:15, 2014.
- [26].↵
Nuno A. Fonseca , John Marioni , and Alvis Brazma . RNA-Seq factor profiling–a systematic empirical com-parison. PloS One, 9(9):e107026, 2014.
- [27].↵
Nuno A. Fonseca , Johan Rung , Alvis Brazma , and John C. Marioni . Tools for mapping loftier-throughput sequencing information. Bioinformatics, 28(24):3169–3177, December 2012.
- [28].↵
Jonathan Friedman and Eric J. Alm . Inferring correlation networks from genomic survey data. PLoS computational biology, 8(9):e1002687, 2012.
- [29].↵
Gregory R. Grant , Michael H. Farkas , Affections D. Pizarro , Nicholas F. Lahens , Jonathan Schug , Brian P. Brunk , Christian J. Stoeckert , John B. Hogenesch , and Eric A. Pierce . Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics, 27(eighteen):2518–2528, September 2011.
- [thirty].↵
Michael Greenacre . Power transformations in correspondence analysis. Computational Statistics & Data Analysis, 53(8):3107–3116, June 2009.
- [31].↵
Michael Greenacre . Measuring Subcompositional Incoherence. Mathematical Geosciences, 43(six):681–693, August 2011.
- [32].↵
Michael Greenacre . Towards a pragmatic approach to compositional information assay. Technical Study 1554, Section of Economics and Business organization, Universitat Pompeu Fabra, January 2017.
- [33].↵
Malachi Griffith , Jason R. Walker , Nicholas C. Spies , Benjamin J. Ainscough , and Obi L. Griffith . Idue north-formatics for RNA Sequencing: A Web Resource for Analysis on the Cloud. PLoS computational biological science, 11(8):e1004393, August 2015.
- [34].↵
Ayat Hatem , Doruk Bozdağ , Amanda E. Toland , and Ümit V. Çatalyürek . Benchmarking short sequence mapping tools. BMC Bioinformatics, 14:184, June 2013.
- [35].↵
Katharina E. Hayer , Affections Pizarro , Nicholas F. Lahens , John B. Hogenesch , and Gregory R. Grant . Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics (Oxford, England), 31(24):3938–3945, December 2015.
- [36].↵
Steven R. Caput , H. Kiyomi Komori , Sarah A. LaMere , Thomas Whisenant , Filip Van Nieuwerburgh , Daniel R. Salomon , and Phillip Ordoukhanian . Library construction for next-generation sequencing: Overviews and challenges. BioTechniques, 56(2):61–passim, February 2014.
- [37].↵
Lichun Jiang , Felix Schlesinger , Carrie A. Davis , Yu Zhang , Renhua Li , Marc Salit , Thomas R. Gingeras , and Brian Oliver . Synthetic spike-in standards for RNA-seq experiments. Genome Research, 21(9):1543–1551, September 2011.
- [38].↵
Zachary D. Kurtz , Christian L. Müller , Emily R. Miraldi , Dan R. Littman , Martin J. Blaser , and Richard A. Bonneau . Sparse and Compositionally Robust Inference of Microbial Ecological Networks. PLOS Compu-tational Biology, 11(5):e1004226, May 2015.
- [39].↵
Charity W. Law , Yunshun Chen , Wei Shi , and Gordon Thou. Smyth . voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15:R29, January 2014.
- [forty].↵
Jun-Hao Li , Shun Liu , Ling-Ling Zheng , Jie Wu , Wen-Ju Sun , Ze-Lin Wang , Hui Zhou , Liang-Hu Qu , and Jian-Hua Yang . Discovery of protein–lncRNA interactions by integrating large-scale Clip-Seq and RNA-Seq datasets. Bioinformatics and Computational Biology, 2:88, 2015.
- [41].↵
Yanzhu Lin , Kseniya Golovnina , Zhen-Xia Chen , Hang Noh Lee , Yazmin 50. Serrano Negron , Hina Sultana , Brian Oliver , and Susan T. Harbison . Comparison of normalization and differential expression analyses using RNA-Seq information from 726 individual Drosophila melanogaster. BMC Genomics, 17, January 2016.
- [42].↵
Robert Lindner and Caroline C. Friedel . A Comprehensive Evaluation of Alignment Algorithms in the Context of RNA-Seq. PLOS ONE, 7(12):e52403, Dec 2012.
- [43].↵
David Lovell , Vera Pawlowsky-Glahn , Juan José Egozcue , Samuel Marguerat , and Jürg Bähler . Propor-tionality: A Valid Alternative to Correlation for Relative Data. PLoS Computational Biology, xi(3), March 2015.
- [44].↵
Siddhartha Mandal , Will Van Treuren , Richard A. White , Merete Eggesbø , Rob Knight , and Shyamal D. Peddada . Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease, 26:27663, 2015.
- [45].↵
JA Martín-Fernández , C Barceló-Vidal , V Pawlowsky-Glahn , A Buccianti , M Nardi , and R Potenza . Measures of deviation for compositional data and hierarchical clustering methods. In Proceedings of IAMG, volume 98, pages 526–531, 1998.
- [46].↵
JA Martín-Fernández and S Thió-Henestrosa . Rounded zeros: some practical aspects for compositional data. Geological Order, London, Special Publications, 264(1):191–201, 2006.
- [47].↵
- [48].↵
Amy McMillan , Stephen Rulisa , Marking Sumarah , Jean M. Macklaim , Justin Renaud , Hashemite kingdom of jordan East. Bisanz , Gregory B. Gloor , and Gregor Reid . A multi-platform metabolomics approach identifies highly specific biomarkers of bacterial diversity in the vagina of meaning and non-pregnant women. Scientific Reports, 5:14174, September 2015.
- [49].↵
Gabriela A Merino , Ana Conesa , and Elmer A Fernandez . A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies. bioRxiv, 2017.
- [l].↵
Michael L. Metzker . Sequencing technologies — the adjacent generation. Nature Reviews Genetics, xi(1):31–46, January 2010.
- [51].↵
Rob Patro , Geet Duggal , Michael I Dear , Rafael A Irizarry , and Carl Kingsford . Salmon: fast and bias-aware quantification of transcript expression using dual-phase inference. Nature methods, 14(4):417, 2017.
- [52].↵
Karl Pearson . Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Concrete Graphic symbol, 187:253–318, 1896.
- [53].↵
Thomas Quinn , Mark F. Richardson , David Lovell , and Tamsyn Crowley . propr: An R-bundle for Place-ing Proportionally Abundant Features Using Compositional Information Analysis. bioRxiv, page 104935, February 2017.
- [54].↵
Franck Rapaport , Raya Khanin , Yupu Liang , Mono Pirun , Azra Krek , Paul Zumbo , Christopher Eastward. Bricklayer , Nicholas D. Socci , and Doron Betel . Comprehensive evaluation of differential gene expression assay methods for RNA-seq data. Genome Biology, 14(9):R95, 2013.
- [55].↵
Matthew Eastward. Ritchie , Belinda Phipson , Di Wu , Yifang Hu , Charity Due west. Law , Wei Shi , and Gordon K. Smyth . limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Inquiry, 43(7):e47, Apr 2015.
- [56].↵
Marker D. Robinson , Davis J. McCarthy , and Gordon K. Smyth . edgeR: a Bioconductor packet for differ-ential expression analysis of digital gene expression information. Bioinformatics, 26(1):139–140, Jan 2010.
- [57].↵
Mark D. Robinson and Alicia Oshlack . A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, xi:R25, 2010.
- [58].↵
Edoardo Saccenti . Correlation Patterns in Experimental Data Are Affected by Normalization Procedures: Consequences for Data Analysis and Network Inference. Journal of Proteome Inquiry, November 2016.
- [59].↵
Nicholas J. Schurch , Pietá Schofield , Marek Gierliński , Christian Cole , Alexander Sherstnev , Vijender Singh , Nicola Wrobel , Karim Gharbi , Gordon One thousand. Simpson , Tom Owen-Hughes , Marking Blaxter , and Geof-frey J. Barton . How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA (New York, N.Y.), 22(6):839–851, June 2016.
- [sixty].↵
Matthew Scott , Carl W Gunderson , Eduard M Mateescu , Zhongge Zhang , and Terence Hwa . Interde-pendence of jail cell growth and factor expression: origins and consequences. Science, 330(6007):1099–1102, 2010.
- [61].↵
Fatemeh Seyednasrollah , Asta Laiho , and Laura L. Elo . Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in Bioinformatics, 16(1):59–70, January 2015.
- [62].↵
Gordon K. Smyth . Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3:Article3, 2004.
- [63].↵
Charlotte Soneson and Mauro Delorenzi . A comparing of methods for differential expression analysis of RNA-seq information. BMC Bioinformatics, 14:91, 2013.
- [64].↵
Sonia Tarazona , Pedro Furió-Tarí , David Turrà , Antonio Di Pietro , María José Nueda , Alberto Ferrer , and Ana Conesa . Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Research, 43(21):e140–e140, Dec 2015.
- [65].↵
Mingxiang Teng , Michael I. Dear , Carrie A. Davis , Sarah Djebali , Alexander Dobin , Brenton R. Graveley , Sheng Li , Christopher E. Stonemason , Sara Olson , Dmitri Pervouchine , Cricket A. Sloan , Xintao Wei , Lijun Zhan , and Rafael A. Irizarry . A benchmark for RNA-seq quantification pipelines. Genome Biology, 17:74, March 2016.
- [66].↵
C. Westward. Thomas and J. Aitchison . Log-ratios and geochemical discrimination of Scottish Dalradian lime-stones: a case report. Geological Society, London, Special Publications, 264(1):25–41, January 2006.
- [67].↵
Hande Topa and Antti Honkela . Analysis of differential splicing suggests different modes of short-term splicing regulation. Bioinformatics, 32(12):i147–i155, June 2016.
- [68].↵
Cole Trapnell , Lior Pachter , and Steven L. Salzberg . TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105–1111, May 2009.
- [69].↵
Cole Trapnell , Brian A. Williams , Geo Pertea , Ali Mortazavi , Gordon Kwan , Marijke J. van Baren , Steven L. Salzberg , Barbara J. Wold , and Lior Pachter . Transcript associates and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28(5):511–515, May 2010.
- [70].↵
Camilla Urbaniak , Michelle Angelini , Gregory B. Gloor , and Gregor Reid . Human milk microbiota profiles in relation to birthing method, gestation and baby gender. Microbiome, 4:1, 2016.
- [71].↵
K. Gerald van den Boogaart and R. Tolosana-Delgado . "compositions": A unified R package to analyze compositional information. Computers & Geosciences, 34(4):320–338, April 2008.
- [72].↵
W. A. Wang , C. T. Wu , T. P. Lu , Chiliad. H. Tsai , Fifty. C. Lai , and E. Y. Chuang . Comparisons and performance evaluations of RNA-seq alignment tools. In 2014 International Conference on Electric Engineering and Computer science (ICEECS), pages 215–218, Oct 2014.
- [73].↵
Zhong Wang , Mark Gerstein , and Michael Snyder . RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(one):57–63, January 2009.
- [74].↵
Alex D. Washburne , Justin D. Silverman , Jonathan Westward. Leff , Dominic J. Bennett , John L. Darcy , Sayan Mukherjee , Noah Fierer , and Lawrence A. David . Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets. PeerJ, 5, February 2017.
- [75].↵
Claire R. Williams , Alyssa Baccarella , Jay Z. Parrish , and Charles C. Kim . Empirical cess of assay workflows for differential expression analysis of human samples using RNA-Seq. BMC Bioinformatics, 18, Jan 2017.
- [76].↵
Tianwei Yu and Yun Bai . Capturing changes in factor expression dynamics by gene gear up differential coordi-nation analysis. Genomics, 98(6):469–477, December 2011.
Source: https://www.biorxiv.org/content/10.1101/206425v1.full
0 Response to "Understanding Sequencing Data as Compositions an Outlook and Review Ncbi"
Post a Comment