

UniFrac Help 

Introduction The UniFrac interface provides a suite of tools for the comparison of microbial communities using phylogenetic information. It takes as input a single rooted phylogenetic tree that contains sequences derived from at least two different environmental samples and a file describing which sequences came from which sample. Both the UniFrac distance metric (Lozupone C. and R. Knight. 2005. Appl. Environ. Microbiol. 71:822535) and the P test (Martin, A. 2002. Appl. Environ. Microbiol. 68:367382) can be used to make comparisons. Both of these techniques bypass the need to choose operational taxonomic units (OTUs) based on sequence divergence prior to analysis. The UniFrac distance is calculated as the percent of branch length leading to descendants from only one of the environments represented in the phylogenetic tree, and reflects differences in the phylogenetic lineages that are adapted to live in one environment verses the others. Although the UniFrac metric and the P test have primarily been applied to trees made with 16S ribosomal RNA gene sequences, phylogenetic trees from any kind of gene can be analyzed. The UniFrac interface allows you to:


Why would I want to use UniFrac? You should use UniFrac if you have sequence data that you can put into a tree, you want to measure the differences in community composition between environments, and you want to take into account the different evolutionary relationships among taxa. You should not use UniFrac if you want a measure of the diversity in each sample, or if you want to perform an OTUbased analysis. The UniFrac interface has analyses that are optimal for comparing sequences from only a couple of different environments, and analyses that are optimal for comparing sequences from many different environments simultaneously. If your dataset contains sequences from only a few different environments, try out UniFrac Significance, P Test Significance, or LineageSpecific Analysis. These are designed to determine whether the samples are significantly different from each other or to find specific lineages that contribute to differences between environments. These analyses do not work well on datasets containing sequences from many different environments because 1) it can be difficult to interpret what the results mean (e.g. a significant difference between the environments in a tree with 10 different environments would not tell you which environments are contributing to this difference) or 2) it can be difficult to get significant pvalues after correcting for multiple comparisons when pairwise comparisons between all of the environments are made. (All Pvalues are multiplied by the number of comparisons made). If your dataset contains sequences from more than 3 different environments, try out Cluster Environments, Jackknife Environment Clusters, or PCA . These techniques are designed to cluster environments based on the sequences they contain and can tell you which environments have related communities. 

Upload tree file in NEWICK or NEXUS format: The uploaded tree must contain sequences from at least two different environments and can be generated by programs such as Arb, Phylip, PAUP*, MrBayes, RaxML, or Clustal W. UniFrac will automatically detect the format that the tree is in. The tree should be rooted; Loading an unrooted tree will either cause an error or result in the assignment of an arbitrary root. The quality of the UniFrac analysis results depends on the quality of the input tree. NEWICK format: NEWICK is a standard format that is recognized by most programs that generate or allow visualization of phylogenetic trees including PHYLIP, TREEPUZZLE, ARB, and TREEVIEW. Here is an example of a tree in NEWICK format: (((SEQ1:0.02120,(SEQ2:0.09111,SEQ3:0.04491)node1:0.00097)node2:0.00194, (SEQ4:0.03160,SEQ5:0.04378)node3:0.00365)node4:0.00188,SEQ6:0.00881)node5:0.00739; NEXUS format: NEXUS format incorporates NEWICK formatting along with other commands relevant to programs such as PAUP*. Here is an example of a tree in NEXUS format: #NEXUS Begin trees; [Treefile saved Wed Jul 26 19:40:41 2000] [output from your data run] Translate 1 TRXEcoli, 2 TRXHomo, 3 TRXSacch, 4 erCaelA, 5 erCaelB, 6 erCaelC, 7 erHomoA, 8 erHomoB, 9 erHomoC, 10 erpCaelC ; tree PAUP_1 = [&U] (1,((2,3),((((4,10),(5,8)),(6,9)),7))); End; 

Each line in the environment file has a sequence name and an environment name separated by a tab. If the same sequence was found in more than one environment, it should be represented by more than one line. Here is an example of an environment file: SEQ1 ENV1 SEQ1 ENV2 SEQ2 ENV1 SEQ3 ENV1 SEQ4 ENV2 SEQ5 ENV1 SEQ6 ENV3 SEQ6 ENV2 ... Optionally, the environment file can contain a third column describing the number of times the sequence was observed in each environment. This data can come from RFLP data or OTU counts. This type of information is used in the Lineagespecific analysis and when applying weighted UniFrac, which accounts for sequence abundance when comparing diversity. Here is an example of an environment file that includes Abundance information: SEQ1 ENV1 1 SEQ1 ENV2 2 SEQ2 ENV1 15 SEQ3 ENV1 2 SEQ4 ENV2 8 SEQ5 ENV1 4 SEQ6 ENV3 1 SEQ6 ENV2 1 ... 

Select YES if you want to display a text version of the tree you upload. The sequences in the tree will be labeled with information on the environments to which they were assigned and the number of times they were observed in that environment.. You must display the tree if you plan to do a Lineagespecific analysis. 

During the upload, the UniFrac interface removes sequences that were not assigned to any environment from the tree. The number of sequences after this pruning step is reported. A text representation of the tree is drawn where each vertical dash represents 1 unit of branch length. The tree leafs are labeled with the taxa name, a list of environments to which the taxa was assigned (in blue), and the number of times that the sequence was observed in each environment (in red). A red dashed line is printed at integrals throughout the tree and serves as a scalebar that can be used to divide the tree into lineages for the Lineagespecific analysis. "Mousingover" the scale bar shows the distance from the root at that point in the tree. 

UniFrac can perform several different phylogenetic analyses on the tree and environments you uploaded.


Select Analysis  UniFrac Significance Calculating the UniFrac metric. The majority of options in the UniFrac interface make comparisons based on the UniFrac metric. The UniFrac metric measures the difference between two environments in terms of the branch length that is unique to one environment or the other. In the tree on the left (below), the division between the two environments (labeled red and blue) occurs very early in the tree, so that all of the branch length is unique to one environment or the other. This provides the maximum UniFrac distance possible, 1.0. In the tree on the right, every sequence in the first environment has a very similar counterpart in the other environment, so most of the branch length in the tree comes from nodes that have descendants in both environments. In the example, there is about as much branch length unique to each environment (red or blue) as shared between environments (purple), so the UniFrac value would be about 0.5. If the two environments were identical and all the same sequences were found in both, all the branch length would be shared and the UniFrac value would be 0.
Calculating the Weighted UniFrac Metric. The UniFrac metric described above does not account for the relative abundance of sequences in the different environments because duplicate sequences contribute no additional branch length to the tree (by definition, the branch length that separates a pair of duplicate sequences is zero, because no substitutions separate them). Because the relative abundance of different kinds of bacteria can be critical for describing community changes, we have developed a variant of the algorithm, weighted UniFrac, which weights the branches based on abundance information during the calculations. Weighted UniFrac can thus detect changes in how many organisms from each lineage are present, as well as detecting changes in which organisms are present. The first step in the algorithm is to calculate the raw weighted UniFrac value (u), according to the equation below:
Here, n is the total number of branches in the tree, b_{i} is the length of branch i, A_{i} and B_{i} are the number of descendants of branch i from communities A and B respectively, and A_{T} and B_{T} are the total number of sequences from communities A and B respectively. In order to control for unequal sampling effort, A_{i} and B_{i} are divided by A_{T} and B_{T}. The figure below illustrates how the Weighted algorithm works. Branch lengths are weighted by the relative abundance of sequences in the square and circle communities; square sequences are weighted twice as much as circle sequences because there are twice as many total circle sequences in the dataset. The width of branches is proportional to the degree to which each branch is weighted in the calculations and the grey branches have no weight. Branches 1 and 2 have heavy weights since the descendants are biased towards the square and circles respectively. Branch 3 contributes no value since it has an equal contribution from circle and square sequences after normalization for different sample sizes.
Normalizing Weighted UniFrac Values. If the phylogenetic tree is not ultrametric (i.e. if different sequences in the sample have evolved at different rates), comparing environments with the Cluster Environments or PCA analysis options using weighted UniFrac will place more emphasis on communities that contain taxa that have evolved more quickly. This is because these taxa contribute more branch length to the tree. In some situations, it may be desirable to normalize the branch lengths within each sample. This normalization has the effect of treating each sample equally instead of treating each unit of branch equally: the issues involved are similar to those involved in performing multivariate analyses using the correlation matrix, to treat each variable equally independent of scale, or using the covariance matrix, to take the scale into account. Normalization has the additional effect of placing all pairwise comparisons on the same scale as unweighted UniFrac (0 for identical communities, 1 for nonoverlapping communities), allowing comparisons among different analyses with different samples. The scale of of the raw weighted UniFrac value (u) depends on the average distance of each sequence from the root. The normalization to correct for this effect is performed by dividing u by the distance scale factor D (see equation below), which is the average distance of each sequence from the root weighted by the number of times each sequence was observed in each community.
Here d_{j} is the distance of sequence j from the root, A_{j} and B_{j} are the number of times the sequences were observed in communities A and B respectively, and A_{T} and B_{T} are the total number of sequences from communities A and B respectively. The UniFrac metric can be used to measure whether the microbial communities in the environments are significantly different. Environments are significantly different if the UniFrac value for the real tree is greater than would be expected if the sequences were randomly distributed between the environments. To test this, the environment labels are randomly permuted across the sequences in the tree and a new UniFrac value is calculated. This is repeated either 10, 100, or 1000 times (determined by what is set for "Number of Permutations"). The reported Pvalue is the fraction of permuted trees that have UniFrac values greater or equal to that of the real tree.


UniFrac Significance  "Type of test" There are three ways to calculate the UniFrac significance: All environments together: tests whether the sequences from all of the different environments in the tree are significantly different from each other. For instance, if there are sequences from 3 different environments in the tree, the UniFrac values are calculated as the percent of the branch length in the tree that leads to descendants from only one of the 3 environments. The output is a single Pvalue. Each pair of environments: tests whether each pair of environments differs from one another. For all possible pairs of environments, removes all sequences not in one of two environments and calculates significance. The calculated Pvalues are output in two separate data tables. In the first table, the values have been corrected for multiple comparisons by multiplying them by the number of comparisons that were made (the Bonferroni correction). The second table displays the raw, uncorrected values. The names of the environments are shown in the column and row headers, if you move your mouse over the individual cells, it will display the names of the environments that pvalue is associated with. Each cell is colored by the pvalue. Pvalues < 0.001 are red, < .01 are yellow, < 0.05 are green, > .05 are blue. Each data table can be downloaded to an Excel spreadsheet with the "download data" link. An example of the output is shown below:
Each environment individually: for each environment, treats the rest of the tree as a single "other" environment and determines if that particular environment has more unique branch length than expected by chance. If again we are looking at a tree with sequences from 3 different environments, and we determine that the environments are significantly different when evaluating "All environments together," the tree can theoretically have a significant UniFrac significance value only because of unique lineages in one particular environment. The "Each environment individually" option provides a Pvalue for each environment in the tree and can be used to determine which particular environment is contributing the most unique branch length. The output is a table of Pvalues for each environment indicating whether that environment is significantly different from the rest of the tree. An example of the output is shown below. This result would indicate that only EPA has a significant amount of unique branch length with respect to the other samples in the tree.


UniFrac Significance  "Number of Permutations" The number of tree permutations used to generate the random distribution of UniFrac values to which the value of the true tree is compared when calculating the Pvalue. Guest accounts have a restricted maximum sample size. 

UniFrac Significance  "Use abundance weights" If "no" uses the standard unweighted UniFrac algorithm to perform the calculations. If "yes" uses the weighted UniFrac algorithm . 

Select Analysis  P Test Significance The P Test (Martin, A. 2002. Appl. Environ. Microbiol. 68:367382) also uses phylogenetic information to test whether two environments are significantly different. It estimates similarity between communities as the number of parsimony changes (change from one environment to another along a branch) that would be required to explain the distribution of sequences between the different environments in the tree. A phylogenetic tree representing sequences from very different communities would require fewer parsimony changes to explain the distribution of sequences between environments than if the environments were similar. One way to think about this is that if two environments are mutually exclusive, only one parsimony change is required to separate the sequences in one environment from the other (see example below), whereas if the two were identical and had the same sequences we would expect all sequences to be found in both, requiring a large number of parsimony changes.
Generating the P test Pvalue. Two environments that are extremely wellseparated on a tree (A) require few changes, and perhaps only one change, between environments to explain the modern distribution of lineages in environments. The minimum number of changes from one environment to another is inferred by parsimony. This observed value is compared to a distribution of values obtained by randomizing the environment to which each sequence is assigned, and reinferring the number of parsimony changes required for these permuted environment sets. If the observed value is much lower than the average for the distribution, we conclude that organisms are significantly clustered by environment. The pvalue is the fraction of random permutations of the labels that require fewer parsimony changes to explain than does the actual labeling. Note that the pvalue is a measure of statistical significance, not of distance between environments. 

P Test Significance  "Type of test" There are two ways to calculate the P test significance: All environments together: tests whether the sequences from all of the different environments in the tree are significantly different from each other. For instance, if there are sequences from 3 different environments in the tree, the parsimony changes is the number required to describe the distribution of sequences from the 3 different environments on the tree. The output is a single Pvalue indicating whether the environments are significantly clustered on the tree. Each pair of environments: tests whether each pair of environments differs from one another. For all possible pairs of environments, removes all sequences not in one of two environments and calculates significance. The calculated Pvalues are output in two separate data tables. In the first table, the values have been corrected for multiple comparisons by multiplying them by the number of comparisons that were made (the Bonferroni correction). The second table displays the raw, uncorrected values. The names of the environments are shown in the column and row headers, if you move your mouse over the individual cells, it will display the names of the environments that pvalue is associated with. Each cell is colored by the pvalue. Pvalues < 0.001 are red, < .01 are yellow, < 0.05 are green, > .05 are blue. Each data table can be downloaded to an Excel spreadsheet with the "download data" link. An example of the output is shown below:
 
P Test Significance  "Number of Permutations" The number of tree permutations used to generate the random distribution of values to which the value of the true tree is compared when calculating a Pvalue. Guest accounts have a restricted maximum sample size. 

Select Analysis  LineageSpecific Analysis LineageSpecific Analysis The lineagespecific analysis applies the Gtest of significance to each lineage to determine whether the sequences have a different distribution among environments than does the tree overall. For instance, if a tree has 50% sequences from env1 and 50% from env2, a lineage that has 95% from env1 and 5% from env2, will likely have a significant Gtest value. The Gtest is similar to the chisquared test for fit, but is somewhat more accurate for small sample sizes (Sokal and Rohlf 1995, Biometry 3rd ed.). The lineages are isolated by cutting the tree at a specific distance from the root. The Gtest of significance is then performed on each lineage. The Gtest considers the distribution of all environments in the tree; The results are thus most meaningful for trees with sequences from only 2 or 3 environments. Choosing a small distance close to the root would perform a highlevel analysis at the division level, whereas choosing a large distance far from the root would perform an analysis at about the family or genus level. Since all Pvalues are multiplied by the number of lineages evaluated (corrected for multiple comparisons), it can be difficult to detect significant differences when a large distance from the root is specified and many lineages are evaluated. The lineagespecific analysis will account for Abundance information if it is provided in the environment file. For instance, if you know based on RFLP data that a particular sequence was actually represented by 15 different clones,the program will count that sequence 15 times when performing the Gtest if that information was loaded with the environment file. After the analysis is performed, the initial tree is returned with the lineages that were evaluated colored according to level of significance. 

LineageSpecific Analysis  "Branch Length Threshold" The lineagespecific analysis requires you to cut the tree at a specified distance from the root. Each lineage that exists at this distance will be treated separately in the analysis. The branch length threshold provides this distance. A small threshold (e.g. 0.01) will produce an analysis of a few, major lineages, e.g. divisions. A high threshold (e.g. 0.6) will produce an analysis of many, minor lineages, e.g. families. You can type in known branch length, or use the scale bar (red dotted line shown in figure below) to select a threshold from the tree.


LineageSpecific Analysis  "Minimum Descendants" Nodes with very few descendants often appear unevenly distributed among environments solely because of sampling error and are thus not usually significant. It is optimal to exclude these nodes to limit the number of comparisons that are made so that uninformative tests do not artificially inflate the Pvalue (all Pvalues are multiplied by the number of comparisons made). The Minimum Descendants threshold allows you to exclude nodes with few descendants from the analysis. 

LineageSpecific Analysis  Output format The output consists of both a table and a tree. The table has a row for each environment in each evaluated lineage/node. The Table has 5 columns including: Node: The name of the node. The nodes are named arbitrarily but can be viewed in the tree, which is shown below the table. Mousing over the colored bar corresponding to the node in the tree shows the name and Pvalue of the node. P Value: The Pvalue for the node. Indicates the probability that the distribution of sequences between environments differs from random expectations. The Pvalues are colored based on their value and the corresponding node is colored the same in the tree. Pvalues < 0.001 are red, < .01 are yellow, < 0.05 are green, < .1 are blue and > 0.1 are gray. Environment Name: There is one row for each node for each environment in the tree. Observed: The number of sequences observed in that node from that environment. Expected: The number of sequences that would be expected in the node if the sequences were evenly distributed in the different lineages. An example of the table section of the output is shown below.


Select Analysis  Cluster Environments The Cluster Environments option can be used to determine which environments in the tree have similar microbial communities. It performs hierarchical clustering analysis (also known as UPGMA) on the environments based on a distance matrix that is generated by calculating pairwise UniFrac values (see illustration below). It only make sense to run this analysis on input trees that have sequences from at least three environments.
Clustering environments with UniFrac The tree on the left has sequences from 3 different environments colored red, yellow, and blue. UniFrac values are calculated for all possible pairs of environments to create a distance matrix. This matrix is used to cluster the environments using UPGMA (Unweighted Pair Group Method with Arithmetic Mean). The resulting cluster illustrates that the red and yellow environments are more similar to each other than either is to the blue environment. 

Cluster Environments  "Use abundance weights" If "no" uses the standard unweighted UniFrac algorithm to perform the calculations. If "yes" uses the weighted UniFrac algorithm ; if "Normalized" performs a normalization step in the weighted UniFrac value calculation. If "Nonnormalized," it does not. 

Select Analysis  Cluster Environments The output of Cluster Environments is a text representation of the tree with the environments shown on it. Similar environments will appear clustered together in the tree. The tree contains the environments, not the sequences, and is drawn so that the root is on the left and the environments are on the right. The scale bar shows the distance between clusters in UniFrac units: a distance of 0 means that two environments are identical, and a distance of 0.5 means that two environments contain mutually exclusive lineages. A newick string describing the cluster is also returned that can be used to view the cluster in other programs such as TreeView.
Note that the method for clustering the environments is independent of the method used to build the phylogenetic tree of sequences: typically, you will want to use an algorithm such as neighborjoining, likelihood, parsimony, or bayesian inference to build your phylogenetic tree. Hierarchical clustering is appropriate here because we are looking purely at distances between environments, and are not claiming that one environment evolved from another.  
Select Analysis  Jackknife Environment Clusters Jackknife Environment Clusters performs jackknife analysis of environment clusters produced with the methods described for the Cluster Environments Analysis option. It can be used to determine how robust the analysis is to both sample size and evenness by assessing how often the cluster nodes are recovered when smaller, even sets of sequences are sampled from the environments. Jackknifing is a statistical resampling technique that asks whether you would see the same results using only a random sample of part of the data. To perform this analysis, we randomly sample a specified number of sequences from each environment and recluster the data. We do this 10, 100, or 1000 times depending on what is set for "Number of Permutations" and calculate the fraction of the times that each node in the cluster was recovered. 

Jackknife Environment Clusters  "Minimum sequence to keep" Minimum sequences to keep specifies the number of sequences that are kept from each environment in the tree during each jackknife replicate. The sequences are randomly selected for inclusion during each replicate. If an environment contains less than the specified number of sequences, all the sequences from that environment are removed, resulting in exclusion of that environment in the cluster output. Because of this, it is often best to select a number that is less than or equal to the minimum number of sequences in any one environment. This maximum recommended value is automatically set as the default. You can view the number of sequences in each environment with the Environment Counts Analysis option. The most reasonable value to use is about 75% of the smallest sample you want to include in the analysis, but in practice people usually just use the smallest sample (despite the fact that there is then no resampling performed on that sample). 

Jackknife Environment Clusters  "Number of Permutations" "Number of Permutations" specifies the number of jackknife replicates to perform, (i.e. the number of times to randomly resample the sequences and redo the environment clustering when calculating the support for each cluster node). This number is limited for guest users. 

Jackknife Environment Clusters  "Use abundance weights" If "no" uses the standard unweighted UniFrac algorithm to perform the calculations. If "yes" uses the weighted UniFrac algorithm ; if "Normalized" performs a normalization step in the weighted UniFrac value calculation. If "Nonnormalized," it does not. 

Jackknife Environment Clusters  Output The output is the same type of tree used in Cluster Environments, except that nodes are colored according to the fraction of the random samples that they were recovered in. Move your mouse over the internal nodes to get the jackknife support.


PCA The PCA option performs Principal Coordinates Analysis. PCA is a multivariate statistical technique for finding the most important axes along which your samples vary. It can also be used to find clusters. PCA calculates the distance matrix for each pair of environments using the UniFrac metric as described for Cluster Environments. It then turns these distances into points in a space with a number of dimensions one less than the number of samples. The principal components, in descending order, describe how much of the variation each of the axes in this new space explains. The first principal component separates out the data as much as possible, the second principal component provides the next most separation, and so forth. 

There are three options for the output format: Scatter Plots: Will plot the following principal components against each other for each environment: P1 vs P2, P1 vs P3, and P2 vs P3.The names of the two principal components that contribute to the graph are written in the title. Environments with similar names will be assigned the same color, providing a convenient way to group environments visually by changing the names in the file you upload. If Assign series by: underscore is selected, environments with the same text before an underscore will have the same color and if Assign series by: first letter is selected, environments starting with the same first letter will have the same color. Environments that have similar values of a particular principal coordinate will appear close together on the corresponding axis. Each axis indicates the fraction of the variance in the data that the axis accounts for. Larger values mean that the axis explains more of the data; values sum to 100% for all principal components. It is fairly common for the first principal component to explain only a few percent of the variance when many diverse factors affect the environment. Mouse over each point to see the name of the environment it corresponds to. A pdf file of the scatterplot can be saved to your local computer with the "Download Figure" link.
Scatter Plots  Text Labels: Same as above except that the environments are represented in the scatterplot by text labels rather than colored symbols. If "Assign series by: underscore" is selected, environments are represented by the text before the underscore and if "Assign series by: first letter" is selected, environments will be represented by the first letter of the name. A pdf file of the scatterplot can be saved to your local computer with the "Download Figure" link. Raw data: Raw data returns a table containing the values of each principal component for each environment. Each row corresponds to an environment. Each column corresponds to a principal component, in descending order of their contribution to the overall variation. The eigenvalues for each principal component and the % of the variation that that component describes is also reported. The "downlaod data" link downloads an Excel spreadsheet file for further analysis.


If "no" uses the standard unweighted UniFrac algorithm to perform the calculations. If "yes" uses the weighted UniFrac algorithm ; if "Normalized" performs a normalization step in the weighted UniFrac value calculation. If "Nonnormalized," it does not. 

Select Analysis  Environment Counts Environment Counts The Environment Counts option produces a table counting the number of taxa in each environment.


Select Analysis  Environment Distance Matrix Environment Distance Matrix The Environment Distance Matrix option calculates the distance between each pair of environments using UniFrac as described for Cluster Environments. This matrix is used as input in both the Cluster Environments and the PCA Analysis options and can be used to cluster the environments using different algorithms outside of the UniFrac environment if desired. The output is a matrix that compares each environment to each other environment. The quartiles of UniFrac values have different colors (red, yellow, green, blue for lower > upper quartile), allowing visual comparison of similarities and distances.

