Fast UniFrac Help

 

Introduction


Fast UniFrac is a new version of UniFrac that is specifically designed to handle very large datasets. Like UniFrac, Fast UniFrac provides a suite of tools for the comparison of microbial communities using phylogenetic information. It takes as input a rooted phylogenetic tree that contains sequences derived from at least three different environmental samples, a file describing which sequences came from which sample, and a file describing attributes of each sample.

Both the UniFrac distance metric (Lozupone C. and R. Knight. 2005. Appl. Environ. Microbiol. 71:8225-35) and the P test (Martin, A. 2002. Appl. Environ. Microbiol. 68:3673-82) can be used to make comparisons. Both of these techniques bypass the need to choose operational taxonomic units (OTUs) based on sequence divergence prior to analysis, although for large datasets, it is recommended to select OTUs prior to generating the phylogenetic tree for input, as this greatly reduces compute time and does not affect the results. The UniFrac distance is calculated as the percent of branch length leading to descendants from only one of the environments represented in the phylogenetic tree, and reflects differences in the phylogenetic lineages that are adapted to live in one environment verses the others. Although the UniFrac metric and the P test have primarily been applied to trees made with 16S ribosomal RNA gene sequences, phylogenetic trees from any kind of gene can be analyzed.

The Fast UniFrac interface allows you to:

  • Make pairwise comparisons between all samples represented in the input phylogenetic tree to determine if they are significantly different.
  • Compare sequences from many samples simultaneously using hierarchical clustering and/or principal coordinates analysis (PCoA) to determine whether there are environmental factors (such as temperature or salinity) that group communities together.
  • Determine whether the samples were sampled sufficiently to support cluster nodes using a sequence jackknifing technique.
  • Easily visualize the differences between samples graphically, with support for three dimensional exploration of datasets and with multiple subcategory coloring.

 
Why would I want to use Fast UniFrac?

You should use Fast UniFrac if you have sequence data that you can put into a tree, you want to measure the differences in community composition between environmental samples, and you want to take into account the different evolutionary relationships among taxa. You should not use Fast UniFrac if you want a measure of the diversity in each sample, or if you want to perform an OTU-based analysis.

The Fast UniFrac interface has analyses that are optimal for comparing sequences from only a couple of different samples, and analyses that are optimal for comparing sequences from many different samples simultaneously.

If your dataset contains sequences from only a few different samples, try out UniFrac Significance or the P Test Significance. These are designed to determine whether the samples are significantly different from each other. These analyses do not work well on datasets containing sequences from many different samples because 1) it can be difficult to interpret what the results mean (e.g. a significant difference between the samples in a tree from 10 different environments would not tell you which samples are contributing to this difference) or 2) it can be difficult to get significant p-values after correcting for multiple comparisons when pairwise comparisons between all of the samples are made. (All p-values are multiplied by the number of comparisons made).

If your dataset contains sequences from more than 3 different samples, try out Cluster Samples, Jackknife Sample Clusters, or PCoA . These techniques are designed to cluster samples based on the sequences they contain and can tell you which samples have related communities.

 

How to compress uploaded files


Before uploading files to Fast UniFrac, they must first be compressed using zip, gzip, or bzip2. This is because the large trees and sample ID mapping files that Fast UniFrac has been designed to support can grow very large and can take a significant amount of time to upload to our servers when uncompressed, particularly if your uplink speed is slow.

If you are using a Mac or Unix operating system, you can simply type one of the following commands in a terminal (assuming your filename is "my_file.tre"):

To create a zip file: (This will create a file called "my_file.tre.zip")

zip my_file.tre.zip my_file.tre

To create a gzip file: (This will create a file called "my_file.tre.gz")

gzip my_file.tre

To create a bzip2 file: (This will create a file called "my_file.tre.bz2")

bzip2 my_file.tre

If you are using Windows, you may want to consider one of the many free compression utilities, or buy one like the popular WinZip. For example, to compress your file using WinZip, right-click the file, select the "WinZip" menu, and then choose the "Add to my_file.tre.zip".

 

Upload tree file or select reference tree


Fast UniFrac allows you to either upload your compressed tree file, or select a reference tree from the drop down menu.

The uploaded tree must contain sequences from at least three different samples and can be generated by programs such as Arb, Phylip, PAUP*, MrBayes, RaxML, or Clustal W. The tree should be rooted; Loading an unrooted tree will usually result in the assignment of an arbitrary root. The quality of the Fast UniFrac analysis results depends on the quality of the input tree.

NEWICK format:

NEWICK is a standard format that is recognized by most programs that generate or allow visualization of phylogenetic trees including PHYLIP, TREE-PUZZLE, ARB, and TREEVIEW. Here is an example of a tree in NEWICK format:

(((Sequence.1:0.02120,(Sequence.2:0.09111,Sequence.3:0.04491)node1:0.00097)node2:0.00194, (Sequence.4:0.03160,Sequence.5:0.04378)node3:0.00365)node4:0.00188,Sequence.6:0.00881)node5:0.00739;

Reference trees:

You may select a reference tree from the drop down menu if you are using PhyloChip data (e.g. processed via PhyloTrac) or if you have mapped your sequence to the GreenGenes coreset tree we provide in the tutorial.

 
Upload sample ID mapping file

This file maps each sequence ID in the tree file to the sample ID that each sequence came from. Each line in the sample ID mapping file has a sequence ID and an sample ID separated by a tab. If the same sequence was found in more than one sample, it should be represented by more than one line. Here is an example of a sample ID mapping file:

Sequence.1        Sample.1
Sequence.1        Sample.2
Sequence.2        Sample.1
Sequence.3        Sample.1
Sequence.4        Sample.2
Sequence.5        Sample.1
Sequence.6        Sample.3
Sequence.6        Sample.2
...

Optionally, the sample ID mapping file can contain a third column containing the number of times that the sequence was observed in each sample. This data can come from RFLP data or OTU counts. This type of information is used when applying weighted UniFrac, which accounts for sequence abundance when comparing diversity. Here is an example of a sample ID mapping file that includes abundance information:

Sequence.1        Sample.1        1
Sequence.1        Sample.2        2
Sequence.2        Sample.1        15
Sequence.3        Sample.1        2
Sequence.4        Sample.2        8
Sequence.5        Sample.1        4
Sequence.6        Sample.3        1        
Sequence.6        Sample.2        1
...
 

Upload category mapping file


The category mapping file contains additional metadata about the samples, and is a table relating each sample to parameters you have measured such as temperature, pH, etc, as well as sample descriptions. In general, people usually prepare this mapping file using Excel, although it is important to save them as plain text format (tab-delimited) and not as Excel documents.

This file can be autogenerated but it is highly recommended that you generate one that is meaningful for the variation you plan to examine in your studies. For example, if you were studying the effects of diet on the gut communities of conventional and humanized mice, you might want one column indicating whether the sample was from a conventional or a humanized mouse, another column indicating whether the mouse was on a chow diet or a high-fat diet, another column containing the combination of these two columns (i.e. diet and humanized/conventional), etc.

The file format is tab-delimited text. The first line is a header line that must start with a "#" character.

Optionally, a general description of the input files can be included in the lines immediately following the header line that start with a "#". This description will be included in the upload and results screens so that relevant information can be easily accessed.

The first column must be named SampleID, must contain unique (short, meaningful) sample IDs containing only alphanumeric characters. (With the exception of ".", "+", and "#" characters.)

The second column to "n-1 th" column are subcategories. These can be anything (random assignment if you want) but each subcategory should a small number of distinct values <= number of samples. There must be at least two unique values for each category.

The last column must be named "Description" and contains the short descriptions for the samples.

Example category mapping file format:

#SampleID EnvType       FreelivingGut  ... Description
# General description of analysis line 1 (optional)
# General description of analysis line 2 (optional)
# ...
Tg#1249   TermiteGut    Gut            ... Whole gut of the wood-feeding termite
Tg#1251   TermiteGut    Gut            ... Whole gut of the fungus-growing termite Macrotermes gilvus
Nso#65    Soil          Freeliving     ... Uncultivated agricultural soil in Wisconsin
Nso#1209  Soil          Freeliving     ... Soil from a fertilized Switzerland plot in the DOK.
Vg#h#111  VertebrateGut Gut            ... Feces from Angolan Colobus Monkey from the St Louis Zoo.
...
 
Display Tree

Select Yes if you want to display a text version of the tree you upload (the default is No). The sequences in the tree will be labeled with information on the sample to which they were assigned and the number of times they were observed in that sample. However, please note that the tree is automatically suppressed when it contains more than 2500 sequences.

 
Text Tree Format

During the upload, the Fast UniFrac interface removes from the tree any sequences that were not assigned to any samples in the sample ID mapping file. The number of sequences that remain in the tree after this pruning step is reported, along with any warnings (in red) about your tree or mapping file that you should try and resolve. A text representation of the tree may be drawn (if the Display tree option was selected during upload) where each dash represents 1 unit of branch length. The leaves in the tree are labeled with the sequence ID, a list of samples to which the sequence was assigned (in blue), and the number of times that the sequence was observed in each sample (in red).

 

Select analysis - Summary


Fast UniFrac can perform several different phylogenetic analyses on the tree and mapping files you've uploaded.

  • Cluster Samples uses the UniFrac metric to cluster the samples based on the phylogenetic lineages they contain.
  • PCoA uses the UniFrac metric to perform principal coordinates analysis on the samples, allowing you to see whether different types of samples are separated in different dimensions.
  • P Test Significance tells you which pairs of samples are significantly different using the P test.
  • UniFrac Significance tells you which pairs of samples are significantly different using the UniFrac significance test.
  • Sample Counts tells you how many sequences are in each sample.
  • Sample Distance Matrix shows you the UniFrac distances between each pair of samples. This distance matrix is what is used as input for sample clustering and PCoA.
  • Jackknife Sample Clusters performs statistical resampling to tell you which clusters you can be confident of.
  •  

    Select analysis - "Cluster Samples"


    The Cluster Samples option can be used to determine which samples in the tree have similar microbial communities. It performs hierarchical clustering analysis (also known as UPGMA) on the samples based on a distance matrix that is generated by calculating pairwise UniFrac values (see illustration below). It only make sense to run this analysis on input trees that have sequences from at least three samples.

    Clustering samples with Fast UniFrac. The tree on the left has sequences from 3 different samples colored red, yellow, and blue. UniFrac values are calculated for all possible pairs of samples to create a distance matrix. This matrix is used to cluster the samples using UPGMA (Unweighted Pair Group Method with Arithmetic Mean). The resulting cluster illustrates that the red and yellow samples are more similar to each other than either is to the blue sample.

     

    Cluster Samples - "Use abundance weights"


    When the No option is selected, it uses the standard unweighted UniFrac algorithm to perform the calculations. If the Yes option is selected, it uses the weighted UniFrac algorithm; when the weighted sub-option Normalized is selected, it performs a normalization step in the weighted UniFrac value calculation, whereas the Non-normalized option this normalization is not applied.

    Cluster Samples - Output format


    The output of Cluster Samples is a text representation of the tree with the samples shown on it. Similar samples will appear clustered together in the tree. The tree contains the samples, not the sequences, and is drawn so that the root is on the left and the samples are on the right. The scale bar shows the distance between clusters in UniFrac units: a distance of 0 means that two samples are identical, and a distance of 0.5 means that two samples contain mutually exclusive lineages. The tree in Newick format can be downloaded using the link under the tree (this file can be used to view the clustering using other programs such as TreeView).

    Note that the method for clustering the samples is independent of the method used to build the phylogenetic tree of sequences: typically, you will want to use an algorithm such as neighbor-joining, likelihood, parsimony, or Bayesian inference to build your phylogenetic tree. Hierarchical clustering is appropriate here because we are looking purely at distances between samples, and are not claiming that one sample evolved from another.

     

    Select analysis - "PCoA"


    PCoA

    The PCoA option performs Principal Coordinates Analysis. PCoA is a multivariate statistical technique for finding the most important axes along which your samples vary. It can also be used to find clusters.

    PCoA calculates the distance matrix for each pair of samples using the UniFrac metric as described for Cluster Samples. It then turns these distances into points in a space with a number of dimensions one less than the number of samples. The principal components, in descending order, describe how much of the variation each of the axes in this new space explains. The first principal component separates out the data as much as possible, the second principal component provides the next most separation, and so forth.

     

    PCoA - "Output format"


    There are two options for the PCoA output format: Scatter Plots and Scatter Plots - Text Labels

    The Scatter Plots option will plot the following principal components against each other for each sample: P1 vs P2, P1 vs P3, and P2 vs P3.The names of the two principal components that contribute to each graph are written in the title. Samples are binned according to the user-selected option. When the Color sample by: first "n" letters option is selected, samples starting with the same first "n" letters will have the same color (by default, sample are binned according to the first 3 letters in the sample ID). When the Color samples by: first hash option is selected, samples are binned according by the text in the sample ID that comes before before the first hash (#) character (note that other characters can be used for binning purposes).

    Samples that have similar values of a particular principal coordinate will appear close together on the corresponding axis. Each axis indicates the fraction of the variance in the data that the axis accounts for. Larger values mean that the axis explains more of the variance in the data; values sum to 100% for all principal components. It is fairly common for the first principal component to explain a small percentage of the variance when many diverse factors affect the samples.

    In the lower right hand corner you can see the scree plot (fraction of variance explained by each axis in red, cumulative variance in blue). This plot can tell you how many axes are likely to be important and help determine how many "real" underlying gradients there might be in your data as well as their relative "strength".

    Move your mouse over each point in the PCoA plots to see the sample ID and description (if you have provided this information in the category mapping file).

    Several files can be downloaded from this screen: (1) an EPS version of each scatterplot can be saved to your local computer using the Download Figure link, (2) a Kinemage file containing the multidimensional PCoA data used by the KiNG Applet using the download kin file link, and (3) a tab-delimited text file containing the raw PCoA data using the link provided below the figures.

    The Scatter Plots - Text Labels option works the same as above except that the individual samples are represented in the scatterplot by text labels rather than colored symbols. So depending on how you've chosen to bin your samples, you might see the first three letters of each sample plotted, instead of a colored point.

    PCoA - "Use abundance weights"
    When the No option is selected, it uses the standard unweighted UniFrac algorithm to perform the calculations. If the Yes option is selected, it uses the weighted UniFrac algorithm; when the weighted sub-option Normalized is selected, it performs a normalization step in the weighted UniFrac value calculation, whereas the Non-normalized option this normalization is not applied.
     

    Select analysis - "UniFrac Significance"


    Calculating the UniFrac metric:

    The majority of options in the Fast UniFrac interface make comparisons based on the UniFrac metric. The UniFrac metric measures the difference between two samples in terms of the branch length that is unique to one sample or the other. In the tree on the right (panel C below), the division between the two samples (labeled red and blue) occurs very early in the tree, so that all of the branch length is unique to one sample or the other. This results in the maximum UniFrac distance possible, 1.0. In the tree on the left, every sequence in the first sample has a very similar counterpart in the other sample, and all of the branch length in the tree comes from nodes that have descendants in both samples. The results in the minimum UniFrac distance of 0.0. In the middle example, there is about as much branch length unique to each sample (red or blue) as is shared between samples (purple), so the UniFrac distance would be about 0.5.

    Calculating the weighted UniFrac metric:

    The UniFrac metric described above does not account for the relative abundance of sequences in the different samples because duplicate sequences contribute no additional branch length to the tree (by definition, the branch length that separates a pair of duplicate sequences is zero, because no substitutions separate them). Because the relative abundance of different kinds of bacteria can be critical for describing community changes, we have developed a variant of the algorithm, weighted UniFrac, which weights the branches based on abundance information during the calculations. Weighted UniFrac can thus detect changes in how many organisms from each lineage are present, as well as detecting changes in which organisms are present.

    The first step in the algorithm is to calculate the raw weighted UniFrac value (u), according to the equation below:

    Here, n is the total number of branches in the tree, bi is the length of branch i, Ai and Bi are the number of descendants of branch i from communities A and B respectively, and AT and BT are the total number of sequences from communities A and B respectively. In order to control for unequal sampling effort, Ai and Bi are divided by AT and BT.

    The figure below illustrates how the weighted UniFrac algorithm works. Branch lengths are weighted by the relative abundance of sequences in the square and circle communities; square sequences are weighted twice as much as circle sequences because there are twice as many total circle sequences in the dataset. The width of branches is proportional to the degree to which each branch is weighted in the calculations and the grey branches have no weight. Branches 1 and 2 have heavy weights since the descendants are biased towards the square and circles respectively. Branch 3 contributes no value since it has an equal contribution from circle and square sequences after normalization for different sample sizes.

    Normalizing weighted UniFrac values:

    If the phylogenetic tree is not ultrametric (i.e. if different sequences in the sample have evolved at different rates), comparing samples with the Cluster Samples or PCoA analysis options using weighted UniFrac will place more emphasis on communities that contain taxa that have evolved more quickly. This is because these taxa contribute more branch length to the tree. In some situations, it may be desirable to normalize the branch lengths within each sample. This normalization has the effect of treating each sample equally instead of treating each unit of branch equally: the issues involved are similar to those involved in performing multivariate analyses using the correlation matrix, to treat each variable equally independent of scale, or using the covariance matrix, to take the scale into account. Normalization has the additional effect of placing all pairwise comparisons on the same scale as unweighted UniFrac (0 for identical communities, 1 for non-overlapping communities), allowing comparisons among different analyses with different samples. The scale of of the raw weighted UniFrac value (u) depends on the average distance of each sequence from the root. The normalization to correct for this effect is performed by dividing u by the distance scale factor D (see equation below), which is the average distance of each sequence from the root weighted by the number of times each sequence was observed in each community.

    Here dj is the distance of sequence j from the root, Aj and Bj are the number of times the sequences were observed in communities A and B respectively, and AT and BT are the total number of sequences from communities A and B respectively.

    Calculating UniFrac Significance:

    The UniFrac metric can be used to measure whether the microbial communities in the samples are significantly different. Samples are significantly different if the UniFrac value for the real tree is greater than would be expected if the sequences were randomly distributed between the samples. To test this, the sample IDs are randomly permuted across the sequences in the tree and a new UniFrac value is calculated. This is repeated a number of times (determined by the selected value of the Number of Permutations option, e.g. 1000 times). The reported p-value is the fraction of permuted trees that have UniFrac values greater or equal to that of the real tree.

     

    Select analysis - "P Test Significance"


    The P Test (Martin, A. 2002. Appl. Environ. Microbiol. 68:3673-82) also uses phylogenetic information to test whether two samples are significantly different. It estimates similarity between communities as the number of parsimony changes (change from one sample to another along a branch) that would be required to explain the distribution of sequences between the different samples in the tree. A phylogenetic tree representing sequences from very different communities would require fewer parsimony changes to explain the distribution of sequences between samples than if the samples were similar. One way to think about this is that if two samples are mutually exclusive, only one parsimony change is required to separate the sequences in one sample from the other (see example below), whereas if the two were identical and had the same sequences we would expect all sequences to be found in both, requiring a large number of parsimony changes.

    Generating the P Test p-value. Two samples that are extremely well-separated on a tree (A) require few changes, and perhaps only one change, between samples to explain the modern distribution of lineages in samples. The minimum number of changes from one sample to another is inferred by parsimony. This observed value is compared to a distribution of values obtained by randomizing the sample to which each sequence is assigned, and re-inferring the number of parsimony changes required for these permuted sample sets. If the observed value is much lower than the average for the distribution, we conclude that organisms are significantly clustered by sample. The p-value is the fraction of random permutations of the labels that require fewer parsimony changes to explain than does the actual labeling. Note that the p-value is a measure of statistical significance, not of distance between samples.

     

    P Test Significance - "Type of test"


    There are two ways to calculate the P test significance: all samples together and each pair of samples.

    All samples together:

    When this option is selected, it tests whether the sequences from all of the different samples in the tree are significantly different from each other. For instance, if there are sequences from 3 different samples in the tree, the parsimony changes is the number required to describe the distribution of sequences from the 3 different samples on the tree. The output is a single p-value indicating whether the samples are significantly clustered on the tree.

    Each pair of samples:

    When this option is selected, it tests whether each pair of samples differs from one another. For all possible pairs of samples, it first removes all sequences not in one of two samples and then calculates significance. The calculated p-values are output in two separate tables. In the first table, the values have been corrected for multiple comparisons by multiplying them by the number of comparisons that were made (the Bonferroni correction). The second table displays the raw, uncorrected values. The samples IDs are displayed in the column and row headers (moving your mouse over the row headers will display the sample description, if you have provided that information in the category mapping file), if you move your mouse over the @ symbol in the individual cells in table, it will display the sample IDs being compared and the associated p-value. Each cell is colored by the p-value and a color key is provided below the table. P-values < 0.001 are red, < .01 are yellow, < 0.05 are green, > .05 are blue. The raw data from each table can be downloaded as a tab-delimited file using the download data link provided. An example of the output is shown below:

     

    P Test Significance - "Number of Permutations"


    The number of tree permutations used to generate the random distribution of values to which the value of the true tree is compared when calculating a p-value. You must choose a population size that is appropriate for the number of samples you wish to analyzing, taking into account the fact that the p-values are corrected for multiple comparisons.

     

    UniFrac Significance - "Type of test"


    There are three ways to calculate the UniFrac significance: All samples together, each pair of samples, and each sample individually.

    All samples together:

    When this option is selected, it tests whether the sequences from all of the different samples in the tree are significantly different from each other. For instance, if there are sequences from 3 different samples in the tree, the UniFrac values are calculated as the percent of the branch length in the tree that leads to descendants from only one of the 3 samples. The output is a single p-value.

    Each pair of samples:

    When this option is selected, it tests whether each pair of samples differs from one another. For all possible pairs of samples, it first removes all sequences not in one of two samples and then calculates the UniFrac significance. The calculated p-values are output in two separate tables. In the first table, the values have been corrected for multiple comparisons by multiplying them by the number of comparisons that were made (the Bonferroni correction). The second table displays the raw, uncorrected values. The samples IDs are displayed in the column and row headers (moving your mouse over the row headers will display the sample description, if you have provided that information in the category mapping file), if you move your mouse over the @ symbol in the individual cells in table, it will display the sample IDs being compared and the associated p-value. Each cell is colored by the p-value and a color key is provided below the table. P-values < 0.001 are red, < .01 are yellow, < 0.05 are green, > .05 are blue. The raw data from each table can be downloaded as a tab-delimited file using the download data link provided. An example of the output is shown below:

    Each sample individually:

    When this option is selected, for each sample, it treats the rest of the tree as a single other sample and determines if that particular sample has more unique branch length than expected by chance. If again we are looking at a tree with sequences from 3 different samples, and we determine that the samples are significantly different when evaluating All samples together, the tree can theoretically have a significant UniFrac significance value only because of unique lineages in one particular sample. The Each sample individually option provides a p-value for each sample in the tree and can be used to determine which particular sample is contributing the most unique branch length. The output is a table of p-values for each sample indicating whether that sample is significantly different from the rest of the tree. An example of the output is shown below. This result would indicate that only the sample ID E#PA has a significant amount of unique branch length with respect to the other samples in the tree.

     
    UniFrac Significance - "Number of Permutations"

    The number of tree permutations used to generate the random distribution of values to which the value of the true tree is compared when calculating a p-value. You must choose a population size that is appropriate for the number of samples you wish to analyzing, taking into account the fact that the p-values are corrected for multiple comparisons.

     

    UniFrac Significance - "Use abundance weights"


    When the No option is selected, it uses the standard unweighted UniFrac algorithm to perform the calculations. If the Yes option is selected, it uses the weighted UniFrac algorithm; when the weighted sub-option Normalized is selected, it performs a normalization step in the weighted UniFrac value calculation, whereas the Non-normalized option this normalization is not applied.

     
    Select analysis - "Sample Counts"

    The Sample Counts option produces a table that reports number of sequences in each sample, along with a description of each sample (if you provided that information in the category mapping file). An example of the sample counts table is shown below.

     
    Select analysis - "Sample Distance Matrix"

    The Sample Distance Matrix option calculates the distance between all pairs of samples using the UniFrac metric as described for Cluster Samples. This matrix is used as input in both the Cluster Samples and the PCoA analysis options and can be used to cluster the samples using different algorithms if desired.

    The output is a matrix that compares each sample to each other sample. The quartiles of UniFrac values have different colors (red, yellow, green, blue for lower to upper quartiles), allowing visual comparison of similarities and distances. Note that if there are more samples in your dataset than some maximum, the interactive matrix will be suppressed.

    Note also that you can download a compressed, tab-delimited file containing the raw data from this matrix using the download link provided.

     
    Select analysis - "Jackknife Sample Clusters"

    Jackknife Sample Clusters performs jackknife analysis of sample clusters produced with the methods described for the Cluster Samples Analysis option. It can be used to determine how robust the analysis is to both sample size and evenness by assessing how often the cluster nodes are recovered when smaller, even sets of sequences are sampled from the samples. Jackknifing is a statistical resampling technique that asks whether you would see the same results using only a random sample of part of the data. To perform this analysis, we randomly sample a specified number of sequences from each sample and re-cluster the data. We do this a number of times depending on what is set for Number of Permutations and calculate the fraction of the times that each node in the cluster was recovered.

     
    Jackknife Sample Clusters - "Minimum sequence to keep"

    Minimum sequences to keep specifies the number of sequences that are kept from each sample in the tree during each jackknife replicate. The sequences are randomly selected for inclusion during each replicate. If an sample contains less than the specified number of sequences, all the sequences from that sample are removed, resulting in exclusion of that sample in the cluster output. Because of this, it is often best to select a number that is less than or equal to the minimum number of sequences in any one sample. This maximum recommended value is automatically set as the default. You can view the number of sequences in each sample with the Sample Counts Analysis option.

    The most reasonable value to use is about 75% of the smallest sample you want to include in the analysis, but in practice people usually just use the smallest sample (despite the fact that there is then no resampling performed on that sample).

     
    Jackknife Sample Clusters - "Number of Permutations"

    The Number of Permutations option specifies the number of jackknife replicates to perform, (i.e. the number of times to randomly resample the sequences and re-do the sample clustering when calculating the support for each cluster node). This number is limited for guest users.

     
    Jackknife Sample Clusters - "Use abundance weights"
    When the No option is selected, it uses the standard unweighted UniFrac algorithm to perform the calculations. If the Yes option is selected, it uses the weighted UniFrac algorithm; when the weighted sub-option Normalized is selected, it performs a normalization step in the weighted UniFrac value calculation, whereas the Non-normalized option this normalization is not applied.
    Jackknife Sample Clusters - Output format

    The output is the same type of tree used in Cluster Samples, except that nodes are colored according to the fraction of the random samples that they were recovered in. Move your mouse over the internal nodes to get the jackknife support or move your mouse over the sample IDs to get the sample descriptions (if you've supplied this information in the category mapping file).