Glossary

Clusters

Clusters are a term used in the cluster tracker to identify a set of closely related samples that are from the same region and descend from a common ancestor with a regional introduction event.

In a phylogenetic tree, a cluster will appear as a set of leaves (samples) from a geographic region descended from a shared common ancestor.  See the glossary phylogenetic tree and polytomy entries for more information about interpreting the branches on phylogenetic trees.

A cluster may be monophyletic or paraphyletic, where in a paraphyletic situation some descendents are missing, depending on whether some descendents left the geographic region.

A regional transmission event is where a child node is from a different region than the parent node. Such patterns reflect cases of infected travelers moving between regions, followed by local transmission, and the eventual sampling of descendent infections.

The calculation of a regional index allows each sample to be assigned an introduction node.  See the glossary entry for regional index for more information about how it is calculated. The introduction point is identified where the regional index is below 0.5, where the value of 0.5 is the cutoff as it represents when downstream samples could either be in or out of the region equivalently. Samples are clustered when they have shared introduction nodes, where dynamic parallel computing allows very fast calculations across extremely large phylogenies to highlight clustered samples for further prioritization and investigation by epidemiologists.


Growth score

A "growth score" takes the square root of the number of samples in a cluster and divides it by the weeks since the original introduction. This calculation is used to order the discovered clusters for priority review so that the most recent clusters with a larger sample size sort to the top of a table of clusters in a region so they might be investigated first by epidemiologists.

Introductions

Introductions are a term used in the cluster tracker to label whether newly obtained sequence descended from a virus already in a region, or was introduced from outside the region. The programmatic tools take in phylogeographic data and look at the branching of trees of new sequence plots to generate a "regional index" to help ascertain transmission dynamics. See the glossary entry for regional index for more information about how it is calculated. 

matUtils

The Pathogen Genomics group created a toolkit of Mutation-Annotated Tree utilities, or matUtils, to extract data from phylogenetic trees for deeper analysis. The paper goes into deeper explanation of the five subcommands of matUtils that include annotate, summary, extract, uncertainty, and introduce, which allow a rapid interpretation and analysis of sequencing information for genomic surveillance. A wiki provides detailed instructions for the usage of each module.

Phylogenetic tree

A phylogenetic tree is also known as an "evolutionary tree" and displays as a branching diagram, resembling a tree, diagramming relationships of genetic similarities and differences. The diagram provides a hypothesis of evolutionary history, where the tips, or "leaves", represent the end or present time.

The branching of a phylogenetic tree depicts the branching history of common ancestry, where the pattern of branching, or topology, informs relationships. Branches can be rotated and display equivalent information as seen in the above image.

This tips for tree reading can help avoid common misinterpretations. For instance, the resource included the top image to demonstrate how branches can be rotated.

Since rotations can display equivalent information, you cannot assume that terminal taxa located more closely to one another are more closely related evolutionarily. One must follow the branches to identify the most common ancestors. For instance, in the below image, the circle and the star are less related, although adjacent in the tree on the left. Instead the star and the rectangle, although separated by the triangle, share a younger common ancestor compared to the circle, and are more related.

Polytomy

A polytomy is a node on a phylogenetic tree with three or more child subtrees, also described as a "multifurcation". Building phylogenetic trees is an imperfect science and results in many polytomies, representing branches that are not yet fully resolved, meaning it is harder to extract relationships expected in a normal binary tree. Polytomies can mean in a tree it may appear items from different regions are clustered together, however, in reality it is only because the nearby items have zero branch lengths.

Regional Index

To identify introductions and clusters in the data, a regional index is calculated. The formula and figure shown here are from the paper introducing the cluster tracker tool. The regional index provides a weighted summary for descendents of a phylogenetic tree.

A binary model of region membership can approximate the intuitive idea that if all descendents of a node were found in region A, likely that the ancestor of that internal node was circulating in region A. The Li and Lo refer to the number of leaves (samples) that are inside or outside when looking at a focal node. The Di and Do refer to the total branch lengths to leaf descendents in the focal region, where the total branch length is equal to the mutations between the query node and a descendant leaf.

Looking at the below figure one, and the focal node, you can see that only one mutation has happened since the top sequence in blue, with only one leaf. Looking at the other branch, you can see that there were first three [3] mutations accumulated, displayed as a longer branch, before a node with a further mutation [1] leading to a total of 3 red leaves. Intuitively we can sense how the red samples are grouped together, more time has passed for all of these mutations to happen in this descendent group. The regional index calculation gives the focal node the value of 0.43, below 0.5, suggesting the blue leaf is out-of-region, with an introduction below the root, while noting the ancestor of the downstream in-region sample cluster exists along that branch as well.

Taxonium

Taxonium is a tool for exploring phylogenetic trees, including those with millions of nodes. Taxonium is especially powerful when applied to a tree that has been annotated with mutations. The "View Cluster" link in the CA Big Tree Cluster Tracker will open a view in Taxonium.

On the right-hand side in Taxonium, you can change the "Color by:" option from "Region" to "PANGO lineage" or "None." By default, the "Search" section will have a selected "Cluster" name from clicking the "View Cluster" link in Cluster Tracker. By clicking the small magnifying glass next to "# results" you can zoom in on the cluster. You can also click the "Add a new search" button and add on searches for other metadata such as specific accessions, other cluster names, or specific regions. The view of the tree can be zoomed in and out both vertically and horizontally with the many different magnifying glass icons on the bottom. You can also use your mouse scroll wheel to zoom in and out vertically without the icons.

With an item selected, details will appear on the right. After the name, you can click a little arrow pointing up to the left. It allows selecting the entire branch of the tree one level up. Once up a branch, a curved arrow that goes to the right should appear after a phrase such as "Number of descendants: #". Once clicked, a pop-up will offer to "List all tips" for this branch. An option to change the default selection of "name" to "meta_region" or other metadata selections will list all the regions, or selected metadata, in this branch. Options to download the JSON for the branch, or view it in NextStrain or CovSpectrum are also available by clicking this arrow.

UShER

UShER stands for Ultrafast Sample placement on Existing tRee and is the tool that puts the ever-growing new SARS-CoV-2 mutations arising in the natural environment into context on an existing phylogenetic tree. The UCSC Pathogen Genomics group uses mutation-annotated tree utilities, or matUtils, with parallel dynamic programming to quickly extract information hidden in the daily growing tree built by UShER. The phylogenetic tree updated daily by UShER is also used by the Phylogenetic Assignment of Named Global Outbreak Lineages (PANGOLIN) software to implement a dynamic nomenclature, known as the PANGO nomenclature, to classify genetic lineages for SARS-CoV-2.

UShER works with mutation-annotated trees, as shown in the figure below. The mutation-annotated tree object carries sufficient information to derive parsimony-resolved genotypes for any tip of the tree using the sequence of mutations from the root to that tip. For example, in the below figure, S5 can be inferred to contain variants G1149U, C7869U, G3179A from the first node and A2869G, specific to S5, all in relation to the reference sequence at the root. UShER’s mutation-annotated tree approach is compact and is what helps make it fast in placing new samples based on the variants they contain.