Cluster Tracker

Cluster Tracker is a data prioritization and exploration tool to help flag samples related by shared genetic decent from a common ancestor and introduction to a focal region for further investigation. Read more here.

What is Cluster Tracker, and why should I use it?

The Cluster Tracker tool uses specialized analysis and contextualization to help obtain insights into viral transmission dynamics and prioritize cases worthy of further investigation in your region.

The world is at risk of viral infection, because viruses are continually mutating, where new successful variants overtake existing lineages and travel across regions. By using UCSC's Cluster Tracker, new virus variants emerging in the sequencing data in your region can not only be identified, but potential variants exhibiting concerning growth and grouped as new introductions into your population can be spotted and examined earlier.

How do I get access to Cluster Tracker?

Public Cluster Tracker Tool

There are two forms of the Cluster Tracker tool. The first just uses the global public SARS-CoV-2-phylogenetic tree and is publicly accessible here: https://clustertracker.gi.ucsc.edu/ 

This public Cluster Tracker tool is limited to state-level regions, and once a state is clicked on to select it as the focal region, the heat map identifies groups of viral sequences that may have recently migrated from outside the state. Here is a link to a tutorial for the public site.

California Big Tree Cluster Tracker

Compared to the public Cluster Tracker tool, the CA Big Tree Cluster Tracker tool is a different website with restricted password access. It incorporates sequences in California’s central repository in Terra and allows county-level selection of focal regions within California.

To access the CA Big Tree Cluster Tracker tool, all tool access requests go to: COVIDNet.TechSupport@cdph.ca.gov

The CA Big Tree Cluster Tracker tool exists to aid Local Health Jurisdictions (LHJs) obtain insights into viral transmission dynamics between California counties, and other states, to help prioritize cases worthy of further investigations. The website is funded by the California Department of Public Health (CDPH) and is restricted to CDPH members.

How do I resolve a  "Couldn't find your Google Account" login issue?

When you see an error like this that says something like "Couldn't find your Google Account" with your access email, you may need to ensure that at Google you allow your access email to be used.


How do I use the tool?

The CA Big Tree Cluster Tracker gives you a map of California, with a table below it displaying the discovered clusters for your selected region, ordered by growth score. The map is a geographic representation that summarizes information from the phylogenetic tree. Investigating the results in the table can help spot potential outbreak connections in your community, where UCSC's algorithms crank through the latest viral sequencing data to help congregate similar sequences and imputing the potential introductory source of the new emerging variants.

By looking at CA Big Tree Cluster Tracker's table of clusters, you can ask yourself, "What clusters spark my interest?" and "What clusters make me want to dig deeper?" The "View Cluster" link allows you to view the samples in the context of all viral lineages. From the thousands of new sequences being placed onto the evolving SARS-CoV-2 phylogenetic tree, the tool circles the selected growing cluster of samples descending from a common out-of-region ancestor to assist you in prioritizing cases for additional investigation.

Another approach is if you already have a specimen ID, you can search the table of clusters and discover linked cases related by sequence. In that way, a known case of concern can be viewed alongside genetically related samples to possibly investigate transmission pathways or aid in seeing new relationships.

Or if you have a new viral variant of interest, such as XBB.1 or BQ.1 (Fall 2022), you can search within your county to see what clusters exist that might be growing in size. You can click the "View Clusters" link to see the samples on the SARS-CoV-2 phylogenetic tree, and then change the "Color by:" to "PANGO lineage" and click the "Add a new search" button and entery your "County" as a search term to zoom out and see all related lineages (for instance XBB.2, XBB.4) with samples from your region.

What does the heat map indicate?

The heat map represents the total number of introductions into a region. Once you click a region it will turn purple indicating it is now the focal region for the map. All the other county and state regions will become colored in terms of the number of introductions that come from that region, with respect to the focal region. The coloration can be changed from the default log-fold enrichment, to aid in spotting introductions from less densely populated areas, or raw cluster count, where the most populous regions will likely have the highest raw count.

To understand how the heat map is connected to the table clusters, once you have selected the focal region, the table will now be filtered for that region. The heat map does not represent a confidence score of the introductions, from other regions, just the number of introductions coming from these other counties or states. In the rows on the table, a "Best Potential Origins" column has a name or names corresponding to the regions in the map. The heat map, in essence, is a global view of introductions, rather than a representation of any one specific cluster in a row. In essence, it shows the sum of the potential of all the clusters from each region into the focal region.

How is the log-fold enrichment calculated?

The shading of the map with our log-fold enrichment scale is an attempt to mitigate sampling biases resulting from larger populations, higher case rates, increased sequencing, or other factors that are not specific to geography.

Intuitively, the less sequencing is performed in a region, the less likely it will be possible to detect sequences from that region when they are an introduction source to another region. In order to attempt to compensate for this statistical bias the log-fold enrichment calculations of introductions between regions is computed and displayed as the default view. It is possible to switch the view to raw cluster count, however, that will always color sources of high population darker.

The log-fold enrichment calculation is described in the Methods section of the paper describing Cluster Tracker.

In this formula, Iab is introductions from region A to region B, Ixx is introductions from any region to any region, Iax is introductions from region A to any other region, and Ixb is introductions from anywhere to region B. This computation can remove biases in rates of detected introduction which would apply to any pair of regions, but requires many regions to be computed as points of comparison. Log10fold enrichment is used to color the map on Cluster Tracker when a state is selected. Often the coloring has a very strong correlation with geographic distance, which also makes intuitive sense. It is worth stressing that while log-fold enrichment may reveal spatial relationships, it does not reflect the absolute importance of a region as a source or sink of viral transmission.

How do you define Cluster, Growth Score, Introduction and other terms?

You can visit our glossary page for a list of these definitions. 

Please do not hesitate to contact us at help-pathogengenomics@ucsc.edu to ask any questions.

How can I download data?

At the bottom of the Cluster Tracker Tool you will find these links to files you can download:

While the first files capture the identified clusters in the table below the map, with the current Cluster ID references, the second two JSONL links capture the phylogenetic tree snapshot, that can be viewed in Taxonium with those same IDs.

Do the Cluster IDs (nodes) in Cluster Tracker always stay the same?

In the table below the map, the first column starts with names such as California_node_###### when looking at the state level, or San_Bernardino_County_node_###### when looking at the county level. These items identify clusters in the current phylogenetic tree. In essence, they represent the region_ followed by the node_number of the current tree.

Each time the tree is rebuilt, new nodes will be identified so you cannot rely on these IDs to be static references. As new data is incorporated into the tree with each build, these reference numbers will likely change.

If Cluster IDs can change, how can I find my cluster again?

When you find a cluster of interest, take note of the Samples and Specimen IDs. You can find this information in the final two columns in the table. You should take a moment to record one of these pieces of information to find your cluster again. For instance, if you have a cluster such as San_Bernardino_County_node_###### you could click the Specimen IDs and note that one of them is FS48076012. By searching FS48076012 in the future between builds of the tree, you will find the same cluster, which may have changed, such as increased in the number of samples as new sequencing information becomes available.

How can I search multiple terms and filter the table?

You can enter multiple terms to search the table of clusters, provided each term is separated by a comma - do not use a space. For instance, if you were looking for rows mentioning Hawaii and Texas you could enter "Hawaii,Texas" and click "Search" and find all the rows in the table that have both of those terms.

If you wanted to further narrow the results, you could add more terms such as a variant term or dates. For instance, if you were curious about rows that included both November and December dates for the rows that include mention of Hawaii, Texas, and the variant XBB you could add  XBB followed by a comma, and then the date 2022-12 for December followed by a comma and 2022-11 for November.

A search such as "Hawaii,Texas,XBB,2022-12,2022-11" will identify the rows in the table where each of these terms exists. Clicking the "Search" button does a Boolean "AND" match by default, however, it can be changed to the "OR" option to expand results. For instance, if you change to the "OR" search and put in many dates, such as "2022-12,2022-11,2022-10,2022-09" you would find any rows that mention any one of those dates once you click "Search".

A "Search Columns:" box allows the removal of columns to further restrict the matched results. For instance, if you searched the term "County" and restricted the "Search Columns:" box to only "Cluster ID" you would filter all California County level information (i.e., remove any clusters from other states). Using the "Region" column rather than "Cluster ID" would provide similar results. Note you can achieve a similar result by clicking the top "Show CA State Introductions" and then click the state of California on the map -while this would not include the county-level metadata.

The Advanced Filter Options are an even better way to drill down on cluster dates, as well as narrow down on specific cluster sizes and growth scores.

Advanced Filter Options

Additional "Cluster Date:" and "Cluster Size:" and "Growth Score:" filter options allow drilling down on clusters that are more recent and potentially more concerning. For instance, perhaps you want to only filter the clusters of 3 or more samples that have been collected in the last month.

First you could click your region of interest on the map, for instance San Francisco County. Or us the above "County" search restricting the "Search Columns:" box to only "Cluster ID" to see all counties. Then you click the side arrow next "Advanced Filter Options" to reveal date, size, and score range options. You can just put a "From: mm/dd/yyyy" date representing the last 30 days. If you click the calendar icon, a pop-up will allow clicking dates rather than typing them. Now in the "Cluster Size:" filter option only put "3" in the "From:" box, leaving the other box empty. Now click "Search" in the top right. If you have no results, move the date further back -often there is a lag in sequencing data entering into the pipeline.

Each time you change the inputs, be sure to click "Search" to have the filter applied. You can apply a "To: mm/dd/yyyy" to limit the date range, and "To:Max" for cluster size. The Growth Score is a calculation derived from both date and sample size. You can use it to further filter, or start filtering samples with Growth Score once you get a feeling of how it highlights more significant clusters.

How can I interpret potential origins?

When looking at a phylogenetic tree, it is important to know that even if two branches are close to each other, the samples may or may not have a relationship to each other. It can be hard to say where a variant may have come from and where it may be going by just looking at a tree. When possible, additional epidemiological data should drive further investigation, and help clarify possible connections. Yet that does not mean it is impossible to make connections with just a phylogenetic tree.

To understand the relationships for a node in a phylogenetic tree, one normally would desire a binary tree, where each node has exactly two descendants, and there is the opportunity to trace back up the tree to an original state related to a leaf node of interest. In many cases, however, there are polytomies in a tree, where one node will have three or more child subtrees, which is also described as a "multifurcation" leading to situations where it is harder to extract relationships expected in a normal binary tree.

Even with a binary phylogenetic tree, it can still not be clear if perhaps a variant originated in one region and traveled to the other, or if the reverse travel could have equally happened. Similarly it is possible by chance the same variant could have emerged in two different locations and the tree could be demonstrating convergent evolution, where two lineages evolved independently and end up looking related.

With these caveats in mind, however, the data in the phylogenetic tree can still be the first stop to make sense of sequences that are highly similar and appear to be related. Cluster Tracker's mutation analysis utilities, or matUtils, can extract information to label the best potential origins for introductions. The matUtils do this by crunching data about the number of samples on a branch, and the number of mutations between samples, calculating a regional index score to help assign if descendents were inside or outside a focal point. 

By looking at the top phylogenetic tree without regional index scores, one can picture that some samples farther to the right are most likely less related to the top-circled CA sample from California. For instance, the bottom branch with the MA sample from Massachusetts, would intuitively not seem likely to be an origin for the virus in California. Likewise, one probably would guess the two adjacent NY samples from New York could be the best potential origin for the CA virus.

The number crunching by matUtils, on the other hand, which takes in variables Li and Lo to refer to the number of leaves (samples) that are inside or outside when looking at a focal node, and Di and Do, to refer to the total branch lengths to leaf descendents in terms of mutations, actually calculates that the three samples in Virginia as having more weight (0.23 vs 0.15) compared to the two samples from New York. While all hypothetical, these heuristic calculations help input how more time has passed for more mutations to happen, and take into account when more samples exist, there is more likelihood a branch is an introduction source or not for another region.

In these ways, the potential origin calculations can help direct hypothesis generation and prioritization. 

While still a hunch, since more samples are from Virginia and the mutation distance is not too far, compared to say the Massachusetts sample, it may make more sense to start investigating additional epidemiological data from Virginia when looking from the California sample, alongside the potential New York connections. It is important to keep in mind it is equally possible the variant started in one of the other branches and a reverse interpretation is possible, and also to keep in mind regions outside of the US are not used in regional index calculations. This real world data example helps demonstrate how one can interpret potential origins calculations to make sense of the many possible places to start an investigation.