About
UC Santa Cruz is drawing on its expertise in genomics and data sharing to lead a new, state-sponsored public health program, made official on August 18, 2021. The California Dept. of Public Health (CDPH)-sponsored program will allow public health experts to use virus genomes to track who is catching the virus and how. While ensuring its tools are both effective and secure, the team will deliver a final product designed to help shepherd California out of the current pandemic.
The UCSC Pathogen Genomics team is developing tools designed to benefit California residents by overlaying SARS-CoV-2 genomic data with public health information. No other pathogen has been sequenced as many times as the SCV2 virus. The explosion of available sequences overwhelmed existing tools for phylogenetic investigation. To address this, Software Architect Angie Hinrichs and researchers in the Corbett-Detig lab created a database of SCV2 phylogenetic trees of unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations as well as Nextstrain clade and Pango lineage labels at clade roots. As of August 31, 2021, our SCV2 MAT (the UCSC Big Tree) consists of more than 3.3M sequences and provides a comprehensive view of the virus’ evolutionary history using public data. Our teams also developed MatUtils, a suite of utilities for rapidly querying, interpreting and manipulating the MATs. MatUtils produces meta-data compatible visualizations. The rapidly growing scale of SCV2 sequencing data made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. In response, Yatish Turakhia, a researcher in the Corbett-Detig Lab, developed a tool that places SCV2 sequences onto existing phylogenetic trees far faster than previous methods. Allowing instantaneous tracing of strains and transmission events, Turakhia’s tool is called Ultrafast Sample placement on Existing tRees (UShER) and exists as an interactive web-tool to compare sequences and link to existing public phylogenetic trees. The central deliverable of this project, the “California Big Tree” grew from this research and the tools we built. This project will extend existing tools to provide a scalable statewide system enabling state and county departments of public health to rapidly utilize genomic data to intelligently inform public health decisions.