Cancer is a disease of the genome. Even with the latest breakthroughs in DNA sequencing technologies, it is impossible to read the complete genomic sequence from the beginning to the end. Instead, these technologies sample millions of short fragments (called "reads") from random locations in the genome. Reconstructing the genome sequence from these short fragments is not unlike assembling a giant jigsaw puzzle. Our goal is to develop methods to analyze large-scale sequencing data, which will ultimately enhance our understanding of carcinogenesis and improve genomics-based cancer diagnostics.
One of our research projects focuses on reconstructing cancer genome architecture using long sequencing reads. Cancer is driven by somatic changes in the genome, which can range from small nucleotide substitutions to chromosome-scale rearrangements. Until recently, it was difficult to study chromosomal architecture using traditional short-read sequencing because of the mapping ambiguity and the limitations of a single reference genome. In contrast, long-reads provide a much more comprehensive view of the structural genomic changes. We will develop reference-free graph-based algorithms to shed light on the ubiquitous, but elusive carcinogenesis processes such as chromotripsis, chromoplexy or extrachromosomal DNA amplification.
Another interest of our laboratory is the genomic analysis of highly heterogeneous cell communities. Solid tumors often consist of multiple clonal cell lines that are evolving under selective pressure. A seemingly unrelated example of a highly heterogeneous community is an environmental metagenome, such as bacteria in the human gut. We are developing methods for characterizing these complex communities using high-coverage bulk sequencing data.
Current and Past Research Highlights
Strain-level metagenome deconvolution. Microbial communities in many environments include distinct lineages of closely related organisms, which have proved challenging to separate in metagenomic assembly. It is difficult to distinguish between read errors and real polymorphisms between bacterial strains, but high-fidelity (HiFi) long reads have the potential to solve this issue. Here we recovered 428 complete or nearly-complete bacterial genomes from a single sheep gut metagenomic sample, the highest resolution achieved with metagenomic deconvolution to date. HiFi assembly has resolved many closely-related microbial lineages into distinct contigs, proving to be a powerful tool to characterize complex heterogeneous environments.
Metagenome assembly with metaFlye. Shotgun metagenomic assembly is a powerful method to characterize complex microbial communities (such as human gut or tumor microenvironments). Until recently, metagenome assemblies based on short reads (such as Illumina) were highly fragmented and incomlete (e.g. missing 16S genes). To enable long-read based analysis, we developed metaFlye, the first dedicated method for long-read metagenomic assembly. Using metaFlye we reconstructed many complete bacterial genomes from various metagenomic communities. We also showed that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products (such as Colibactin).
Long-read assembly using Flye. The new long-read sequencing technologies (such as Pacific Biosciences or Oxford Nanopore) increased the read length up to tens of thousands of nucleotides, and substantially improved the quality of many genome assemblies. These technologies, however, are facing the challenge of the high error rates. We have created the Flye algorithm for assembly of long and error-prone reads to address this challenge. Flye is using the novel repeat graph framework, which enables fast and accurate assemblies of various organisms. In particular, Flye is good for assembly of human genomes using ultra-long Oxford Nanopore sequencing data (such as NA12878 or CHM13).
Comparative assembly using multiple references. Since many de novo assemblies of large genomes are still incomplete, one can use the information for related reference genomes to order and orient the contig fragments. We have developed Ragout that infers structural rearrangements between the multiple input references and reconstructs the most probable architecture of a target genome. We used Ragout to produce chromosome assemblies of multiple mice genomes, which gave insights into rodent genome evolution and novel functional loci. Mouse assemblies were generated as a part of Mouse genomes sequencing project, hosted by Wellcome Sanger Institute.
Tools for assembly graphs analysis. The analysis of genome graphs is helpful in studying repeat structure of genomes (for example, mosaic segmental duplications in humans). To visualize large and complex assembly graphs, we developed AGB - an interactive graph visualization tool. We have also introduced a new Synteny Paths approach for comparison of two related genomes in a graph from, similarly to synteny block for linear genomes. The tools were developed in a collaboration with the Center for Algorithmic Biotechnology and Bioinformatics Institute in St. Petersburg, Russia.
Before joining the Cancer Data Science Laboratory in January 2022, Mikhail was a postdoctoral fellow at the University of California (UC) Santa Cruz, supervised by Dr. Benedict Paten. Prior to that, he was a postdoctoral fellow at the UC San Diego, co-supervised by Dr. Rob Knight and Dr. Pavel Pevzner. Mikhail completed his Ph.D. in September 2019 in Computer Science from UC San Diego, under the mentorship of Dr. Pavel Pevzner. He received his M.Sc. in bioinformatics from St. Petersburg University of the Russian Academy of Sciences.
Related Scientific Focus Areas
Genetics and Genomics
This page was last updated on Tuesday, September 20, 2022