ALGORITHMS AND STRUCTURES IN GRAPH PANGENOMES: WHAT IS NEXT?
The adoption of graph-based reference genomes, already started within the GenomeReference Consortium Human Genome, has determined a progress of methods to deal with such graphs. The speed in producing large amounts of genome data and in advancing sequencing technologies is far from the slow progress of new methods for analyzing graphs representing multiple genomes. In this talk I will discuss an agenda of actions involving graph-based notions in computer science that will put the shift of paradigms -- from sequence- to graph-based representations of genomes -- into full effect.
Paola Bonizzoni is full professor of Computer Science at the University of Milano-Bicocca in Milan, Italy. Her early research focused on combinatorial methods for graphs and sequence analysis, including formal Languages. She is the head of a group in Bioinformatics and Experimental Algorithmics: research in the group is mainlyfocused on combinatorial methods for solving computational problems on next generation sequencing data, including alternative splicing predictions, haplotype assembly, genotype calling, phylogenetic reconstruction and comparison in cencer genomics. She is the President of the Association Computability in Europe. She is currently the coordinator of an H2020 MSCA-RISE project on Algorithms and Data structures for Pangenome Graphs.
BACTERIAL PAN-GENOMICS WITH REFERENCE GRAPHS
Bacteria are amongst the most abundant and successful forms of life on earth. Their genomes bear the imprint of billions of years of evolution, and we study them to learn about function, infection and a whole range of other questions. Sequencing, and in particular long-read sequencing, has gone a long way to making bacterial genome assembly a solved problem. However, bacterial genomes are remarkably diverse, even within a single species, and so the problem of genome comparison is still far from solved (even if we have perfect genomes). In this talk I will describe how bacterial genomes differ and show how dramatically this differs from the genomes many of us were trained on (eg humans). It should be clear by this stage of the talk that there is potentially an important and appreciable fraction of genetic variation which will never be accessible with single-reference genome approaches.
This motivates the problem we want to solve: finding the best way to compare bacterial genomes at the level of SNPs and indels, as well as at a larger scale.
I will describe our attempt to solve this. We represent the pan genome of a species as a network of “floating” graphs, representing the ensemble of known variation in orthology blocks (we use genes and intergenic regions, but this could be done for mobile elements also). In doing so it becomes possible to discover and describe genetic variation at fine (SNP/indel) and coarse (gene order) level. I will show how this allows us to improve both the analysis of individual genomes, and the comparison of a cohort/dataset. Our graph genome algorithms are implemented in a tool called pandora https://github.com/rmcolq/pandora.
We evaluate pandora on (illumina and nanopore data from) a small but highly curated global set of E. coli genomes which have polished long-read assemblies as “truth”, and are able to show the benefits and limitations of pandora.
In doing so, we are able to compare this approach with single-reference approaches, both in terms of recall, and reference-bias. A recurring theme in the talk will be how to make these very technical approaches accessible and useful to bioinformaticians and biologists who want to integrate this information with their prior knowledge and downstream tools.
Zamin Iqbal leads a team at EBI working on fundamental method development for microbial sequence analysis.
He works in 3 major areas. First, data structures for indexing and searching DNA archives. Second, variant calling software tailored for bacterial pan-genomes. Third, translation of the above (and other) methods for analysis of M. tuberculosis in public health settings. Zamin obtained his PhD in Mathematics from the University of Oxford, worked in the software industry for several years and returned to academia to work on the 1000 Genomes project. He did his postdoctoral work with Professor Gil McVean at the University of Oxford before being awarded a Wellcome Trust/Royal Society Sir Henry Dale award and starting his own group.