
“Our new algorithms will allow not only more accurate phylogeny and alignment inference but it will also facilitate the estimation of statistical supports of inferred tree partitions and the ancestral reconstruction of insertions-deletions and substitution history. Our tool has been developed in a user-friendly software package and is applicable to large genomic and metagenomic datasets.”
Thesis: Progressive Multiple Sequence Alignment with Indel Evolution
Date of Defense: 12. 2. 2019
Supervisors:
Prof. Dr. Christian von Mering (UZH)
Prof. Dr. Maria Anisimova (ZHAW)
Keywords: Sequence Alignment, Indel, Phylogeny, Dynamic Programming, Poisson Process

Abstract: Here we present, for the first time, a frequentist progressive Multiple Sequence Alignment (MSA) method based on a rigorous and explicit mathematical formulation of insertions and deletions, namely the Poisson Indel Process (PIP). Having designed our algorithm in the Maximum Likelihood (ML) framework has enabled us to avoid the time-consuming Markov Chain Monte Carlo sampling of alignments. Our proposed algorithm aligns two homology paths, represented by their corresponding MSAs in polynomial time by ML under the PIP model (Maiolo et al. 2017). The procedure has been integrated into a progressive procedure that, traversing a given phylogenetic tree, produces at each internal node the optimal pairwise MSA ending at the root with the alignment of all the input sequences. The integration of PIP equations into a Dynamic Programming (DP) approach is not straightforward since the marginal likelihood is non-monotonic in the alignment length. Therefore, to account for the dependence on alignment length we have extended the DP matrices with a third dimension.
In order to reduce the computational complexity, the algorithm predicts candidate homologous segments for the purpose of filtering out non-promising regions in the DP matrix prior to the effective alignment process. Although our method has been strongly inspired by MAFFT, we have introduced a number of improvements like for example the use of a multi-scale short-time Fourier transform (STFT) for the automatic detection of candidate homologous patterns. Moreover, the use of the multiple-resolution STFT rather than the Fourier transform improves the detection of homologous regions especially in the presence of noise and in case of relative short patterns. We have also defined a more sophisticated and general approach to generate logically sound paths to connect homologous blocks and resolve overlaps between them.
To mitigate the intrinsic greediness brought by the progressive DP approach we have also implemented a Stochastic backtracking (Mueckstein et al. 2002) version of the algorithm under the PIP model. In this way, our progressive algorithm generates at each visited node a distribution of candidate sub-optimal alignments. Aligning sub-optimal solutions increases the chances to escape from local maxima and in our opinion provides a valid strategy to reduce the progressive bias.
Finally, to account for the among-sites substitution rate variation (ASRV) we have applied a Gamma distribution to all the rates, insertion and deletion rates included. However, further
analysis are still needed to investigate the impact of ASRV on the inferred alignments. Our hope is that this feature could mimic to some extent a long insertion, that is, an insertion of more than a single character at a time.
The use of a sound mathematical model of indel, namely the Poisson Indel Process model, is providing more realistic and accurate estimates of MSAs, phylogenies and model parameters. As a consequence, our new algorithms will allow not only more accurate phylogeny and alignment inference but it will also facilitate the estimation of statistical supports of inferred tree partitions and the ancestral reconstruction of insertions-deletions and substitution history. Our tool has been developed in a user-friendly software package and is applicable to large genomic and metagenomic datasets.
Publications:
Maiolo, M., Zhang, X., Gil, M. et al. Progressive multiple sequence alignment with indel evolution. BMC Bioinformatics19, 331 (2018)
Links:
ZHAW Life Sciences und Facility Management
IAS Institut für Angewandte Simulation
Computational Genomics
Maria Anismova’s group
University of Zurich
Institute of Molecular Life Sciences
Christian von Mering’s group
University of Lausanne
Department of Computational Biology
Christophe Dessimoz’s group