Next generation computational tools for SARS-CoV-2 using tskit and the PyData ecosystem
Millions of viral genomes have been sequenced during the SARS-CoV-2 pandemic, and this data has been central to pandemic responses. Computational methods have struggled to keep up with this deluge of data, with the majority of approaches being unusable because of the inordinate amounts of memory required and the lack of support for parallelisation. The tskit library (https://tskit.dev) was developed to support millions of human genome sequences, but also works extremely well on viral data, providing very high levels of data compression and fast processing. One of the key benefits of tskit is that it provides direct access to the PyData ecosystem, allowing cutting-edge data science tools such as Dask, xarray and numba to be deployed. In this project you will implement key phylogenetics algorithms such as parsimony calculations using numba to target both CPUs and GPUs, and parallelise these algorithms using Dask.
Length
6 – 12 weeks, depending on the availability of the candidate, starting mid-July 2022
Selection Criteria
The project would be suitable for an advanced undergraduate, or masters, student with strong Python programming skills.
Experience of bioinformatics or genomics is desirable, but not necessary.