Cookies on this website
We use cookies to ensure that we give you the best experience on our website. If you click 'Continue' we'll assume that you are happy to receive all cookies and you won't see this message again. Click 'Find out more' for information on how to change your cookie settings.

Millions of viral genomes have been sequenced during the SARS-CoV-2 pandemic, and this data has been central to pandemic responses. Computational methods have struggled to keep up with this deluge of data, with the majority of approaches being unusable because of the inordinate amounts of memory required and the lack of support for parallelisation. The tskit library (https://tskit.dev) was developed to support millions of human genome sequences, but also works extremely well on viral data, providing very high levels of data compression and fast processing. One of the key benefits of tskit is that it provides direct access to the PyData ecosystem, allowing cutting-edge data science tools such as Dask, xarray and numba to be deployed. In this project you will implement key phylogenetics algorithms such as parsimony calculations using numba to target both CPUs and GPUs, and parallelise these algorithms using Dask.

Length

6 – 12 weeks, depending on the availability of the candidate, starting mid-July 2022

 

Selection Criteria

The project would be suitable for an advanced undergraduate, or masters, student with strong Python programming skills.

Experience of bioinformatics or genomics is desirable, but not necessary.

Our team