Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

A team of scientists and engineers who analyse and curate the world’s largest genetic datasets have announced a new data format designed to unlock the potential of the millions of genomes now sequenced in healthcare systems around the world.

Genetic data is widely used in scientific research and rapidly becoming a standard part of healthcare. Genome sequencing generates vast amounts of data, which presents many challenges. One major problem is that the standard format for representing the genomes of many individuals for statistical analysis, Variant Call Format (VCF), which was designed as part of the 1000 Genomes Project in 2010, is not suitable for today’s population-scale datasets. VCF defines the genetic and quality control data for all individuals in a dataset at one position on the genome as a single record, usually encoded as a line of text. With datasets now approaching millions of individuals and billions of records this representation is increasingly unwieldy.

In a paper published in the journal GigaScience the team describe how today’s petabyte-scale genetic datasets can be stored using the popular Zarr data format. The paper shows that translating VCF data to Zarr speeds up statistical analysis and opens up many exciting new possibilities. Because Zarr is an open standard that is widely used to store huge scientific datasets, biologists can now take full advantage of modern infrastrucure like cloud computing and AI frameworks such as PyTorch and TensorFlow to analyse genetic data.

The globally distributed team includes representatives of three of the world’s largest genetic datasets: Genomics England, Our Future Health and All of Us; researchers from universities in Canada, New Zealand, the UK, USA and Sweden; along with big data pioneers Tom White, author of Hadoop: The Definitive Guide, and Jeff Hammerbacher, co-founder of Cloudera.

‘VCF is a really successful standard’, said Jerome Kelleher, Associate Professor at the University of Oxford’s Big Data Institute, who co-led the work with Jeff Hammerbacher, ‘it’s universally supported and there’s many, many petabytes of it out there. The way the format is currently used, however, is a bit like storing a database on tape, and accessing subsets of the data is much harder than it should be. We’ve shown a way forward here that opens the door to a whole new generation of exciting applications, while maintaining full compatibility with the huge catalogue of existing software needed for genomic analysis.’

Jeff Hammerbacher said ‘This could be a game-changer. Our experiments with cloud object stores and graphics processing units (GPUs), for example, show that there’s an immense untapped potential for using modern computing infrastructure in genomics. If Zarr were to be adopted as the new standard in medical genomics it would enable a much wider pool of researchers to analyse the data, both in terms of financial means and bioinformatics experience needed. Ultimately, this would accelerate the pace of discovery, and that’s good for everyone.’