Analysis-ready VCF at Biobank scale using Zarr.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Eric Czech, Benjamin Elsworth, Jérémy Guez, Jeff Hammerbacher, Jonny Hancox, Ben Jeffery, Konrad J Karczewski, Jerome Kelleher, Alistair Miles, Timothy R Millar, Sam Tallman, Will Tyler, Per Unneberg, Tom White, Rafal Wojdyla, Shadi Zabad

Ngôn ngữ: eng

Ký hiệu phân loại: 920.71 Men

Thông tin xuất bản: United States : bioRxiv : the preprint server for biology , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 89381

Thêm vào giỏ Liên kết toàn văn

BACKGROUND: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. RESULTS: Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: CONCLUSIONS: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows. KEY POINTS: VCF is widely supported, and the underlying data model entrenched in bioinformatics pipelines.The standard row-wise encoding as text (or binary) is inherently inefficient for large-scale data processing.The Zarr format provides an efficient solution, by encoding fields in the VCF separately in chunk-compressed binary format.

Tạo bộ sưu tập với mã QR