Tools for short variant calling and the way to deal with big datasets

  • Adrien Le Meur
  • , Rima Zein-Eddine
  • , Ombeline Lamer
  • , Fiona Hak
  • , Gaëtan Senelle
  • , Jean Philippe Vernadet
  • , Samuel O’Donnell
  • , Ricardo Rodriguez de la Vega
  • , Guislaine Refrégier

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

Short variant calling in whole genome sequencing data is a practice focusing on single nucleotide substitutions—most commonly referred to as single nucleotide variants—and short indels, that is, classically up to 29bp in length. This practice has become widespread in genomics, taking advantage of the decreasing costs of short-read sequencing, and the resulting huge amount of short-read archives (SRAs) publicly available. It has already allowed important advances in the characterization of biodiversity, for example, by accelerating the implementation of phylogenomic and association studies. In theory, several tools must be combined to perform it with good reliability. However, integrated (all-in-one) pipelines are increasingly offered to end-users, so that people not trained in bioinformatics can take them up. It is becoming tempting for any biologist to launch large studies based on the ever-growing body of public sequencing data. All-in-one pipelines act either as a black box or as a palette of tools from which the user must choose. To limit major inferences, it is important that the user has a good understanding of the underlying tools. This chapter of the book aims to enlighten the naive user and to compile useful information for any user, naive or expert. We will clarify which tools are essential for calling variants in short-read sequencing data and which tools are likely to measure and improve their reliability, with an emphasis on decontamination. We will then present the properties of some all-in-one pipelines. We will focus on the performance of the tools and on best practices to consider for the study of large datasets.

Original languageEnglish
Title of host publicationPhylogenomics
Subtitle of host publicationFoundations, Methods, and Pathogen Analysis
PublisherElsevier
Pages219-250
Number of pages32
ISBN (Electronic)9780323998864
ISBN (Print)9780323913096
DOIs
Publication statusPublished - 1 Jan 2024

Keywords

  • Aligner
  • NGS
  • WGS
  • all-in-one pipelines
  • benchmarking

Fingerprint

Dive into the research topics of 'Tools for short variant calling and the way to deal with big datasets'. Together they form a unique fingerprint.

Cite this