Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes

bioRxiv pre-print

Assessing accuracy of NGS variant calling is immensely facilitated by a robust benchmarking strategy, and tools to carry it out in a standard way. Benchmarking variant calls requires careful attention to definitions of performance metrics, sophisticated comparison approaches, and stratification by variant type and genome context. The Global Alliance for Genomics and Health (GA4GH) Benchmarking Team has developed standardized performance metrics and tools for benchmarking germline small variant calls. This Team includes representatives from sequencing technology developers, government agencies, academic bioinformatics researchers, clinical laboratories, and commercial technology and bioinformatics developers for whom benchmarking variant calls is essential to their work. Benchmarking variant calls is a challenging problem for many reasons: Evaluating variant calls requires complex matching algorithms and standardized counting, because the same variant may be represented differently in truth and query callsets; Defining and interpreting resulting metrics such as precision (aka positive predictive value = TP/(TP+FP)) and recall (aka sensitivity = TP/(TP+FN)) requires standardization to draw robust conclusions about comparative performance for different variant calling methods; Performance of NGS methods can vary depending on variant types and genome context, and as a result understanding performance requires meaningful stratification; High-confidence variant calls and regions that can be used as "truth" to accurately identify false positives and negatives are difficult to define, and reliable calls for the most challenging regions and variants remain out of reach. We have made significant progress on standardizing comparison methods, metric definitions and reporting, as well as developing and using truth sets. Our methods are publicly available on GitHub ( as well as in a web-based app on precisionFDA, which allows users to compare their variant calls against truth sets and to obtain a standardized report on their variant calling performance. Our methods have been piloted in the precisionFDA variant calling challenges to identify the best-in-class variant calling methods within high-confidence regions. Finally, we recommend a set of best practices for using our tools and critically evaluating the results.