Haplotype phasing

We develop computational tools to solve statistical and algorithmic challenges in quantitative genetics.

We are based in the Division of Genetics and Center for Data Sciences at Brigham and Women's Hospital / Harvard Medical School. We are affiliated with the Program in Medical and Population Genetics at the Broad Institute.

Our work is generously supported by an NIH Director's New Innovator Award, a Burroughs Wellcome Fund Career Award at the Scientific Interface, and a Broad Institute Next Generation Fund award, and we are grateful for past support from a Glenn Foundation for Medical Research and AFAR Grant for Junior Faculty and a Sloan Research Fellowship.

Latest News

Maxwell Sherman receives his PhD

November 18, 2022
Maxwell Sherman has completed MIT's Electrical Engineering and Computer Science (EECS) PhD program and is moving on to Serinus Biosciences, which he co-founded earlier this year. Congratulations, Max!

Paper on haplotype-informed CNV detection published in Cell

October 27, 2022
Margaux Hujoel's paper on haplotype-informed CNV detection (HI-CNV; Hujoel et al. 2022 Cell) is now published -- congratulations, Margaux! This work explores the phenotypic impact of rare copy number variation in the human genome, discovering many new ways in which genetic variation shapes human traits. The analyses were enabled by a new computational approach we developed that substantially increases CNV detection power in large cohorts by pooling information... Read more about Paper on haplotype-informed CNV detection published in Cell

New preprint on repeat polymorphisms influencing glaucoma and colorectal cancer risk

October 14, 2022
We are excited to share a new preprint, "Repeat polymorphisms in non-coding DNA underlie top genetic risk loci for glaucoma and colorectal cancer" (Mukamel*, Handsaker* et al.), which identifies variable-number-of-tandem-repeat (VNTR) polymorphisms that appear to generate the strongest known contributions of any common variants to two common diseases. This collaboration with Bob Handsaker and Steve McCarroll leveraged whole-genome sequencing data in large cohorts together... Read more about New preprint on repeat polymorphisms influencing glaucoma and colorectal cancer risk

Two platform talks at ASHG 2022

August 31, 2022

We're very excited to share our latest work at ASHG this October! Ronen Mukamel and Margaux Hujoel will present platform talks describing strong associations of structural variants with heritable traits and diseases that were revealed by statistical haplotype-sharing models. Ronen and Margaux were also selected as finalists for the Charles J. Epstein Trainee Awards -- congratulations!

Ronen Mukamel: "Repeat polymorphisms underlie top genetic risk loci for glaucoma and colorectal cancer" (platform talk, Wed 10/26 at 1:45pm, #206)

Margaux Hujoel: "...

Read more about Two platform talks at ASHG 2022

Paper on cancer mutation modeling published in Nature Biotechnology

June 20, 2022
Maxwell Sherman's paper on modeling somatic mutation rates to uncover cancer drivers (Sherman*, Yaari*, Priebe* et al. 2022 Nat Biotech) is now published -- congratulations, Max! This work, a collaboration with Bonnie Berger's group at MIT, developed a deep-learning model to predict cancer-specific neutral mutation rates at kilobase-scale resolution from epigenomic annotations. Applying this model to the Pan-Cancer Analysis of Whole Genomes (PCAWG) resource... Read more about Paper on cancer mutation modeling published in Nature Biotechnology

Recent Publications

Influences of rare copy-number variation on human complex traits

Hujoel MLA, Sherman MA, Barton AR, Mukamel RE, Sankaran VG, Terao C, Loh P-R. Influences of rare copy-number variation on human complex traits. Cell 2022;185(22):4233-4248.e27.Abstract
The human genome contains hundreds of thousands of regions harboring copy-number variants (CNV). However, the phenotypic effects of most such polymorphisms are unknown because only larger CNVs have been ascertainable from SNP-array data generated by large biobanks. We developed a computational approach leveraging haplotype sharing in biobank cohorts to more sensitively detect CNVs. Applied to UK Biobank, this approach accounted for approximately half of all rare gene inactivation events produced by genomic structural variation. This CNV call set enabled a detailed analysis of associations between CNVs and 56 quantitative traits, identifying 269 independent associations (p < 5 × 10-8) likely to be causally driven by CNVs. Putative target genes were identifiable for nearly half of the loci, enabling insights into dosage sensitivity of these genes and uncovering several gene-trait relationships. These results demonstrate the ability of haplotype-informed analysis to provide insights into the genetic basis of human complex traits.
Read more

A saturated map of common genetic variants associated with human height

Yengo L, Vedantam S, Marouli E, Sidorenko J, Bartell E, Sakaue S, Graff M, Eliasen AU, Jiang Y, Raghavan S, Miao J, Arias JD, Graham SE, Mukamel RE, .., Loh P-R, Yang J, Esko T, Assimes TL, Auton A, Abecasis GR, Willer CJ, Locke AE, Berndt SI, Lettre G, Frayling TM, Okada Y, Wood AR, Visscher PM, Hirschhorn JN. A saturated map of common genetic variants associated with human height. Nature 2022;Abstract
Common single-nucleotide polymorphisms (SNPs) are predicted to collectively explain 40-50% of phenotypic variation in human height, but identifying the specific variants and associated regions requires huge sample sizes 1 . Here, using data from a genome-wide association study of 5.4 million individuals of diverse ancestries, we show that 12,111 independent SNPs that are significantly associated with height account for nearly all of the common SNP-based heritability. These SNPs are clustered within 7,209 non-overlapping genomic segments with a mean size of around 90 kb, covering about 21% of the genome. The density of independent associations varies across the genome and the regions of increased density are enriched for biologically relevant genes. In out-of-sample estimation and prediction, the 12,111 SNPs (or all SNPs in the HapMap 3 panel 2 ) account for 40% (45%) of phenotypic variance in populations of European ancestry but only around 10-20% (14-24%) in populations of other ancestries. Effect sizes, associated regions and gene prioritization are similar across ancestries, indicating that reduced prediction accuracy is likely to be explained by linkage disequilibrium and differences in allele frequency within associated regions. Finally, we show that the relevant biological pathways are detectable with smaller sample sizes than are needed to implicate causal genes and variants. Overall, this study provides a comprehensive map of specific genomic regions that contain the vast majority of common height-associated variants. Although this map is saturated for populations of European ancestry, further research is needed to achieve equivalent saturation in other ancestries.
Read more

Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes

Gao T, Soldatov R, Sarkar H, Kurkiewicz A, Biederstedt E, Loh P-R, Kharchenko PV. Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes. Nat Biotechnol 2022;Abstract
Genome instability and aberrant alterations of transcriptional programs both play important roles in cancer. Single-cell RNA sequencing (scRNA-seq) has the potential to investigate both genetic and nongenetic sources of tumor heterogeneity in a single assay. Here we present a computational method, Numbat, that integrates haplotype information obtained from population-based phasing with allele and expression signals to enhance detection of copy number variations from scRNA-seq. Numbat exploits the evolutionary relationships between subclones to iteratively infer single-cell copy number profiles and tumor clonal phylogeny. Analysis of 22 tumor samples, including multiple myeloma, gastric, breast and thyroid cancers, shows that Numbat can reconstruct the tumor copy number profile and precisely identify malignant cells in the tumor microenvironment. We identify genetic subpopulations with transcriptional signatures relevant to tumor progression and therapy resistance. Numbat requires neither sample-matched DNA data nor a priori genotyping, and is applicable to a wide range of experimental settings and cancer types.
Read more

Chromosomal phase improves aneuploidy detection in non-invasive prenatal testing at low fetal DNA fractions

Genovese G, Mello CJ, Loh P-R, Handsaker RE, Kashin S, Whelan CW, Bayer-Zwirello LA, McCarroll SA. Chromosomal phase improves aneuploidy detection in non-invasive prenatal testing at low fetal DNA fractions. Sci Rep 2022;12(1):12025.Abstract
Non-invasive prenatal testing (NIPT) to detect fetal aneuploidy by sequencing the cell-free DNA (cfDNA) in maternal plasma is being broadly adopted. To detect fetal aneuploidies from maternal plasma, where fetal DNA is mixed with far-larger amounts of maternal DNA, NIPT requires a minimum fraction of the circulating cfDNA to be of placental origin, a level which is usually attained beginning at 10 weeks gestational age. We present an approach that leverages the arrangement of alleles along homologous chromosomes-also known as chromosomal phase-to make NIPT analyses more conclusive. We validate our approach with in silico simulations, then re-analyze data from a pregnant mother who, due to a fetal DNA fraction of 3.4%, received an inconclusive aneuploidy determination through NIPT. We find that the presence of a trisomy 18 fetus can be conclusively inferred from the patient's same molecular data when chromosomal phase is incorporated into the analysis. Key to the effectiveness of our approach is the ability of homologous chromosomes to act as natural controls for each other and the ability of chromosomal phase to integrate subtle quantitative signals across very many sequence variants. These results show that chromosomal phase increases the sensitivity of a common laboratory test, an idea that could also advance cfDNA analyses for cancer detection.
Read more

Incorporating family history of disease improves polygenic risk scores in diverse populations

Hujoel MLA, Loh P-R, Neale BM, Price AL. Incorporating family history of disease improves polygenic risk scores in diverse populations. Cell Genom 2022;2(7):100152.Abstract
Polygenic risk scores (PRSs) derived from genotype data and family history (FH) of disease provide valuable information for predicting disease risk, but PRSs perform poorly when applied to diverse populations. Here, we explore methods for combining both types of information (PRS-FH) in UK Biobank data. PRSs were trained using all British individuals (n = 409,000), and target samples consisted of unrelated non-British Europeans (n = 42,000), South Asians (n = 7,000), or Africans (n = 7,000). We evaluated PRS, FH, and PRS-FH using liability-scale R 2, primarily focusing on 3 well-powered diseases (type 2 diabetes, hypertension, and depression). PRS attained average prediction R 2s of 5.8%, 4.0%, and 0.53% in non-British Europeans, South Asians, and Africans, confirming poor cross-population transferability. In contrast, PRS-FH attained average prediction R 2s of 13%, 12%, and 10%, respectively, representing a large improvement in Europeans and an extremely large improvement in Africans. In conclusion, including family history improves the accuracy of polygenic risk scores, particularly in diverse populations.
Read more

Genome-wide mapping of somatic mutation rates uncovers drivers of cancer

Sherman MA, Yaari AU, Priebe O, Dietlein F, Loh P-R, Berger B. Genome-wide mapping of somatic mutation rates uncovers drivers of cancer. Nat Biotechnol 2022;Abstract
Identification of cancer driver mutations that confer a proliferative advantage is central to understanding cancer; however, searches have often been limited to protein-coding sequences and specific non-coding elements (for example, promoters) because of the challenge of modeling the highly variable somatic mutation rates observed across tumor genomes. Here we present Dig, a method to search for driver elements and mutations anywhere in the genome. We use deep neural networks to map cancer-specific mutation rates genome-wide at kilobase-scale resolution. These estimates are then refined to search for evidence of driver mutations under positive selection throughout the genome by comparing observed to expected mutation counts. We mapped mutation rates for 37 cancer types and applied these maps to identify putative drivers within intronic cryptic splice regions, 5' untranslated regions and infrequently mutated genes. Our high-resolution mutation rate maps, available for web-based exploration, are a resource to enable driver discovery genome-wide.
Read more