Clonally expanded blood cells that contain somatic mutations (clonal haematopoiesis) are commonly acquired with age and increase the risk of blood cancer. The blood clones identified so far contain diverse large-scale mosaic chromosomal alterations (deletions, duplications and copy-neutral loss of heterozygosity (CN-LOH)) on all chromosomes, but the sources of selective advantage that drive the expansion of most clones remain unknown. Here, to identify genes, mutations and biological processes that give selective advantage to mutant clones, we analysed genotyping data from the blood-derived DNA of 482,789 participants from the UK Biobank. We identified 19,632 autosomal mosaic chromosomal alterations and analysed these for relationships to inherited genetic variation. We found 52 inherited, rare, large-effect coding or splice variants in 7 genes that were associated with greatly increased vulnerability to clonal haematopoiesis with specific acquired CN-LOH mutations. Acquired mutations systematically replaced the inherited risk alleles (at MPL) or duplicated them to the homologous chromosome (at FH, NBN, MRE11, ATM, SH2B3 and TM2D3). Three of the genes (MRE11, NBN and ATM) encode components of the MRN-ATM pathway, which limits cell division after DNA damage and telomere attrition; another two (MPL and SH2B3) encode proteins that regulate the self-renewal of stem cells. In addition, we found that CN-LOH mutations across the genome tended to cause chromosomal segments with alleles that promote the expansion of haematopoietic cells to replace their homologous (allelic) counterparts, increasing polygenic drive for blood-cell proliferation traits. Readily acquired mutations that replace chromosomal segments with their homologous counterparts seem to interact with pervasive inherited variation to create a challenge for lifelong cytopoiesis.
The extent to which the biology of oncogenesis and ageing are shaped by factors that distinguish human populations is unknown. Haematopoietic clones with acquired mutations become common with advancing age and can lead to blood cancers. Here we describe shared and population-specific patterns of genomic mutations and clonal selection in haematopoietic cells on the basis of 33,250 autosomal mosaic chromosomal alterations that we detected in 179,417 Japanese participants in the BioBank Japan cohort and compared with analogous data from the UK Biobank. In this long-lived Japanese population, mosaic chromosomal alterations were detected in more than 35.0% (s.e.m., 1.4%) of individuals older than 90 years, which suggests that such clones trend towards inevitability with advancing age. Japanese and European individuals exhibited key differences in the genomic locations of mutations in their respective haematopoietic clones; these differences predicted the relative rates of chronic lymphocytic leukaemia (which is more common among European individuals) and T cell leukaemia (which is more common among Japanese individuals) in these populations. Three different mutational precursors of chronic lymphocytic leukaemia (including trisomy 12, loss of chromosomes 13q and 13q, and copy-neutral loss of heterozygosity) were between two and six times less common among Japanese individuals, which suggests that the Japanese and European populations differ in selective pressures on clones long before the development of clinically apparent chronic lymphocytic leukaemia. Japanese and British populations also exhibited very different rates of clones that arose from B and T cell lineages, which predicted the relative rates of B and T cell cancers in these populations. We identified six previously undescribed loci at which inherited variants predispose to mosaic chromosomal alterations that duplicate or remove the inherited risk alleles, including large-effect rare variants at NBN, MRE11 and CTU2 (odds ratio, 28-91). We suggest that selective pressures on clones are modulated by factors that are specific to human populations. Further genomic characterization of clonal selection and cancer in populations from around the world is therefore warranted.
Family history of disease can provide valuable information in case-control association studies, but it is currently unclear how to best combine case-control status and family history of disease. We developed an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH). Analyzing 12 diseases from the UK Biobank (average N = 350,000) we compared LT-FH to genome-wide association without using family history (GWAS) and a previous proxy-based method incorporating family history (GWAX). LT-FH was 63% (standard error (s.e.) 6%) more powerful than GWAS and 36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, based on the number of independent genome-wide-significant loci across all diseases (for example, 690 loci for LT-FH versus 423 for GWAS); relative improvements were similar when applying BOLT-LMM to GWAS, GWAX and LT-FH phenotypes. Thus, LT-FH greatly increases association power when family history of disease is available.
Thompson DJ, Genovese G, Halvardson J, Ulirsch JC, Wright DJ, Terao C, Davidsson OB, Day FR, Sulem P, Jiang Y, Danielsson M, Davies H, Dennis J, Dunlop MG, Easton DF, Fisher VA, Zink F, Houlston RS, Ingelsson M, Kar S, Kerrison ND, Kinnersley B, Kristjansson RP, Law PJ, Li R, Loveday C, Mattisson J, McCarroll SA, Murakami Y, Murray A, Olszewski P, Rychlicka-Buniowska E, Scott RA, Thorsteinsdottir U, Tomlinson I, Moghadam BT, Turnbull C, Wareham NJ, Gudbjartsson DF, Kamatani Y, Hoffmann ER, Jackson SP, Stefansson K, Auton A, Ong KK, Machiela MJ, Loh P-R, Dumanski JP, Chanock SJ, Forsberg LA, Perry JRB. Genetic predisposition to mosaic Y chromosome loss in blood. Nature 2019;575(7784):652-657.Abstract
Mosaic loss of chromosome Y (LOY) in circulating white blood cells is the most common form of clonal mosaicism, yet our knowledge of the causes and consequences of this is limited. Here, using a computational approach, we estimate that 20% of the male population represented in the UK Biobank study (n = 205,011) has detectable LOY. We identify 156 autosomal genetic determinants of LOY, which we replicate in 757,114 men of European and Japanese ancestry. These loci highlight genes that are involved in cell-cycle regulation and cancer susceptibility, as well as somatic drivers of tumour growth and targets of cancer therapy. We demonstrate that genetic susceptibility to LOY is associated with non-haematological effects on health in both men and women, which supports the hypothesis that clonal haematopoiesis is a biomarker of genomic instability in other tissues. Single-cell RNA sequencing identifies dysregulated expression of autosomal genes in leukocytes with LOY and provides insights into why clonal expansion of these cells may occur. Collectively, these data highlight the value of studying clonal mosaicism to uncover fundamental mechanisms that underlie cancer and other ageing-related diseases.
The emerging diversity of single-cell RNA-seq datasets allows for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. However, it is challenging to analyze them together, particularly when datasets are assayed with different technologies, because biological and technical differences are interspersed. We present Harmony (https://github.com/immunogenomics/harmony), an algorithm that projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological factors. In six analyses, we demonstrate the superior performance of Harmony to previously published algorithms while requiring fewer computational resources. Harmony enables the integration of ~10 cells on a personal computer. We apply Harmony to peripheral blood mononuclear cells from datasets with large experimental differences, five studies of pancreatic islet cells, mouse embryogenesis datasets and the integration of scRNA-seq with spatial transcriptomics data.
Mosaic loss of chromosome Y (mLOY) is frequently observed in the leukocytes of ageing men. However, the genetic architecture and biological mechanisms underlying mLOY are not fully understood. In a cohort of 95,380 Japanese men, we identify 50 independent genetic markers in 46 loci associated with mLOY at a genome-wide significant level, 35 of which are unreported. Lead markers overlap enhancer marks in hematopoietic stem cells (HSCs, P ≤ 1.0 × 10). mLOY genome-wide association study signals exhibit polygenic architecture and demonstrate strong heritability enrichment in regions surrounding genes specifically expressed in multipotent progenitor (MPP) cells and HSCs (P ≤ 3.5 × 10). ChIP-seq data demonstrate that binding sites of FLI1, a fate-determining factor promoting HSC differentiation into platelets rather than red blood cells (RBCs), show a strong heritability enrichment (P = 1.5 × 10). Consistent with these findings, platelet and RBC counts are positively and negatively associated with mLOY, respectively. Collectively, our observations improve our understanding of the mechanisms underlying mLOY.
Regulatory variation plays a major role in complex disease and that cell-type-specific binding of transcription factors (TF) is critical to gene regulation. However, assessing the contribution of genetic variation in TF binding sites to disease heritability is challenging, as binding is often cell-type-specific and annotations from directly measured TF binding are not currently available for most cell-type-TF pairs. We investigate approaches to annotate TF binding, including directly measured chromatin data and sequence-based predictions. We find that TF binding annotations constructed by intersecting sequence-based TF binding predictions with cell-type-specific chromatin data explain a large fraction of heritability across a broad set of diseases and corresponding cell-types; this strategy of constructing annotations addresses both the limitation that identical sequences may be bound or unbound depending on surrounding chromatin context, and the limitation that sequence-based predictions are generally not cell-type-specific. We partitioned the heritability of 49 diseases and complex traits using stratified LD score regression with the baseline-LD model (which is not cell-type-specific) plus the new annotations. We determined that 100bp windows around MotifMap sequenced-based TF binding predictions intersected with a union of six cell-type-specific chromatin marks (imputed using ChromImpute) performed best, with an 58% increase in heritability enrichment compared to the chromatin marks alone (11.6x vs 7.3x; P = 9 x 10-14 for difference) and a 20% increase in cell-type-specific signal conditional on annotations from the baseline-LD model (P = 8 x 10-11 for difference). Our results show that TF binding annotations explain substantial disease heritability and can help refine genome-wide association signals.
Recent studies have highlighted the role of gene networks in disease biology. To formally assess this, we constructed a broad set of pathway, network, and pathway+network annotations and applied stratified LD score regression to 42 diseases and complex traits (average N = 323K) to identify enriched annotations. First, we analyzed 18,119 biological pathways. We identified 156 pathway-trait pairs whose disease enrichment was statistically significant (FDR < 5%) after conditioning on all genes and 75 known functional annotations (from the baseline-LD model), a stringent step that greatly reduced the number of pathways detected; most significant pathway-trait pairs were previously unreported. Next, for each of four published gene networks, we constructed probabilistic annotations based on network connectivity. For each gene network, the network connectivity annotation was strongly significantly enriched. Surprisingly, the enrichments were fully explained by excess overlap between network annotations and regulatory annotations from the baseline-LD model, validating the informativeness of the baseline-LD model and emphasizing the importance of accounting for regulatory annotations in gene network analyses. Finally, for each of the 156 enriched pathway-trait pairs, for each of the four gene networks, we constructed pathway+network annotations by annotating genes with high network connectivity to the input pathway. For each gene network, these pathway+network annotations were strongly significantly enriched for the corresponding traits. Once again, the enrichments were largely explained by the baseline-LD model. In conclusion, gene network connectivity is highly informative for disease architectures, but the information in gene networks may be subsumed by regulatory annotations, emphasizing the importance of accounting for known annotations.
Warrington NM, Beaumont RN, Horikoshi M, Day FR, Helgeland Ø, Laurin C, Bacelis J, Peng S, Hao K, Feenstra B, Wood AR, Mahajan A, Tyrrell J, Robertson NR, Rayner WN, Qiao Z, Moen G-H, Vaudel M, Marsit CJ, Chen J, Nodzenski M, Schnurr TM, Zafarmand MH, Bradfield JP, Grarup N, Kooijman MN, Li-Gao R, Geller F, Ahluwalia TS, Paternoster L, Rueedi R, Huikari V, Hottenga J-J, Lyytikäinen L-P, Cavadino A, Metrustry S, Cousminer DL, Wu Y, Thiering E, Wang CA, Have CT, Vilor-Tejedor N, Joshi PK, Painter JN, Ntalla I, Myhre R, Pitkänen N, van Leeuwen EM, Joro R, Lagou V, Richmond RC, Espinosa A, Barton SJ, Inskip HM, Holloway JW, Santa-Marina L, Estivill X, Ang W, Marsh JA, Reichetzeder C, Marullo L, Hocher B, Lunetta KL, Murabito JM, Relton CL, Kogevinas M, Chatzi L, Allard C, Bouchard L, Hivert M-F, Zhang G, Muglia LJ, Heikkinen J, Morgen CS, van Kampen AHC, van Schaik BDC, Mentch FD, Langenberg C, Luan J'an, Scott RA, Zhao JH, Hemani G, Ring SM, Bennett AJ, Gaulton KJ, Fernandez-Tajes J, van Zuydam NR, Medina-Gomez C, de Haan HG, Rosendaal FR, Kutalik Z, Marques-Vidal P, Das S, Willemsen G, Mbarek H, Müller-Nurasyid M, Standl M, Appel EVR, Fonvig CE, Trier C, van Beijsterveldt CE, Murcia M, Bustamante M, Bonas-Guarch S, Hougaard DM, Mercader JM, Linneberg A, Schraut KE, Lind PA, Medland SE, Shields BM, Knight BA, Chai J-F, Panoutsopoulou K, Bartels M, Sánchez F, Stokholm J, Torrents D, Vinding RK, Willems SM, Atalay M, Chawes BL, Kovacs P, Prokopenko I, Tuke MA, Yaghootkar H, Ruth KS, Jones SE, Loh P-R, .., Ong KK, McCarthy MI, Perry JRB, Evans DM, Freathy RM. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nat Genet 2019;51(5):804-814.Abstract
Birth weight variation is influenced by fetal and maternal genetic and non-genetic factors, and has been reproducibly associated with future cardio-metabolic health outcomes. In expanded genome-wide association analyses of own birth weight (n = 321,223) and offspring birth weight (n = 230,069 mothers), we identified 190 independent association signals (129 of which are novel). We used structural equation modeling to decompose the contributions of direct fetal and indirect maternal genetic effects, then applied Mendelian randomization to illuminate causal pathways. For example, both indirect maternal and direct fetal genetic effects drive the observational relationship between lower birth weight and higher later blood pressure: maternal blood pressure-raising alleles reduce offspring birth weight, but only direct fetal effects of these alleles, once inherited, increase later offspring blood pressure. Using maternal birth weight-lowering genotypes to proxy for an adverse intrauterine environment provided no evidence that it causally raises offspring blood pressure, indicating that the inverse birth weight-blood pressure association is attributable to genetic effects, and not to intrauterine programming.
Recent studies have examined the genetic correlations of single-nucleotide polymorphism (SNP) effect sizes across pairs of populations to better understand the genetic architectures of complex traits. These studies have estimated ρ g , the cross-population correlation of joint-fit effect sizes at genotyped SNPs. However, the value of ρ g depends both on the cross-population correlation of true causal effect sizes ( ρ b ) and on the similarity in linkage disequilibrium (LD) patterns in the two populations, which drive tagging effects. Here, we derive the value of the ratio ρ g / ρ b as a function of LD in each population. By applying existing methods to obtain estimates of ρ g , we can use this ratio to estimate ρ b . Our estimates of ρ b were equal to 0.55 ( SE = 0.14) between Europeans and East Asians averaged across nine traits in the Genetic Epidemiology Research on Adult Health and Aging data set, 0.54 ( SE = 0.18) between Europeans and South Asians averaged across 13 traits in the UK Biobank data set, and 0.48 ( SE = 0.06) and 0.65 ( SE = 0.09) between Europeans and East Asians in summary statistic data sets for type 2 diabetes and rheumatoid arthritis, respectively. These results implicate substantially different causal genetic architectures across continental populations.
Understanding the role of rare variants is important in elucidating the genetic basis of human disease. Negative selection can cause rare variants to have larger per-allele effect sizes than common variants. Here, we develop a method to estimate the minor allele frequency (MAF) dependence of SNP effect sizes. We use a model in which per-allele effect sizes have variance proportional to [p(1 - p)], where p is the MAF and negative values of α imply larger effect sizes for rare variants. We estimate α for 25 UK Biobank diseases and complex traits. All traits produce negative α estimates, with best-fit mean of -0.38 (s.e. 0.02) across traits. Despite larger rare variant effect sizes, rare variants (MAF < 1%) explain less than 10% of total SNP-heritability for most traits analyzed. Using evolutionary modeling and forward simulations, we validate the α model of MAF-dependent trait effects and assess plausible values of relevant evolutionary parameters.
We introduce cross-trait penalized regression (CTPR), a powerful and practical approach for multi-trait polygenic risk prediction in large cohorts. Specifically, we propose a novel cross-trait penalty function with the Lasso and the minimax concave penalty (MCP) to incorporate the shared genetic effects across multiple traits for large-sample GWAS data. Our approach extracts information from the secondary traits that is beneficial for predicting the primary trait based on individual-level genotypes and/or summary statistics. Our novel implementation of a parallel computing algorithm makes it feasible to apply our method to biobank-scale GWAS data. We illustrate our method using large-scale GWAS data (~1M SNPs) from the UK Biobank (N = 456,837). We show that our multi-trait method outperforms the recently proposed multi-trait analysis of GWAS (MTAG) for predictive performance. The prediction accuracy for height by the aid of BMI improves from R = 35.8% (MTAG) to 42.5% (MCP + CTPR) or 42.8% (Lasso + CTPR) with UK Biobank data.
Functional genomics data has the potential to increase GWAS power by identifying SNPs that have a higher prior probability of association. Here, we introduce a method that leverages polygenic functional enrichment to incorporate coding, conserved, regulatory, and LD-related genomic annotations into association analyses. We show via simulations with real genotypes that the method, functionally informed novel discovery of risk loci (FINDOR), correctly controls the false-positive rate at null loci and attains a 9%-38% increase in the number of independent associations detected at causal loci, depending on trait polygenicity and sample size. We applied FINDOR to 27 independent complex traits and diseases from the interim UK Biobank release (average N = 130K). Averaged across traits, we attained a 13% increase in genome-wide significant loci detected (including a 20% increase for disease traits) compared to unweighted raw p values that do not use functional data. We replicated the additional loci in independent UK Biobank and non-UK Biobank data, yielding a highly statistically significant replication slope (0.66-0.69) in each case. Finally, we applied FINDOR to the full UK Biobank release (average N = 416K), attaining smaller relative improvements (consistent with simulations) but larger absolute improvements, detecting an additional 583 GWAS loci. In conclusion, leveraging functional enrichment using our method robustly increases GWAS power.
Common variant heritability has been widely reported to be concentrated in variants within cell-type-specific non-coding functional annotations, but little is known about low-frequency variant functional architectures. We partitioned the heritability of both low-frequency (0.5%≤ minor allele frequency <5%) and common (minor allele frequency ≥5%) variants in 40 UK Biobank traits across a broad set of functional annotations. We determined that non-synonymous coding variants explain 17 ± 1% of low-frequency variant heritability ([Formula: see text]) versus 2.1 ± 0.2% of common variant heritability ([Formula: see text]). Cell-type-specific non-coding annotations that were significantly enriched for [Formula: see text] of corresponding traits were similarly enriched for [Formula: see text] for most traits, but more enriched for brain-related annotations and traits. For example, H3K4me3 marks in brain dorsolateral prefrontal cortex explain 57 ± 12% of [Formula: see text] versus 12 ± 2% of [Formula: see text] for neuroticism. Forward simulations confirmed that low-frequency variant enrichment depends on the mean selection coefficient of causal variants in the annotation, and can be used to predict effect size variance of causal rare variants (minor allele frequency <0.5%).
Biological interpretation of genome-wide association study data frequently involves assessing whether SNPs linked to a biological process, for example, binding of a transcription factor, show unsigned enrichment for disease signal. However, signed annotations quantifying whether each SNP allele promotes or hinders the biological process can enable stronger statements about disease mechanism. We introduce a method, signed linkage disequilibrium profile regression, for detecting genome-wide directional effects of signed functional annotations on disease risk. We validate the method via simulations and application to molecular quantitative trait loci in blood, recovering known transcriptional regulators. We apply the method to expression quantitative trait loci in 48 Genotype-Tissue Expression tissues, identifying 651 transcription factor-tissue associations including 30 with robust evidence of tissue specificity. We apply the method to 46 diseases and complex traits (average n = 290 K), identifying 77 annotation-trait associations representing 12 independent transcription factor-trait associations, and characterize the underlying transcriptional programs using gene-set enrichment analyses. Our results implicate new causal disease genes and new disease mechanisms.
The selective pressures that shape clonal evolution in healthy individuals are largely unknown. Here we investigate 8,342 mosaic chromosomal alterations, from 50 kb to 249 Mb long, that we uncovered in blood-derived DNA from 151,202 UK Biobank participants using phase-based computational techniques (estimated false discovery rate, 6-9%). We found six loci at which inherited variants associated strongly with the acquisition of deletions or loss of heterozygosity in cis. At three such loci (MPL, TM2D3-TARSL2, and FRA10B), we identified a likely causal variant that acted with high penetrance (5-50%). Inherited alleles at one locus appeared to affect the probability of somatic mutation, and at three other loci to be objects of positive or negative clonal selection. Several specific mosaic chromosomal alterations were strongly associated with future haematological malignancies. Our results reveal a multitude of paths towards clonal expansions with a wide range of effects on human health.
There is increasing evidence that many risk loci found using genome-wide association studies are molecular quantitative trait loci (QTLs). Here we introduce a new set of functional annotations based on causal posterior probabilities of fine-mapped molecular cis-QTLs, using data from the Genotype-Tissue Expression (GTEx) and BLUEPRINT consortia. We show that these annotations are more strongly enriched for heritability (5.84× for eQTLs; P = 1.19 × 10) across 41 diseases and complex traits than annotations containing all significant molecular QTLs (1.80× for expression (e)QTLs). eQTL annotations obtained by meta-analyzing all GTEx tissues generally performed best, whereas tissue-specific eQTL annotations produced stronger enrichments for blood- and brain-related diseases and traits. eQTL annotations restricted to loss-of-function intolerant genes were even more enriched for heritability (17.06×; P = 1.20 × 10). All molecular QTLs except splicing QTLs remained significantly enriched in joint analysis, indicating that each of these annotations is uniquely informative for disease and complex trait architectures.
Clinical and epidemiological data suggest that asthma and allergic diseases are associated and may share a common genetic etiology. We analyzed genome-wide SNP data for asthma and allergic diseases in 33,593 cases and 76,768 controls of European ancestry from UK Biobank. Two publicly available independent genome-wide association studies were used for replication. We have found a strong genome-wide genetic correlation between asthma and allergic diseases (r = 0.75, P = 6.84 × 10). Cross-trait analysis identified 38 genome-wide significant loci, including 7 novel shared loci. Computational analysis showed that shared genetic loci are enriched in immune/inflammatory systems and tissues with epithelium cells. Our work identifies common genetic architectures shared between asthma and allergy and will help to advance understanding of the molecular mechanisms underlying co-morbid asthma and allergic diseases.
We introduce an approach to identify disease-relevant tissues and cell types by analyzing gene expression data together with genome-wide association study (GWAS) summary statistics. Our approach uses stratified linkage disequilibrium (LD) score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We applied our approach to gene expression data from several sources together with GWAS summary statistics for 48 diseases and traits (average N = 169,331) and found significant tissue-specific enrichments (false discovery rate (FDR) < 5%) for 34 traits. In our analysis of multiple tissues, we detected a broad range of enrichments that recapitulated known biology. In our brain-specific analysis, significant enrichments included an enrichment of inhibitory over excitatory neurons for bipolar disorder, and excitatory over inhibitory neurons for schizophrenia and body mass index. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signals.