Common variant heritability has been widely reported to be concentrated in variants within cell-type-specific non-coding functional annotations, but little is known about low-frequency variant functional architectures. We partitioned the heritability of both low-frequency (0.5%≤ minor allele frequency <5%) and common (minor allele frequency ≥5%) variants in 40 UK Biobank traits across a broad set of functional annotations. We determined that non-synonymous coding variants explain 17 ± 1% of low-frequency variant heritability ([Formula: see text]) versus 2.1 ± 0.2% of common variant heritability ([Formula: see text]). Cell-type-specific non-coding annotations that were significantly enriched for [Formula: see text] of corresponding traits were similarly enriched for [Formula: see text] for most traits, but more enriched for brain-related annotations and traits. For example, H3K4me3 marks in brain dorsolateral prefrontal cortex explain 57 ± 12% of [Formula: see text] versus 12 ± 2% of [Formula: see text] for neuroticism. Forward simulations confirmed that low-frequency variant enrichment depends on the mean selection coefficient of causal variants in the annotation, and can be used to predict effect size variance of causal rare variants (minor allele frequency <0.5%).
Biological interpretation of genome-wide association study data frequently involves assessing whether SNPs linked to a biological process, for example, binding of a transcription factor, show unsigned enrichment for disease signal. However, signed annotations quantifying whether each SNP allele promotes or hinders the biological process can enable stronger statements about disease mechanism. We introduce a method, signed linkage disequilibrium profile regression, for detecting genome-wide directional effects of signed functional annotations on disease risk. We validate the method via simulations and application to molecular quantitative trait loci in blood, recovering known transcriptional regulators. We apply the method to expression quantitative trait loci in 48 Genotype-Tissue Expression tissues, identifying 651 transcription factor-tissue associations including 30 with robust evidence of tissue specificity. We apply the method to 46 diseases and complex traits (average n = 290 K), identifying 77 annotation-trait associations representing 12 independent transcription factor-trait associations, and characterize the underlying transcriptional programs using gene-set enrichment analyses. Our results implicate new causal disease genes and new disease mechanisms.
The selective pressures that shape clonal evolution in healthy individuals are largely unknown. Here we investigate 8,342 mosaic chromosomal alterations, from 50 kb to 249 Mb long, that we uncovered in blood-derived DNA from 151,202 UK Biobank participants using phase-based computational techniques (estimated false discovery rate, 6-9%). We found six loci at which inherited variants associated strongly with the acquisition of deletions or loss of heterozygosity in cis. At three such loci (MPL, TM2D3-TARSL2, and FRA10B), we identified a likely causal variant that acted with high penetrance (5-50%). Inherited alleles at one locus appeared to affect the probability of somatic mutation, and at three other loci to be objects of positive or negative clonal selection. Several specific mosaic chromosomal alterations were strongly associated with future haematological malignancies. Our results reveal a multitude of paths towards clonal expansions with a wide range of effects on human health.
There is increasing evidence that many risk loci found using genome-wide association studies are molecular quantitative trait loci (QTLs). Here we introduce a new set of functional annotations based on causal posterior probabilities of fine-mapped molecular cis-QTLs, using data from the Genotype-Tissue Expression (GTEx) and BLUEPRINT consortia. We show that these annotations are more strongly enriched for heritability (5.84× for eQTLs; P = 1.19 × 10) across 41 diseases and complex traits than annotations containing all significant molecular QTLs (1.80× for expression (e)QTLs). eQTL annotations obtained by meta-analyzing all GTEx tissues generally performed best, whereas tissue-specific eQTL annotations produced stronger enrichments for blood- and brain-related diseases and traits. eQTL annotations restricted to loss-of-function intolerant genes were even more enriched for heritability (17.06×; P = 1.20 × 10). All molecular QTLs except splicing QTLs remained significantly enriched in joint analysis, indicating that each of these annotations is uniquely informative for disease and complex trait architectures.
Clinical and epidemiological data suggest that asthma and allergic diseases are associated and may share a common genetic etiology. We analyzed genome-wide SNP data for asthma and allergic diseases in 33,593 cases and 76,768 controls of European ancestry from UK Biobank. Two publicly available independent genome-wide association studies were used for replication. We have found a strong genome-wide genetic correlation between asthma and allergic diseases (r = 0.75, P = 6.84 × 10). Cross-trait analysis identified 38 genome-wide significant loci, including 7 novel shared loci. Computational analysis showed that shared genetic loci are enriched in immune/inflammatory systems and tissues with epithelium cells. Our work identifies common genetic architectures shared between asthma and allergy and will help to advance understanding of the molecular mechanisms underlying co-morbid asthma and allergic diseases.
We introduce an approach to identify disease-relevant tissues and cell types by analyzing gene expression data together with genome-wide association study (GWAS) summary statistics. Our approach uses stratified linkage disequilibrium (LD) score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We applied our approach to gene expression data from several sources together with GWAS summary statistics for 48 diseases and traits (average N = 169,331) and found significant tissue-specific enrichments (false discovery rate (FDR) < 5%) for 34 traits. In our analysis of multiple tissues, we detected a broad range of enrichments that recapitulated known biology. In our brain-specific analysis, significant enrichments included an enrichment of inhibitory over excitatory neurons for bipolar disorder, and excitatory over inhibitory neurons for schizophrenia and body mass index. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signals.
Weng L-C, Choi SH, Klarin D, Smith GJ, Loh P-R, Chaffin M, Roselli C, Hulme OL, Lunetta KL, Dupuis J, Benjamin EJ, Newton-Cheh C, Kathiresan S, Ellinor PT, Lubitz SA. Heritability of Atrial Fibrillation. Circ Cardiovasc Genet 2017;10(6)Abstract
BACKGROUND: Previous reports have implicated multiple genetic loci associated with AF, but the contributions of genome-wide variation to AF susceptibility have not been quantified. METHODS AND RESULTS: We assessed the contribution of genome-wide single-nucleotide polymorphism variation to AF risk (single-nucleotide polymorphism heritability, h2g ) using data from 120 286 unrelated individuals of European ancestry (2987 with AF) in the population-based UK Biobank. We ascertained AF based on self-report, medical record billing codes, procedure codes, and death records. We estimated h2g using a variance components method with variants having a minor allele frequency ≥1%. We evaluated h2g in age, sex, and genomic strata of interest. The h2g for AF was 22.1% (95% confidence interval, 15.6%-28.5%) and was similar for early- versus older-onset AF (≤65 versus >65 years of age), as well as for men and women. The proportion of AF variance explained by genetic variation was mainly accounted for by common (minor allele frequency, ≥5%) variants (20.4%; 95% confidence interval, 15.1%-25.6%). Only 6.4% (95% confidence interval, 5.1%-7.7%) of AF variance was attributed to variation within known AF susceptibility, cardiac arrhythmia, and cardiomyopathy gene regions. CONCLUSIONS: Genetic variation contributes substantially to AF risk. The risk for AF conferred by genomic variation is similar to that observed for several other cardiovascular diseases. Established AF loci only explain a moderate proportion of disease risk, suggesting that further genetic discovery, with an emphasis on common variation, is warranted to understand the causal genetic basis of AF.
Methods for genetic risk prediction have been widely investigated in recent years. However, most available training data involves European samples, and it is currently unclear how to accurately predict disease risk in other populations. Previous studies have used either training data from European samples in large sample size or training data from the target population in small sample size, but not both. Here, we introduce a multiethnic polygenic risk score that combines training data from European samples and training data from the target population. We applied this approach to predict type 2 diabetes (T2D) in a Latino cohort using both publicly available European summary statistics in large sample size (Neff = 40k) and Latino training data in small sample size (Neff = 8k). Here, we attained a >70% relative improvement in prediction accuracy (from R2 = 0.027 to 0.047) compared to methods that use only one source of training data, consistent with large relative improvements in simulations. We observed a systematically lower load of T2D risk alleles in Latino individuals with more European ancestry, which could be explained by polygenic selection in ancestral European and/or Native American populations. We predict T2D in a South Asian UK Biobank cohort using European (Neff = 40k) and South Asian (Neff = 16k) training data and attained a >70% relative improvement in prediction accuracy, and application to predict height in an African UK Biobank cohort using European (N = 113k) and African (N = 2k) training data attained a 30% relative improvement. Our work reduces the gap in polygenic risk prediction accuracy between European and non-European target populations.
Recent work has hinted at the linkage disequilibrium (LD)-dependent architecture of human complex traits, where SNPs with low levels of LD (LLD) have larger per-SNP heritability. Here we analyzed summary statistics from 56 complex traits (average N = 101,401) by extending stratified LD score regression to continuous annotations. We determined that SNPs with low LLD have significantly larger per-SNP heritability and that roughly half of this effect can be explained by functional annotations negatively correlated with LLD, such as DNase I hypersensitivity sites (DHSs). The remaining signal is largely driven by our finding that more recent common variants tend to have lower LLD and to explain more heritability (P = 2.38 × 10-104); the youngest 20% of common SNPs explain 3.9 times more heritability than the oldest 20%, consistent with the action of negative selection. We also inferred jointly significant effects of other LD-related annotations and confirmed via forward simulations that they jointly predict deleterious effects.
Hand grip strength is a widely used proxy of muscular fitness, a marker of frailty, and predictor of a range of morbidities and all-cause mortality. To investigate the genetic determinants of variation in grip strength, we perform a large-scale genetic discovery analysis in a combined sample of 195,180 individuals and identify 16 loci associated with grip strength (P<5 × 10-8) in combined analyses. A number of these loci contain genes implicated in structure and function of skeletal muscle fibres (ACTG1), neuronal maintenance and signal transduction (PEX14, TGFA, SYT1), or monogenic syndromes with involvement of psychomotor impairment (PEX14, LRPPRC and KANSL1). Mendelian randomization analyses are consistent with a causal effect of higher genetically predicted grip strength on lower fracture risk. In conclusion, our findings provide new biological insight into the mechanistic underpinnings of grip strength and the causal role of muscular strength in age-related morbidities and mortality.
The timing of puberty is a highly polygenic childhood trait that is epidemiologically associated with various adult diseases. Using 1000 Genomes Project-imputed genotype data in up to ∼370,000 women, we identify 389 independent signals (P < 5 × 10(-8)) for age at menarche, a milestone in female pubertal development. In Icelandic data, these signals explain ∼7.4% of the population variance in age at menarche, corresponding to ∼25% of the estimated heritability. We implicate ∼250 genes via coding variation or associated expression, demonstrating significant enrichment in neural tissues. Rare variants near the imprinted genes MKRN3 and DLK1 were identified, exhibiting large effects when paternally inherited. Mendelian randomization analyses suggest causal inverse associations, independent of body mass index (BMI), between puberty timing and risks for breast and endometrial cancers in women and prostate cancer in men. In aggregate, our findings highlight the complexity of the genetic regulation of puberty timing and support causal links with cancer susceptibility.
Background: The association of differing genotypes with disease-related phenotypic traits offers great potential to both help identify new therapeutic targets and support stratification of patients who would gain the greatest benefit from specific drug classes. Development of low-cost genotyping and sequencing has made collecting large-scale genotyping data routine in population and therapeutic intervention studies. In addition, a range of new technologies is being used to capture numerous new and complex phenotypic descriptors. As a result, genotype and phenotype datasets have grown exponentially. Genome-wide association studies associate genotypes and phenotypes using methods such as logistic regression. As existing tools for association analysis limit the efficiency by which value can be extracted from increasing volumes of data, there is a pressing need for new software tools that can accelerate association analyses on large genotype-phenotype datasets. Results: Using open innovation (OI) and contest-based crowdsourcing, the logistic regression analysis in a leading, community-standard genetics software package (PLINK 1.07) was substantially accelerated. OI allowed us to do this in <6 months by providing rapid access to highly skilled programmers with specialized, difficult-to-find skill sets. Through a crowd-based contest a combination of computational, numeric, and algorithmic approaches was identified that accelerated the logistic regression in PLINK 1.07 by 18- to 45-fold. Combining contest-derived logistic regression code with coarse-grained parallelization, multithreading, and associated changes to data initialization code further developed through distributed innovation, we achieved an end-to-end speedup of 591-fold for a data set size of 6678 subjects by 645 863 variants, compared to PLINK 1.07's logistic regression. This represents a reduction in run time from 4.8 hours to 29 seconds. Accelerated logistic regression code developed in this project has been incorporated into the PLINK2 project. Conclusions: Using iterative competition-based OI, we have developed a new, faster implementation of logistic regression for genome-wide association studies analysis. We present lessons learned and recommendations on running a successful OI process for bioinformatics.
Mixed models have become the tool of choice for genetic association studies; however, standard mixed model methods may be poorly calibrated or underpowered under family sampling bias and/or case-control ascertainment. Previously, we introduced a liability threshold-based mixed model association statistic (LTMLM) to address case-control ascertainment in unrelated samples. Here, we consider family-biased case-control ascertainment, where case and control subjects are ascertained non-randomly with respect to family relatedness. Previous work has shown that this type of ascertainment can severely bias heritability estimates; we show here that it also impacts mixed model association statistics. We introduce a family-based association statistic (LT-Fam) that is robust to this problem. Similar to LTMLM, LT-Fam is computed from posterior mean liabilities (PML) under a liability threshold model; however, LT-Fam uses published narrow-sense heritability estimates to avoid the problem of biased heritability estimation, enabling correct calibration. In simulations with family-biased case-control ascertainment, LT-Fam was correctly calibrated (average χ(2) = 1.00-1.02 for null SNPs), whereas the Armitage trend test (ATT), standard mixed model association (MLM), and case-control retrospective association test (CARAT) were mis-calibrated (e.g., average χ(2) = 0.50-1.22 for MLM, 0.89-2.65 for CARAT). LT-Fam also attained higher power than other methods in some settings. In 1,259 type 2 diabetes-affected case subjects and 5,765 control subjects from the CARe cohort, downsampled to induce family-biased ascertainment, LT-Fam was correctly calibrated whereas ATT, MLM, and CARAT were again mis-calibrated. Our results highlight the importance of modeling family sampling bias in case-control datasets with related samples.
Analyzing genetic differences between closely related populations can be a powerful way to detect recent adaptation. The very large sample size of the UK Biobank is ideal for using population differentiation to detect selection and enables an analysis of the UK population structure at fine resolution. In this study, analyses of 113,851 UK Biobank samples showed that population structure in the UK is dominated by five principal components (PCs) spanning six clusters: Northern Ireland, Scotland, northern England, southern England, and two Welsh clusters. Analyses of ancient Eurasians revealed that populations in the northern UK have higher levels of Steppe ancestry and that UK population structure cannot be explained as a simple mixture of Celts and Saxons. A scan for unusual population differentiation along the top PCs identified a genome-wide-significant signal of selection at the coding variant rs601338 in FUT2 (p = 9.16 × 10(-9)). In addition, by combining evidence of unusual differentiation within the UK with evidence from ancient Eurasians, we identified genome-wide-significant (p = 5 × 10(-8)) signals of recent selection at two additional loci: CYP1A2-CSK and F12. We detected strong associations between diastolic blood pressure in the UK Biobank and both the variants with selection signals at CYP1A2-CSK (p = 1.10 × 10(-19)) and the variants with ancient Eurasian selection signals at the ATXN2-SH2B3 locus (p = 8.00 × 10(-33)), implicating recent adaptation related to blood pressure.
Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
Horikoshi M, Beaumont RN, Day FR, Warrington NM, Kooijman MN, Fernandez-Tajes J, Feenstra B, van Zuydam NR, Gaulton KJ, Grarup N, Bradfield JP, Strachan DP, Li-Gao R, Ahluwalia TS, Kreiner E, Rueedi R, Lyytikäinen L-P, Cousminer DL, Wu Y, Thiering E, Wang CA, Have CT, Hottenga J-J, Vilor-Tejedor N, Joshi PK, Boh ETH, Ntalla I, Pitkänen N, Mahajan A, van Leeuwen EM, Joro R, Lagou V, Nodzenski M, Diver LA, Zondervan KT, Bustamante M, Marques-Vidal P, Mercader JM, Bennett AJ, Rahmioglu N, Nyholt DR, Ma RCW, Tam CHT, Tam WH, Tam WH, Ganesh SK, van Rooij FJA, Jones SE, Loh P-R, .., Timpson NJ, Perry JRB, Evans DM, McCarthy MI, Freathy RM. Genome-wide associations for birth weight and correlations with adult disease. Nature 2016;538(7624):248-252.Abstract
Birth weight (BW) has been shown to be influenced by both fetal and maternal factors and in observational studies is reproducibly associated with future risk of adult metabolic diseases including type 2 diabetes (T2D) and cardiovascular disease. These life-course associations have often been attributed to the impact of an adverse early life environment. Here, we performed a multi-ancestry genome-wide association study (GWAS) meta-analysis of BW in 153,781 individuals, identifying 60 loci where fetal genotype was associated with BW (P < 5 × 10(-8)). Overall, approximately 15% of variance in BW was captured by assays of fetal genetic variation. Using genetic association alone, we found strong inverse genetic correlations between BW and systolic blood pressure (Rg = -0.22, P = 5.5 × 10(-13)), T2D (Rg = -0.27, P = 1.1 × 10(-6)) and coronary artery disease (Rg = -0.30, P = 6.5 × 10(-9)). In addition, using large -cohort datasets, we demonstrated that genetic factors were the major contributor to the negative covariance between BW and future cardiometabolic risk. Pathway analyses indicated that the protein products of genes within BW-associated regions were enriched for diverse processes including insulin signalling, glucose homeostasis, glycogen biosynthesis and chromatin remodelling. There was also enrichment of associations with BW in known imprinted regions (P = 1.9 × 10(-4)). We demonstrate that life-course associations between early growth phenotypes and adult cardiometabolic disease are in part the result of shared genetic effects and identify some of the pathways through which these causal genetic effects are mediated.
Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze SI, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh P-R, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. Next-generation genotype imputation service and methods. Nat Genet 2016;48(10):1284-1287.Abstract
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.
The ages of puberty, first sexual intercourse and first birth signify the onset of reproductive ability, behavior and success, respectively. In a genome-wide association study of 125,667 UK Biobank participants, we identify 38 loci associated (P < 5 × 10(-8)) with age at first sexual intercourse. These findings were taken forward in 241,910 men and women from Iceland and 20,187 women from the Women's Genome Health Study. Several of the identified loci also exhibit associations (P < 5 × 10(-8)) with other reproductive and behavioral traits, including age at first birth (variants in or near ESR1 and RBM6-SEMA3F), number of children (CADM2 and ESR1), irritable temperament (MSRA) and risk-taking propensity (CADM2). Mendelian randomization analyses infer causal influences of earlier puberty timing on earlier first sexual intercourse, earlier first birth and lower educational attainment. In turn, likely causal consequences of earlier first sexual intercourse include reproductive, educational, psychiatric and cardiometabolic outcomes.