Pickrell JK, Patterson N, Barbieri C, Berthold F, Gerlach L, Güldemann T, Kure B, Mpoloka SW, Nakagawa H, Naumann C, Lipson M, Loh P-R, Lachance J, Mountain J, Bustamante CD, Berger B, Tishkoff SA, Henn BM, Stoneking M, Reich D, Pakendorf B. The genetic prehistory of southern Africa. Nat Commun 2012;3:1143.Abstract
Southern and eastern African populations that speak non-Bantu languages with click consonants are known to harbour some of the most ancient genetic lineages in humans, but their relationships are poorly understood. Here, we report data from 23 populations analysed at over half a million single-nucleotide polymorphisms, using a genome-wide array designed for studying human history. The southern African Khoisan fall into two genetic groups, loosely corresponding to the northwestern and southeastern Kalahari, which we show separated within the last 30,000 years. We find that all individuals derive at least a few percent of their genomes from admixture with non-Khoisan populations that began ∼1,200 years ago. In addition, the East African Hadza and Sandawe derive a fraction of their ancestry from admixture with a population related to the Khoisan, supporting the hypothesis of an ancient link between southern and eastern Africa.
BACKGROUND: Causal graphs are an increasingly popular tool for the analysis of biological datasets. In particular, signed causal graphs--directed graphs whose edges additionally have a sign denoting upregulation or downregulation--can be used to model regulatory networks within a cell. Such models allow prediction of downstream effects of regulation of biological entities; conversely, they also enable inference of causative agents behind observed expression changes. However, due to their complex nature, signed causal graph models present special challenges with respect to assessing statistical significance. In this paper we frame and solve two fundamental computational problems that arise in practice when computing appropriate null distributions for hypothesis testing. RESULTS: First, we show how to compute a p-value for agreement between observed and model-predicted classifications of gene transcripts as upregulated, downregulated, or neither. Specifically, how likely are the classifications to agree to the same extent under the null distribution of the observed classification being randomized? This problem, which we call "Ternary Dot Product Distribution" owing to its mathematical form, can be viewed as a generalization of Fisher's exact test to ternary variables. We present two computationally efficient algorithms for computing the Ternary Dot Product Distribution and investigate its combinatorial structure analytically and numerically to establish computational complexity bounds.Second, we develop an algorithm for efficiently performing random sampling of causal graphs. This enables p-value computation under a different, equally important null distribution obtained by randomizing the graph topology but keeping fixed its basic structure: connectedness and the positive and negative in- and out-degrees of each vertex. We provide an algorithm for sampling a graph from this distribution uniformly at random. We also highlight theoretical challenges unique to signed causal graphs; previous work on graph randomization has studied undirected graphs and directed but unsigned graphs. CONCLUSION: We present algorithmic solutions to two statistical significance questions necessary to apply the causal graph methodology, a powerful tool for biological network analysis. The algorithms we present are both fast and provably correct. Our work may be of independent interest in non-biological contexts as well, as it generalizes mathematical results that have been studied extensively in other fields.
Small organisms can be used as biomonitoring tools to assess chemicals in the environment. Chemical stressors are especially hard to assess and monitor when present as complex mixtures. Here, fifteen polymerase chain reaction assays targeting Daphnia magna genes were calibrated to responses elicited in D. magna exposed for 24 h to five different doses each of the munitions constituents 2,4,6-trinitrotoluene, 2,4-dinitrotoluene, 2,6-dinitrotoluene, trinitrobenzene, dinitrobenzene, or 1,3,5-trinitro-1,3,5-triazacyclohexane. A piecewise-linear model for log-fold expression changes in gene assays was used to predict response to munitions mixtures and contaminated groundwater under the assumption that chemical effects were additive. The correlations of model predictions with actual expression changes ranged from 0.12 to 0.78 with an average of 0.5. To better understand possible mixture effects, gene expression changes from all treatments were compared using high-density microarrays. Whereas mixtures and groundwater exposures had genes and gene functions in common with single chemical exposures, unique functions were also affected, which was consistent with the nonadditivity of chemical effects in these mixtures. These results suggest that, while gene behavior in response to chemical exposure can be partially predicted based on chemical exposure, estimation of the composition of mixtures from chemical responses is difficult without further understanding of gene behavior in mixtures. Future work will need to examine additive and nonadditive mixture effects using a much greater range of different chemical classes in order to clarify the behavior and predictability of complex mixtures.
Soil contamination near munitions plants and testing grounds is a serious environmental concern that can result in the formation of tissue chemical residue in exposed animals. Quantitative prediction of tissue residue still represents a challenging task despite long-term interest and pursuit, as tissue residue formation is the result of many dynamic processes including uptake, transformation, and assimilation. The availability of high-dimensional microarray gene expression data presents a new opportunity for computational predictive modeling of tissue residue from changes in expression profile. Here we analyzed a 240-sample data set with measurements of transcriptomic-wide gene expression and tissue residue of two chemicals, 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5-triazacyclohexane (RDX), in the earthworm Eisenia fetida. We applied two different computational approaches, LASSO (Least Absolute Shrinkage and Selection Operator) and RF (Random Forest), to identify predictor genes and built predictive models. Each approach was tested alone and in combination with a prior variable selection procedure that involved the Wilcoxon rank-sum test and HOPACH (Hierarchical Ordered Partitioning And Collapsing Hybrid). Model evaluation results suggest that LASSO was the best performer of minimum complexity on the TNT data set, whereas the combined Wilcoxon-HOPACH-RF approach achieved the highest prediction accuracy on the RDX data set. Our models separately identified two small sets of ca. 30 predictor genes for RDX and TNT. We have demonstrated that both LASSO and RF are powerful tools for quantitative prediction of tissue residue. They also leave more unknown than explained, however, allowing room for improvement with other computational methods and extension to mixture contamination scenarios.
A major goal of large-scale genomics projects is to enable the use of data from high-throughput experimental methods to predict complex phenotypes such as disease susceptibility. The DREAM5 Systems Genetics B Challenge solicited algorithms to predict soybean plant resistance to the pathogen Phytophthora sojae from training sets including phenotype, genotype, and gene expression data. The challenge test set was divided into three subcategories, one requiring prediction based on only genotype data, another on only gene expression data, and the third on both genotype and gene expression data. Here we present our approach, primarily using regularized regression, which received the best-performer award for subchallenge B2 (gene expression only). We found that despite the availability of 941 genotype markers and 28,395 gene expression features, optimal models determined by cross-validation experiments typically used fewer than ten predictors, underscoring the importance of strong regularization in noisy datasets with far more features than samples. We also present substantial analysis of the training and test setup of the challenge, identifying high variance in performance on the gold standard test sets.
We demonstrate that the ratio of group to phase velocity has a simple relationship to the orientation of the electromagnetic field. In nondispersive materials, opposite group and phase velocity corresponds to fields that are mostly oriented in the propagation direction. More generally, this relationship (including the case of dispersive and negative-index materials) offers a perspective on the phenomena of backward waves and left-handed media. As an application of this relationship, we demonstrate and explain an irrecoverable failure of perfectly matched layer absorbing boundaries in computer simulations for constant cross-section waveguides with backward-wave modes and suggest an alternative in the form of adiabatic isotropic absorbers.