Publications
See Google Scholar for the full list.
2022
2022
-
Towards practical and robust DNA-based data archiving using the yin–yang codec system Ping, Zhi, Chen, Shihong, Zhou, Guangyu, Huang, Xiaoluo, Zhu, Sha Joe, Zhang, Haoling, Lee, Henry H., Lan, Zhaojun, Cui, Jie, Chen, Tai, Zhang, Wenwei, Yang, Huanming, Xu, Xun, Church, George M., and Shen, Yue Nature Computational Science 2022 [Abs]
DNA is a promising data storage medium due to its remarkable durability and space-efficient storage. Early bit-to-base transcoding schemes have primarily pursued information density, at the expense of introducing biocompatibility challenges or decoding failure. Here we propose a robust transcoding algorithm named the yin–yang codec, using two rules to encode two binary bits into one nucleotide, to generate DNA sequences that are highly compatible with synthesis and sequencing technologies. We encoded two representative file formats and stored them in vitro as 200\thinspacent oligo pools and in vivo as a ~54\thinspacekbps DNA fragment in yeast cells. Sequencing results show that the yin–yang codec exhibits high robustness and reliability for a wide variety of data types, with an average recovery rate of 99.9% above 104 molecule copies and an achieved recovery rate of 87.53% at ≤102 copies. Additionally, the in vivo storage demonstration achieved an experimentally measured physical density close to the theoretical maximum.
-
Efficient ancestry and mutation simulation with msprime 1.0 Baumdicker, Franz, Bisschop, Gertjan, Goldstein, Daniel, Gower, Graham, Ragsdale, Aaron P., Tsambos, Georgia, Zhu, Sha, Eldon, Bjarki, Ellerman, E. Castedo, Galloway, Jared G., Gladstein, Ariella L., Gorjanc, Gregor, Guo, Bing, Jeffery, Ben, Kretzschumar, Warren W., Lohse, Konrad, Matschiner, Michael, Nelson, Dominic, Pope, Nathaniel S., Quinto-Cortes, Consuelo D., Rodrigues, Murillo F., Saunack, Kumar, Sellinger, Thibaut, Thornton, Kevin, Van Kemenade, Hugo, Wohns, Anthony W., Wong, Yan, Gravel, Simon, Kern, Andrew D., Koskela, Jere, Ralph, Peter L., and Kelleher, Jerome Genetics 2022 [Abs]
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime’s many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
2021
2021
-
Efficient ancestry and mutation simulation with msprime 1.0 Baumdicker, Franz, Bisschop, Gertjan, Goldstein, Daniel, Gower, Graham, Ragsdale, Aaron P., Tsambos, Georgia, Zhu, Sha, Eldon, Bjarki, Ellerman, Castedo E., Galloway, Jared G., Gladstein, Ariella L., Gorjanc, Gregor, Guo, Bing, Jeffery, Ben, Kretzschmar, Warren W., Lohse, Konrad, Matschiner, Michael, Nelson, Dominic, Pope, Nathaniel S., Quinto-Cortés, Consuelo D., Rodrigues, Murillo F., Saunack, Kumar, Sellinger, Thibaut, Thornton, Kevin, Kemenade, Hugo, Wohns, Anthony W., Wong, H. Yan, Gravel, Simon, Kern, Andrew D., Koskela, Jere, Ralph, Peter L., and Kelleher, Jerome bioRxiv 2021
-
Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes Henderson, Donna, Zhu, Sha (Joe), Cole, Christopher B., and Lunter, Gerton PLOS ONE 2021
-
Chamaeleo: DNA存储碱基编解码算法的可拓展集成与系统评估平台 合成生物学 2021
2020
2020
-
Ancient Admixture into Africa from the ancestors of non-Africans Cole, Christopher B., Zhu, Sha Joe, Mathieson, Iain, Prufer, Kay, and Lunter, Gerton bioRxiv 2020
2019
2019
-
Carbon-based archiving: current progress and future prospects of DNA-based data storage Ping, Zhi, Ma, Dongzhao, Huang, Xiaoluo, Chen, Shihong, Liu, Longying, Guo, Fei, Zhu, Sha Joe, and Shen, Yue GigaScience 2019
-
The origins and relatedness structure of mixed infections vary with local prevalence of \it P. falciparum malaria Zhu, Sha Joe, Hendry, Jason A, Almagro-Garcia, Jacob, Pearson, Richard D., Amato, Roberto, Miles, Alistair, Weiss, Daniel J, Lucas, Tim CD, Nguyen, Michele, Gething, Peter W, Kwiatkowski, Dominic, and McVean, Gil eLife 2019
2018
2018
-
Deconvolution of multiple infections in \it Plasmodium falciparum from high throughput sequencing data Zhu, Sha Joe, Almagro-Garcia, Jacob, and McVean, Gil Bioinformatics 2018
-
Neonatal MicroRNA Profile Determines Endothelial Function in Offspring of Hypertensive Pregnancies Yu, Grace Z., Reilly, Svetlana, Lewandowski, Adam J., Aye, Christina Y.L., Simpson, Lisa J., Newton, Laura D., Davis, Esther F., Zhu, Sha J., Fox, Willow R., Goel, Anuj, Watkins, Hugh, Channon, Keith M., Watt, Suzanne M., Kyriakou, Theodosios, and Leeson, Paul Hypertension 2018
2017
2017
-
Displayed Trees Do Not Determine Distinguishability Under the Network Multispecies Coalescent Zhu, Sha, and Degnan, James H. Systematic Biology 2017
2015
2015
-
scrm: efficiently simulating long sequences using the approximated coalescent with recombination Staab, Paul R., Zhu, Sha, Metzler, Dirk, and Lunter, Gerton Bioinformatics 2015
-
Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees Zhu, Sha, Degnan, James H., Goldstien, Sharyn J., and Eldon, Bjarki BMC Bioinformatics 2015 [Abs]
There has been increasing interest in coalescent models which admit multiple mergers of ancestral lineages; and to model hybridization and coalescence simultaneously.
-
Clades and clans: a comparison study of two evolutionary models Zhu, Sha, Than, Cuong, and Wu, Taoyang Journal of Mathematical Biology 2015 [Abs]
The Yule–Harding–Kingman (YHK) model and the proportional to distinguishable arrangements (PDA) model are two binary tree generating models that are widely used in evolutionary biology. Understanding the distributions of clade sizes under these two models provides valuable insights into macro-evolutionary processes, and is important in hypothesis testing and Bayesian analyses in phylogenetics. Here we show that these distributions are log-convex, which implies that very large clades or very small clades are more likely to occur under these two models. Moreover, we prove that there exists a critical value }}\backslashkappa (n)}}κ(n)for each }}n\backslashgeqslant 4}}n⩾4such that for a given clade with size }}k}}k, the probability that this clade is contained in a random tree with }}n}}nleaves generated under the YHK model is higher than that under the PDA model if }}1<k<\backslashkappa (n)}}1<k<κ(n), and lower if }}\backslashkappa (n)<k<n}}κ(n)<k<n. Finally, we extend our results to binary unrooted trees, and obtain similar results for the distributions of clan sizes.
2013
2013
-
Does random tree puzzle produce Yule–Harding trees in the many-taxon limit? Zhu, Sha, and Steel, Mike Mathematical Biosciences 2013 [Abs]
It has been suggested that a random tree puzzle (RTP) process leads to a Yule–Harding (YH) distribution, when the number of taxa becomes large. In this study, we formalize this conjecture, and we prove that the two tree distributions converge for two particular properties, which suggests that the conjecture may be true. However, we present statistical evidence that, while the two distributions are close, the RTP appears to converge on a different distribution than does the YH. By way of contrast, in the concluding section we show that the maximum parsimony method applied to random two-state data leads a very different (PDA, or uniform) distribution on trees.
2011
2011
-
Clades, clans, and reciprocal monophyly under neutral evolutionary models Zhu, Sha, Degnan, James H., and Steel, Mike Theoretical Population Biology 2011 [Abs]
The Yule model and the coalescent model are two neutral stochastic models for generating trees in phylogenetics and population genetics, respectively. Although these models are quite different, they lead to identical distributions concerning the probability that pre-specified groups of taxa form monophyletic groups (clades) in the tree. We extend earlier work to derive exact formulae for the probability of finding one or more groups of taxa as clades in a rooted tree, or as ‘clans’ in an unrooted tree. Our findings are relevant for calculating the statistical significance of observed monophyly and reciprocal monophyly in phylogenetics.