Publications
publications by categories in reversed chronological order.
2024
- DeepurifyA multi-modal deep language model for contaminant removal from metagenome-assembled genomesBohao Zou, Jingjing Wang, Yi Ding, and 6 more authors2024Publisher: Nature Publishing Group
Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.
- PangaeaExploring high-quality microbial genomes by assembling short-reads with long-range connectivityZhenmiao Zhang, Jin Xiao, Hongbo Wang, and 9 more authorsNature Communications. More Information can be found here , May 2024Publisher: Nature Publishing Group
Although long-read sequencing enables the generation of complete genomes for unculturable microbes, its high cost limits the widespread adoption of long-read sequencing in large-scale metagenomic studies. An alternative method is to assemble short-reads with long-range connectivity, which can be a cost-effective way to generate high-quality microbial genomes. Here, we develop Pangaea, a bioinformatic approach designed to enhance metagenome assembly using short-reads with long-range connectivity. Pangaea leverages connectivity derived from physical barcodes of linked-reads or virtual barcodes by aligning short-reads to long-reads. Pangaea utilizes a deep learning-based read binning algorithm to assemble co-barcoded reads exhibiting similar sequence contexts and abundances, thereby improving the assembly of high- and medium-abundance microbial genomes. Pangaea also leverages a multi-thresholding algorithm strategy to refine assembly for low-abundance microbes. We benchmark Pangaea on linked-reads and a combination of short- and long-reads from simulation data, mock communities and human gut metagenomes. Pangaea achieves significantly higher contig continuity as well as more near-complete metagenome-assembled genomes (NCMAGs) than the existing assemblers. Pangaea also generates three complete and circular NCMAGs on the human gut microbiomes.
- Structural and Functional Disparities within the Human Gut Virome in Terms of Genome Topology and Representative Genome SelectionWerner P. Veldsman, Chao Yang, Zhenmiao Zhang, and 3 more authorsMay 2024Number: 1 Publisher: Multidisciplinary Digital Publishing Institute
Circularity confers protection to viral genomes where linearity falls short, thereby fulfilling the form follows function aphorism. However, a shift away from morphology-based classification toward the molecular and ecological classification of viruses is currently underway within the field of virology. Recent years have seen drastic changes in the International Committee on Taxonomy of Viruses’ operational definitions of viruses, particularly for the tailed phages that inhabit the human gut. After the abolition of the order Caudovirales, these tailed phages are best defined as members of the class Caudoviricetes. To determine the epistemological value of genome topology in the context of the human gut virome, we designed a set of seven experiments to assay the impact of genome topology and representative viral selection on biological interpretation. Using Oxford Nanopore long reads for viral genome assembly coupled with Illumina short-read polishing, we showed that circular and linear virus genomes differ remarkably in terms of genome quality, GC skew, transfer RNA gene frequency, structural variant frequency, cross-reference functional annotation (COG, KEGG, Pfam, and TIGRfam), state-of-the-art marker-based classification, and phage–host interaction. Furthermore, the disparity profile changes during dereplication. In particular, our phage–host interaction results demonstrated that proportional abundances cannot be meaningfully compared without due regard for genome topology and dereplication threshold, which necessitates the need for standardized reporting. As a best practice guideline, we recommend that comparative studies of the human gut virome always report the ratio of circular to linear viral genomes along with the dereplication threshold so that structural and functional metrics can be placed into context when assessing biologically relevant metagenomic properties such as proportional abundance.
- LRTKLRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenomeChao Yang, Zhenmiao Zhang, Yufen Huang, and 7 more authorsMay 2024
Linked-read sequencing technologies generate high-base quality short reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and have been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to 1 specific sequencing platform.To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome. LRTK provides functions to perform linked-read simulation, barcode sequencing error correction, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically and provides users with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on linked reads from simulation, mock community, and real datasets for both human genome and metagenome. We showcased LRTK’s ability to generate comparative performance results from preceding benchmark studies and to report these results in publication-ready HTML document plots.LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.
2023
- Ruminococcus gnavus plays a pathogenic role in diarrhea-predominant irritable bowel syndrome by increasing serotonin biosynthesisLixiang Zhai, Chunhua Huang, Ziwan Ning, and 18 more authorsMay 2023
Diarrhea-predominant irritable bowel syndrome (IBS-D), a globally prevalent functional gastrointestinal (GI) disorder, is associated with elevated serotonin that increases gut motility. While anecdotal evidence suggests that the gut microbiota contributes to serotonin biosynthesis, mechanistic insights are limited. We determined that the bacterium Ruminococcus gnavus plays a pathogenic role in IBS-D. Monocolonization of germ-free mice with R. gnavus induced IBS-D-like symptoms, including increased GI transit and colonic secretion, by stimulating the production of peripheral serotonin. R. gnavus-mediated catabolism of dietary phenylalanine and tryptophan generated phenethylamine and tryptamine that directly stimulated serotonin biosynthesis in intestinal enterochromaffin cells via a mechanism involving activation of trace amine-associated receptor 1 (TAAR1). This R. gnavus-driven increase in serotonin levels elevated GI transit and colonic secretion but was abrogated upon TAAR1 inhibition. Collectively, our study provides molecular and pathogenetic insights into how gut microbial metabolites derived from dietary essential amino acids affect serotonin-dependent control of gut motility.
- Benchmarking multi-platform sequencing technologies for human genome assemblyJingjing Wang, Werner Pieter Veldsman, Xiaodong Fang, and 4 more authorsMay 2023
Genome assembly is a computational technique that involves piecing together deoxyribonucleic acid (DNA) fragments generated by sequencing technologies to create a comprehensive and precise representation of the entire genome. Generating a high-quality human reference genome is a crucial prerequisite for comprehending human biology, and it is also vital for downstream genomic variation analysis. Many efforts have been made over the past few decades to create a complete and gapless reference genome for humans by using a diverse range of advanced sequencing technologies. Several available tools are aimed at enhancing the quality of haploid and diploid human genome assemblies, which include contig assembly, polishing of contig errors, scaffolding and variant phasing. Selecting the appropriate tools and technologies remains a daunting task despite several studies have investigated the pros and cons of different assembly strategies. The goal of this paper was to benchmark various strategies for human genome assembly by combining sequencing technologies and tools on two publicly available samples (NA12878 and NA24385) from Genome in a Bottle. We then compared their performances in terms of continuity, accuracy, completeness, variant calling and phasing. We observed that PacBio HiFi long-reads are the optimal choice for generating an assembly with low base errors. On the other hand, we were able to produce the most continuous contigs with Oxford Nanopore long-reads, but they may require further polishing to improve on quality. We recommend using short-reads rather than long-reads themselves to improve the base accuracy of contigs from Oxford Nanopore long-reads. Hi-C is the best choice for chromosome-level scaffolding because it can capture the longest-range DNA connectedness compared to 10× linked-reads and Bionano optical maps. However, a combination of multiple technologies can be used to further improve the quality and completeness of genome assembly. For diploid assembly, hifiasm is the best tool for human diploid genome assembly using PacBio HiFi and Hi-C data. Looking to the future, we expect that further advancements in human diploid assemblers will leverage the power of PacBio HiFi reads and other technologies with long-range DNA connectedness to enable the generation of high-quality, chromosome-level and haplotype-resolved human genome assemblies.
- Accurate and interpretable gene expression imputation on scRNA-seq data using IGSimputeKe Xu, ChinWang Cheong, Werner P Veldsman, and 3 more authorsMay 2023
Single-cell ribonucleic acid sequencing (scRNA-seq) enables the quantification of gene expression at the transcriptomic level with single-cell resolution, enhancing our understanding of cellular heterogeneity. However, the excessive missing values present in scRNA-seq data hinder downstream analysis. While numerous imputation methods have been proposed to recover scRNA-seq data, high imputation performance often comes with low or no interpretability. Here, we present IGSimpute, an accurate and interpretable imputation method for recovering missing values in scRNA-seq data with an interpretable instance-wise gene selection layer (GSL). IGSimpute outperforms 12 other state-of-the-art imputation methods on 13 out of 17 datasets from different scRNA-seq technologies with the lowest mean squared error as the chosen benchmark metric. We demonstrate that IGSimpute can give unbiased estimates of the missing values compared to other methods, regardless of whether the average gene expression values are small or large. Clustering results of imputed profiles show that IGSimpute offers statistically significant improvement over other imputation methods. By taking the heart-and-aorta and the limb muscle tissues as examples, we show that IGSimpute can also denoise gene expression profiles by removing outlier entries with unexpectedly high expression values via the instance-wise GSL. We also show that genes selected by the instance-wise GSL could indicate the age of B cells from bladder fat tissue of the Tabula Muris Senis atlas. IGSimpute can impute one million cells using 64 min, and thus applicable to large datasets.
- Benchmarking genome assembly methods on metagenomic sequencing dataZhenmiao Zhang, Chao Yang, Werner Pieter Veldsman, and 2 more authorsMay 2023
Metagenome assembly is an efficient approach to reconstruct microbial genomes from metagenomic sequencing data. Although short-read sequencing has been widely used for metagenome assembly, linked- and long-read sequencing have shown their advancements in assembly by providing long-range DNA connectedness. Many metagenome assembly tools were developed to simplify the assembly graphs and resolve the repeats in microbial genomes. However, there remains no comprehensive evaluation of metagenomic sequencing technologies, and there is a lack of practical guidance on selecting the appropriate metagenome assembly tools. This paper presents a comprehensive benchmark of 19 commonly used assembly tools applied to metagenomic sequencing datasets obtained from simulation, mock communities or human gut microbiomes. These datasets were generated using mainstream sequencing platforms, such as Illumina and BGISEQ short-read sequencing, 10x Genomics linked-read sequencing, and PacBio and Oxford Nanopore long-read sequencing. The assembly tools were extensively evaluated against many criteria, which revealed that long-read assemblers generated high contig contiguity but failed to reveal some medium- and high-quality metagenome-assembled genomes (MAGs). Linked-read assemblers obtained the highest number of overall near-complete MAGs from the human gut microbiomes. Hybrid assemblers using both short- and long-read sequencing were promising methods to improve both total assembly length and the number of near-complete MAGs. This paper also discussed the running time and peak memory consumption of these assembly tools and provided practical guidance on selecting them.
- A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variantsChonghao Wang, Jing Zhang, Werner Pieter Veldsman, and 2 more authorsMay 2023
Quantifying an individual’s risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.
- PriVar: a toolkit for prioritizing SNVs and indels from next-generation sequencing dataLu Zhang, Jing Zhang, Jing Yang, and 3 more authorsMay 2023
Next-generation sequencing has become a valuable tool for detecting mutations involved in Mendelian diseases. However, it is a challenge to identify the small subset of functionally important mutations from tens of thousands of rare variants in a whole exome/genome. Therefore, we developed a toolkit called PriVar, a systematic prioritization pipeline that takes into consideration calling quality of the variants, their predicted functional impact, known connection of the gene to the disease and the number of mutations in a gene, and inference from linkage analysis.Availability: Executable jar package is available at http://paed.hku.hk/uploadarea/yangwl/html/software.html. Contact: yangwl@hkucc.hku.hkSupplementary information: Supplementary data are available at Bioinformatics online.
2022
- A machine learning model for disease risk prediction by integrating genetic and non-genetic factorsYu Xu, Chonghao Wang, Zeming Li, and 4 more authorsMay 2022
- dynDeepDRIM: a dynamic deep learning model to infer direct regulatory interactions using time-course single-cell gene expression dataYu Xu, Jiaxing Chen, Aiping Lyu, and 2 more authorsMay 2022
Time-course single-cell RNA sequencing (scRNA-seq) data have been widely used to explore dynamic changes in gene expression of transcription factors (TFs) and their target genes. This information is useful to reconstruct cell-type-specific gene regulatory networks (GRNs). However, the existing tools are commonly designed to analyze either time-course bulk gene expression data or static scRNA-seq data via pseudo-time cell ordering. A few methods successfully utilize the information from multiple time points while also considering the characteristics of scRNA-seq data. We proposed dynDeepDRIM, a novel deep learning model to reconstruct GRNs using time-course scRNA-seq data. It represents the joint expression of a gene pair as an image and utilizes the image of the target TF–gene pair and the ones of the potential neighbors to reconstruct GRNs from time-course scRNA-seq data. dynDeepDRIM can effectively remove the transitive TF–gene interactions by considering neighborhood context and model the gene expression dynamics using high-dimensional tensors. We compared dynDeepDRIM with six GRN reconstruction methods on both simulation and four real time-course scRNA-seq data. dynDeepDRIM achieved substantially better performance than the other methods in inferring TF–gene interactions and eliminated the false positives effectively. We also applied dynDeepDRIM to annotate gene functions and found it achieved evidently better performance than the other tools due to considering the neighbor genes.
2021
- DeepDRIM: a deep neural network to reconstruct cell-type-specific gene regulatory network using single-cell RNA-seq dataJiaxing Chen, ChinWang Cheong, Liang Lan, and 5 more authorsMay 2021
Single-cell RNA sequencing has enabled to capture the gene activities at single-cell resolution, thus allowing reconstruction of cell-type-specific gene regulatory networks (GRNs). The available algorithms for reconstructing GRNs are commonly designed for bulk RNA-seq data, and few of them are applicable to analyze scRNA-seq data by dealing with the dropout events and cellular heterogeneity. In this paper, we represent the joint gene expression distribution of a gene pair as an image and propose a novel supervised deep neural network called DeepDRIM which utilizes the image of the target TF-gene pair and the ones of the potential neighbors to reconstruct GRN from scRNA-seq data. Due to the consideration of TF-gene pair’s neighborhood context, DeepDRIM can effectively eliminate the false positives caused by transitive gene–gene interactions. We compared DeepDRIM with nine GRN reconstruction algorithms designed for either bulk or single-cell RNA-seq data. It achieves evidently better performance for the scRNA-seq data collected from eight cell lines. The simulated data show that DeepDRIM is robust to the dropout rate, the cell number and the size of the training data. We further applied DeepDRIM to the scRNA-seq gene expression of B cells from the bronchoalveolar lavage fluid of the patients with mild and severe coronavirus disease 2019. We focused on the cell-type-specific GRN alteration and observed targets of TFs that were differentially expressed between the two statuses to be enriched in lysosome, apoptosis, response to decreased oxygen level and microtubule, which had been proved to be associated with coronavirus infection.
- METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphsZhenmiao Zhang, and Lu ZhangMay 2021
Due to the complexity of microbial communities, de novo assembly on next generation sequencing data is commonly unable to produce complete microbial genomes. Metagenome assembly binning becomes an essential step that could group the fragmented contigs into clusters to represent microbial genomes based on contigs’ nucleotide compositions and read depths. These features work well on the long contigs, but are not stable for the short ones. Contigs can be linked by sequence overlap (assembly graph) or by the paired-end reads aligned to them (PE graph), where the linked contigs have high chance to be derived from the same clusters.
- A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing dataChao Yang, Debajyoti Chowdhury, Zhenmiao Zhang, and 4 more authorsMay 2021
Metagenomic sequencing provides a culture-independent avenue to investigate the complex microbial communities by constructing metagenome-assembled genomes (MAGs). A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables us to identify novel species and understand their potential functions in a dynamic ecosystem. Many computational tools have been developed to construct and annotate MAGs from metagenomic sequencing, however, there is a prominent gap to comprehensively introduce their background and practical performance. In this paper, we have thoroughly investigated the computational tools designed for both upstream and downstream analyses, including metagenome assembly, metagenome binning, gene prediction, functional annotation, taxonomic classification, and profiling. We have categorized the commonly used tools into unique groups based on their functional background and introduced the underlying core algorithms and associated information to demonstrate a comparative outlook. Furthermore, we have emphasized the computational requisition and offered guidance to the users to select the most efficient tools. Finally, we have indicated current limitations, potential solutions, and future perspectives for further improving the tools of MAG construction and annotation. We believe that our work provides a consolidated resource for the current stage of MAG studies and shed light on the future development of more effective MAG analysis tools on metagenomic sequencing.
2020
- A comprehensive investigation of metagenome assembly by linked-read sequencingLu Zhang, Xiaodong Fang, Herui Liao, and 6 more authorsMay 2020
The human microbiota are complex systems with important roles in our physiological activities and diseases. Sequencing the microbial genomes in the microbiota can help in our interpretation of their activities. The vast majority of the microbes in the microbiota cannot be isolated for individual sequencing. Current metagenomics practices use short-read sequencing to simultaneously sequence a mixture of microbial genomes. However, these results are in ambiguity during genome assembly, leading to unsatisfactory microbial genome completeness and contig continuity. Linked-read sequencing is able to remove some of these ambiguities by attaching the same barcode to the reads from a long DNA fragment (10–100 kb), thus improving metagenome assembly. However, it is not clear how the choices for several parameters in the use of linked-read sequencing affect the assembly quality.
2019
- Assessment of network module identification across complex diseasesSarvenaz Choobdar, Mehmet E. Ahsen, Jake Crawford, and 20 more authorsNature Methods, May 2019Publisher: Nature Publishing Group
Many bioinformatics methods have been proposed for reducing the complexity of large gene or protein networks into relevant subnetworks or modules. Yet, how such methods compare to each other in terms of their ability to identify disease-relevant modules in different types of network remains poorly understood. We launched the ’Disease Module Identification DREAM Challenge’, an open competition to comprehensively assess module identification methods across diverse protein-protein interaction, signaling, gene co-expression, homology and cancer-gene networks. Predicted network modules were tested for association with complex traits and diseases using a unique collection of 180 genome-wide association studies. Our robust assessment of 75 module identification methods reveals top-performing algorithms, which recover complementary trait-associated modules. We find that most of these modules correspond to core disease-relevant pathways, which often comprise therapeutic targets. This community challenge establishes biologically interpretable benchmarks, tools and guidelines for molecular network analysis to study human disease biology.
- Assessment of human diploid genome assembly with 10x Linked-Reads dataLu Zhang, Xin Zhou, Ziming Weng, and 1 more authorMay 2019
Producing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries.We prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole-genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332× and 823× and assembly quality worsened if it increased to >1,000× for a given C. Long DNA fragments could significantly extend phase blocks but decreased contig contiguity. The optimal length-weighted fragment length (W\{}mu _{FL}}\) was ∼50–150 kb. When broadly optimal parameters were used for library preparation and sequencing, ∼80% of the genome was assembled in a diploid state.The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.
2018
- Coding mutations in NUS1 contribute to Parkinson’s diseaseJi-feng Guo, Lu Zhang, Kai Li, and 37 more authorsMay 2018Publisher: Proceedings of the National Academy of Sciences
Whole-exome sequencing has been successful in identifying genetic factors contributing to familial or sporadic Parkinson’s disease (PD). However, this approach has not been applied to explore the impact of de novo mutations on PD pathogenesis. Here, we sequenced the exomes of 39 early onset patients, their parents, and 20 unaffected siblings to investigate the effects of de novo mutations on PD. We identified 12 genes with de novo mutations (MAD1L1, NUP98, PPP2CB, PKMYT1, TRIM24, CEP131, CTTNBP2, NUS1, SMPD3, MGRN1, IFI35, and RUSC2), which could be functionally relevant to PD pathogenesis. Further analyses of two independent case-control cohorts (1,852 patients and 1,565 controls in one cohort and 3,237 patients and 2,858 controls in the other) revealed that NUS1 harbors significantly more rare nonsynonymous variants (P = 1.01E-5, odds ratio = 11.3) in PD patients than in controls. Functional studies in Drosophila demonstrated that the loss of NUS1 could reduce the climbing ability, dopamine level, and number of dopaminergic neurons in 30-day-old flies and could induce apoptosis in fly brain. Together, our data suggest that de novo mutations could contribute to early onset PD pathogenesis and identify NUS1 as a candidate gene for PD.
2016
- Identification of RELN variation p.Thr3192Ser in a Chinese family with schizophreniaZhifan Zhou, Zhengmao Hu, Lu Zhang, and 9 more authorsMay 2016Publisher: Nature Publishing Group
Schizophrenia (SCZ) is a serious psychiatric disease with strong heritability. Its complexity is reflected by extensive genetic heterogeneity and much of the genetic liability remains unaccounted for. We applied a combined strategy involving detection of copy number variants (CNVs), whole-genome mapping and exome sequencing to identify the genetic basis of autosomal-dominant SCZ in a Chinese family. To rule out pathogenic CNVs, we first performed Illumina single nucleotide polymorphism (SNP) array analysis on samples from two patients and one psychiatrically healthy family member, but no pathogenic CNVs were detected. In order to further narrow down the susceptible region, we conducted genome-wide linkage analysis and mapped the disease locus to chromosome 7q21.13-22.3, with a maximum multipoint logarithm of odds score of 2.144. Whole-exome sequencing was then carried out with samples from three affected individuals and one unaffected individual in the family. A missense variation c.9575 C \textgreater G (p.Thr3192Ser) was identified in RELN, which is known as a risk gene for SCZ, located on chromosome 7q22, in the pedigree. This rare variant, as a highly penetrant risk variant, co-segregated with the phenotype. Our results provide genetic evidence that RELN may be one of pathogenic gene in SCZ.
- Genome-wide search followed by replication reveals genetic interaction of CD80 and ALOX5AP associated with systemic lupus erythematosus in Asian populationsYan Zhang, Jing Yang, Jing Zhang, and 42 more authorsMay 2016Publisher: BMJ Publishing Group Ltd Section: Basic and translational research
2015
- Adaptation and possible ancient interspecies introgression in pigs identified by whole-genome sequencingHuashui Ai, Xiaodong Fang, Bin Yang, and 6 more authorsNature Genetics, May 2015Publisher: Nature Publishing Group
Lusheng Huang, Jun Ren and colleagues report the genome sequences of 69 pigs, representing 11 geographically distinct breeds and 3 wild boar populations, from within China. They identify loci related to high- and low-latitude adaptation and infer a likely ancient introgression event in northern Chinese pigs.
- Association of Common Variants in LOX with Keratoconus: A Meta-AnalysisJing Zhang, Lu Zhang, Jiaxu Hong, and 2 more authorsMay 2015Publisher: Public Library of Science
Background Several case-control studies have been performed to examine the association of genetic variants in lysyl oxidase (LOX) with keratoconus. However, the results remained inconclusive and great heterogeneity might exist across populations. Method A comprehensive literature search for studies that published up to June 25, 2015 was performed. Summary odds ratios (OR) and 95% confidence intervals (CI) of each single nucleotide polymorphism (SNP) were estimated with fixed effects model when I2\textless50% in the test for heterogeneity or random effects model when I2\textgreater50%. Publication bias was evaluated using funnel plots and Egger’s test. Results A total of four studies including 1,467 keratoconus cases and 4,490 controls were involved in this meta-analysis. SNPs rs2956540 and rs10519694 showed significant association with keratoconus, with ORs of 0.71 (95% CI: 0.63–0.80, P = 1.43E-08) and 0.77 (95% CI: 0.61–0.97, P = 0.026), respectively. In contrast, our study lacked sufficient evidences to support the association of rs1800449/rs2288393 with keratoconus across populations. Conclusion This meta-analysis suggested that two LOX variants, rs2956540 and rs10519694, may affect individual susceptibility to keratoconus, while distinct heterogeneity existed within this locus. Larger-scale and multi-ethnic genetic studies on keratoconus are required to further validate the results.
- Meta-analysis of GWAS on two Chinese populations followed by replication identifies novel genetic variants on the X chromosome associated with systemic lupus erythematosusYan Zhang, Jing Zhang, Jing Yang, and 48 more authorsMay 2015
Systemic lupus erythematosus (SLE) is a prototypic autoimmune disease that affects mainly females. What role the X chromosome plays in the disease has always been an intriguing question. In this study, we examined the genetic variants on the X chromosome through meta-analysis of two genome-wide association studies (GWAS) on SLE on Chinese Han populations. Prominent association signals from the meta-analysis were replicated in 4 additional Asian cohorts, with a total of 5373 cases and 9166 matched controls. We identified a novel variant in PRPS2 on Xp22.3 as associated with SLE with genome-wide significance (rs7062536, OR = 0.84, P = 1.00E−08). Association of the L1CAM-MECP2 region with SLE was reported previously. In this study, we identified independent contributors in this region in NAA10 (rs2071128, OR = 0.81, P = 2.19E−13) and TMEM187 (rs17422, OR = 0.75, P = 1.47E−15), in addition to replicating the association from IRAK1-MECP2 region (rs1059702, OR = 0.71, P = 2.40E−18) in Asian cohorts. The X-linked susceptibility variants showed higher effect size in males than that in females, similar to results from a genome-wide survey of associated SNPs on the autosomes. These results suggest that susceptibility genes identified on the X chromosome, while contributing to disease predisposition, might not contribute significantly to the female predominance of this prototype autoimmune disease.
- Outbred genome sequencing and CRISPR/Cas9 gene editing in butterfliesXueyan Li, Dingding Fan, Wei Zhang, and 21 more authorsMay 2015Publisher: Nature Publishing Group
Butterflies are exceptionally diverse but their potential as an experimental system has been limited by the difficulty of deciphering heterozygous genomes and a lack of genetic manipulation technology. Here we use a hybrid assembly approach to construct high-quality reference genomes for Papilio xuthus (contig and scaffold N50: 492kb, 3.4Mb) and Papilio machaon (contig and scaffold N50: 81kb, 1.15Mb), highly heterozygous species that differ in host plant affiliations, and adult and larval colour patterns. Integrating comparative genomics and analyses of gene expression yields multiple insights into butterfly evolution, including potential roles of specific genes in recent diversification. To functionally test gene function, we develop an efficient (up to 92.5%) CRISPR/Cas9 gene editing method that yields obvious phenotypes with three genes, Abdominal-B, ebony and frizzled. Our results provide valuable genomic and technological resources for butterflies and unlock their potential as a genetic model system.
2014
- Whole-genome sequencing of cultivated and wild peppers provides insights into Capsicum domestication and specializationCheng Qin, Changshui Yu, Yaou Shen, and 70 more authorsMay 2014Publisher: Proceedings of the National Academy of Sciences
As an economic crop, pepper satisfies people’s spicy taste and has medicinal uses worldwide. To gain a better understanding of Capsicum evolution, domestication, and specialization, we present here the genome sequence of the cultivated pepper Zunla-1 (C. annuum L.) and its wild progenitor Chiltepin (C. annuum var. glabriusculum). We estimate that the pepper genome expanded 0.3 Mya (with respect to the genome of other Solanaceae) by a rapid amplification of retrotransposons elements, resulting in a genome comprised of 81% repetitive sequences. Approximately 79% of 3.48-Gb scaffolds containing 34,476 protein-coding genes were anchored to chromosomes by a high-density genetic map. Comparison of cultivated and wild pepper genomes with 20 resequencing accessions revealed molecular footprints of artificial selection, providing us with a list of candidate domestication genes. We also found that dosage compensation effect of tandem duplication genes probably contributed to the pungent diversification in pepper. The Capsicum reference genome provides crucial information for the study of not only the evolution of the pepper genome but also, the Solanaceae family, and it will facilitate the establishment of more effective pepper breeding programs.
- Genome-wide adaptive complexes to underground stresses in blind mole rats SpalaxXiaodong Fang, Eviatar Nevo, Lijuan Han, and 50 more authorsMay 2014Publisher: Nature Publishing Group
The blind mole rat (BMR), Spalax galili, is an excellent model for studying mammalian adaptation to life underground and medical applications. The BMR spends its entire life underground, protecting itself from predators and climatic fluctuations while challenging it with multiple stressors such as darkness, hypoxia, hypercapnia, energetics and high pathonecity. Here we sequence and analyse the BMR genome and transcriptome, highlighting the possible genomic adaptive responses to the underground stressors. Our results show high rates of RNA/DNA editing, reduced chromosome rearrangements, an over-representation of short interspersed elements (SINEs) probably linked to hypoxia tolerance, degeneration of vision and progression of photoperiodic perception, tolerance to hypercapnia and hypoxia and resistance to cancer. The remarkable traits of the BMR, together with its genomic and transcriptomic information, enhance our understanding of adaptation to extreme environments and will enable the utilization of BMR models for biomedical research in the fight against cancer, stroke and cardiovascular diseases.
2013
- Epistatic Interaction between Genetic Variants in Susceptibility Gene ETS1 Correlates with IL-17 Levels in SLE PatientsJing Zhang, Yan Zhang, Lu Zhang, and 19 more authorsMay 2013
T-helper cells that produce IL-17 (Th17 cells) are a subset of CD4+ T-cells with pathological roles in autoimmune diseases including systemic lupus erythematosus (SLE), and ETS1 is a negative regulator of Th17 cell differentiation. Our previous work on genome-wide association study (GWAS) identified two variants in the ETS1 gene (rs10893872 and rs1128334) as being associated with SLE. However, like many other risk alleles for complex diseases, little is known on how these genetic variants might affect disease pathogenesis. In this study, we examined serum IL-17 levels from 283 SLE cases and observed a significant correlation between risk variants in ETS1 and serum IL-17 concentration in patients, which suggests a potential mechanistic link between these variants and the disease. Furthermore, we found that the two variants act synergistically in influencing IL-17 production, with evidence of significant genetic interaction between them as well as higher correlation between the haplotype formed by the risk alleles and IL-17 level in patient serum. In addition, the correlation between ETS1 variants and IL-17 level seems to be more significant in SLE patients manifesting renal involvement, dsDNA autoantibody production or early-onset.
- Meta-analysis Followed by Replication Identifies Loci in or near CDKN1B, TET3, CD80, DRAM1, and ARID5B as Associated with Systemic Lupus Erythematosus in AsiansMay 2013
2010
- Genome-Wide Association Study in Asian Populations Identifies Variants in ETS1 and WDFY4 Associated with Systemic Lupus ErythematosusWanling Yang, Nan Shen, Dong-Qing Ye, and 29 more authorsMay 2010Publisher: Public Library of Science