
Research Topics
Historically, our group has focused on bioinformatics approaches to cancer genomics and sleep-related research. Building on this foundation, we have recently launched a new initiative centered on the development of deep learning models for toxicology.
Ongoing projects in the lab include:
Deep Learning on Toxicology Data - The U.S. Tox21 collaboration has generated a large reference library of high-throughput concentration–response assays. Here we present Tox21mer, a 43.5-million-parameter transformer that encodes each Tox21 concentration–response curve together with assay metadata into a 768-dimensional representation. Tox21mer was pretrained on ~2.5 million curves from 102 assay protocols and 6,727 compounds using masked-response reconstruction as the primary objective, with low-weight auxiliary supervision on assay outcome and AC50. To evaluate the learned representation, we trained lightweight probes on frozen embeddings from concentration–response curves of held-out compounds. The representation supported a macro-F1 of 0.985 for three-class outcome prediction (agonist, antagonist, inactive), a binary F1 of 0.994 for active/inactive prediction, and an R2 of 0.87 for log10(AC50). The learned embeddings formed coherent groupings by curve-class category. A masked-only pretraining variant retained near-baseline probe performance, indicating that the representation is learned largely from the self-supervised objective rather than from auxiliary labels. Ablation analyses further showed that predictive performance depends mainly on curve-level response-value distributions conditioned on assay context, with limited reliance on detailed within-curve ordering. Tox21mer thus provides a reusable foundation representation for Tox21 concentration-response data that can support extrapolation to untested compounds through integration with chemical features or distillation into chemistry-only student models for large-scale external screening.
Our group is also building a next-generation foundation model for NTP chronic and subchronic assay pathology data, unlocking the power of one of the world’s most extraordinary toxicology resources. This effort brings together tens of thousands of ultra-high-resolution, gigabyte-scale pathology images and millions of pathology report entries generated through NTP studies. By integrating these massive and richly informative datasets, we are creating AI tools that can transform how tissue-level toxicological effects are detected, interpreted, and predicted. This is an ambitious and exciting undertaking, and we are making remarkable progress. Our ultimate vision is to develop foundation models capable of predicting tissue pathology for previously untested chemicals, accelerating discovery and helping shape the future of predictive toxicology.
Biography
Leping Li, Ph.D., is a senior investigator in the Biostatistics and Computational Biology Branch. His research program focuses on computational biology, and his staff is a multidisciplinary team. The early focus of the group was the development and implementation of computational/statistical methods to mine high-dimensional genomic data.
Selected Publications
- Shats I, Williams JG, Liu J, Makarov MV, Wu X, Lih FB, Deterding LJ, Lim C, Xu X, Randall TA, Lee E, Li W, Fan W, Li JL, Sokolsky M, Kabanov AV, Li L, Migaud ME, Locasale JW, Li X. Bacteria Boost Mammalian Host NAD Metabolism by Engaging the Deamidated Biosynthesis Pathway. Cell Metab. 2020;31(3):564-579.e7.
- Kang K, Meng Q, Shats I, Umbach DM, Li M, Li Y, Li X, Li L. CDSeq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data. PLoS Comput Biol. 2019;15(12):e1007510.
- Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28(4):593-4.
- Gilchrist DA, Dos Santos G, Fargo DC, Xie B, Gao Y, Li L, Adelman K. Pausing of RNA polymerase II disrupts DNA-specified nucleosome organization to enable precise gene regulation. Cell. 2010;143(4):540-51.
- Li L, Weinberg CR, Darden TA, Pedersen LG. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics. 2001;17(12):1131-42.
Related Scientific Focus Areas
This page was last updated on Monday, May 18, 2026



