Scalable Bioinformatics for Population-Scale Adaptive Immune Receptor Repertoire Analysis

Join us for our upcoming Future Computing Seminar Series
Speaker: Prof. Serghei Mangul (Sage Bionetworks & University of Southern California)
Date: June 5th, 2025, 11:00 CET
Where: ETZ E6
Abstract:
The recent advances in high-throughput sequencing technologies enable cost-effective characterization of the immune system and provide novel opportunities to study adaptive immune receptor repertoire (AIRR) at the population scale. A commonly used assay-based approach (i.e. AIRR-Seq) provides a detailed view of the adaptive immune system by leveraging the deep sequencing of amplified DNA or RNA from the variable region of the T and B cell receptors (TCR and BCR) loci. However, the limited number of samples probed by the AIRR-Seq approach restricts the ability to detect novel population-specific V(D)J gene alleles across ethnically diverse and admixed populations. Non-targeted next-generation sequencing (NGS) (e.g. WGS) promises to fill the existing data gap by providing hundreds of thousands of NGS datasets across various ancestry groups. However, reliable and scalable bioinformatics algorithms have yet to be developed to utilize non-targeted NGS technologies to assemble novel population-specific alleles that support effect-size heterogeneity across ancestries. There’s a lack of comprehensive population-specific allelic immunogenomics reference databases. This void exacerbates existing health disparities, as discoveries in medical immunogenomics continue to be a privilege and benefit for populations of European ancestry. The current state-of-the-art databases are built on the genetic architecture based on individuals of European ancestry and thus fail to capture allelic variation across diverse populations. I will discuss a data science approach for studying the variation of the human adaptive immune system at a truly global scale, improving studies of immunological health and diseases, and reducing health disparities. In this study, we develop robust and scalable bioinformatics tools and databases able to leverage the largest datasets covering individuals of various ancestries, composed of over half a million NGS samples spanning the AIRR-Seq, RNA-Seq, and WGS technologies. Additionally, I will discuss benchmarking strategies of the developed bioinformatics methods based on both simulated and real data to demonstrate the feasibility of using NGS-based approaches to assemble novel V(D)J alleles and assembled receptor sequences.
Bio:
Dr. Mangul holds positions as the Director of Challenges and Benchmarking at Sage Bionetworks and an Assistant Professor of Clinical Pharmacy and Computational Biology at the University of Southern California. He specializes in the design, development, and application of novel data-driven computational approaches to accelerate the diffusion of genomics and biomedical data into translational research and education. Dr. Mangul is a passionate advocate for promoting transparency and reproducibility in data-driven biomedical research, as well as for making bioinformatics education accessible to all. Dr. Mangul’s work is dedicated to advancing the principles of reproducibility, data sharing, and software usability, with the ultimate goal of shaping a more equitable and impactful future for the field of bioinformatics. Dr. Mangul received his Ph.D. in Bioinformatics from Georgia State University and holds a B.Sc. in Applied Mathematics from Moldova State University, Chisinau, Moldova. He completed his postdoctoral training in computational genomics with Prof. Eskin at the University of California Los Angeles (UCLA). Dr. Mangul is the recipient of the prestigious National Science Foundation CAREER and Fulbright U.S. Scholar Program awards. He serves as a mentor for the NIH AIM-AHEAD Leadership Fellowship and NCATS Training Program in Advanced Data Analysis.