An AI language model developed by Illinois researchers for sequence-based antibody specificity prediction and published in Immunity will accelerate epitope mapping and improve scientists’ understanding of the B-cell repertoire.
The model, which distinguishes between antibodies targeting the head and stem domains of human influenza hemagglutinin glycoprotein, functions as a critical resource for scientists applying deep learning methods to antibody research.
Historically, scientists have struggled to predict the specificity of an antibody based solely on its sequence. The Illinois researchers tackled this challenge by training a lightweight memory B-cell language model, a type of AI model that can learn the intrinsic grammar of biological inputs.
“Language models are very good at catching patterns in diverse antibody amino acid sequences,” said Yiquan Wang, a Biochemistry graduate student and the lead author of the paper. “It’s like English; there are lots of motifs and structures behind the sequences.”
The researchers resolved to train a model that could distinguish between antibodies targeting the two divisions of hemagglutinin, known as the head and stem. Much like a flower, the head is more mutable while the stem is highly conserved. Most antibodies bind to the head domain, which is a less desirable target for vaccines due to its penchant for mutating. If a virus mutates, an updated version of the vaccine for that virus will eventually be required. For this reason, scientists are interested in developing a universal vaccine for certain viruses like the flu, eliminating the need for yearly boosters.
The Illinois researchers’ language model, which predicts the sequence of an antibody motif attached to the stem domain, is an important step in this pursuit.
“One significance of this study is taking the first step to decode human immune history based on an individual’s antibody sequences,” said Nicholas Wu, a professor of Biochemistry and the corresponding author of the paper. “This can aid precision vaccination: designing vaccines that are customized for different people and age groups. We’re not solving the whole problem, but this is an important step.”
A major bottleneck of antibody characterization is epitope mapping, an expensive and labor-intensive process that identifies the binding site of an antibody. But using a language model to identify antibodies to a specific antigen of interest can accelerate the process.
“It’s very important for us to understand the B-cell repertoire — how our immune system responds to infection and vaccination,” Wang said. “If you get a vaccine, how do you know whether you have acquired that specific antibody? Mapping the epitope is key to answering that question. It tells us how many antibodies you have for each virus.”
To create the model, researchers mined 60 publications to curate a list of over 5,000 human influenza hemagglutinin antibodies. They began with pre-training, a process that allows the model to learn the “grammar” of an input: in this scenario, defining the antibody sequence space and identifying the intrinsic patterns.
Once the model understood the parameters of an antibody, it progressed to transfer learning, which presents a specific goal: antibody specificity prediction. In this step, researchers input an antibody sequence, and the model predicts the probability of its specificity categories. Much like English grammar relies on the mechanics of a sentence to decode meaning, the model relies on information obtained in pre-training to pinpoint the binding location of an antibody.
“We will ask the model, ‘what did you learn?’ ‘How did you make this prediction?’ said Tomas Lyu, a postdoctoral fellow in Biochemistry and a lead author of the paper. “And the model will tell us which positions on the antibody sequence it thinks are important.”
Although the language model can capture a surprising level of detail including somatic hypermutations, it is limited by its lack of biophysical knowledge, including the 3D interactions of biological entities. The researchers hope to incorporate this information into future iterations of the model, which will eventually serve as a centralized database to describe the sequence-specificity relationship for antibodies.
“Before our method, epitope mapping was done through individual antibody experimentation,” Wang said. “Our bodies contain 10 billion B-cells; there’s no way you can analyze that one by one. So we used a language model to solve this computationally instead of conducting physical experiments. Going forward, I think this will be the only practical solution.”
Read t
Cover photo courtesy of Fred Zwicky.