New machine learning algorithm improves estimation and integration of single-cell data

Wang paper figure

By Charlene N. Rivera-Bonet, Waisman Science Writer

Like a game of Wheel of Fortune, where you have to fill in missing letters to guess the hidden phrase, analyzing data sometimes requires estimating missing data points by relying on available information in order to get the full picture of what’s being studied. Unlike a game of Wheel of Fortune, this becomes almost humanly impossible with the large amounts of data scientists acquire from the brain, particularly single-cell data. Scientists at the Waisman Center have developed a machine learning algorithm to both estimate missing data points and integrate the data to allow for a more complete picture of how cells work.

A recent study published in Nature Machine Intelligence created a machine learning algorithm called JAMIE (Joint Variational Autoencoders for Multimodal Imputation and Embedding) that uses data from one modality, such as gene expression, to predict a missing modality, such as electrophysiology. This technique is called cross-modal imputation. In addition to imputation, or value estimation for missing data, JAMIE is able to integrate different modalities of a cell together for a more comprehensive understanding of their function.

Daifeng Wang, PhD“Each modality just looks at one particular aspect of the cell. Now we can integrate the multimodal data together to fully understand cellular function,” says Daifeng Wang, PhD, associate professor of biostatistics & medical informatics, and computer sciences, and senior author of the study.

With the recent advances in technology, scientists are now able to collect large amounts of data on different features of single cells such as gene expression, morphology (cell shape), electrophysiology, and epigenetics, to name a few. While this allows for a better understanding of molecular mechanisms in the cells, it poses two challenges: making sense of large, complex quantities of data in an integrative way, and needing to estimate larger quantities of missing data.

Wang’s team came up with a machine learning program, JAMIE, that addresses these two challenges. Machine learning is the use of computer systems that can learn and draw inferences from patterns in the data using mathematical and statistical models. First, JAMIE analyses the multimodal data of the single cells and learns the different patterns and relationships. Once the program learns these, it can make predictions, such as calculating one modality based on another.

JAMIE, the study found, is able to successfully perform cross-modal imputation. Based on one modality, such as gene expression, it is able to estimate missing values for other modalities such as electrophysiology of single cells. JAMIE is both faster and more accurate than current methods of imputation.

For JAMIE to be able to accomplish these predictions, it first has to form what’s known as a lower-dimensional latent space consisting of latent representations for each of the given modalities. Because each cell starts out with a lot of data, a latent space condenses information and represents the important aspects of each modality. It then uses that information to separate cells based on their functions, typically resulting in separation by cell type.

After extracting features and learning patterns, a latent space becomes a cell map, where each location in the map represents a biological function. The study found that JAMIE is able to create a latent space that is biologically significant and interpretable. This means that the cell locations in the latent space correspond to specific biological functions which the researchers are able to identify.

Noah Cohen Kalafut
Noah Cohen Kalafut

“In addition to the imputation, predicting one modality from another, we’re also doing integration, which means taking both of the modalities and then putting them together in such a way that they form common latent spaces, which can then be used for further downstream analysis,” says Noah Cohen Kalafut, doctoral student in computer sciences and first author of the study.

Once they have the latent spaces representing the most important aspects of each modality, they can measure how each modality relates to each other, providing a more complete picture of how the cell works.

“I think the one uniqueness of JAMIE is that it tries to form bridges between modality latent spaces,” says Wang. Unlike existing programs that create one latent space for all of the data, which is not normally reusable, JAMIE builds multiple similar latent spaces for each modality and builds bridges connecting them.

In order to ensure the biological significance of imputation using JAMIE, the researchers withhold a portion of the data, let the program make the imputations, and then use the withheld data as a comparison.

Wang compares JAMIE to the artificial intelligence chatbot ChatGPT in the sense that you can give it some input and it can give you missing information. “JAMIE could function as a sort of neuronalGPT or brainGPT,” he says.

Currently, JAMIE has been trained to only use two modalities. In the future, their plan is to expand the number of modalities JAMIE can support. “It is pretty easily extensible to three modalities, or an indeterminate number of modalities,” says Cohen Kalafut.

The scientists are also looking to use JAMIE on single-cell data from disease samples to impute the single-cell features that are missing or difficult to observe in brain diseases such as neurodevelopmental and neurodegenerative disorders and learn more about single-cell features specific to disease. This work is actively going on in collaborations with several Waisman labs.

JAMIE [https://github.com/daifengwanglab/JAMIE] is open source and available for anyone to use.

Note: Caption for image at top.  (a) Maps of gene expression and electrophysiological features in the mouse visual cortex. (b) JAMIE’s latent space for gene expression and electrophysiological features colored by cell types.

Your support makes a difference. Donate now to advance knowledge about human development, developmental disabilities, and neurodegenerative diseases through research, services, training, and community outreach. DONATE NOW

Waisman Center Anniversary Logo | 1973-2023 Learn more about the Waisman Center's 50th Anniversary, including events, history, stories and images:
50 Years | 1973 - 2023