Utilizing AI to Better Understand the Genotype-Phenotype Connection

By Emily Leclerc, Waisman Science Writer

There are thousands if not millions of steps to get from genotype, your genetic code, to phenotype, your physical attributes. Understanding what those steps are and how they lead from one to the other can reveal important information about disease mechanisms behind conditions like Alzheimer’s disease and schizophrenia, which both have genetic components. A new neural network computer model, developed at the Waisman Center, is making it easier and more accurate to predict and understand how differences in the genome contribute to individual disease phenotypes.

Daifeng Wang, PhD
Daifeng Wang, PhD at his Waisman Center office

Daifeng Wang, PhD, Waisman investigator and UW-Madison associate professor of biostatistics and medical informatics, computer sciences, Pramod Bharadwaj Chandrashekar, PhD, postdoctoral fellow in Wang’s lab, and Sayali Alatkar, PhD student in Wang’s lab, recently published a paper in the journal Genome Medicine that showcases their new model DeepGAMI. DeepGAMI utilizes auxiliary learning to more accurately predict disease phenotype from genomic data.

Your genome is the entire collection of your DNA. That DNA is the blueprint that your body uses to build and maintain everything from your eye and hair color to your liver and brain function. This includes diseases as well. Diseases with a genetic cause or genetic link are all tied to your genome. Your phenotype is the physical presentation of your genome. For example, your genome has the genes that code for blue eyes. That’s your genotype. Your phenotype is actually having blue eyes, or the physical attribute that results from your genotype. Understanding exactly how a person’s genotype results in their phenotype, especially a disease phenotype, is important for understanding the disease itself and how new treatments can be developed.

When developing DeepGAMI, Wang and Chandrashekar wanted the model to be able to solve several issues, the biggest being improvements in understanding the genotype-phenotype association. “It is one of the most important things because it is what drives everybody’s external and internal attributes,” Chandrashekar says. “For example, a change in a portion of the genome affects certain genes which is what causes you to gain height or lose height, in its simplest case. This association is very important because it helps us uncover a lot of disease mechanisms as well as molecular and cellular mechanisms that happens within the body.”

The other challenges Wang and Chandrashekar wanted their model to overcome involved integrating and using multiple sources of data and finding how to get accurate predictions when parts of datasets are missing. The resulting model is DeepGAMI which accomplishes three important tasks all in one.

Pramod Bharadwaj Chandrashekar, PhD
Pramod Bharadwaj Chandrashekar, PhD

The model’s main task is to take genetic data, specifically genotype and gene expression in this case, and uses that to predict what an individual’s phenotype will be. Then the model has two subtasks – locating influential genes in the prediction and predicting missing gene expression data – that it both accomplishes simultaneously to the main task and uses to improve the main task’s prediction. “The first subtask is we try to tell people which genes or mutations contribute to the model’s prediction. It’s trying to prioritize the genes that are important,” Wang says. “It also can do that within cell types too, which is unique to our work. The second subtask the model does is impute or predict gene expression data from the genotype data.”

Gene expression data can be difficult to get, especially from the brain. Genotype data is simple because all that is needed is a blood or skin sample. Gene expression requires more than that. “You can’t just open up a living person’s brain, right?” Wang says. So, often in data sets, some or all gene expression data is missing which makes predictions hard and often inaccurate.

To overcome this, DeepGAMI employs auxiliary learning to predict missing gene expression data and then use that information in its final phenotype prediction. “Auxiliary learning is a way of learning your primary task by jointly defining and learning subtasks related to the primary task. You’re using the help of these subtasks to make the main task’s prediction or goal better while simultaneously solving the subtasks,” Chandrashekar says. Imputing gene expression is the subtask being solved with auxiliary learning while the main goal is predicting phenotypes.

DeepGAMI was trained on huge complete datasets of genotype and gene expression data in Alzheimer’s disease and schizophrenia. This training taught DeepGAMI the relationships between genotype and gene expression data. Using that training on data relationships, the model is then able to predict what missing gene expression data would be. From there, DeepGAMI then uses the genotype data and predicated gene expression data to complete its main task of predicting the phenotype.

The auxiliary learning aspect of DeepGAMI makes for a more accurate prediction and for additional information that can be useful to researchers. Other models that have been developed to make similar predictions currently don’t use auxiliary learning to enhance their outcomes. This extra step in DeepGAMI makes its predictions more accurate and more reliable.

The paper shows that this model was able to not only pull out potential important genes and mutations in Alzheimer’s disease and schizophrenia but also predict the level of cognitive impairment that a person may have. And when they tested the model on datasets with missing gene expression data, it was still able to provide accurate information, outperforming other computational methods.

DeepGAMI’s code and instructions are open source and available for all to use. Wang and Chandraeshekar are hopeful to be able to use this model to study intellectual and developmental disabilities (IDDs) in the future as its core coding is general use. “For this work, we focused on Alzheimer’s disease and schizophrenia because that was the largest and most complete datasets we had,” Wang says. “We could definitely apply it to neurodevelopmental diseases like autism but we don’t have the data yet because there is no good population data. I’m looking forward to applying the model to developmental diseases when we have data.”

Moving forward, Wang and Chandrashekar hope to improve the model by including other types of information as well as training the model on other ways that the data could be connected. “We are currently only looking at two sources of data, genotype and gene expression. The next step is how can we use imaging data or protein expression or other information to make the predictions better,” Chandrashekar says.

Once large enough datasets are available on IDDs to train DeepGAMI, Wang and Chandrashekar are hopeful that it could reveal important insights into disease and molecular mechanisms that may make it easier to develop new and improved therapies for IDDs.

Your support makes a difference. Donate now to advance knowledge about human development, developmental disabilities, and neurodegenerative diseases through research, services, training, and community outreach. DONATE NOW

Waisman Center Anniversary Logo | 1973-2023 Learn more about the Waisman Center's 50th Anniversary, including events, history, stories and images:
50 Years | 1973 - 2023