Big language models pick up biological language

Big language models pick up biological language

AI systems are being trained to interpret the language of life encoded in DNA and use that knowledge to attempt and create new molecules. These systems have already made significant progress in understanding the language of humans.

 

 

Why it matters: Artificial intelligence (AI) that can interpret biological data may be able to assist researchers in creating novel treatments and in modifying cells to create biofuels, materials, medications, and other goods.

 

Background: With the use of increasingly sophisticated computer tools, scientists have spent decades trying to reverse engineer cells in order to create new proteins and enhance chemicals found in nature.

 

Other scientists have searched the planet in search of substances produced by fungus, plants, bacteria, and other species that may be valuable for specific uses but haven't yet been found. New cancer products and therapies have been produced using both strategies.

"But eventually, we run out of low-hanging fruit to pick," says Kyunghyun Cho, senior director of Frontier Research at Prescient Design, a Genentech company, and a professor of computer science and data science at New York University.

These days, generative AI models are being constructed to comprehend the many activities and qualities that DNA, RNA, and proteins produce, as well as their laws and interactions. These models resemble the large language model (LLM) that enables ChatGPT.

 

 

How it operates: The 26 letters that make up the current English alphabet are arranged by humans into around 500,000 words, if not more.

 

Text is provided to LLMs, who divide it into tokens—characters, words, or subwords.

After figuring out the connections between these tokens, the AI model creates original text using that knowledge.

 

Biology uses significantly fewer letters in its language, but it creates many more "words" in the form of proteins.

 

DNA is made up of four molecules that carry genetic information: A stands for adenine, C for cytosine, T for thymine, and G for guanine.

These four basepairs' three-letter combinations, or codons, result in 20 distinct amino acids, some or all of which are combined in various orders to form proteins.

Over 200 million proteins are known to exist. One of the most difficult and time-consuming tasks in biology is predicting a protein's structure given its amino acid sequence. DeepMind's AlphaFold AI system can do this.

However, the possibility of many orders of magnitude more proteins exists.

That gives scientists a tonne of room to explore if they want to create new proteins with the qualities they need for a novel medicine or modify cells to do different functions.

 

What's happening: AI models are mapping that area to find modifications in RNA or DNA that cause sickness or modify important cellular functions, and then using that data to build new proteins. But there are a number of obstacles for scientists that do it.

 

They must figure out the best way to break biology's language down into tokens that the LLM can work with. They must ensure the AI is able to see the relationships between genes and elements of genes that affect one another from different places in a long stretch of DNA, says Joshua Dunn, a molecular and computational biologist at Ginkgo Bioworks, which uses AI to drive some of its gene designs. It's like having to pull sentences from different parts of a book to understand its meaning.

An additional factor to take into account is that reading DNA from different starting places might result in different proteins; for example, reading a phrase from the middle will tell you a different story than reading it from the beginning.

Furthermore, whereas the majority of proteins are encoded in the genetic code, others are transcribed by distinct "readers" within cells. "That means there are a whole lot of different languages being spoken at the same time," Dunn explains.

Dunn states that he is "extremely optimistic that large language models are going to figure out some of this because they're actually very good at understanding different scales of meanings spoken in different languages."

 

 

However, how to tokenize genetic data to extract additional information remains an unanswered subject. A model must, for instance, scan a large enough range of data to represent the signals dispersed throughout a chromosome while preserving important information about mutations affecting single letters and the alterations they produce. According to Dunn, AI models might not be able to rely on tokenization or could need to be modified in order to do so.

As it is now: Although AI foundation models in biology are still in their infancy, organisations such as Profluent Bio, Inceptive, and others, as well as academic institutions, are working to create models that will enable them to understand DNA and create novel proteins.

 

 

HyenaDNA, a "genomic foundation model" developed by researchers at Stanford University, learns how DNA sequences are distributed, genes are encoded and how regions in between those that code for amino acids regulate a gene's expression. Yes, but: Like with LLMs, there is concern about biased training data based on where samples are taken from, says Vaneet Aggarwal, a computer scientist and professor at Purdue University who has worked on AI models to understand the language of DNA. What's next: Spewing out novel molecules from generative models is only a first step — and not necessarily the biggest hurdle, Cho says. Candidate molecules have to go through several more phases of development to filter out the most promising ones for experimental testing in the lab, he says. 

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.