Big language models pick up biological language
AI systems are being
trained to interpret the language of life encoded in DNA and use that knowledge
to attempt and create new molecules. These systems have already made
significant progress in understanding the language of humans.
Why it matters: Artificial
intelligence (AI) that can interpret biological data may be able to assist
researchers in creating novel treatments and in modifying cells to create
biofuels, materials, medications, and other goods.
Background: With the use
of increasingly sophisticated computer tools, scientists have spent decades
trying to reverse engineer cells in order to create new proteins and enhance
chemicals found in nature.
Other scientists have
searched the planet in search of substances produced by fungus, plants, bacteria,
and other species that may be valuable for specific uses but haven't yet been
found. New cancer products and therapies have been produced using both
strategies.
"But eventually, we
run out of low-hanging fruit to pick," says Kyunghyun Cho, senior director
of Frontier Research at Prescient Design, a Genentech company, and a professor
of computer science and data science at New York University.
These days, generative
AI models are being constructed to comprehend the many activities and qualities
that DNA, RNA, and proteins produce, as well as their laws and interactions.
These models resemble the large language model (LLM) that enables ChatGPT.
How it operates: The 26
letters that make up the current English alphabet are arranged by humans into
around 500,000 words, if not more.
Text is provided to
LLMs, who divide it into tokens—characters, words, or subwords.
After figuring out the
connections between these tokens, the AI model creates original text using that
knowledge.
Biology uses
significantly fewer letters in its language, but it creates many more
"words" in the form of proteins.
DNA is made up of four
molecules that carry genetic information: A stands for adenine, C for cytosine,
T for thymine, and G for guanine.
These four basepairs'
three-letter combinations, or codons, result in 20 distinct amino acids, some
or all of which are combined in various orders to form proteins.
Over 200 million
proteins are known to exist. One of the most difficult and time-consuming tasks
in biology is predicting a protein's structure given its amino acid sequence.
DeepMind's AlphaFold AI system can do this.
However, the possibility
of many orders of magnitude more proteins exists.
That gives scientists a
tonne of room to explore if they want to create new proteins with the qualities
they need for a novel medicine or modify cells to do different functions.
What's happening: AI
models are mapping that area to find modifications in RNA or DNA that cause
sickness or modify important cellular functions, and then using that data to
build new proteins. But there are a number of obstacles for scientists that do
it.
They must figure out the
best way to break biology's language down into tokens that the LLM can work
with. They must ensure the AI is able to see the relationships between genes
and elements of genes that affect one another from different places in a long stretch
of DNA, says Joshua Dunn, a molecular and computational biologist at Ginkgo
Bioworks, which uses AI to drive some of its gene designs. It's like having to
pull sentences from different parts of a book to understand its meaning.
An additional factor to
take into account is that reading DNA from different starting places might
result in different proteins; for example, reading a phrase from the middle
will tell you a different story than reading it from the beginning.
Furthermore, whereas the
majority of proteins are encoded in the genetic code, others are transcribed by
distinct "readers" within cells. "That means there are a whole
lot of different languages being spoken at the same time," Dunn explains.
Dunn states that he is
"extremely optimistic that large language models are going to figure out
some of this because they're actually very good at understanding different
scales of meanings spoken in different languages."
However, how to tokenize
genetic data to extract additional information remains an unanswered subject. A
model must, for instance, scan a large enough range of data to represent the
signals dispersed throughout a chromosome while preserving important
information about mutations affecting single letters and the alterations they
produce. According to Dunn, AI models might not be able to rely on tokenization
or could need to be modified in order to do so.
As it is now: Although
AI foundation models in biology are still in their infancy, organisations such
as Profluent Bio, Inceptive, and others, as well as academic institutions, are
working to create models that will enable them to understand DNA and create
novel proteins.
HyenaDNA, a "genomic foundation model" developed by researchers at Stanford University, learns how DNA sequences are distributed, genes are encoded and how regions in between those that code for amino acids regulate a gene's expression. Yes, but: Like with LLMs, there is concern about biased training data based on where samples are taken from, says Vaneet Aggarwal, a computer scientist and professor at Purdue University who has worked on AI models to understand the language of DNA. What's next: Spewing out novel molecules from generative models is only a first step — and not necessarily the biggest hurdle, Cho says. Candidate molecules have to go through several more phases of development to filter out the most promising ones for experimental testing in the lab, he says.