CharacteristicDetail
Estimated Reading Time20-45 minutes
Technical LevelNone
PrerequisitesNone

Introduction

Drug discovery is one of humanity’s most ambitious scientific endeavors. The journey from identifying a disease target to getting a medication on pharmacy shelves takes an average of 10-15 years and costs upwards of $2.6 billion1. Despite these massive investments, approximately 90% of drug candidates fail during clinical trials. This staggering failure rate isn’t due to lack of effort—it’s because drug discovery is an extraordinarily complex computational and biological problem.

In this blog series, we’ll explore how modern computational methods, particularly machine learning and artificial intelligence, are revolutionizing drug discovery. But before we dive into the algorithms—from protein structure prediction to molecular generation—we need to understand the fundamental biology: what molecules and proteins are, how they interact, and why finding the right drug molecule is like searching for a needle in a haystack the size of the universe.

The Building Blocks: Atoms, Bonds, and Molecules

Atoms and Chemical Bonds

Everything in chemistry starts with atoms—the fundamental units of matter. For drug discovery, we primarily care about a handful of elements: carbon (C), hydrogen (H), oxygen (O), nitrogen (N), sulfur (S), and phosphorus (P). These atoms combine through chemical bonds to form molecules.

There are three main types of chemical bonds we need to understand:

  1. Covalent bonds: Strong bonds where atoms share electrons. These form the backbone of molecules and are typically represented as lines in chemical structures. For example, in water (H₂O), each hydrogen atom forms a covalent bond with the oxygen atom.

  2. Ionic bonds: Bonds formed by the electrostatic attraction between positively and negatively charged atoms (ions). Common in salts like sodium chloride (table salt).

  3. Hydrogen bonds: Weaker interactions that occur when a hydrogen atom bonded to an electronegative atom (like oxygen or nitrogen) is attracted to another electronegative atom. These are crucial in biology—they hold DNA’s double helix together and determine protein shapes.

Functional Groups: The Personality of Molecules

While atoms are the building blocks, functional groups give molecules their “personality”—their chemical behavior and biological activity. A functional group is a specific arrangement of atoms that appears repeatedly across different molecules and confers particular properties.

Key functional groups in drug discovery include:

  • Hydroxyl group (-OH): Makes molecules more water-soluble and can form hydrogen bonds
  • Amine group (-NH₂): Can accept protons and often carries a positive charge at physiological pH (also called amino group)
  • Carboxyl group (-COOH): Acidic group that loses a proton to become negatively charged
  • Carbonyl group (C=O): Highly reactive and appears in many biological molecules
  • Aromatic rings: Flat, ring-shaped structures (like benzene) that provide rigidity and can stack with other aromatic groups

Understanding functional groups is crucial because they determine:

  • How a molecule dissolves in water or fat
  • Whether it can cross cell membranes
  • How it interacts with proteins
  • How the body metabolizes and eliminates it

Chemical Bonds: The Forces That Hold Molecules Together

Single, Double, and Triple Bonds

When atoms form covalent bonds, they can share different numbers of electron pairs, creating bonds of varying strength and flexibility.

Single bonds are the most common type of covalent bond, where two atoms share one pair of electrons (two electrons total). Represented as a single line (C-C) in chemical structures, single bonds allow free rotation around the bond axis. This rotational freedom is crucial for molecular flexibility—it’s why proteins can fold and why drug molecules can adjust their shape to fit into binding pockets. Examples include the C-C bonds in ethane or the C-O bond in alcohols. Single bonds are relatively long and weak compared to multiple bonds.

Double bonds involve two pairs of shared electrons (four electrons total), represented as two parallel lines (C=C). These bonds are shorter and stronger than single bonds, and critically, they do not allow free rotation. The atoms are locked in place relative to each other, creating rigidity in molecular structures. The C=O double bond in carbonyl groups is one of the most important in biology—it appears in proteins (peptide bonds), DNA bases, and countless drug molecules. Double bonds also create the possibility of geometric isomers: two molecules with the same atoms but different spatial arrangements (cis vs trans configurations), which can have completely different biological activities. For exampel, see the two Stereoisomers of 2-Butene below,

cis-2-butene (with cis-Configuration): the two $CH_3$ groups are on the same side, and the two H atoms are also on the same side.

      CH3     CH3
        \     /
         C = C
        /     \
       H       H

trans-2-butene (with trans-Configuration): the two $CH_3$ groups are on opposite sides (diagonally across the double bond), and the two H atoms are also on opposite sides.

      CH3     H
        \     /
         C = C
        /     \
       H       CH3

Triple bonds share three pairs of electrons (six electrons total), shown as three parallel lines (C≡C or C≡N). These are the shortest and strongest covalent bonds, creating very rigid, linear geometries. Triple bonds are less common in biology than single or double bonds but appear in some drugs and natural products. The nitrile group (C≡N) is occasionally used in drug design for its metabolic stability and ability to form certain interactions with proteins.

Bond Order and Molecular Properties

The number of shared electron pairs (bond order) directly affects molecular behavior:

  • Bond length: Triple bonds < double bonds < single bonds
  • Bond strength: Triple bonds > double bonds > single bonds
  • Reactivity: Double and triple bonds are more reactive than single bonds, making them sites where chemical reactions often occur

Understanding bond types is essential for drug discovery because they determine:

  • How rigid or flexible a molecule is (affecting binding)
  • Which parts of the molecule are chemically reactive (affecting metabolism)
  • How electrons are distributed (affecting interactions with proteins)

When designing drugs, medicinal chemists carefully choose where to place single bonds (for flexibility) versus double bonds (for rigidity and specific geometry), optimizing the molecule’s ability to bind its target while maintaining drug-like properties.

Proteins: The Molecular Machines of Life

What Are Proteins?

Proteins are the workhorses of biology, performing virtually every critical function in living organisms. As enzymes, they act as biological catalysts that accelerate chemical reactions by millions of times—reactions that would otherwise take years to complete happen in milliseconds. For example, the enzyme carbonic anhydrase converts carbon dioxide to bicarbonate at a rate of nearly one million reactions per second. As transporters and channels, proteins move essential molecules across cell membranes, from nutrients like glucose entering cells to ions flowing through nerve cells to generate electrical signals. As receptors, they sit on cell surfaces detecting hormones, neurotransmitters, and other signaling molecules, then transmitting these messages into the cell to trigger appropriate responses—this is how insulin tells your cells to absorb glucose, or how adrenaline prepares your body for “fight or flight.” As structural proteins, they provide mechanical support and shape to cells and tissues—collagen gives strength to skin and bones, while actin and myosin enable muscle contraction. Beyond these roles, proteins also serve as antibodies defending against pathogens, motor proteins transporting cargo within cells, and transcription factors controlling which genes are turned on or off. This remarkable functional diversity all stems from the same basic building blocks: chains of amino acids folded into precise three-dimensional shapes.

In humans, we have approximately 20,000-25,000 different proteins encoded in our genome2.

At their core, proteins are polymers—long chains made of smaller units called amino acids linked together like beads on a string. Each amino acid has the same basic structure: a central carbon atom (called the alpha carbon) bonded to an amino group (-NH₂), a carboxyl group (-COOH), a hydrogen atom, and a variable side chain (called the R group). It’s this side chain that makes each of the 20 standard amino acids unique and gives proteins their remarkable versatility.

These 20 amino acids can be grouped by their chemical properties. Hydrophobic (water-fearing) amino acids like leucine, isoleucine, and valine have oily, nonpolar side chains that tend to cluster together in the protein’s interior, away from water. Hydrophilic (water-loving) amino acids like serine and threonine have polar side chains that can form hydrogen bonds with water. Charged amino acids carry electrical charges: positively charged amino acids like lysine and arginine, and negatively charged ones like aspartate and glutamate—these can form strong electrostatic interactions (salt bridges) with each other. Aromatic amino acids like phenylalanine and tyrosine have ring structures that can stack together and interact through pi-pi interactions. Then there are special cases: cysteine can form covalent disulfide bonds with other cysteines, essentially creating chemical “staples” that lock protein structures in place; proline introduces kinks in the chain due to its rigid ring structure; and glycine, the smallest amino acid, provides flexibility.

The sequence of amino acids—which amino acid comes first, second, third, and so on—is called the primary structure. This sequence is not random; it’s precisely determined by the DNA sequence in our genes through the genetic code. Every three DNA bases (a codon) specify one amino acid. For instance, the DNA sequence ATG codes for methionine, while GGC codes for glycine. During protein synthesis, molecular machinery reads this genetic blueprint and assembles amino acids in the exact order specified, creating a polypeptide chain that can be hundreds or even thousands of amino acids long. A typical protein might be 300-500 amino acids, though some can be much larger—titin, a muscle protein, contains over 34,000 amino acids. Remarkably, just by changing one amino acid in this sequence, you can dramatically alter or completely destroy a protein’s function—a single amino acid mutation in hemoglobin causes sickle cell disease, demonstrating how critical each position in the sequence is to proper protein function. See image below for an illustration of sickle cell disease on a molecular level.

Description
Because of this change of one amino acid in the chain, hemoglobin molecules form long fibers that distort the biconcave, or disc-shaped, red blood cells and causes them to assume a crescent or “sickle” shape, which clogs blood vessels. The beta (β)- chain of hemoglobin is 147 amino acids in length, yet a single amino acid substitution in the primary sequence leads changes in secondary, tertiary and quaternary structures and sickle cell anemia. In normal hemoglobin, the amino acid at position six is glutamate. In sickle cell hemoglobin glutamate is replaced by valine. Credit: Rao, A., Tag, A. Ryan, K. and Fletcher, S. Department of Biology, Texas A&M University.

Structure Determines Function

One of the most fundamental principles in biology is that a protein’s structure determines its function. This principle is central to understanding drug discovery.

Proteins have four levels of structural organization:

  1. Primary structure: The linear sequence of amino acids connected by peptide bonds (covalent bonds between amino acids).
Description
Bovine serum insulin is a protein hormone comprised of two peptide chains, A (21 amino acids long) and B (30 amino acids long). In each chain, three-letter abbreviations that represent the amino acids' names in the order they are present indicate primary structure. The amino acid cysteine (cys) has a sulfhydryl (SH) group as a side chain. Two sulfhydryl groups can react in the presence of oxygen to form a disulfide (S-S) bond. Two disulfide bonds connect the A and B chains together, and a third helps the A chain fold into the correct shape. Note that all disulfide bonds are the same length, but we have drawn them different sizes for clarity.
  1. Secondary structure: Local folding patterns stabilized by hydrogen bonds, primarily:
    • Alpha helices: Spiral structures that look like a spring
    • Beta sheets: Extended, pleated structures that can run parallel or antiparallel
Description
The α-helix and β-pleated sheet are secondary protein structures formed when hydrogen bonds form between the carbonyl oxygen and the amino hydrogen in the peptide backbone. Certain amino acids have a propensity to form an α-helix while others favor β-pleated sheet formation. Black = carbon, White = hydrogen, Blue = nitrogen, and Red = oxygen. Credit: Rao, A., Ryan, K. Fletcher, S. and Tag, A. Department of Biology, Texas A&M University.
  1. Tertiary structure: The complete three-dimensional shape of a single protein chain, formed by interactions between amino acids that may be far apart in the sequence. This includes hydrophobic interactions (oily amino acids clustering away from water), disulfide bonds (covalent bonds between cysteine residues), electrostatic interactions, and hydrogen bonds.
Description
A variety of chemical interactions determine the proteins' tertiary structure. These include hydrophobic interactions, ionic bonding, hydrogen bonding, and disulfide linkages.
  1. Quaternary structure: When multiple protein chains (subunits) come together to form a functional complex.
Description
A variety of chemical interactions determine the proteins' tertiary structure. These include hydrophobic interactions, ionic bonding, hydrogen bonding, and disulfide linkages.
The process of a protein folding from a linear chain into its functional 3D structure is governed by physics and chemistry. The protein seeks the conformation (shape) with the lowest free energy. Remarkably, the amino acid sequence contains all the information needed for proper folding—though how exactly this happens is still an active area of research (with recent breakthroughs from AlphaFold and other AI models).

Binding Sites and Molecular Recognition

For drug discovery, we care most about binding sites—specific pockets or grooves on a protein’s surface where other molecules can attach. These sites typically have:

  • A complementary shape to their binding partners (like a lock and key)
  • Chemical groups positioned to form favorable interactions (hydrogen bonds, electrostatic attractions, hydrophobic contacts)
  • Some flexibility to adjust their shape upon binding (induced fit)

When a small molecule binds to a protein, it can either:

  • Activate the protein (agonist)
  • Inhibit the protein (antagonist)
  • Modulate the protein’s activity in more subtle ways (allosteric modulator)

This is the molecular basis of how drugs work.

The Drug Discovery Problem

What Makes a Good Drug?

Finding a molecule that binds to a target protein is only the first hurdle. A successful drug must satisfy multiple competing requirements:

  1. Binding affinity and specificity: The molecule must bind strongly enough to the target protein (high affinity) but not too strongly (which can cause toxicity). It must also be selective—binding to the intended target without affecting other proteins, which would cause side effects.

  2. ADMET properties: An acronym for crucial pharmacological characteristics:

    • Absorption: Can the drug get into the bloodstream?
    • Distribution: Does it reach the right tissues?
    • Metabolism: How does the body break it down?
    • Excretion: How is it eliminated from the body?
    • Toxicity: Is it safe?
  3. Drug-like properties: Lipinski’s “Rule of Five” provides rough guidelines for whether a molecule is likely to be orally bioavailable—meaning it can be taken as a pill and successfully absorbed into the bloodstream through the digestive system. Developed by Christopher Lipinski in 1997 based on analyzing existing oral drugs, this rule identifies physicochemical properties that help molecules pass through intestinal membranes and reach their targets. The rule gets its name because most of the cutoff values are multiples of five:

  • Molecular weight < 500 Daltons (Da): This measures how heavy the molecule is. One Dalton is approximately the mass of a single hydrogen atom. Smaller molecules (under 500 Da) generally cross cell membranes more easily than larger ones. Very large molecules struggle to squeeze through the lipid bilayers that form cell membranes. For reference, aspirin has a molecular weight of 180 Da, while many antibiotics fall in the 300-500 Da range.

  • LogP (lipophilicity) < 5: LogP measures how much a molecule “likes” fat versus water—technically, it’s the logarithm of the partition coefficient between octanol (representing fatty environments) and water. A LogP of 0 means the molecule distributes equally between oil and water. Positive values mean it prefers oil (lipophilic), while negative values mean it prefers water (hydrophilic). For oral drugs, you want a balance: too hydrophilic (LogP < 0) and the molecule won’t cross lipid membranes; too lipophilic (LogP > 5) and it won’t dissolve in blood, may accumulate in fatty tissues, and often binds indiscriminately to many proteins causing side effects. A LogP between 1-3 is often ideal for oral drugs.

  • Hydrogen bond donors ≤ 5: Hydrogen bond donors are groups that have a hydrogen attached to an electronegative atom (oxygen or nitrogen), like -OH or -NH₂, which can donate a hydrogen to form hydrogen bonds. These groups make molecules more water-soluble but also make it harder to cross cell membranes, because membranes are made of fatty molecules that repel water-loving groups. Having too many hydrogen bond donors (more than 5) means the molecule becomes too polar and struggles to pass through the lipid-rich intestinal barrier.

  • Hydrogen bond acceptors ≤ 10: Hydrogen bond acceptors are atoms with lone pairs of electrons that can accept hydrogen bonds—typically oxygen and nitrogen atoms. Like donors, having too many acceptors makes the molecule too polar. The rule allows for more acceptors (10) than donors (5) because acceptors generally have a smaller negative impact on membrane permeability. Common acceptors include the oxygen atoms in carbonyl groups (C=O), ether groups (C-O-C), and nitrogen atoms in amines.

These rules help ensure oral bioavailability, though many successful drugs violate them—particularly newer biologics, antibody-drug conjugates, and drugs designed for specific transporters that actively shuttle them across membranes. The Rule of Five is a guideline, not an absolute requirement, but it remains a useful filter when screening large chemical libraries for potential drug candidates.

  1. Synthetic accessibility: Can we actually make the molecule at scale?

  2. Patent considerations: Is the molecule novel and patentable?

The challenge is that these requirements often conflict. A molecule that binds perfectly might be insoluble in water. One with good ADMET properties might not be selective. This multi-objective optimization problem is at the heart of medicinal chemistry.

The Chemical Space Problem

The number of possible drug-like molecules is astronomically large. Conservative estimates suggest there are 10⁶⁰ possible drug-like molecules3. For context, there are only about 10⁸⁰ atoms in the observable universe. We’ve only explored a tiny fraction—commercial databases contain around 10⁸ compounds.

This means:

  • We can never synthesize and test everything
  • Traditional trial-and-error is impossibly slow
  • We need smarter ways to navigate this chemical space
  • Small improvements in prediction accuracy can save years and millions of dollars

The Computational Drug Discovery Pipeline

Modern drug discovery is a multi-stage pipeline where computational methods play increasingly important roles. Let’s walk through each stage:

Stage 1: Target Identification and Validation

Goal: Identify a protein whose modulation will treat a disease.

This stage answers: “What biological target should we go after?” Scientists study disease biology to identify proteins that, when modified, could alleviate symptoms or address root causes. This involves:

  • Genetic studies linking gene variants to disease susceptibility: Researchers compare the DNA of people with a disease to healthy individuals, looking for genetic variations that appear more frequently in patients. For example, genome-wide association studies (GWAS) might reveal that people with a certain gene variant are three times more likely to develop Alzheimer’s disease, suggesting that the protein encoded by that gene could be a good drug target.

  • Protein expression analysis in diseased vs. healthy tissues: Scientists measure which proteins are present at higher or lower levels in sick tissue compared to healthy tissue. If a specific protein is overproduced in cancer cells but not in normal cells, targeting that protein might selectively kill cancer while sparing healthy tissue. Techniques like mass spectrometry and immunohistochemistry help quantify these differences.

  • Pathway analysis to understand disease mechanisms: Diseases rarely involve just one protein—they involve networks of interacting proteins (pathways). By mapping these molecular pathways, scientists can identify which proteins are central to the disease process. For instance, in diabetes, researchers study the insulin signaling pathway to find proteins that could restore normal glucose metabolism when targeted by drugs.

AI/Computation role:

  • Mining genomic databases to find disease-associated genes: AI algorithms can search through massive databases containing millions of genetic sequences from patients worldwide, identifying patterns that human researchers would miss. Machine learning can integrate data from multiple studies to find genes consistently linked to diseases, even when individual studies show weak signals.

  • Analyzing gene expression data with machine learning: When cells are diseased, they produce different amounts of thousands of proteins. Machine learning models can analyze this complex data to identify which protein changes are causes of disease (good drug targets) versus which are just side effects. These models can also predict which combination of protein targets might work best together.

  • Network analysis to identify key nodes in disease pathways: Computational network analysis treats proteins as nodes and their interactions as connections, creating a map of cellular communication. AI can identify “hub” proteins that influence many others—shutting down one hub might have a bigger therapeutic effect than targeting proteins on the periphery. This is similar to how disrupting a major airport affects the entire airline network.

  • Predicting which proteins are “druggable” (have suitable binding sites): Not all proteins make good drug targets. AI models trained on existing drugs can analyze a protein’s 3D structure and surface features to predict whether small molecules can bind tightly to it. These models look for pockets or grooves on the protein surface that are the right size, shape, and chemical environment to accommodate drug molecules.

Stage 2: Hit Discovery

Goal: Find molecules that bind to the target protein.

Once we have a target, we need to find initial “hit” compounds—molecules that show measurable binding or activity. Traditional approaches include:

  • High-throughput screening (HTS): Robot systems can test 100,000+ compounds per day by mixing each one with the target protein and measuring whether binding occurs. Companies maintain libraries of millions of compounds, and HTS systematically tests them all. However, this is expensive (costing hundreds of thousands of dollars per screen), and you can only test compounds that physically exist in your library.

  • Fragment-based drug discovery: Instead of testing complete drug-sized molecules, scientists test tiny molecular “fragments” (typically 150-250 Da, much smaller than the usual 300-500 Da for drugs). These fragments bind weakly but are more likely to fit somewhere on the protein. Once you find several fragments that bind to different parts of the binding site, medicinal chemists can chemically link them together or use them as starting points to build up larger, more potent molecules.

  • Structure-based design: If you know the 3D structure of your target protein (from X-ray crystallography, cryo-EM, or AI prediction like AlphaFold), medicinal chemists can visually examine the binding pocket and design molecules that should fit like a hand in a glove. They consider the pocket’s shape, which amino acids line it, and what chemical interactions would be favorable. This is like custom-tailoring a key to fit a specific lock.

AI/Computation role:

  • Virtual screening: Instead of physically testing millions of compounds in the lab, AI can computationally simulate how each molecule would interact with the target protein. This allows screening of billions of virtual compounds (that don’t even exist yet) at a tiny fraction of the cost. Only the top-scoring candidates—typically a few hundred or thousand—need to be synthesized and tested experimentally.

  • Molecular docking: Docking algorithms predict the 3D pose of how a small molecule will sit inside a protein’s binding pocket. The software tries thousands of different orientations and conformations, scoring each based on geometric fit and predicted interaction strength. It’s like a computational jigsaw puzzle where the algorithm tries to find the best way to fit the molecule into the protein pocket.

  • Machine learning models: Predicting binding affinity from molecular structure: Rather than physics-based docking, ML models learn from historical data of which molecules bound strongly to similar proteins. Given a new molecule, the model predicts its binding affinity (how tightly it will stick) based on patterns learned from thousands of previous experiments. Modern graph neural networks can make these predictions in milliseconds, compared to minutes or hours for physics-based methods.

  • Generative models: Designing novel molecules optimized for the target: These AI models don’t just score existing molecules—they create entirely new molecular structures. They learn the “grammar” of chemistry (which atom arrangements are valid) and the patterns of what makes molecules bind well, then generate novel designs that have never been synthesized before but are predicted to have strong binding and good drug-like properties.

  • Active learning: Iteratively selecting the most informative compounds to test experimentally: This is a smart sampling strategy where AI doesn’t just predict which molecules are best, but which experiments would teach it the most. The algorithm might select molecules it’s uncertain about to reduce that uncertainty, creating a feedback loop where each round of experiments makes the AI model smarter and better at predicting the next round. This can dramatically reduce the total number of experiments needed.

Modern approaches can reduce the number of compounds that need physical synthesis and testing by 10-100x, saving enormous time and resources.

Stage 3: Lead Optimization

Goal: Improve hit compounds into “lead” candidates with drug-like properties.

Hit compounds usually have problems: weak binding, poor solubility, off-target effects, or toxicity. Medicinal chemists iteratively modify the molecular structure to improve properties while maintaining activity. This is a complex multi-objective optimization problem.

AI/Computation role:

  • QSAR (Quantitative Structure-Activity Relationship): Predicting how structural changes affect activity: QSAR models learn mathematical relationships between a molecule’s structure (like which functional groups it has, its shape, its electronic properties) and its biological activity. For example, a QSAR model might reveal that adding a chlorine atom at a specific position increases binding affinity by 10-fold, while adding it elsewhere has no effect. This helps chemists make informed decisions about which modifications to try next.

  • ADMET prediction: Machine learning models for absorption, distribution, metabolism, excretion, and toxicity: Before synthesizing modified molecules, AI can predict whether they’ll have acceptable pharmacological properties. Models trained on thousands of previous drugs can predict whether a molecule will be absorbed through the intestine, how long it will stay in the body, whether liver enzymes will break it down too quickly, and whether it’s likely to be toxic to the heart, liver, or kidneys. This prevents wasting time on molecules that would fail later.

  • Molecular dynamics: Simulating protein-ligand binding at atomic detail: These physics-based simulations model every atom’s movement over time (typically nanoseconds to microseconds), showing how the drug molecule and protein move and interact dynamically. This reveals whether the drug stays stably bound, which interactions are most important, and whether the protein changes shape upon binding. However, these simulations are computationally expensive, often requiring hours to days on powerful computers.

  • Multi-objective optimization: Algorithms that balance multiple desired properties: Since improving one property (like binding) often worsens another (like solubility), optimization algorithms help find the best compromise. These algorithms explore chemical modifications that improve the overall “score” across all important properties simultaneously, rather than optimizing one property at a time. Think of it like tuning a car for both speed and fuel efficiency—you need to find the sweet spot.

  • Retrosynthesis prediction: AI models that suggest synthetic routes for proposed molecules: Just because you design a molecule doesn’t mean chemists can actually make it. Retrosynthesis AI works backward from the target molecule, breaking it down into simpler starting materials and suggesting the chemical reactions needed to build it. Modern models trained on millions of published reactions can suggest routes that human chemists might not consider, and assess whether a synthesis would be practical or prohibitively difficult.

  • Generative models with constraints: Creating new molecules that optimize multiple properties simultaneously: Advanced generative models can be given multiple objectives (high binding, low toxicity, good solubility, easy to synthesize) and generate novel molecules that satisfy all constraints. These models navigate the chemical space more intelligently than random modifications, proposing molecules that represent meaningful improvements across the board.

Tools like graph neural networks can predict molecular properties orders of magnitude faster than physics-based simulations, enabling rapid iteration.

Stage 4: Preclinical Testing

Goal: Demonstrate safety and efficacy before human trials.

Lead compounds undergo rigorous testing in cell cultures and animal models. Scientists assess:

  • Efficacy: Does it work in biological systems?: First tested in cells grown in dishes (in vitro), then in living animals (in vivo). Researchers measure whether the drug actually produces the desired biological effect—does it kill cancer cells, reduce inflammation, or lower blood pressure? They also determine the dose-response relationship: how much drug is needed to see an effect?

  • Safety: What are the toxic doses? Are there concerning side effects?: Animal studies test escalating doses to find the highest dose animals can tolerate without serious harm. Researchers monitor for organ damage (liver, kidney, heart toxicity), examine tissues under microscopes, and track blood chemistry. They also watch for behavioral changes, weight loss, or other signs of distress that might indicate problems.

  • Pharmacokinetics: How does the body process the drug?: Scientists track the drug’s concentration in blood over time after dosing. This reveals how quickly it’s absorbed, how it distributes to different tissues, how long it remains in the body (half-life), and how it’s eliminated (through kidneys, liver metabolism, etc.). This data helps predict the appropriate dosing schedule for humans—whether a drug needs to be taken once daily, twice daily, or more frequently.

AI/Computation role:

  • Toxicity prediction: ML models trained on historical toxicity data can flag potential problems early: Before expensive animal studies, AI models can screen for known toxicity patterns. Models trained on databases of chemicals that caused liver damage, heart problems, or other issues can predict whether a new molecule shares structural features with known toxins. This early warning system can save months of work and reduce animal testing by filtering out problematic candidates.

  • Biomarker identification: Analyzing preclinical data to find indicators of drug response: Machine learning can analyze complex biological data (gene expression, protein levels, metabolite concentrations) to identify biomarkers—measurable indicators that predict whether a drug is working. For example, if a certain protein level drops when the drug is effective, measuring that protein provides an early signal of efficacy before you see full therapeutic effects. These biomarkers become valuable in later human trials.

  • Dose-response modeling: Predicting optimal dosing regimens: Computational pharmacokinetic/pharmacodynamic (PK/PD) models integrate data on how drug concentration changes over time with how those concentrations produce effects. These models can predict the optimal dose and frequency to maintain therapeutic drug levels without causing toxicity, even before testing all possible regimens experimentally. This reduces the amount of trial-and-error needed.

  • Cross-species extrapolation: Predicting human responses from animal data: Animals metabolize drugs differently than humans—a safe dose in mice might be toxic in humans, or vice versa. AI models can learn scaling relationships and species differences to better predict human responses from animal data. These models incorporate known differences in liver enzymes, protein targets, and physiology to make more accurate human predictions.

  • Clinical trial design: Optimizing trial protocols using historical data: Even before human trials begin, AI can analyze data from previous trials of similar drugs to suggest optimal trial designs—how many patients to enroll, which measurements to track, which patient subgroups to focus on, and which doses to test. This increases the probability of trial success and reduces the number of patients exposed to suboptimal doses.

Even at this late stage, computational methods can identify potential failures before expensive animal studies.

Beyond Preclinical: Clinical Trials and Approval

After successful preclinical testing, drugs enter clinical trials:

  • Phase I: Safety testing in small numbers (20-100) of healthy volunteers to establish that the drug is safe in humans, determine how it’s metabolized, and find the maximum tolerated dose. These are the first humans to ever take the drug.

  • Phase II: Efficacy and dosing in patients (100-300) who actually have the disease. Researchers assess whether the drug produces the intended therapeutic effect and continue to monitor safety. They also refine the optimal dose and start looking for which patient populations benefit most.

  • Phase III: Large-scale trials (1,000-3,000+ patients) comparing the new drug to current standard treatments (or placebo if no treatment exists). These trials must demonstrate that the drug is both effective and safe in a diverse patient population before regulatory agencies like the FDA will approve it for market.

This process takes 6-10 years and costs hundreds of millions of dollars. AI is beginning to play roles in patient selection (identifying which patients are most likely to respond), biomarker discovery (finding early indicators of response), and trial optimization (determining optimal trial designs and endpoints), though clinical testing remains the most expensive and time-consuming part of drug development.

Why This Is Computationally Hard

Drug discovery presents unique computational challenges:

1. Multiple Scales of Complexity

Drug discovery spans scales from quantum mechanics (chemical bond formation) to cellular biology (how drugs affect cells) to whole-organism physiology (drug effects in patients). No single computational model captures all this complexity.

2. Data Scarcity and Quality

Unlike computer vision or NLP where millions of labeled examples exist, drug discovery data is:

  • Expensive to generate (each data point might cost thousands of dollars)
  • Noisy (biological assays have high variability)
  • Biased (researchers test molecules they think will work)
  • Imbalanced (far more inactive than active compounds)

3. Physics and Chemistry Constraints

Molecules must obey physical and chemical laws. Generative models can’t just create any pattern—they must generate chemically valid, synthesizable molecules. This requires incorporating domain knowledge into AI architectures.

4. High-Dimensional, Sparse Search Space

Chemical space is vast, but regions where molecules have all desired properties are tiny and disconnected. Finding these regions is like searching for scattered islands in an ocean.

5. Multiple Conflicting Objectives

Optimizing binding affinity might hurt solubility. Improving metabolic stability might increase toxicity. These trade-offs require sophisticated multi-objective optimization approaches.

6. The Sim-to-Real Gap

Computational predictions must eventually be validated experimentally. Models that work well on benchmark datasets might fail on real-world synthesis and testing. Closing this “sim-to-real gap” is crucial.

The Promise of AI in Drug Discovery

Despite these challenges, AI is transforming drug discovery:

  • AlphaFold has solved the protein structure prediction problem, providing high-quality 3D structures for millions of proteins4
  • Generative models can design novel molecules optimized for multiple properties
  • Graph neural networks predict molecular properties with unprecedented accuracy
  • Reinforcement learning navigates chemical space more efficiently than random exploration
  • Foundation models trained on massive chemical and biological datasets are emerging

The field is moving from physics-based simulations (slow but interpretable) toward hybrid approaches that combine machine learning’s speed with physics-based modeling’s rigor.

What’s Next

In this series, we’ll build up the technical foundations to understand and implement modern AI methods for drug discovery:

  • Blog 2: We’ll explore how to represent molecules and proteins for machine learning—from simple text strings to sophisticated graph structures
  • Blog 3: We’ll examine AlphaFold and the protein structure prediction revolution—how transformers and attention mechanisms solved a 50-year-old problem and why accurate structures are essential for drug design
  • Blog 4: We’ll dive deep into graph neural networks and how they learn from molecular structures to predict properties
  • Blog 5: We’ll examine generative models that can create entirely new molecules, including transformers and reinforcement learning approaches
  • Blog 6: We’ll explore the cutting edge—diffusion models that generate 3D molecular structures optimized for binding to specific proteins

Each post will balance theory with practical implementation details, helping you build an intuition for both the biological problems and the computational solutions.

The intersection of AI and drug discovery is one of the most exciting frontiers in both fields. By understanding the biological foundations covered here, we’ll be well-equipped to appreciate how modern machine learning methods are accelerating the search for new medicines.

Image Credits

All images in this post are copied from the course General, Organic, and Biochemistry with Problems, Case Studies, and Activities by LibreTexts licensed under CC BY 4.0. 5

References


  1. DiMasi, J. A., Grabowski, H. G., & Hansen, R. W. (2016). Innovation in the pharmaceutical industry: New estimates of R&D costs. Journal of Health Economics, 47, 20-33. ↩︎

  2. International Human Genome Sequencing Consortium. (2004). Finishing the euchromatic sequence of the human genome. Nature, 431(7011), 931-945. ↩︎

  3. Polishchuk, P. G., Madzhidov, T. I., & Varnek, A. (2013). Estimation of the size of drug-like chemical space based on GDB-17 data. Journal of Computer-Aided Molecular Design, 27(8), 675-679. ↩︎

  4. Jumper, J., Evans, R., Pritzel, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589. ↩︎

  5. General, Organic, and Biochemistry with Problems, Case Studies, and Activities by LibreTexts is licensed under CC BY 4.0. Available at: https://chem.libretexts.org/Courses/Roosevelt_University/General_Organic_and_Biochemistry_with_Problems_Case_Studies_and_Activities↩︎