Hit-to-Lead, Virtual Screening Compounds for Drug Discovery


Introduction

What follows are notes on getting Virtual Flow running on one machine to understand the mechanics and scaling around how in silico drug discovery works. Virtual Flow is an open-source drug candidate screening platform that has been designed to screen millions of compounds against a protein or receptor target.

Background

Bringing a new drug to market is an expensive endeavor. Cost estimates vary widely, but one study found the median cost of bringing a drug to market was $1.1 billion (in 2018 dollars). There are many components that contribute to this cost, so it’s important to find ways to reduce cost along the development pipeline. One area that’s received a lot of attention recently is the development of in silico drug discovery systems that reduce the cost of “hit discovery”, finding the small molecule compounds (i.e. ligands) that show high affinity to a target (i.e. receptor or protein).

At the core of in silico drug discovery is the use of molecular docking simulations (e.g. AutoDock Vina) that are used to predict the binding affinity between the small molecule compound and the target. The docking is usually scored according to the binding energy between the compound and target.

Virtual Flow

A very high-level overview of a drug screening system such as VirtualFlow is as follows:

graph TD
	A(Compound Database)-->B(Generate Docking Conformations)
    B --> C(Dock conformation with Receptor)
    C --> D(Measure Binding Energy)
    D --> E(Store results)

While VirtualFlow has been designed to scale across multiple machines, these notes will step through running VFVS on one Ubuntu 22.04 machine.

SLURM Batch System Setup

In order to use VirtualFlow, a batch system needs to be setup first. Here’s how I setup Simple Linux Utility for Resource Management (SLURM):

  1. Install both slurmd and slurmctld since both the controller and node daemon will be running on the same machine.

     sudo apt update -y
     sudo apt install slurmd slurmctld -y
    
  2. Add the slurm config:
     sudo touch /etc/slurm/slurm.conf
     sudo chmod 755 /etc/slurm/slurm.conf
    
  3. Add the following to the slurm.conf file (customizing it as necessary for the machine hardware):

     # slurm.conf file for Ubuntu on with debug logging in /var/log/slurm
    	
     # Control machine configuration
     SlurmctldHost=localhost
     SlurmctldPort=6817
     SlurmdPort=6818
    	
     # Node configuration
     # A single-CPU computer with 16 cores and 32 GB RAM with hyperthreading on
    	
     NodeName=localhost Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=32000 State=UNKNOWN
    	
     # Partition configuration
     PartitionName=test Nodes=localhost Default=YES MaxTime=INFINITE State=UP
    	
    	
     AccountingStorageType=accounting_storage/none
     JobCompType=jobcomp/none
     JobAcctGatherFrequency=30
     JobAcctGatherType=jobacct_gather/none
     SlurmctldDebug=info
     SlurmctldLogFile=/var/log/slurm/slurmctld.log
     SlurmdDebug=info
     SlurmdLogFile=/var/log/slurm/slurmd.log
     SlurmdSpoolDir=/var/lib/slurm/slurmd
     SlurmUser=slurm
     StateSaveLocation=/var/lib/slurm/slurmctld
    	
    	
     # User configuration
     ClusterName=localcluster
    	
     # Timers
     InactiveLimit=0
     KillWait=30
     MinJobAge=300
     SlurmctldTimeout=60
     SlurmdTimeout=150
     Waittime=0
    	
     # SCHEDULING
     SchedulerType=sched/backfill
     SelectType=select/cons_tres
     SelectTypeParameters=CR_Core
    	
     # Logging configuration
     SlurmctldLogFile=/var/log/slurm/slurmctld.log
     SlurmdLogFile=/var/log/slurm/slurmd.log
     SlurmSchedLogFile=/var/log/slurm/slurmsched.log
     JobCompType=jobcomp/filetxt
     JobCompLoc=/var/log/slurm/jobacct.log
     SlurmdDebug=info
     SlurmctldDebug=info
    
    
  4. Start slurmd and slurmctl:

     sudo systemctl start slurmd
     sudo systemctl start slurmctld
    

Useful SLURM commands for Virtual Flow:

  • sinfo - check the state of queue (partition)
  • squeue - check for jobs in the queue
  • scancel -u <user> - cancel all the jobs belonging to the particular user. Jobs can also be stopped in other ways in Virtual Flow.

Virtual Flow Setup

For these notes, I’ll be using the setup described in the Virtual Flow tutorial

  1. Get VFVS_GK tutorial files:

     cd ~/dev/
     wget https://virtual-flow.org/sites/virtual-flow.org/files/tutorials/VFVS_GK.tar
     tar -xvf VFVS_GK.tar
    
  2. Select the compounds according to these instructions (optional, if testing beyond the tutorial files). Run source tranches.sh from the VFVS_GK/input-files/ligand-library folder to download the compound files.

  3. Edit tools/templates/all.ctrl and set these values according to the number of cores on the machine (e.g. 16)

     cpus_per_step=16
     queues_per_step=16
     cpus_per_queue=16
     ...
    
  4. Install Open Babel:

     sudo apt-get install openbabel
    

Virtual Flow Run

After the setup is complete, follow these steps for each run:

  1. Prepare the output folders:

     cd ~/dev/VFVS_GK/tools
     ./vf_prepare_folders.sh 
    
  2. Start Virtual Flow:

    The below is an example with one job, and one queue. Note that the number of cores used in the queue will be used according to the cpus_per_queue command.

     ./vf_start_jobline.sh 1 1 templates/template1.slurm.sh submit 1
    
  3. Check Virtual Flow status:

     ./vf_report.sh -c workflow
    

    Example output:

     Total number of ligands: 1123                                                     
     Number of ligands started: 21                                                     
     	Number of ligands successfully completed: 21                                                
     	Number of ligands failed: 0      
     	...
     	Docking runs per ligand: 2
     Number of dockings started: 42                                                     
     Number of dockings successfully completed: 42                                                
     Number of dockings failed: 0
    
  4. Check a particular docking method (with top 10 compounds by binding energy):

     ./vf_report.sh -c vs -d qvina02_rigid_receptor1 -n 10
    

    Example output:

    ``` Binding affinity - statistics
    ……………………………………………………………………………………

Number of ligands screened with binding affinity between 0 and inf kcal/mole: 26 Number of ligands screened with binding affinity between -0.1 and -5.0 kcal/mole: 119

… Binding affinity - highest scoring compounds
……………………………………………………………………………………

   Rank  Ligand             Collection    Highest-Score

   1     ABC-1234_1 		XXXXXX_00000  -7.6
   2     ABC-1234_2  		XXXXXX_00000  -7.6
   3     XYZ-4321_4  		XXXXXX_00000  -7.4
	...
```

Monitoring and Debugging

In addition to running vf_report.sh, it can be useful to also monitor the slurm logs:

sudo tail -f  /var/log/slurm/*.log

Typically failed jobs will appear in /var/log/slurm/jobacct.log with a FAILED JobState. For example:

JobId=3391 ... Name=t-1.1 JobState=FAILED

To view error messages, try looking in the logs under:

workflow/output-files/queues/
workflow/output-files/jobs

Related Note: These logs were useful for debugging a particular issue with leading zeroes not being removed when doing a date calculation. So in one-queue.sh, I had to change the start/end date calculations (e.g. docking_start_time_s) $(($(date +'%s * 1000 + %-N / 1000000'))) to $(($(date +'%s'))).

Virtual Flow Run Completion

Once the Virtual Flow is complete, rank all the ligands and extract the docking poses:

  1. Rank the ligands

    Add VFTools to your path:

     export PATH=$PATH:/home/<user>/dev/VFTools/bin
    

    Then run:

     cd ~/dev/VFVS_GK
     mkdir -p pp/ranking
     cd pp/ranking
     vfvs_pp_ranking_all.sh ../../output-files/complete/ 2 meta_tranche
    
  2. Get the top 100 docking poses

     cd ~/dev/VFVS_GK/pp/ranking/qvina02_rigid_receptor1
     head -100 firstposes.all.minindex.sorted.clean > compounds
    	
    
  3. Extract the docking poses

     cd ~/dev/VFVS_GK
     mkdir docking_poses
     cd docking_poses
     vfvs_pp_prepare_dockingposes.sh ../output-files/complete/qvina02_rigid_receptor1/results/ meta_tranch ../pp/ranking/qvina02_rigid_receptor1/compounds dockingsposes overwrite
    

Follow-ups

  1. Why does ./vf_report.sh -c vs -d qvina02_rigid_receptor1 -n 10 and head -10 qvina02_rigid_receptor1/firstposes.all.minindex.sorted.clean > compounds not have the same top 10 compounds?
  2. The compound screening process is much slower than expected (~60min for 1000 compounds on a 5900X Ryzen CPU). Possible things to try:

    • A) Program docking to run on a GPU
    • B) Experiment with other batching systems
    • C) Spread jobs across a cluster (e.g. AWS ParallelCluster)
    • D) Try and benchmark other docking programs (e.g. Quick Vina)
  3. In order to reduce computation cost, can serverless compute (e.g. AWS lambda) be used with the docking executables/binaries?
  4. Can we replace SLURM with other batching or queuing (e.g. Kafka, RabbitMQ) systems?
  5. I’m not a fan of bash as a scripting language due to the lack of modularity and the difficultly in debugging/logging. Can this be re-implemented Ruby or Python instead?

Useful References: