Multi-stage Virtual Screening Tutorial

Multi-stage Virtual Screening Tutorial is designed to unify dynamic structure-based virtual screening workflows by integrating Su-GaMD-derived conformational states, molecular docking, machine learning algorithms and pharmacophore models to improve the hit rate of selective ligands.

Prerequisites

  • Python 3.7+

  • RDKit

  • PyTorch

Install Core Dependencies

pip install -r requirements.txt

Note: If you encounter issues with PyTorch installation, please visit PyTorch official website to get the appropriate version for your system.

Model Training

Data Preprocessing: The dataset should be in CSV format and include at least: SMILES column (molecular structure) and Target column (1 = active, 0 = inactive)

Run run_model.py for hyperparameter optimization:

python scripts/run_model.py \
    --file data/gap100/A1R-gap100.csv \
    --split random \
    --FP ECFP4 \
    --model RF

You need to specify the input data file path, data splitting strategy, molecular feature type, model type and other parameters. It is recommended to use the same parameters for the same configuration to ensure result reproducibility.

After hyperparameter optimization, run run_result.py for model training and evaluation:

python scripts/run_result.py \
    --file data/gap100/A1R-gap100.csv \
    --split random \
    --FP ECFP4 \
    --model RF

The script will automatically load the optimal hyperparameters, split the data according to the specified strategy, train the model, and output evaluation indicators (accuracy, AUC, etc.)

Note: Use consistent parameters between run_model.py and run_result.py for the same configuration.

Parameter Details

run_model.py / run_result.py Parameters:

Parameter

Description

Options

Default

–file

Input data file path (required)

Any CSV file

–split

Data splitting strategy

random, scaffold, cluster

scaffold

–FP

Molecular feature type

ECFP4, MACCS, 2d-3d, pubchem

ECFP4

–model

Model type

RF, attentivefp

RF

–threads

Number of CPU threads for multiprocessing (RF only)

Integer

1

–mpl

Enable multiprocessing

true, false

false

Molecular Screening

After model training, conduct virtual screening for novel compounds.

python scripts/ml_screener.py \
    --file new_molecules.csv \
    --model model_save/RF/random_RF_ECFP4_bestModel.pkl \
    --prop 0.5 \
    --out_dir ./results
python scripts/dl_screener.py \
    --file new_molecules.csv \
    --model model_save/attentivefp/random_cla_attentivefp.pth \
    --prop 0.5 \
    --out_dir ./results

Note: The scripts automatically preprocess data and handle different feature types.

After molecular screening, the next step is to integrate the metastable intermediate conformations obtained by Su-GaMD technology in the previous step for molecular docking, and obtain candidate compounds for activity testing.

Logging

All running logs are output to both the console and log files, containing:

  • Data processing progress

  • Model training progress

  • Hyperparameter optimization progress

  • Final evaluation progress

  • Screening predictions and results