Multi-stage Virtual Screening Tutorial

Multi-stage Virtual Screening Tutorial is designed to unify dynamic structure-based virtual screening workflows by integrating Su-GaMD-derived conformational states, molecular docking, machine learning algorithms and pharmacophore models to improve the hit rate of selective ligands.

Prerequisites

Python 3.7+
RDKit
PyTorch

Install Core Dependencies

pip install -r requirements.txt

Note: If you encounter issues with PyTorch installation, please visit PyTorch official website to get the appropriate version for your system.

Model Training

Data Preprocessing: The dataset should be in CSV format and include at least: SMILES column (molecular structure) and Target column (1 = active, 0 = inactive)

Run run_model.py for hyperparameter optimization:

python scripts/run_model.py \
    --file data/gap100/A1R-gap100.csv \
    --split random \
    --FP ECFP4 \
    --model RF

You need to specify the input data file path, data splitting strategy, molecular feature type, model type and other parameters. It is recommended to use the same parameters for the same configuration to ensure result reproducibility.

After hyperparameter optimization, run run_result.py for model training and evaluation:

python scripts/run_result.py \
    --file data/gap100/A1R-gap100.csv \
    --split random \
    --FP ECFP4 \
    --model RF

The script will automatically load the optimal hyperparameters, split the data according to the specified strategy, train the model, and output evaluation indicators (accuracy, AUC, etc.)

Note: Use consistent parameters between run_model.py and run_result.py for the same configuration.

Parameter Details

run_model.py / run_result.py Parameters:

Parameter	Description	Options	Default
–file	Input data file path (required)	Any CSV file
–split	Data splitting strategy	random, scaffold, cluster	scaffold
–FP	Molecular feature type	ECFP4, MACCS, 2d-3d, pubchem	ECFP4
–model	Model type	RF, attentivefp	RF
–threads	Number of CPU threads for multiprocessing (RF only)	Integer	1
–mpl	Enable multiprocessing	true, false	false

Molecular Screening

After model training, conduct virtual screening for novel compounds.

python scripts/ml_screener.py \
    --file new_molecules.csv \
    --model model_save/RF/random_RF_ECFP4_bestModel.pkl \
    --prop 0.5 \
    --out_dir ./results

python scripts/dl_screener.py \
    --file new_molecules.csv \
    --model model_save/attentivefp/random_cla_attentivefp.pth \
    --prop 0.5 \
    --out_dir ./results

Note: The scripts automatically preprocess data and handle different feature types.

After molecular screening, the next step is to integrate the metastable intermediate conformations obtained by Su-GaMD technology in the previous step for molecular docking, and obtain candidate compounds for activity testing.

Logging

All running logs are output to both the console and log files, containing:

Data processing progress
Model training progress
Hyperparameter optimization progress
Final evaluation progress
Screening predictions and results