Multi-stage Virtual Screening Tutorial
Multi-stage Virtual Screening Tutorial is designed to unify dynamic structure-based virtual screening workflows by integrating Su-GaMD-derived conformational states, molecular docking, machine learning algorithms and pharmacophore models to improve the hit rate of selective ligands.
Prerequisites
Python 3.7+
RDKit
PyTorch
Install Core Dependencies
pip install -r requirements.txt
Note: If you encounter issues with PyTorch installation, please visit PyTorch official website to get the appropriate version for your system.
Model Training
Data Preprocessing: The dataset should be in CSV format and include at least: SMILES column (molecular structure) and Target column (1 = active, 0 = inactive)
Run run_model.py for hyperparameter optimization:
python scripts/run_model.py \
--file data/gap100/A1R-gap100.csv \
--split random \
--FP ECFP4 \
--model RF
You need to specify the input data file path, data splitting strategy, molecular feature type, model type and other parameters. It is recommended to use the same parameters for the same configuration to ensure result reproducibility.
After hyperparameter optimization, run run_result.py for model training and evaluation:
python scripts/run_result.py \
--file data/gap100/A1R-gap100.csv \
--split random \
--FP ECFP4 \
--model RF
The script will automatically load the optimal hyperparameters, split the data according to the specified strategy, train the model, and output evaluation indicators (accuracy, AUC, etc.)
Note: Use consistent parameters between run_model.py and run_result.py for the same configuration.
Parameter Details
run_model.py / run_result.py Parameters:
Parameter |
Description |
Options |
Default |
|---|---|---|---|
–file |
Input data file path (required) |
Any CSV file |
|
–split |
Data splitting strategy |
random, scaffold, cluster |
scaffold |
–FP |
Molecular feature type |
ECFP4, MACCS, 2d-3d, pubchem |
ECFP4 |
–model |
Model type |
RF, attentivefp |
RF |
–threads |
Number of CPU threads for multiprocessing (RF only) |
Integer |
1 |
–mpl |
Enable multiprocessing |
true, false |
false |
Molecular Screening
After model training, conduct virtual screening for novel compounds.
python scripts/ml_screener.py \
--file new_molecules.csv \
--model model_save/RF/random_RF_ECFP4_bestModel.pkl \
--prop 0.5 \
--out_dir ./results
python scripts/dl_screener.py \
--file new_molecules.csv \
--model model_save/attentivefp/random_cla_attentivefp.pth \
--prop 0.5 \
--out_dir ./results
Note: The scripts automatically preprocess data and handle different feature types.
After molecular screening, the next step is to integrate the metastable intermediate conformations obtained by Su-GaMD technology in the previous step for molecular docking, and obtain candidate compounds for activity testing.
Logging
All running logs are output to both the console and log files, containing:
Data processing progress
Model training progress
Hyperparameter optimization progress
Final evaluation progress
Screening predictions and results