Machine Learning Pipeline for Aqueous Solubility Prediction
Features • Installation • Quick Start • Pipeline Overview • Documentation
DrugSol is an end-to-end machine learning pipeline for predicting aqueous solubility (logS) of drug-like compounds. Built with Nextflow DSL2, it provides a reproducible, scalable, and production-ready workflow for pharmaceutical research and drug discovery.
The pipeline implements a state-of-the-art ensemble approach combining:
- Gradient Boosting Models: XGBoost, LightGBM, CatBoost
- Graph Neural Networks: Chemprop D-MPNN
- Physics-informed Baseline: Ridge regression with thermodynamic features
Aqueous solubility is a critical physicochemical property in drug development:
- ~40% of drug candidates fail due to poor solubility
- Directly impacts bioavailability and absorption
- Essential for formulation development
- Required by regulatory agencies (FDA, EMA)
- Multi-source data integration: BigSolDB, ChEMBL, custom datasets
- Automated data curation: Water solvent filtering, temperature range selection, outlier detection
- SMILES standardization: Neutralization, tautomer canonicalization, salt removal
- Dual feature engineering: 1,600+ Mordred descriptors + RDKit physicochemical properties
- ChemBERTa embeddings: Transformer-based molecular representations
- pH-dependent corrections: Henderson-Hasselbalch thermodynamic adjustments
- Nextflow DSL2: Modular, reproducible workflows
- Conda environments: Automatic dependency management
- GPU acceleration: CUDA support for Chemprop and GBM training
- Cross-validation: Stratified K-fold with Optuna hyperparameter tuning
- Ensemble learning: Stacking and blending meta-learners
- Two operational modes: Research (training) and Execution (inference)
- Nextflow ≥ 22.10.1
- Micromamba or Conda
- Python 3.8+ (managed by Conda)
- CUDA 11.x (optional, for GPU acceleration)
# 1. Clone the repository git clone https://github.com/yourusername/drugsol.git cd drugsol # 2. Install Nextflow (if not already installed) curl -s https://get.nextflow.io | bash sudo mv nextflow /usr/local/bin/ # 3. Install Micromamba (recommended over Conda) "${SHELL}" <(curl -L micro.mamba.pm/install.sh) # 4. Verify installation nextflow -version micromamba --versionThe pipeline automatically creates Conda environments on first execution:
nextflow run main.nf --mode research -profile gpu_small --n_iterations 1Train models with cross-validation on public datasets:
# Full training pipeline (10 iterations, 5-fold CV) nextflow run main.nf --mode research -profile gpu_small # Quick test (1 iteration) nextflow run main.nf --mode research -profile gpu_small --n_iterations 1 # CPU-only execution nextflow run main.nf --mode research -profile standardPredict solubility for new molecules:
# Using trained models from research phase nextflow run main.nf --mode execution --input molecules.csv -profile standard # With specific model override nextflow run main.nf --mode execution --input molecules.csv --model /path/to/modelFor execution mode, provide a CSV/TSV/Parquet file with SMILES:
smiles,name CC(=O)OC1=CC=CC=C1C(=O)O,Aspirin CN1C=NC2=C1C(=O)N(C(=O)N2C)C,Caffeine CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,Ibuprofen┌─────────────────────────────────────────────────────────────────────────────┐ │ DrugSol Pipeline │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ INGEST │───▶│ CURATE │───▶│ PREPARE │ │ │ │ │ │ │ │ │ │ │ │ • BigSolDB │ │ • Filter H2O │ │ • Mordred │ │ │ │ • ChEMBL │ │ • Temp range │ │ • RDKit │ │ │ │ • Custom │ │ • Outliers │ │ • ChemBERTa │ │ │ │ │ │ • SMILES std │ │ • Folds │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ TRAIN (OOF) │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ │ │ XGBoost │ │ LightGBM│ │ CatBoost│ │ Chemprop│ │ Physics │ │ │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ └────────────┴────────────┴────────────┴────────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ Meta-Learner │ │ │ │ │ │ (Stack/Blend) │ │ │ │ │ └─────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ PRODUCTION │───▶│ EVALUATE │───▶│ PUBLISH │ │ │ │ │ │ │ │ │ │ │ │ • Full train │ │ • Metrics │ │ • Model card │ │ │ │ • Ensemble │ │ • Plots │ │ • Resources │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ | Mode | Purpose | Input | Output |
|---|---|---|---|
| Research | Train and validate models | Public databases | Trained ensemble + metrics |
| Execution | Predict new compounds | SMILES file | Solubility predictions |
drugsol/ ├── main.nf # Pipeline entrypoint ├── nextflow.config # Global configuration │ ├── subworkflows/ │ └── modes/ │ ├── research/ │ │ └── research.nf # Training workflow │ └── execution/ │ └── execution.nf # Inference workflow │ ├── modules/ # Nextflow process modules │ ├── fetch_bigsoldb/ # Data ingestion │ ├── fetch_chembl/ │ ├── filter_water/ # Data curation │ ├── filter_by_temperature_range/ │ ├── detect_outliers/ │ ├── standardize_smiles/ │ ├── make_features_mordred/ # Feature engineering │ ├── make_features_rdkit/ │ ├── make_embeddings_chemberta/ │ ├── train_oof_gbm/ # Model training │ ├── train_oof_chemprop/ │ ├── train_oof_physics/ │ ├── meta_stack_blend/ # Ensemble learning │ ├── final_report/ # Evaluation │ └── ... │ ├── bin/ # Python scripts │ ├── fetch_bigsoldb.py │ ├── standardize_smiles.py │ ├── make_features_mordred.py │ ├── train_oof_gbm.py │ ├── train_oof_chemprop.py │ └── ... │ ├── envs/ # Conda environments │ ├── drugsol-data.yml # Data processing │ ├── drugsol-train.yml # Model training │ └── drugsol-bert.yml # ChemBERTa │ ├── resources/ # Reference files │ ├── smarts_pattern_ionized.txt │ └── ... │ └── results/ # Pipeline outputs ├── research/ │ ├── ingest/ │ ├── curate/ │ ├── prepare_data/ │ ├── training/ │ ├── final_product/ │ └── pipeline_info/ └── execution/ └── predictions/ | Parameter | Default | Description |
|---|---|---|
--mode | research | Pipeline mode: research or execution |
--input | null | Input file for execution mode |
--outdir | results/<mode> | Output directory |
--n_iterations | 10 | Number of CV iterations |
--n_cv_folds | 5 | Number of CV folds |
--random_seed | 42 | Random seed for reproducibility |
| Profile | Use Case | GPU | Memory |
|---|---|---|---|
standard | CPU-only, testing | ❌ | Low |
gpu_small | Consumer GPU (RTX 3070) | ✅ | 6-8 GB |
gpu_high | Workstation (A5000/A6000) | ✅ | 32+ GB |
# High-performance training nextflow run main.nf \ --mode research \ --n_iterations 20 \ --n_cv_folds 10 \ -profile gpu_high # Skip specific models nextflow run main.nf \ --mode research \ --skip_chemprop true \ --skip_catboost true \ -profile standard # Custom temperature range nextflow run main.nf \ --mode research \ --temp_min_celsius 20 \ --temp_max_celsius 40 \ -profile gpu_small| Model | Type | Features | Hyperparameter Tuning |
|---|---|---|---|
| XGBoost | Gradient Boosting | Mordred + ChemBERTa | Optuna (50 trials) |
| LightGBM | Gradient Boosting | Mordred + ChemBERTa | Optuna (50 trials) |
| CatBoost | Gradient Boosting | Mordred + ChemBERTa | Optuna (50 trials) |
| Chemprop | D-MPNN (GNN) | SMILES only | Optuna (20 trials) |
| Physics | Ridge Regression | RDKit + Engineered | GridSearchCV |
The meta-learner combines base model predictions using:
- Stacking: Ridge regression on OOF predictions
- Blending: Weighted average based on validation performance
results/research/ ├── ingest/ │ ├── bigsoldb.csv │ └── chembl_solubility.csv ├── curate/ │ ├── filtered_water.parquet │ ├── filtered_temperature.parquet │ └── standardized_smiles.parquet ├── prepare_data/ │ ├── iter_1/ │ │ ├── train_features_mordred.parquet │ │ ├── train_chemberta_embeddings.parquet │ │ └── folds.parquet │ └── ... ├── training/ │ ├── iter_1/ │ │ ├── oof_gbm/ │ │ ├── oof_gnn/ │ │ └── oof_physics/ │ └── ... ├── final_product/ │ ├── drugsol_model/ │ │ ├── model_card.json │ │ ├── xgboost_final.pkl │ │ ├── lightgbm_final.pkl │ │ ├── catboost_final.cbm │ │ ├── chemprop_final/ │ │ └── meta_weights.json │ └── final_report.html └── pipeline_info/ ├── execution_timeline.html └── execution_report.html results/execution/ └── predictions/ ├── predictions_raw.csv └── predictions_physio_pH7.4.csv | Model | RMSE (logS) | R² | MAE |
|---|---|---|---|
| XGBoost | ~0.85 | ~0.82 | ~0.62 |
| LightGBM | ~0.84 | ~0.83 | ~0.61 |
| CatBoost | ~0.86 | ~0.81 | ~0.63 |
| Chemprop | ~0.92 | ~0.78 | ~0.68 |
| Physics | ~1.10 | ~0.70 | ~0.82 |
| Ensemble | ~0.80 | ~0.85 | ~0.58 |
| Stage | Time |
|---|---|
| Ingest + Curate | ~5 min |
| Feature Engineering | ~15 min |
| GBM Training (3 models) | ~30 min |
| Chemprop Training | ~45 min |
| Full Training + Ensemble | ~20 min |
| Total | ~2 hours |
# Reset environments rm -rf envs/conda_cache/drugsol-* rm -rf .nextflow nextflow run main.nf --mode research -profile gpu_small# Use smaller batches nextflow run main.nf \ --mode research \ --chemprop_batch_size 16 \ --gbm_tune_trials 20 \ -profile gpu_small# Manually verify environment micromamba run -p envs/conda_cache/drugsol-train \ python -c "import torch, xgboost, lightgbm; print('OK')"If you use DrugSol in your research, please cite:
@software{drugsol2024, author = {Olivares Rodriguez, Aitor}, title = {DrugSol: Machine Learning Pipeline for Aqueous Solubility Prediction}, year = {2024}, url = {https://github.com/yourusername/drugsol} }- BigSolDB: Zenodo Record 15094979
- Chemprop: Yang et al. (2019) "Analyzing Learned Molecular Representations for Property Prediction" J. Chem. Inf. Model.
- QED: Bickerton et al. (2012) "Quantifying the chemical beauty of drugs" Nature Chemistry
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-model) - Commit changes (
git commit -am 'Add new model') - Push to branch (
git push origin feature/new-model) - Open a Pull Request
- Universitat Rovira i Virgili - Academic supervision
- BigSolDB - Primary solubility dataset
- ChEMBL - Secondary data source
- Chemprop - Graph neural network implementation
- Nextflow - Workflow management
