Project: Mathematical Foundations Individual Project
Topic: Theoretical and Empirical Analysis of Importance Sampling in Convex and Deep Learning Settings
This repository investigates the effectiveness of Importance Sampling (IS) as a variance reduction technique in stochastic optimization, with a focus on robustness to data heterogeneity and label noise. The project comprises both theoretical analysis and comprehensive empirical validation across convex optimization and deep neural network training.
- Does importance sampling accelerate convergence in the presence of data heterogeneity?
- How robust is IS-based SGD to label noise compared to uniform sampling?
- Can proxy models provide effective importance scores for deep learning training?
MF-individual-project/ ├── README.md # This file ├── MF_project_Csongor.pdf # Full project report ├── Convex-IS-notebook.ipynb # Interactive notebook for convex experiments │ ├── Convex-Noise/ # Convex optimization experiments │ ├── README.md # Convex experiments documentation │ ├── IS-noise.py # Main experimental script │ └── Res-Convex-noisy-*/ # Results directories │ ├── scaled_{ratio}/ # Data scaling experiments │ └── flip-{type}/ # Noise injection strategies │ ├── DL-correlation/ # Deep learning correlation analysis │ ├── README.md # Correlation experiments documentation │ └── IS_01_corelations.py # Multi-model correlation study │ └── DL-noise/ # Deep learning noise robustness ├── README.md # DL experiments documentation ├── is.py # Main training pipeline ├── plot.py # Visualization utilities └── r-b128/ # Results storage └── res.txt # Experimental logs - Problem: Binary classification with squared hinge loss
- Dataset: Synthetic data with heterogeneous feature norms
- Methods: Uniform SGD vs. IS-SGD with norm-based sampling
- Metrics: Parameter distance, objective value, test error
- Task: Image classification with label noise
- Architecture: VGG-19 with batch normalization
- Proxy Models: ResNet-20, MobileNetV2, ShuffleNetV2
- Strategies:
- Baseline (random sampling)
- Consensus-high (high loss across proxies)
- Ambiguous (high variance across proxies)
- Noise Levels: 0%, 2%, 5%, 10%, 25% label corruption
cd Convex-Noise python IS-noise.pySee Convex-Noise/README.md for configuration options.
cd DL-correlation python IS_01_corelations.pyAnalyzes cross-model score correlations on CIFAR-10/100.
cd DL-noise python is.py python plot.py # Generate visualizationsSee DL-noise/README.md for hyperparameter settings.
Core Libraries:
numpy- Numerical computingmatplotlib,seaborn- Visualizationscipy- Statistical analysispandas- Data manipulation
Deep Learning:
torch,torchvision- PyTorch frameworktqdm- Progress bars
Pre-trained Models: Models loaded via torch.hub from chenyaofo/pytorch-cifar-models
# Create virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # Install dependencies pip install numpy matplotlib seaborn scipy pandas pip install torch torchvision tqdmAll experiments use fixed random seeds for reproducibility. Results are averaged over 3 independent runs in deep learning experiments.
Csongor Horváth - Mathematical Foundations Individual Course Project (2025-2026)
In the creation of experimental code and documentation AI assistance were used (model Gemini 3).
If using this code, please reference:
Horváth Cs. Importance Sampling for Robust Machine Learning. Mathematical Foundations Individual Project, 2026.
This project is for academic purposes.