This directory contains datasets used for the fraud detection and sanctions screening research case study.
Warning
IEEE-CIS Dataset Restrictions
The IEEE-CIS Fraud Detection dataset is licensed for non-commercial research use only.
- Cannot redistribute the dataset
- Cannot use trained models commercially
- Must comply with Kaggle competition rules
- Source: Kaggle Competition
- License: Non-commercial research use only
- Cannot redistribute dataset or trained models
- Cannot use for commercial model training
- Academic research and education only
- Must accept competition rules
- Location:
ieee-fraud/ - Files:
train_transaction.csv- Training transaction data (~590K rows)train_identity.csv- Identity information for training settest_transaction.csv- Test transaction datatest_identity.csv- Identity information for test setsample_submission.csv- Submission format reference
- Source: Kaggle - Synthetic Financial Datasets
- License: Open data
- Location:
paysim/ - Description: Synthetic mobile money transactions based on real financial logs
- Source: U.S. Treasury Department
- License: Public domain (U.S. Government data)
- Location:
ofac/(raw data, not committed) - Files:
- SDN List (Specially Designated Nationals)
- Consolidated Sanctions List
- Location:
processed/ - Description: Pre-processed datasets, model artifacts, and metadata included in the repository for convenience
- Files:
- OFAC Sanctions Data:
sanctions_names.csv/sanctions_names.parquet- Processed OFAC sanctions names (~39K entities)sanctions_names_summary.json- Statistics on sanctions data coverage
- Processing Metadata:
exploration_metadata.json- Dataset overview and quality metrics
- OFAC Sanctions Data:
- Note: These files are derived from public domain OFAC data and are licensed under Apache 2.0 (see project LICENSE file)
Option A: Use Available Processed Data
The processed/ directory contains:
- Sanctions Screening: Processed OFAC sanctions names for fuzzy matching
- Exploration Metadata: Dataset overview and quality metrics
Option B: Download and Process Raw Datasets
For fraud detection model training, you'll need to:
- Download the IEEE-CIS fraud detection dataset (see instructions below)
- Run the notebooks to generate processed data locally:
01_data_exploration.ipynb- Initial EDA and data quality analysis02_feature_engineering.ipynb- Feature engineering pipeline03_model_training.ipynb- Model training with temporal splits
Note: IEEE-CIS processed data cannot be redistributed due to Kaggle competition rules. Users must download and process the raw data themselves.
# Install Kaggle CLI pip install kaggle # Set up Kaggle API credentials # 1. Go to https://www.kaggle.com/account # 2. Click "Create New API Token" # 3. Place the downloaded kaggle.json in ~/.kaggle/ mkdir -p ~/.kaggle mv ~/Downloads/kaggle.json ~/.kaggle/ chmod 600 ~/.kaggle/kaggle.jsoncd data_catalog # Accept competition rules first at: # https://www.kaggle.com/c/ieee-fraud-detection/rules # Download and extract kaggle competitions download -c ieee-fraud-detection unzip ieee-fraud-detection.zip -d ieee-fraud/ rm ieee-fraud-detection.zipcd data_catalog # Download and extract kaggle datasets download -d ealaxi/paysim1 unzip paysim1.zip -d paysim/ rm paysim1.zipNote: Pre-processed OFAC data is already available in
processed/sanctions_names.csv. Only download raw files if you need to regenerate or customize the processing.
Note: OFAC has updated their download system. Automated curl downloads are no longer supported.
Manual Download Steps:
-
Visit the OFAC Sanctions List site:
- SDN List: https://sanctionslist.ofac.treas.gov/Home/SdnList
- Consolidated List: https://sanctionslist.ofac.treas.gov/Home/ConsolidatedList
-
Download CSV format for both lists (recommended for easier processing):
- SDN List: Download all 4 CSV files:
SDN.CSV(primary names)ADD.CSV(addresses)ALT.CSV(alternate/AKA names - critical for fuzzy matching)SDN_COMMENTS.CSV(extended remarks)
- Consolidated List: Download all 4 CSV files:
CONS_PRIM.CSV(primary names)CONS_ADD.CSV(addresses)CONS_ALT.CSV(alternate/AKA names)CONS_COMMENTS.CSV(extended remarks)
- SDN List: Download all 4 CSV files:
-
Move downloaded files to the project:
cd data_catalog mkdir -p ofac/sdn ofac/consolidated # Move SDN list files mv ~/Downloads/sdn.csv ofac/sdn/ mv ~/Downloads/add.csv ofac/sdn/ mv ~/Downloads/alt.csv ofac/sdn/ mv ~/Downloads/sdn_comments.csv ofac/sdn/ # Move Consolidated list files mv ~/Downloads/cons_prim.csv ofac/consolidated/ mv ~/Downloads/cons_add.csv ofac/consolidated/ mv ~/Downloads/cons_alt.csv ofac/consolidated/ mv ~/Downloads/cons_comments.csv ofac/consolidated/Alternative: If you prefer XML format, download the XML versions and adjust filenames accordingly.
Caution
IEEE-CIS Dataset Compliance
- Do NOT redistribute IEEE-CIS raw data (violates license)
- Do NOT commit datasets to version control
- Do NOT use trained models for commercial purposes
- Do NOT remove license attributions from shared models
- Only for academic research and education
- Trained models may be shared for research/educational purposes with proper attribution
Raw datasets excluded via .gitignore:
- Large files bloat repository size
- IEEE-CIS has strict redistribution restrictions (cannot be shared)
- PaySim and OFAC raw files are large but publicly available
- Processed derivatives (in
processed/) are committed for convenience
Reproducibility
- All raw datasets are publicly available via the download instructions above
- Processed datasets and model artifacts are committed to the repository (
processed/directory) - Complete feature engineering pipeline documented with metadata
- Temporal data splits with drift analysis included
- Feature registry provides complete model training contract
| Dataset | Status | Size | Committed | License |
|---|---|---|---|---|
| IEEE-CIS raw | Download required | ~500MB | No (license restriction) | Non-commercial only |
| PaySim raw | Download required | ~470MB | No (large file) | Open data |
| OFAC raw | Optional | ~10MB | No (large file) | Public domain |
| Processed Datasets | ||||
| OFAC processed | Included | ~2.1MB | Yes | Apache 2.0 |
| Exploration metadata | Included | <1KB | Yes | Apache 2.0 |
| IEEE-CIS processed | Generate locally | ~380MB | No (license restriction) | Non-commercial only |