This project aims to build an ETL pipeline that processes loan application data using PySpark, Docker, AWS, and Kubernetes for orchestration. The goal is to transform the underlying data to extract meaningful insights into credit risk models, which can be further used to assess loan approval likelihood.
The dataset being used is the Financial Risk for Loan Approval Dataset, which contains information about applicants' financial and personal data. https://www.kaggle.com/datasets/lorenzozoppelletto/financial-risk-for-loan-approval/data
- Source: The data is extracted from a flat file (CSV) stored in AWS S3.
- Ingestion: PySpark is used to load the data into a distributed environment for efficient and large-scale processing.
-
Data Cleaning:
- Handle missing values in key columns such as
CreditScore,AnnualIncome, andLoanAmountby imputing values or dropping rows. - For categorical variables (
EmploymentStatus,EducationLevel), fill missing values or create an "Unknown" category. - Detect and treat outliers in columns like
AnnualIncome,CreditScore, andLoanAmount. - Convert
ApplicationDateinto aDateTimeformat for time-based analysis.
- Handle missing values in key columns such as
-
Data Enrichment:
- Create Age Groups from the
Agecolumn (e.g., 18-30, 31-50, 51+). - Calculate DebtToAssetRatio as
TotalLiabilities / TotalAssets. - Create a CreditUtilizationRisk feature based on the
CreditCardUtilizationRate(e.g., "Low", "Moderate", "High"). - Calculate LoanToIncomeRatio as
LoanAmount / AnnualIncome. - Convert
LoanDurationinto years for consistent analysis. - Adjust or calculate a new RiskScore using
CreditScore,DebtToIncomeRatio, andPreviousLoanDefaults.
- Create Age Groups from the
-
Aggregation and Summarization:
- Group applicants into Risk Buckets (e.g., "Low Risk", "Moderate Risk", "High Risk") based on their
RiskScore. - Summarize Loan Defaults based on factors like age group, loan amount, and employment status.
- Calculate aggregate metrics like average
MonthlyDebtPaymentsandLoanAmountfor different demographics.
- Group applicants into Risk Buckets (e.g., "Low Risk", "Moderate Risk", "High Risk") based on their
-
Data Validation:
- Ensure
DebtToIncomeRatioandLoanToIncomeRatiodo not exceed thresholds (e.g., flag when > 50%). - Validate that
LoanAmountandMonthlyLoanPaymentare logically consistent.
- Ensure
- AWS S3: Store both raw and transformed data in Parquet format for optimized storage and querying.
- Amazon Redshift: Load aggregated data (e.g., summary of loan approvals, risk assessments) for efficient querying.
- Dashboard Integration (optional): Visualize risk assessment and loan approval trends using Amazon QuickSight.
- Docker: Containerize the PySpark ETL pipeline to ensure portability and consistent environments across development, testing, and production.
- Kubernetes: Use Kubernetes for orchestrating and scaling the ETL pipeline, managing the distributed infrastructure for large datasets.
- AWS Step Functions / Kubernetes Cron Jobs: Automate regular ETL jobs for data refresh (e.g., daily or weekly).
- AWS CloudWatch: Implement logging to track the ETL process performance and monitor data quality. Set up alerts for failures or data anomalies.
- PySpark: For large-scale data processing, transformation, and feature engineering.
- Docker: To containerize the ETL pipeline.
- AWS: For data storage (S3, Redshift), orchestration (EMR, Lambda), and monitoring (CloudWatch).
- Kubernetes: For orchestrating the ETL pipeline and managing the infrastructure.
- Implement real-time risk score calculation using streaming data.
- Expand the ETL pipeline to support multiple loan datasets or integrate external financial data sources for more comprehensive analysis.
- Python 3.8+
- pip
- AWS CLI configured with your credentials
-
Install Pipenv if you haven't already:
pip install pipenv
-
Clone the repository and navigate to the project directory:
git clone https://github.com/your-username/financial-risk-loan-approval-etl.git cd financial-risk-loan-approval-etl -
Create a virtual environment and install dependencies:
pipenv install
-
Activate the virtual environment:
pipenv shell
-
Install project dependencies:
pipenv install --dev
-
Configure AWS CLI (if not already done):
aws configure
-
Build the Docker image:
docker build -t loan-approval-etl .
To run the ETL job using Pipenv:
pipenv run python src/main.pyTo run the ETL job using Docker:
docker run loan-approval-etlchmod +x setup.sh ./setup.sh aws configure docker build -t loan-approval-etl .