This project involves working with a clinical trial dataset containing information on 500 patients, of which 350 participated in a trial comparing two insulin treatments: Novodra (injectable) and Auralin (oral).
The dataset includes patient details, treatment records, HbA1c measurements, and adverse reactions.
The main goal of data wrangling here is to:
- Clean and organize raw data
- Handle missing or inconsistent values
- Prepare data for analysis (statistical testing, visualization, reporting)
Contains demographic and baseline details:
- Identifiers (patient_id, name, contact, address)
- Demographics (sex, birthdate, age)
- Measurements (weight, height, BMI)
Tracks treatment progress and effectiveness:
- Insulin doses (Auralin, Novodra)
- HbA1c levels (start, end, change)
Logs reported side effects for both treatment groups.
The wrangling process will include:
- Loading Data – Import CSV/Excel files into Pandas
- Exploring Structure – Use
.info(),.head(),.describe() - Cleaning
- Remove duplicates
- Standardize column names
- Handle missing values (imputation or removal)
- Transformations
- Convert datatypes (e.g., birthdate → datetime, zip_code → string)
- Calculate derived columns (e.g., age from birthdate, BMI categories)
- Merging Tables – Combine patients, treatments, and adverse reactions for complete analysis
- Validation – Ensure correct ranges (BMI, age ≥ 18, HbA1c values)
This project works with a clinical trial dataset of 500 patients, where 350 participated in a study comparing two insulin treatments: Novodra (injectable) and Auralin (oral).
The dataset includes patient demographics, treatment details, HbA1c levels, and reported adverse reactions.
The goal is to clean, transform, and prepare the data for analysis.
patient_id→ Unique patient IDassigned_sex→ Sex at birth (Male/Female)given_name,surname→ Patient namesaddress,city,state,zip_code,country→ Contact details (all US)contact→ Phone & emailbirthdate→ Patient’s date of birth (Age ≥ 18 included)weight,height,bmi→ Body stats (Inclusion BMI: 16–38)
given_name,surname→ Patient identifiersauralin→ Baseline and final insulin doses (units “u”)novodra→ Same as above, for Novodra grouphba1c_start,hba1c_end→ HbA1c levels at start and end (%)hba1c_change→ Change in HbA1c (start − end)
given_name,surname→ Patient identifiersadverse_reaction→ Reported side effect
- Load Data → Import CSV/Excel files into Pandas
- Explore → Use
.info(),.head(),.describe() - Clean → Remove duplicates, standardize names, handle missing values
- Transform → Convert datatypes, derive new columns (e.g., Age, BMI category)
- Merge → Combine patients, treatments, and adverse reactions
- Validate → Ensure correct ranges (Age ≥ 18, BMI 16–38, valid HbA1c values)
df.info() df.describe() df.isna().sum() df.drop_duplicates() df.fillna() df.merge() df.groupby()