Skip to content

Python examples of data wrangling and analysis, including data cleaning, transformation, handling missing values, and exploratory insights.

Notifications You must be signed in to change notification settings

Awais11227/Data_-Analysis

Repository files navigation

Data Wrangling

📖 Project Overview

This project involves working with a clinical trial dataset containing information on 500 patients, of which 350 participated in a trial comparing two insulin treatments: Novodra (injectable) and Auralin (oral).
The dataset includes patient details, treatment records, HbA1c measurements, and adverse reactions.

The main goal of data wrangling here is to:

  • Clean and organize raw data
  • Handle missing or inconsistent values
  • Prepare data for analysis (statistical testing, visualization, reporting)

📂 Dataset Description

Patients Table (patients)

Contains demographic and baseline details:

  • Identifiers (patient_id, name, contact, address)
  • Demographics (sex, birthdate, age)
  • Measurements (weight, height, BMI)

Treatments Table (treatments, treatment_cut)

Tracks treatment progress and effectiveness:

  • Insulin doses (Auralin, Novodra)
  • HbA1c levels (start, end, change)

Adverse Reactions Table (adverse_reactions)

Logs reported side effects for both treatment groups.


🛠️ Data Wrangling Steps

The wrangling process will include:

  1. Loading Data – Import CSV/Excel files into Pandas
  2. Exploring Structure – Use .info(), .head(), .describe()
  3. Cleaning
    • Remove duplicates
    • Standardize column names
    • Handle missing values (imputation or removal)
  4. Transformations
    • Convert datatypes (e.g., birthdate → datetime, zip_code → string)
    • Calculate derived columns (e.g., age from birthdate, BMI categories)
  5. Merging Tables – Combine patients, treatments, and adverse reactions for complete analysis
  6. Validation – Ensure correct ranges (BMI, age ≥ 18, HbA1c values)

Data Wrangling - Clinical Trial Dataset

📖 Project Overview

This project works with a clinical trial dataset of 500 patients, where 350 participated in a study comparing two insulin treatments: Novodra (injectable) and Auralin (oral).
The dataset includes patient demographics, treatment details, HbA1c levels, and reported adverse reactions.

The goal is to clean, transform, and prepare the data for analysis.


📂 Dataset Structure

🧑 Patients Table (patients)

  • patient_id → Unique patient ID
  • assigned_sex → Sex at birth (Male/Female)
  • given_name, surname → Patient names
  • address, city, state, zip_code, country → Contact details (all US)
  • contact → Phone & email
  • birthdate → Patient’s date of birth (Age ≥ 18 included)
  • weight, height, bmi → Body stats (Inclusion BMI: 16–38)

💉 Treatments Table (treatments, treatment_cut)

  • given_name, surname → Patient identifiers
  • auralin → Baseline and final insulin doses (units “u”)
  • novodra → Same as above, for Novodra group
  • hba1c_start, hba1c_end → HbA1c levels at start and end (%)
  • hba1c_change → Change in HbA1c (start − end)

⚠️ Adverse Reactions Table (adverse_reactions)

  • given_name, surname → Patient identifiers
  • adverse_reaction → Reported side effect

🛠️ Data Wrangling Steps

  1. Load Data → Import CSV/Excel files into Pandas
  2. Explore → Use .info(), .head(), .describe()
  3. Clean → Remove duplicates, standardize names, handle missing values
  4. Transform → Convert datatypes, derive new columns (e.g., Age, BMI category)
  5. Merge → Combine patients, treatments, and adverse reactions
  6. Validate → Ensure correct ranges (Age ≥ 18, BMI 16–38, valid HbA1c values)

🔍 Example Pandas Functions

df.info() df.describe() df.isna().sum() df.drop_duplicates() df.fillna() df.merge() df.groupby()