DA 5230 – Statistical & Machine Learning Lecture 3 – Exploratory Data Analytics Maninda Edirisooriya manindaw@uom.lk
What is Exploratory Data Analysis (EDA)? • When an analyst/data scientist is given a dataset he has to do some initial analysis to start the data analysis process • This involves very basic data filtering, processing and visualization • Results from this analysis he gets the intuition on further analysis • Depending on the results, this process can be iterative with newly discovered patterns and knowledge • This process is known as Exploratory Data Analytics (EDA)
Why Exploratory Data Analysis? • To understand data: to know metadata like size, types, structures of it • To Identify patterns of data: to identify visible trends and relationships among data • To detect anomalies and outliers • To clean data: to de-duplicate, remove/fill missing values and inconsistent values • For Feature Engineering: to discover combinations of features and create new features to improve performance • Visualization and communication
When to do Exploratory Data Analysis? • If the given dataset easily fits the memory, Pandas and Numpy (Python libraries) are used • If the dataset cannot be fit the memory but fits the secondary storage SQL can be used • If the dataset is large so that it cannot be stored in a single machine, big data analytics has to be used • Scope of this lesson is applies only for the first scenario • In other cases data samples taken from larger dataset can also be used for some extend
Structure of Data • Structured Data: tabular data or data with feature labels • E.g: RDBMS data table • Unstructured Data: data without feature labels • E.g: Image pixel data, video data • Semi-structured Data: has a certain structure but not tabular • E.g: XML, JSON • In this subject module we mainly focus on Structured Data
Numpy • A Python library named as “Numerical Python” • Used to efficiently store and process numerical data • Numerical data is stored in memory-efficient arrays (tensors) • Has efficient array processing capabilities accelerated with hardware support • Has a rich API for processing data • Broadcasting operations for replacing many loop requirements • Interoperable with other languages like C, C++ and Fortran • Used by most other data related Python libraries like Pandas
Pandas • A high performance Python library for data processing • Highly supports Numpy arrays (tensors) • Supports typed, rich, tabular, structured data over Numpy using DataFrames • Rich APIs for, • Loading data from data source like files and storing back to them • Transforming data in-memory like sorting, merging, pivoting and aggregation • Data selection, slicing, filtering and indexing • Data cleaning like de-duplication, filling missing value and outlier removal • Integrating well with visualization libraries like Matplotlib and Seaborn
Data Types Source: https://medium.com/@simranjeetsingh1497/the-ultimate-guide-to-machine-learning-from-eda-to-model-deployment-part-2-e56ac58785f8
Data Types • Pandas can represent most of the naturally occurring data types • Types mentioned earlier • Therefore, data can be very easily loaded into Python DataFrames • Data is stored in column-major manner • each column of data is stored as a contiguous block in memory, and values within a column are stored consecutively • Sometimes categorical data has to be encoded as continuous data • E.g.: date-time as time • Sometimes continuous data has to be encoded as categorical data • E.g.: income as income levels
One Hour Homework • Officially we have one more hour to do after the end of the lectures • When it comes to ML self-learning is very important • Therefore, for this week’s extra hour you have a homework • After today’s Pandas tutorial figure out all the difficult sections in it • Try yourself to complete it and refer Internet when needed • Ask questions from ChatGPT for even difficult questions • Play with Pandas and EDA with your own code and get familiar with them • We need you to be comfortable with EDA for learning ML and SL ahead • Good Luck!
Questions?

Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Statistical & Machine Learning

  • 1.
    DA 5230 –Statistical & Machine Learning Lecture 3 – Exploratory Data Analytics Maninda Edirisooriya manindaw@uom.lk
  • 2.
    What is ExploratoryData Analysis (EDA)? • When an analyst/data scientist is given a dataset he has to do some initial analysis to start the data analysis process • This involves very basic data filtering, processing and visualization • Results from this analysis he gets the intuition on further analysis • Depending on the results, this process can be iterative with newly discovered patterns and knowledge • This process is known as Exploratory Data Analytics (EDA)
  • 3.
    Why Exploratory DataAnalysis? • To understand data: to know metadata like size, types, structures of it • To Identify patterns of data: to identify visible trends and relationships among data • To detect anomalies and outliers • To clean data: to de-duplicate, remove/fill missing values and inconsistent values • For Feature Engineering: to discover combinations of features and create new features to improve performance • Visualization and communication
  • 4.
    When to doExploratory Data Analysis? • If the given dataset easily fits the memory, Pandas and Numpy (Python libraries) are used • If the dataset cannot be fit the memory but fits the secondary storage SQL can be used • If the dataset is large so that it cannot be stored in a single machine, big data analytics has to be used • Scope of this lesson is applies only for the first scenario • In other cases data samples taken from larger dataset can also be used for some extend
  • 5.
    Structure of Data •Structured Data: tabular data or data with feature labels • E.g: RDBMS data table • Unstructured Data: data without feature labels • E.g: Image pixel data, video data • Semi-structured Data: has a certain structure but not tabular • E.g: XML, JSON • In this subject module we mainly focus on Structured Data
  • 6.
    Numpy • A Pythonlibrary named as “Numerical Python” • Used to efficiently store and process numerical data • Numerical data is stored in memory-efficient arrays (tensors) • Has efficient array processing capabilities accelerated with hardware support • Has a rich API for processing data • Broadcasting operations for replacing many loop requirements • Interoperable with other languages like C, C++ and Fortran • Used by most other data related Python libraries like Pandas
  • 7.
    Pandas • A highperformance Python library for data processing • Highly supports Numpy arrays (tensors) • Supports typed, rich, tabular, structured data over Numpy using DataFrames • Rich APIs for, • Loading data from data source like files and storing back to them • Transforming data in-memory like sorting, merging, pivoting and aggregation • Data selection, slicing, filtering and indexing • Data cleaning like de-duplication, filling missing value and outlier removal • Integrating well with visualization libraries like Matplotlib and Seaborn
  • 8.
  • 9.
    Data Types • Pandascan represent most of the naturally occurring data types • Types mentioned earlier • Therefore, data can be very easily loaded into Python DataFrames • Data is stored in column-major manner • each column of data is stored as a contiguous block in memory, and values within a column are stored consecutively • Sometimes categorical data has to be encoded as continuous data • E.g.: date-time as time • Sometimes continuous data has to be encoded as categorical data • E.g.: income as income levels
  • 10.
    One Hour Homework •Officially we have one more hour to do after the end of the lectures • When it comes to ML self-learning is very important • Therefore, for this week’s extra hour you have a homework • After today’s Pandas tutorial figure out all the difficult sections in it • Try yourself to complete it and refer Internet when needed • Ask questions from ChatGPT for even difficult questions • Play with Pandas and EDA with your own code and get familiar with them • We need you to be comfortable with EDA for learning ML and SL ahead • Good Luck!
  • 11.