How to convert Categorical features to Numerical Features in Python?

How to convert Categorical features to Numerical Features in Python?

Converting categorical features to numerical features is an essential step in data preprocessing for machine learning algorithms. This is because most machine learning algorithms work better with numerical inputs. In Python, you can use pandas and scikit-learn libraries for this purpose. Here are some common techniques:

1. Label Encoding

Label Encoding converts each unique category into a numeric value. It's suitable for categorical variables with ordinal relationships.

Using Pandas:

import pandas as pd # Example categorical data data = {'category': ['red', 'green', 'blue', 'green']} df = pd.DataFrame(data) # Convert categorical variable to numeric df['category_encoded'] = df['category'].astype('category').cat.codes 

Using Scikit-Learn:

from sklearn.preprocessing import LabelEncoder # Initialize the encoder label_encoder = LabelEncoder() # Fit and transform the data df['category_encoded'] = label_encoder.fit_transform(df['category']) 

2. One-Hot Encoding

One-Hot Encoding creates binary columns for each category in the original data. It's suitable for nominal categorical data.

Using Pandas:

# One-Hot Encoding one_hot_encoded_data = pd.get_dummies(df, columns=['category']) 

Using Scikit-Learn:

from sklearn.preprocessing import OneHotEncoder # Initialize the encoder one_hot_encoder = OneHotEncoder() # Fit and transform one_hot_encoded = one_hot_encoder.fit_transform(df[['category']]).toarray() # Convert to dataframe and concatenate with original data one_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(['category'])) df = pd.concat([df, one_hot_encoded_df], axis=1) 

3. Binary Encoding

Binary Encoding converts categories into binary digits, which can be more efficient than One-Hot Encoding for high cardinality features.

Using Category Encoders:

!pip install category_encoders import category_encoders as ce # Initialize the encoder binary_encoder = ce.BinaryEncoder(cols=['category']) # Fit and transform binary_encoded = binary_encoder.fit_transform(df['category']) df = pd.concat([df, binary_encoded], axis=1) 

Notes:

  • Label Encoding implies an ordinal relationship and should be used when such a relationship exists in the categories.
  • One-Hot Encoding can result in a high number of columns if the categorical variable has many unique values (high cardinality).
  • Binary Encoding is a good compromise between Label and One-Hot encoding, especially for high cardinality features.
  • Always consider the nature of your categorical data and the requirements of the machine learning model when choosing an encoding method.

These methods will help you effectively convert categorical features into numerical forms suitable for machine learning models.


More Tags

responsive-design intersection-observer terminal fft jscript mysql-connector appium-ios sap-commerce-cloud spring-mvc-test net-sftp

More Programming Guides

Other Guides

More Programming Examples