Skip to content

Commit 5816ce7

Browse files
Create README.md
1 parent 603ea97 commit 5816ce7

File tree

1 file changed

+89
-0
lines changed
  • Machine Learning Templates/Feature Engineering Template

1 file changed

+89
-0
lines changed
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# **FEATURE ENGINEERING**
2+
This is the process of transforming the raw data into a format that is easier to understand and use.
3+
It has different process. Some of them are:
4+
1. **Feature Transformation**
5+
a. *Missing Value Imputation*
6+
b. *Handling Categorical Features*
7+
c. *Outlier Detection*
8+
d. *Scaling*
9+
e. *Column Transformer*
10+
f. *Function Transformer*
11+
g. *Binning and Binarization*
12+
2. **Feature Construction**
13+
3. **Feature Splitting**
14+
15+
## Feature Transformation
16+
Feature Transformation is the process of modifying or changing the original features of a dataset in order to improve the performance of the ML models.
17+
Some of the techniques used in feature transformation are:
18+
* Missing Value Imputation
19+
* Handling Categorical Features
20+
* Outlier Detection
21+
* Scaling
22+
23+
**Missing Value Imputation :**
24+
To choose the right method is always important.
25+
Missing Value Imputation can be done by:
26+
* Using Mean or Median if the data is not too skewed
27+
* Using Median if the data is skewed or outliers
28+
* Using Mode Imputation for categorical data
29+
* Use Forward Fill or Backward Fill for time series data
30+
* KNN Imputation if relationship between missing values and other features is complex.
31+
32+
**Handling Categorical Features :**
33+
Categorical Features can be handled by:
34+
* Label Encoding
35+
* One Hot Encoding
36+
* Ordinal Encoding
37+
38+
In deep, Categorical Features can be handled by:
39+
* **Nominal Encoding:** One Hot Encoding, One Hot Encoding (Multiple Categories) , Mean Encoding
40+
* **Ordinal Encoding:** Label Encoding, Target Guided Ordinal Encoding
41+
42+
**Outlier Detection :**
43+
Outliers can be treated by:
44+
* Trimming
45+
* Capping
46+
* Treating Outliers as a missing value
47+
* Discretization
48+
49+
Outliers can be detected by:
50+
* **Visual Method:** Box Plot and Scatter Plot
51+
* **Statistical Method:** Z-Score(Standard Score), IQR (Inter-quartile Range), Winsorization.
52+
53+
**Scaling :**
54+
Scaling can be done by:
55+
* Standardization
56+
* Normalization
57+
* Min Max Scaling
58+
59+
**Column Transformer :**
60+
There is a scikit library called ColumnTransformer which can be used to apply different transformations to different columns.
61+
It is used to create and apply seperate transformers for numarical and categorical data.
62+
63+
```python
64+
transformer = ColumnTransformer(transformers = [
65+
('tnf-1', SimpleImputer, ['column_name']),
66+
('tnf-2', OrdinalEncoder, categories = [['CN1', 'CN2']], ['column_name'])
67+
.....
68+
.....
69+
.....
70+
])
71+
```
72+
73+
**Function Transformer :**
74+
It is used to normally distribute the data. If your data is skewed, you can use this transformer to make it normally distributed.
75+
76+
Some of the methods to do it:
77+
* Log transformation
78+
* Reciprocal transformation
79+
* Square root transformation
80+
* Box-Cox transformation
81+
* Yeo-Johnson transformation
82+
83+
In sklearn, Function Transformer can be used for: log, reciprocal, square root. Power Transformer can be used for Box-Cox and Yeo-Johnson transformation.
84+
85+
**Binning and Binarization :**
86+
Some of the datasets do contain irregular values in number. There are two techniques for handling this numerical to the categorical data:
87+
* Binning or Discretization
88+
* Binarization
89+
* K-means Binning

0 commit comments

Comments
 (0)