|
| 1 | +# **FEATURE ENGINEERING** |
| 2 | +This is the process of transforming the raw data into a format that is easier to understand and use. |
| 3 | +It has different process. Some of them are: |
| 4 | +1. **Feature Transformation** |
| 5 | + a. *Missing Value Imputation* |
| 6 | + b. *Handling Categorical Features* |
| 7 | + c. *Outlier Detection* |
| 8 | + d. *Scaling* |
| 9 | + e. *Column Transformer* |
| 10 | + f. *Function Transformer* |
| 11 | + g. *Binning and Binarization* |
| 12 | +2. **Feature Construction** |
| 13 | +3. **Feature Splitting** |
| 14 | + |
| 15 | +## Feature Transformation |
| 16 | +Feature Transformation is the process of modifying or changing the original features of a dataset in order to improve the performance of the ML models. |
| 17 | +Some of the techniques used in feature transformation are: |
| 18 | +* Missing Value Imputation |
| 19 | +* Handling Categorical Features |
| 20 | +* Outlier Detection |
| 21 | +* Scaling |
| 22 | + |
| 23 | +**Missing Value Imputation :** |
| 24 | +To choose the right method is always important. |
| 25 | +Missing Value Imputation can be done by: |
| 26 | +* Using Mean or Median if the data is not too skewed |
| 27 | +* Using Median if the data is skewed or outliers |
| 28 | +* Using Mode Imputation for categorical data |
| 29 | +* Use Forward Fill or Backward Fill for time series data |
| 30 | +* KNN Imputation if relationship between missing values and other features is complex. |
| 31 | + |
| 32 | +**Handling Categorical Features :** |
| 33 | +Categorical Features can be handled by: |
| 34 | +* Label Encoding |
| 35 | +* One Hot Encoding |
| 36 | +* Ordinal Encoding |
| 37 | + |
| 38 | +In deep, Categorical Features can be handled by: |
| 39 | +* **Nominal Encoding:** One Hot Encoding, One Hot Encoding (Multiple Categories) , Mean Encoding |
| 40 | +* **Ordinal Encoding:** Label Encoding, Target Guided Ordinal Encoding |
| 41 | + |
| 42 | +**Outlier Detection :** |
| 43 | +Outliers can be treated by: |
| 44 | +* Trimming |
| 45 | +* Capping |
| 46 | +* Treating Outliers as a missing value |
| 47 | +* Discretization |
| 48 | + |
| 49 | +Outliers can be detected by: |
| 50 | +* **Visual Method:** Box Plot and Scatter Plot |
| 51 | +* **Statistical Method:** Z-Score(Standard Score), IQR (Inter-quartile Range), Winsorization. |
| 52 | + |
| 53 | +**Scaling :** |
| 54 | +Scaling can be done by: |
| 55 | +* Standardization |
| 56 | +* Normalization |
| 57 | +* Min Max Scaling |
| 58 | + |
| 59 | +**Column Transformer :** |
| 60 | +There is a scikit library called ColumnTransformer which can be used to apply different transformations to different columns. |
| 61 | +It is used to create and apply seperate transformers for numarical and categorical data. |
| 62 | + |
| 63 | +```python |
| 64 | +transformer = ColumnTransformer(transformers = [ |
| 65 | + ('tnf-1', SimpleImputer, ['column_name']), |
| 66 | + ('tnf-2', OrdinalEncoder, categories = [['CN1', 'CN2']], ['column_name']) |
| 67 | + ..... |
| 68 | + ..... |
| 69 | + ..... |
| 70 | +]) |
| 71 | +``` |
| 72 | + |
| 73 | +**Function Transformer :** |
| 74 | +It is used to normally distribute the data. If your data is skewed, you can use this transformer to make it normally distributed. |
| 75 | + |
| 76 | +Some of the methods to do it: |
| 77 | +* Log transformation |
| 78 | +* Reciprocal transformation |
| 79 | +* Square root transformation |
| 80 | +* Box-Cox transformation |
| 81 | +* Yeo-Johnson transformation |
| 82 | + |
| 83 | +In sklearn, Function Transformer can be used for: log, reciprocal, square root. Power Transformer can be used for Box-Cox and Yeo-Johnson transformation. |
| 84 | + |
| 85 | +**Binning and Binarization :** |
| 86 | +Some of the datasets do contain irregular values in number. There are two techniques for handling this numerical to the categorical data: |
| 87 | +* Binning or Discretization |
| 88 | +* Binarization |
| 89 | +* K-means Binning |
0 commit comments