3
$\begingroup$

I am building a neural network and am at the point of using OneHotEncoder on many independent(categorical) variables. I would like to know if I am approaching this properly with dummy variables or if since all of my variables require dummy variables there may be a better way.

df UserName Token ThreadID ChildEXE 0 TAG TokenElevationTypeDefault (1) 20788 splunk-MonitorNoHandle.exe 1 TAG TokenElevationTypeDefault (1) 19088 splunk-optimize.exe 2 TAG TokenElevationTypeDefault (1) 2840 net.exe 807 User TokenElevationTypeFull (2) 18740 E2CheckFileSync.exe 808 User TokenElevationTypeFull (2) 18740 E2check.exe 809 User TokenElevationTypeFull (2) 18740 E2check.exe 811 Local TokenElevationTypeFull (2) 18740 sc.exe ParentEXE ChildFilePath ParentFilePath splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0 splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0 dagent.exe C:\Windows\System32 C:\Program Files\Dagent 0 wscript.exe \Device\Mup\sysvol C:\Windows 1 E2CheckFileSync.exe C:\Util \Device\Mup\sysvol\ 1 cmd.exe C:\Windows\SysWOW64 C:\Util\E2Check 1 cmd.exe C:\Windows C:\Windows\SysWOW64 1 DependentVariable 0 0 0 1 1 1 1 

I import the data and using the LabelEncoder on the independent variables

from sklearn.preprocessing import LabelEncoder, OneHotEncoder #IMPORT DATA #Matrix x of features X = df.iloc[:, 0:7].values #Dependent variable y = df.iloc[:, 7].values #Encoding Independent Variable #Need a label encoder for every categorical variable #Converts categorical into number - set correct index of column #Encode "UserName" labelencoder_X_1 = LabelEncoder() X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0]) #Encode "Token" labelencoder_X_2 = LabelEncoder() X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1]) #Encode "ChildEXE" labelencoder_X_3 = LabelEncoder() X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3]) #Encode "ParentEXE" labelencoder_X_4 = LabelEncoder() X[:, 4] = labelencoder_X_4.fit_transform(X[:, 4]) #Encode "ChildFilePath" labelencoder_X_5 = LabelEncoder() X[:, 5] = labelencoder_X_5.fit_transform(X[:, 5]) #Encode "ParentFilePath" labelencoder_X_6 = LabelEncoder() X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6]) 

This gives me the following array:

X array([[2, 0, 20788, ..., 46, 31, 24], [2, 0, 19088, ..., 46, 31, 24], [2, 0, 2840, ..., 27, 42, 15], ..., [2, 0, 20148, ..., 17, 40, 32], [2, 0, 20148, ..., 47, 23, 0], [2, 0, 3176, ..., 48, 42, 32]], dtype=object) 

Now for all of the independent variables I have to create dummy variables:

Should I use:

onehotencoder = OneHotEncoder(categorical_features = [0, 1, 2, 3, 4, 5, 6]) X = onehotencoder.fit_transform(X).toarray() 

Which gives me:

X array([[0., 0., 1., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], ..., [0., 0., 1., ..., 1., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [0., 0., 1., ..., 1., 0., 0.]]) 

Or is there a better way to approach this this?

$\endgroup$
2
  • 1
    $\begingroup$ I usually prefer pandas.get_dummies() from sklearn's OneHotEncoder. I find it easier to work with since you don't have to fit and then transform the data. $\endgroup$ Commented Aug 9, 2018 at 2:22
  • $\begingroup$ Thank you for the suggestion, I'm going to look into that one! @Djib2011 $\endgroup$ Commented Aug 9, 2018 at 12:33

2 Answers 2

2
$\begingroup$

Yes. You can use get_dummies(). get_dummies() method does what both LabelEncoder and OneHotEncoder do, besides you can drop the first dummy column of each category to prevent dummy variable trap if you intend to build linear regression.

Example: 1. Create dataframe:

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]}) df.head() A B C 0 a b 1 1 b a 2 2 a c 3 

2. Apply get_dummies():

df2 = pd.get_dummies(df, prefix=['A', 'B'], drop_first=True) df2.head() 

Output:

 C A_b B_b B_c 0 1 0 1 0 1 2 1 0 0 2 3 0 0 1 
$\endgroup$
2
$\begingroup$

If your categorical variables include variables that suggest some numerical values like ranks, you should consider just label encoding them. (for e.g. First, Second, Third, and so on can be encoded as 1, 2, 3 and so on).

Also, find out if all these categories are important. Plot some graphs(such as histograms, distribution plots) to visualize the dataset. Remove categories that don't seem necessary.(e.g. in the above example, if First occurs more than 80%, you should consider if that certain features really contributes to your model.)

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.