I am building a neural network and am at the point of using OneHotEncoder on many independent(categorical) variables. I would like to know if I am approaching this properly with dummy variables or if since all of my variables require dummy variables there may be a better way.
df UserName Token ThreadID ChildEXE 0 TAG TokenElevationTypeDefault (1) 20788 splunk-MonitorNoHandle.exe 1 TAG TokenElevationTypeDefault (1) 19088 splunk-optimize.exe 2 TAG TokenElevationTypeDefault (1) 2840 net.exe 807 User TokenElevationTypeFull (2) 18740 E2CheckFileSync.exe 808 User TokenElevationTypeFull (2) 18740 E2check.exe 809 User TokenElevationTypeFull (2) 18740 E2check.exe 811 Local TokenElevationTypeFull (2) 18740 sc.exe ParentEXE ChildFilePath ParentFilePath splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0 splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0 dagent.exe C:\Windows\System32 C:\Program Files\Dagent 0 wscript.exe \Device\Mup\sysvol C:\Windows 1 E2CheckFileSync.exe C:\Util \Device\Mup\sysvol\ 1 cmd.exe C:\Windows\SysWOW64 C:\Util\E2Check 1 cmd.exe C:\Windows C:\Windows\SysWOW64 1 DependentVariable 0 0 0 1 1 1 1 I import the data and using the LabelEncoder on the independent variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder #IMPORT DATA #Matrix x of features X = df.iloc[:, 0:7].values #Dependent variable y = df.iloc[:, 7].values #Encoding Independent Variable #Need a label encoder for every categorical variable #Converts categorical into number - set correct index of column #Encode "UserName" labelencoder_X_1 = LabelEncoder() X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0]) #Encode "Token" labelencoder_X_2 = LabelEncoder() X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1]) #Encode "ChildEXE" labelencoder_X_3 = LabelEncoder() X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3]) #Encode "ParentEXE" labelencoder_X_4 = LabelEncoder() X[:, 4] = labelencoder_X_4.fit_transform(X[:, 4]) #Encode "ChildFilePath" labelencoder_X_5 = LabelEncoder() X[:, 5] = labelencoder_X_5.fit_transform(X[:, 5]) #Encode "ParentFilePath" labelencoder_X_6 = LabelEncoder() X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6]) This gives me the following array:
X array([[2, 0, 20788, ..., 46, 31, 24], [2, 0, 19088, ..., 46, 31, 24], [2, 0, 2840, ..., 27, 42, 15], ..., [2, 0, 20148, ..., 17, 40, 32], [2, 0, 20148, ..., 47, 23, 0], [2, 0, 3176, ..., 48, 42, 32]], dtype=object) Now for all of the independent variables I have to create dummy variables:
Should I use:
onehotencoder = OneHotEncoder(categorical_features = [0, 1, 2, 3, 4, 5, 6]) X = onehotencoder.fit_transform(X).toarray() Which gives me:
X array([[0., 0., 1., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], ..., [0., 0., 1., ..., 1., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [0., 0., 1., ..., 1., 0., 0.]]) Or is there a better way to approach this this?