-
To understand what factors contributed most to employee turnover.
-
To perform clustering to find any meaningful patterns of employee traits.
-
To create a model that predicts the likelihood if a certain employee will leave the company or not.
-
To create or improve different retention strategies on targeted employees.
The implementation of this model will allow management to create better decision-making actions.
One of the most common problems at work is turnover.
Replacing a worker earning about 50,000 dollars cost the company about 10,000 dollars or 20% of that worker’s yearly income according to the Center of American Progress.
Replacing a high-level employee can cost multiple of that...
Cost include:
- Cost of off-boarding
- Cost of hiring (advertising, interviewing, hiring)
- Cost of onboarding a new person (training, management time)
- Lost productivity (a new person may take 1-2 years to reach the productivity of an existing person)
# Import the neccessary modules for data manipulation and visual representation import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as matplot import seaborn as sns %matplotlib inlinedf = pd.read_csv('HR_comma_sep.csv.txt')# Examine the dataset df.head().dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } | satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | sales | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
# Can you check to see if there are any missing values in our data set df.isnull().any()satisfaction_level False last_evaluation False number_project False average_montly_hours False time_spend_company False Work_accident False left False promotion_last_5years False sales False salary False dtype: bool # Rename Columns # Renaming certain columns for better readability df = df.rename(columns={'satisfaction_level': 'satisfaction', 'last_evaluation': 'evaluation', 'number_project': 'projectCount', 'average_montly_hours': 'averageMonthlyHours', 'time_spend_company': 'yearsAtCompany', 'Work_accident': 'workAccident', 'promotion_last_5years': 'promotion', 'sales' : 'department', 'left' : 'turnover' }) df.head(3).dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | turnover | promotion | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
# Check the type of our features. Are there any data inconsistencies? df.dtypessatisfaction float64 evaluation float64 projectCount int64 averageMonthlyHours int64 yearsAtCompany int64 workAccident int64 turnover int64 promotion int64 department object salary object dtype: object # How many employees are in the dataset? df.shape(14999, 10) # Calculate the turnover rate of our company's dataset. What's the rate of turnover? turnover_rate = df.turnover.value_counts() / 14999 turnover_rate0 0.761917 1 0.238083 Name: turnover, dtype: float64 # Display the statistical overview of the employees df.describe().dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | turnover | promotion | |
|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
# Display the mean summary of Employees (Turnover V.S. Non-turnover). What do you notice between the groups? turnover_Summary = df.groupby('turnover') turnover_Summary.mean().dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | promotion | |
|---|---|---|---|---|---|---|---|
| turnover | |||||||
| 0 | 0.666810 | 0.715473 | 3.786664 | 199.060203 | 3.380032 | 0.175009 | 0.026251 |
| 1 | 0.440098 | 0.718113 | 3.855503 | 207.419210 | 3.876505 | 0.047326 | 0.005321 |
# Create a correlation matrix. What features correlate the most with turnover? What other correlations did you find? corr = df.corr() corr = (corr) sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values) plt.title('Heatmap of Correlation Matrix') corr.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | turnover | promotion | |
|---|---|---|---|---|---|---|---|---|
| satisfaction | 1.000000 | 0.105021 | -0.142970 | -0.020048 | -0.100866 | 0.058697 | -0.388375 | 0.025605 |
| evaluation | 0.105021 | 1.000000 | 0.349333 | 0.339742 | 0.131591 | -0.007104 | 0.006567 | -0.008684 |
| projectCount | -0.142970 | 0.349333 | 1.000000 | 0.417211 | 0.196786 | -0.004741 | 0.023787 | -0.006064 |
| averageMonthlyHours | -0.020048 | 0.339742 | 0.417211 | 1.000000 | 0.127755 | -0.010143 | 0.071287 | -0.003544 |
| yearsAtCompany | -0.100866 | 0.131591 | 0.196786 | 0.127755 | 1.000000 | 0.002120 | 0.144822 | 0.067433 |
| workAccident | 0.058697 | -0.007104 | -0.004741 | -0.010143 | 0.002120 | 1.000000 | -0.154622 | 0.039245 |
| turnover | -0.388375 | 0.006567 | 0.023787 | 0.071287 | 0.144822 | -0.154622 | 1.000000 | -0.061788 |
| promotion | 0.025605 | -0.008684 | -0.006064 | -0.003544 | 0.067433 | 0.039245 | -0.061788 | 1.000000 |
# Plot the distribution of Employee Satisfaction, Evaluation, and Project Count. What story can you tell? # Set up the matplotlib figure f, axes = plt.subplots(ncols=3, figsize=(15, 6)) # Graph Employee Satisfaction sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution') axes[0].set_ylabel('Employee Count') # Graph Employee Evaluation sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution') axes[1].set_ylabel('Employee Count') # Graph Employee Average Monthly Hours sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution') axes[2].set_ylabel('Employee Count')C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " Text(0,0.5,'Employee Count') Apply get_dummies() to the categorical variables. Seperate categorical variables and numeric variables, then combine them.
cat_var = ['department','salary','turnover','promotion'] num_var = ['satisfaction','evaluation','projectCount','averageMonthlyHours','yearsAtCompany', 'workAccident'] categorical_df = pd.get_dummies(df[cat_var], drop_first=True) numerical_df = df[num_var] new_df = pd.concat([categorical_df,numerical_df], axis=1) new_df.head().dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } | turnover | promotion | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | salary_low | salary_medium | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0.11 | 0.88 | 7 | 272 | 4 | 0 |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.72 | 0.87 | 5 | 223 | 5 | 0 |
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.37 | 0.52 | 2 | 159 | 3 | 0 |
from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve# Create the X and y set X = new_df.iloc[:,1:] y = new_df.iloc[:,0] # Define train and test X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)%%time # Check accuracy of Logistic Model# Check from sklearn.linear_model import LogisticRegression # Define the Logistic Regression Model lr = LogisticRegression(class_weight='balanced') # Fit the Logistic Regression Model to the train set lr.fit(X_train, y_train) print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, lr.predict(X_test)))Logistic accuracy is 0.77 Wall time: 110 ms %%time from sklearn import model_selection # Define the 10-Fold Cross Validation kfold = model_selection.KFold(n_splits=10, random_state=7) # Define the Logistic Regression Model lrCV = LogisticRegression() # Define the evaluation metric scoring = 'roc_auc' # Train the Logistic Regression Model on the 10-Fold Cross Validation lr_results = model_selection.cross_val_score(lrCV, X_train, y_train, cv=kfold, scoring=scoring)Wall time: 628 ms # Print out the 10 scores from the training. Notice how you get a wide range of scores compared to one single training lr_resultsarray([0.79845385, 0.8371952 , 0.82284329, 0.8179427 , 0.80693377, 0.83157279, 0.82354362, 0.82073686, 0.80722612, 0.83976854]) Let's use AUC as a general baseline to compare our model's performance. After comparing, we can then select the best one and look at its precision and recall.
# Print out the mean and standard deviation of the training score lr_auc = lr_results.mean() print("The Logistic Regression AUC: %.3f and the STD is (%.3f)" % (lr_auc, lr_results.std()))The Logistic Regression AUC: 0.821 and the STD is (0.013) from sklearn.metrics import roc_auc_score print ("\n\n ---Logistic Regression Model---") lr_auc = roc_auc_score(y_test, lr.predict(X_test)) print ("Logistic Regression AUC = %2.2f" % lr_auc) print(classification_report(y_test, lr.predict(X_test))) ---Logistic Regression Model--- Logistic Regression AUC = 0.78 precision recall f1-score support 0 0.92 0.76 0.83 1714 1 0.50 0.80 0.62 536 avg / total 0.82 0.77 0.78 2250 Notice how the random forest classifier takes a while to run on the dataset. That is one downside to the algorithm, it takes a lot of computation. But it has a better performance than the sipler models like Logistic Regression
%%time from sklearn.ensemble import RandomForestClassifier # Random Forest Model rf = RandomForestClassifier( class_weight="balanced" ) # Fit the RF Model rf = rf.fit(X_train, y_train)Wall time: 321 ms %%time rf_results = model_selection.cross_val_score(rf, X_train, y_train, cv=kfold, scoring=scoring) rf_resultsWall time: 1.6 s # Print out the mean and standard deviation of the training score rf_auc = rf_results.mean() print("The Random Forest AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))The Random Forest AUC: 0.988 and the STD is (0.004) from sklearn.metrics import roc_auc_score print ("\n\n ---Random Forest Model---") rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test)) print ("Random Forest AUC = %2.2f" % rf_roc_auc) print(classification_report(y_test, rf.predict(X_test))) ---Random Forest Model--- Random Forest AUC = 0.99 precision recall f1-score support 0 0.99 1.00 0.99 1714 1 0.99 0.98 0.98 536 avg / total 0.99 0.99 0.99 2250 %%time from sklearn.svm import SVC svclassifier = SVC(kernel='rbf', probability=True) svc = svclassifier.fit(X_train,y_train)Wall time: 26.7 s %%time svc_result = model_selection.cross_val_score(svc, X_train, y_train, cv=kfold, scoring=scoring) svc_resultWall time: 46.2 s # Print out the mean and standard deviation of the training score svc_auc = svc_result.mean() print("The Supper Vector Classifier AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))The Supper Vector Classifier AUC: 0.988 and the STD is (0.004) from sklearn.metrics import roc_auc_score print ("\n\n ---Support Vector Model---") rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test)) print ("Support Vector Classifier AUC = %2.2f" % rf_roc_auc) print(classification_report(y_test, rf.predict(X_test))) ---Support Vector Model--- Support Vector Classifier AUC = 0.99 precision recall f1-score support 0 0.99 1.00 0.99 1714 1 0.99 0.98 0.98 536 avg / total 0.99 0.99 0.99 2250 # Create ROC Graph from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1]) rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1]) svc_fpr, svc_tpr, svc_thresholds = roc_curve(y_test, svc.predict_proba(X_test)[:,1]) plt.figure() # Plot Logistic Regression ROC plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % lr_auc) # Plot Random Forest ROC plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_auc) # Plot Decision Tree ROC plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier (area = %0.2f)' % svc_auc) # Plot Base Rate ROC plt.plot([0,1], [0,1],label='Base Rate' 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Graph') plt.legend(loc="lower right") plt.show()# Get Feature Importances feature_importances = pd.DataFrame(rf.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False) feature_importances = feature_importances.reset_index() feature_importances.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } | index | importance | |
|---|---|---|
| 0 | satisfaction | 0.279718 |
| 1 | yearsAtCompany | 0.240698 |
| 2 | averageMonthlyHours | 0.178100 |
| 3 | evaluation | 0.129985 |
| 4 | projectCount | 0.119583 |
| 5 | workAccident | 0.013300 |
| 6 | salary_low | 0.011167 |
| 7 | department_technical | 0.005552 |
| 8 | department_sales | 0.004075 |
| 9 | salary_medium | 0.003387 |
| 10 | department_support | 0.003291 |
| 11 | promotion | 0.002225 |
| 12 | department_hr | 0.002103 |
| 13 | department_management | 0.001688 |
| 14 | department_accounting | 0.001502 |
| 15 | department_RandD | 0.001363 |
| 16 | department_marketing | 0.001333 |
| 17 | department_product_mng | 0.000928 |
sns.set(style="whitegrid") # Initialize the matplotlib figure f, ax = plt.subplots(figsize=(13, 7)) # Plot the total schools per city sns.set_color_codes("pastel") sns.barplot(x="importance", y='index', data=feature_importances, label="Total", color="b")<matplotlib.axes._subplots.AxesSubplot at 0x1d0835083c8> Since this model is being used for people, we should refrain from soley relying on the output of our model. Instead, we can use it's probability output and design our own system to treat each employee accordingly.
- Safe Zone (Green) – Employees within this zone are considered safe.
- Low Risk Zone (Yellow) – Employees within this zone are too be taken into consideration of potential turnover. This is more of a long-term track.
- Medium Risk Zone (Orange) – Employees within this zone are at risk of turnover. Action should be taken and monitored accordingly.
- High Risk Zone (Red) – Employees within this zone are considered to have the highest chance of turnover. Action should be taken immediately.
rf.predict_proba(X_test)[175:200,]array([[1. , 0. ], [0. , 1. ], [1. , 0. ], [0. , 1. ], [0.8, 0.2], [0. , 1. ], [1. , 0. ], [0. , 1. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [0.9, 0.1], [1. , 0. ], [0.4, 0.6], [1. , 0. ], [1. , 0. ], [0. , 1. ], [1. , 0. ], [0. , 1. ]]) 



