Skip to content

vishalinvincible/HRActivity

Repository files navigation

Understanding and Predicting Employee Turnover

HR Analytics


Objective:

  • To understand what factors contributed most to employee turnover.

  • To perform clustering to find any meaningful patterns of employee traits.

  • To create a model that predicts the likelihood if a certain employee will leave the company or not.

  • To create or improve different retention strategies on targeted employees.

The implementation of this model will allow management to create better decision-making actions.

The Problem:

One of the most common problems at work is turnover.

Replacing a worker earning about 50,000 dollars cost the company about 10,000 dollars or 20% of that worker’s yearly income according to the Center of American Progress.

Replacing a high-level employee can cost multiple of that...

Cost include:

  • Cost of off-boarding
  • Cost of hiring (advertising, interviewing, hiring)
  • Cost of onboarding a new person (training, management time)
  • Lost productivity (a new person may take 1-2 years to reach the productivity of an existing person)

Import Packages


# Import the neccessary modules for data manipulation and visual representation import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as matplot import seaborn as sns %matplotlib inline

Read the Data


df = pd.read_csv('HR_comma_sep.csv.txt')
# Examine the dataset df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } 
</style>
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Data Quality Check


# Can you check to see if there are any missing values in our data set df.isnull().any()
satisfaction_level False last_evaluation False number_project False average_montly_hours False time_spend_company False Work_accident False left False promotion_last_5years False sales False salary False dtype: bool 
# Rename Columns # Renaming certain columns for better readability df = df.rename(columns={'satisfaction_level': 'satisfaction', 'last_evaluation': 'evaluation', 'number_project': 'projectCount', 'average_montly_hours': 'averageMonthlyHours', 'time_spend_company': 'yearsAtCompany', 'Work_accident': 'workAccident', 'promotion_last_5years': 'promotion', 'sales' : 'department', 'left' : 'turnover' }) df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } 
</style>
satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident turnover promotion department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
# Check the type of our features. Are there any data inconsistencies? df.dtypes
satisfaction float64 evaluation float64 projectCount int64 averageMonthlyHours int64 yearsAtCompany int64 workAccident int64 turnover int64 promotion int64 department object salary object dtype: object 

Exploratory Data Analysis


# How many employees are in the dataset? df.shape
(14999, 10) 
# Calculate the turnover rate of our company's dataset. What's the rate of turnover? turnover_rate = df.turnover.value_counts() / 14999 turnover_rate
0 0.761917 1 0.238083 Name: turnover, dtype: float64 
# Display the statistical overview of the employees df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } 
</style>
satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident turnover promotion
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000
# Display the mean summary of Employees (Turnover V.S. Non-turnover). What do you notice between the groups? turnover_Summary = df.groupby('turnover') turnover_Summary.mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } 
</style>
satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident promotion
turnover
0 0.666810 0.715473 3.786664 199.060203 3.380032 0.175009 0.026251
1 0.440098 0.718113 3.855503 207.419210 3.876505 0.047326 0.005321
# Create a correlation matrix. What features correlate the most with turnover? What other correlations did you find? corr = df.corr() corr = (corr) sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values) plt.title('Heatmap of Correlation Matrix') corr
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } 
</style>
satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident turnover promotion
satisfaction 1.000000 0.105021 -0.142970 -0.020048 -0.100866 0.058697 -0.388375 0.025605
evaluation 0.105021 1.000000 0.349333 0.339742 0.131591 -0.007104 0.006567 -0.008684
projectCount -0.142970 0.349333 1.000000 0.417211 0.196786 -0.004741 0.023787 -0.006064
averageMonthlyHours -0.020048 0.339742 0.417211 1.000000 0.127755 -0.010143 0.071287 -0.003544
yearsAtCompany -0.100866 0.131591 0.196786 0.127755 1.000000 0.002120 0.144822 0.067433
workAccident 0.058697 -0.007104 -0.004741 -0.010143 0.002120 1.000000 -0.154622 0.039245
turnover -0.388375 0.006567 0.023787 0.071287 0.144822 -0.154622 1.000000 -0.061788
promotion 0.025605 -0.008684 -0.006064 -0.003544 0.067433 0.039245 -0.061788 1.000000

png

# Plot the distribution of Employee Satisfaction, Evaluation, and Project Count. What story can you tell? # Set up the matplotlib figure f, axes = plt.subplots(ncols=3, figsize=(15, 6)) # Graph Employee Satisfaction sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution') axes[0].set_ylabel('Employee Count') # Graph Employee Evaluation sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution') axes[1].set_ylabel('Employee Count') # Graph Employee Average Monthly Hours sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution') axes[2].set_ylabel('Employee Count')
C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " Text(0,0.5,'Employee Count') 

png

Pre-processing


Apply get_dummies() to the categorical variables. Seperate categorical variables and numeric variables, then combine them.

cat_var = ['department','salary','turnover','promotion'] num_var = ['satisfaction','evaluation','projectCount','averageMonthlyHours','yearsAtCompany', 'workAccident'] categorical_df = pd.get_dummies(df[cat_var], drop_first=True) numerical_df = df[num_var] new_df = pd.concat([categorical_df,numerical_df], axis=1) new_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } 
</style>
turnover promotion department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical salary_low salary_medium satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident
0 1 0 0 0 0 0 0 0 1 0 0 1 0 0.38 0.53 2 157 3 0
1 1 0 0 0 0 0 0 0 1 0 0 0 1 0.80 0.86 5 262 6 0
2 1 0 0 0 0 0 0 0 1 0 0 0 1 0.11 0.88 7 272 4 0
3 1 0 0 0 0 0 0 0 1 0 0 1 0 0.72 0.87 5 223 5 0
4 1 0 0 0 0 0 0 0 1 0 0 1 0 0.37 0.52 2 159 3 0

Split Train/Test Set


from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
# Create the X and y set X = new_df.iloc[:,1:] y = new_df.iloc[:,0] # Define train and test X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)

Train Logistic Regression Model


%%time # Check accuracy of Logistic Model# Check  from sklearn.linear_model import LogisticRegression # Define the Logistic Regression Model lr = LogisticRegression(class_weight='balanced') # Fit the Logistic Regression Model to the train set lr.fit(X_train, y_train) print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, lr.predict(X_test)))
Logistic accuracy is 0.77 Wall time: 110 ms 

Apply 10-Fold Cross Validation for Logistic Regression

%%time from sklearn import model_selection # Define the 10-Fold Cross Validation kfold = model_selection.KFold(n_splits=10, random_state=7) # Define the Logistic Regression Model lrCV = LogisticRegression() # Define the evaluation metric  scoring = 'roc_auc' # Train the Logistic Regression Model on the 10-Fold Cross Validation lr_results = model_selection.cross_val_score(lrCV, X_train, y_train, cv=kfold, scoring=scoring)
Wall time: 628 ms 
# Print out the 10 scores from the training. Notice how you get a wide range of scores compared to one single training lr_results
array([0.79845385, 0.8371952 , 0.82284329, 0.8179427 , 0.80693377, 0.83157279, 0.82354362, 0.82073686, 0.80722612, 0.83976854]) 

Average Score

Let's use AUC as a general baseline to compare our model's performance. After comparing, we can then select the best one and look at its precision and recall.

# Print out the mean and standard deviation of the training score lr_auc = lr_results.mean() print("The Logistic Regression AUC: %.3f and the STD is (%.3f)" % (lr_auc, lr_results.std()))
The Logistic Regression AUC: 0.821 and the STD is (0.013) 

Logistic Regression AUC (0.78)

from sklearn.metrics import roc_auc_score print ("\n\n ---Logistic Regression Model---") lr_auc = roc_auc_score(y_test, lr.predict(X_test)) print ("Logistic Regression AUC = %2.2f" % lr_auc) print(classification_report(y_test, lr.predict(X_test)))
 ---Logistic Regression Model--- Logistic Regression AUC = 0.78 precision recall f1-score support 0 0.92 0.76 0.83 1714 1 0.50 0.80 0.62 536 avg / total 0.82 0.77 0.78 2250 

Train Random Forest Classifier Model


Notice how the random forest classifier takes a while to run on the dataset. That is one downside to the algorithm, it takes a lot of computation. But it has a better performance than the sipler models like Logistic Regression

%%time from sklearn.ensemble import RandomForestClassifier # Random Forest Model rf = RandomForestClassifier( class_weight="balanced" ) # Fit the RF Model rf = rf.fit(X_train, y_train)
Wall time: 321 ms 

Apply 10-Fold Cross Validation for Random Forest

%%time rf_results = model_selection.cross_val_score(rf, X_train, y_train, cv=kfold, scoring=scoring) rf_results
Wall time: 1.6 s 

Average Score

# Print out the mean and standard deviation of the training score rf_auc = rf_results.mean() print("The Random Forest AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))
The Random Forest AUC: 0.988 and the STD is (0.004) 

Random Forest AUC (0.99)

from sklearn.metrics import roc_auc_score print ("\n\n ---Random Forest Model---") rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test)) print ("Random Forest AUC = %2.2f" % rf_roc_auc) print(classification_report(y_test, rf.predict(X_test)))
 ---Random Forest Model--- Random Forest AUC = 0.99 precision recall f1-score support 0 0.99 1.00 0.99 1714 1 0.99 0.98 0.98 536 avg / total 0.99 0.99 0.99 2250 

Support Vector Classifier

%%time from sklearn.svm import SVC svclassifier = SVC(kernel='rbf', probability=True) svc = svclassifier.fit(X_train,y_train)
Wall time: 26.7 s 
%%time svc_result = model_selection.cross_val_score(svc, X_train, y_train, cv=kfold, scoring=scoring) svc_result
Wall time: 46.2 s 
# Print out the mean and standard deviation of the training score svc_auc = svc_result.mean() print("The Supper Vector Classifier AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))
The Supper Vector Classifier AUC: 0.988 and the STD is (0.004) 
from sklearn.metrics import roc_auc_score print ("\n\n ---Support Vector Model---") rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test)) print ("Support Vector Classifier AUC = %2.2f" % rf_roc_auc) print(classification_report(y_test, rf.predict(X_test)))
 ---Support Vector Model--- Support Vector Classifier AUC = 0.99 precision recall f1-score support 0 0.99 1.00 0.99 1714 1 0.99 0.98 0.98 536 avg / total 0.99 0.99 0.99 2250 

ROC Graph

# Create ROC Graph from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1]) rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1]) svc_fpr, svc_tpr, svc_thresholds = roc_curve(y_test, svc.predict_proba(X_test)[:,1]) plt.figure() # Plot Logistic Regression ROC plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % lr_auc) # Plot Random Forest ROC plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_auc) # Plot Decision Tree ROC plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier (area = %0.2f)' % svc_auc) # Plot Base Rate ROC plt.plot([0,1], [0,1],label='Base Rate' 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Graph') plt.legend(loc="lower right") plt.show()

png

Random Forest Feature Importances

# Get Feature Importances feature_importances = pd.DataFrame(rf.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False) feature_importances = feature_importances.reset_index() feature_importances
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } 
</style>
index importance
0 satisfaction 0.279718
1 yearsAtCompany 0.240698
2 averageMonthlyHours 0.178100
3 evaluation 0.129985
4 projectCount 0.119583
5 workAccident 0.013300
6 salary_low 0.011167
7 department_technical 0.005552
8 department_sales 0.004075
9 salary_medium 0.003387
10 department_support 0.003291
11 promotion 0.002225
12 department_hr 0.002103
13 department_management 0.001688
14 department_accounting 0.001502
15 department_RandD 0.001363
16 department_marketing 0.001333
17 department_product_mng 0.000928
sns.set(style="whitegrid") # Initialize the matplotlib figure f, ax = plt.subplots(figsize=(13, 7)) # Plot the total schools per city sns.set_color_codes("pastel") sns.barplot(x="importance", y='index', data=feature_importances, label="Total", color="b")
<matplotlib.axes._subplots.AxesSubplot at 0x1d0835083c8> 

png

Retention PLan

Reference: http://rupeshkhare.com/wp-content/uploads/2013/12/Employee-Attrition-Risk-Assessment-using-Logistic-Regression-Analysis.pdf

Since this model is being used for people, we should refrain from soley relying on the output of our model. Instead, we can use it's probability output and design our own system to treat each employee accordingly.

  1. Safe Zone (Green) – Employees within this zone are considered safe.
  2. Low Risk Zone (Yellow) – Employees within this zone are too be taken into consideration of potential turnover. This is more of a long-term track.
  3. Medium Risk Zone (Orange) – Employees within this zone are at risk of turnover. Action should be taken and monitored accordingly.
  4. High Risk Zone (Red) – Employees within this zone are considered to have the highest chance of turnover. Action should be taken immediately.

rf.predict_proba(X_test)[175:200,]
array([[1. , 0. ], [0. , 1. ], [1. , 0. ], [0. , 1. ], [0.8, 0.2], [0. , 1. ], [1. , 0. ], [0. , 1. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [0.9, 0.1], [1. , 0. ], [0.4, 0.6], [1. , 0. ], [1. , 0. ], [0. , 1. ], [1. , 0. ], [0. , 1. ]]) 

About

DSDJ Employee Turnover Activity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors