Understanding and Predicting Employee Turnover

HR Analytics

Objective:

To understand what factors contributed most to employee turnover.
To perform clustering to find any meaningful patterns of employee traits.
To create a model that predicts the likelihood if a certain employee will leave the company or not.
To create or improve different retention strategies on targeted employees.

The implementation of this model will allow management to create better decision-making actions.

The Problem:

One of the most common problems at work is turnover.

Replacing a worker earning about 50,000 dollars cost the company about 10,000 dollars or 20% of that worker’s yearly income according to the Center of American Progress.

Replacing a high-level employee can cost multiple of that...

Cost include:

Cost of off-boarding
Cost of hiring (advertising, interviewing, hiring)
Cost of onboarding a new person (training, management time)
Lost productivity (a new person may take 1-2 years to reach the productivity of an existing person)

Import Packages

# Import the neccessary modules for data manipulation and visual representation import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as matplot import seaborn as sns %matplotlib inline

Read the Data

df = pd.read_csv('HR_comma_sep.csv.txt')

# Examine the dataset df.head()

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

</style>

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	sales	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

Data Quality Check

# Can you check to see if there are any missing values in our data set df.isnull().any()

satisfaction_level False last_evaluation False number_project False average_montly_hours False time_spend_company False Work_accident False left False promotion_last_5years False sales False salary False dtype: bool

# Rename Columns # Renaming certain columns for better readability df = df.rename(columns={'satisfaction_level': 'satisfaction', 'last_evaluation': 'evaluation', 'number_project': 'projectCount', 'average_montly_hours': 'averageMonthlyHours', 'time_spend_company': 'yearsAtCompany', 'Work_accident': 'workAccident', 'promotion_last_5years': 'promotion', 'sales' : 'department', 'left' : 'turnover' }) df.head(3)

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

</style>

	satisfaction	evaluation	projectCount	averageMonthlyHours	yearsAtCompany	turnover	department	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium

# Check the type of our features. Are there any data inconsistencies? df.dtypes

satisfaction float64 evaluation float64 projectCount int64 averageMonthlyHours int64 yearsAtCompany int64 workAccident int64 turnover int64 promotion int64 department object salary object dtype: object

Exploratory Data Analysis

# How many employees are in the dataset? df.shape

(14999, 10)

# Calculate the turnover rate of our company's dataset. What's the rate of turnover? turnover_rate = df.turnover.value_counts() / 14999 turnover_rate

0 0.761917 1 0.238083 Name: turnover, dtype: float64

# Display the statistical overview of the employees df.describe()

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

</style>

	satisfaction	evaluation	projectCount	averageMonthlyHours	yearsAtCompany	workAccident	turnover	promotion
count	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000
mean	0.612834	0.716102	3.803054	201.050337	3.498233	0.144610	0.238083	0.021268
std	0.248631	0.171169	1.232592	49.943099	1.460136	0.351719	0.425924	0.144281
min	0.090000	0.360000	2.000000	96.000000	2.000000	0.000000	0.000000	0.000000
25%	0.440000	0.560000	3.000000	156.000000	3.000000	0.000000	0.000000	0.000000
50%	0.640000	0.720000	4.000000	200.000000	3.000000	0.000000	0.000000	0.000000
75%	0.820000	0.870000	5.000000	245.000000	4.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	7.000000	310.000000	10.000000	1.000000	1.000000	1.000000

# Display the mean summary of Employees (Turnover V.S. Non-turnover). What do you notice between the groups? turnover_Summary = df.groupby('turnover') turnover_Summary.mean()

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

</style>

	satisfaction	evaluation	projectCount	averageMonthlyHours	yearsAtCompany	workAccident	promotion
turnover
0	0.666810	0.715473	3.786664	199.060203	3.380032	0.175009	0.026251
1	0.440098	0.718113	3.855503	207.419210	3.876505	0.047326	0.005321

# Create a correlation matrix. What features correlate the most with turnover? What other correlations did you find? corr = df.corr() corr = (corr) sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values) plt.title('Heatmap of Correlation Matrix') corr

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

</style>

	satisfaction	evaluation	projectCount	averageMonthlyHours	yearsAtCompany	workAccident	turnover	promotion
satisfaction	1.000000	0.105021	-0.142970	-0.020048	-0.100866	0.058697	-0.388375	0.025605
evaluation	0.105021	1.000000	0.349333	0.339742	0.131591	-0.007104	0.006567	-0.008684
projectCount	-0.142970	0.349333	1.000000	0.417211	0.196786	-0.004741	0.023787	-0.006064
averageMonthlyHours	-0.020048	0.339742	0.417211	1.000000	0.127755	-0.010143	0.071287	-0.003544
yearsAtCompany	-0.100866	0.131591	0.196786	0.127755	1.000000	0.002120	0.144822	0.067433
workAccident	0.058697	-0.007104	-0.004741	-0.010143	0.002120	1.000000	-0.154622	0.039245
turnover	-0.388375	0.006567	0.023787	0.071287	0.144822	-0.154622	1.000000	-0.061788
promotion	0.025605	-0.008684	-0.006064	-0.003544	0.067433	0.039245	-0.061788	1.000000

# Plot the distribution of Employee Satisfaction, Evaluation, and Project Count. What story can you tell? # Set up the matplotlib figure f, axes = plt.subplots(ncols=3, figsize=(15, 6)) # Graph Employee Satisfaction sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution') axes[0].set_ylabel('Employee Count') # Graph Employee Evaluation sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution') axes[1].set_ylabel('Employee Count') # Graph Employee Average Monthly Hours sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution') axes[2].set_ylabel('Employee Count')

C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " C:\Users\Randy\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg. warnings.warn("The 'normed' kwarg is deprecated, and has been " Text(0,0.5,'Employee Count')

Pre-processing

Apply get_dummies() to the categorical variables. Seperate categorical variables and numeric variables, then combine them.

cat_var = ['department','salary','turnover','promotion'] num_var = ['satisfaction','evaluation','projectCount','averageMonthlyHours','yearsAtCompany', 'workAccident'] categorical_df = pd.get_dummies(df[cat_var], drop_first=True) numerical_df = df[num_var] new_df = pd.concat([categorical_df,numerical_df], axis=1) new_df.head()

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

</style>

	turnover	department_sales	salary_low	salary_medium	satisfaction	evaluation	projectCount	averageMonthlyHours	yearsAtCompany
0	1	1	1	0	0.38	0.53	2	157	3
1	1	1	0	1	0.80	0.86	5	262	6
2	1	1	0	1	0.11	0.88	7	272	4
3	1	1	1	0	0.72	0.87	5	223	5
4	1	1	1	0	0.37	0.52	2	159	3

Split Train/Test Set

from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve

# Create the X and y set X = new_df.iloc[:,1:] y = new_df.iloc[:,0] # Define train and test X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)

Train Logistic Regression Model

%%time # Check accuracy of Logistic Model# Check  from sklearn.linear_model import LogisticRegression # Define the Logistic Regression Model lr = LogisticRegression(class_weight='balanced') # Fit the Logistic Regression Model to the train set lr.fit(X_train, y_train) print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, lr.predict(X_test)))

Logistic accuracy is 0.77 Wall time: 110 ms

Apply 10-Fold Cross Validation for Logistic Regression

%%time from sklearn import model_selection # Define the 10-Fold Cross Validation kfold = model_selection.KFold(n_splits=10, random_state=7) # Define the Logistic Regression Model lrCV = LogisticRegression() # Define the evaluation metric  scoring = 'roc_auc' # Train the Logistic Regression Model on the 10-Fold Cross Validation lr_results = model_selection.cross_val_score(lrCV, X_train, y_train, cv=kfold, scoring=scoring)

Wall time: 628 ms

# Print out the 10 scores from the training. Notice how you get a wide range of scores compared to one single training lr_results

array([0.79845385, 0.8371952 , 0.82284329, 0.8179427 , 0.80693377, 0.83157279, 0.82354362, 0.82073686, 0.80722612, 0.83976854])

Average Score

Let's use AUC as a general baseline to compare our model's performance. After comparing, we can then select the best one and look at its precision and recall.

# Print out the mean and standard deviation of the training score lr_auc = lr_results.mean() print("The Logistic Regression AUC: %.3f and the STD is (%.3f)" % (lr_auc, lr_results.std()))

The Logistic Regression AUC: 0.821 and the STD is (0.013)

Logistic Regression AUC (0.78)

from sklearn.metrics import roc_auc_score print ("\n\n ---Logistic Regression Model---") lr_auc = roc_auc_score(y_test, lr.predict(X_test)) print ("Logistic Regression AUC = %2.2f" % lr_auc) print(classification_report(y_test, lr.predict(X_test)))

 ---Logistic Regression Model--- Logistic Regression AUC = 0.78 precision recall f1-score support 0 0.92 0.76 0.83 1714 1 0.50 0.80 0.62 536 avg / total 0.82 0.77 0.78 2250

Train Random Forest Classifier Model

Notice how the random forest classifier takes a while to run on the dataset. That is one downside to the algorithm, it takes a lot of computation. But it has a better performance than the sipler models like Logistic Regression

%%time from sklearn.ensemble import RandomForestClassifier # Random Forest Model rf = RandomForestClassifier( class_weight="balanced" ) # Fit the RF Model rf = rf.fit(X_train, y_train)

Wall time: 321 ms

Apply 10-Fold Cross Validation for Random Forest

%%time rf_results = model_selection.cross_val_score(rf, X_train, y_train, cv=kfold, scoring=scoring) rf_results

Wall time: 1.6 s

Average Score

# Print out the mean and standard deviation of the training score rf_auc = rf_results.mean() print("The Random Forest AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))

The Random Forest AUC: 0.988 and the STD is (0.004)

Random Forest AUC (0.99)

from sklearn.metrics import roc_auc_score print ("\n\n ---Random Forest Model---") rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test)) print ("Random Forest AUC = %2.2f" % rf_roc_auc) print(classification_report(y_test, rf.predict(X_test)))

 ---Random Forest Model--- Random Forest AUC = 0.99 precision recall f1-score support 0 0.99 1.00 0.99 1714 1 0.99 0.98 0.98 536 avg / total 0.99 0.99 0.99 2250

Support Vector Classifier

%%time from sklearn.svm import SVC svclassifier = SVC(kernel='rbf', probability=True) svc = svclassifier.fit(X_train,y_train)

Wall time: 26.7 s

%%time svc_result = model_selection.cross_val_score(svc, X_train, y_train, cv=kfold, scoring=scoring) svc_result

Wall time: 46.2 s

# Print out the mean and standard deviation of the training score svc_auc = svc_result.mean() print("The Supper Vector Classifier AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))

The Supper Vector Classifier AUC: 0.988 and the STD is (0.004)

from sklearn.metrics import roc_auc_score print ("\n\n ---Support Vector Model---") rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test)) print ("Support Vector Classifier AUC = %2.2f" % rf_roc_auc) print(classification_report(y_test, rf.predict(X_test)))

 ---Support Vector Model--- Support Vector Classifier AUC = 0.99 precision recall f1-score support 0 0.99 1.00 0.99 1714 1 0.99 0.98 0.98 536 avg / total 0.99 0.99 0.99 2250

ROC Graph

# Create ROC Graph from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1]) rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1]) svc_fpr, svc_tpr, svc_thresholds = roc_curve(y_test, svc.predict_proba(X_test)[:,1]) plt.figure() # Plot Logistic Regression ROC plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % lr_auc) # Plot Random Forest ROC plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_auc) # Plot Decision Tree ROC plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier (area = %0.2f)' % svc_auc) # Plot Base Rate ROC plt.plot([0,1], [0,1],label='Base Rate' 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Graph') plt.legend(loc="lower right") plt.show()

Random Forest Feature Importances

# Get Feature Importances feature_importances = pd.DataFrame(rf.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False) feature_importances = feature_importances.reset_index() feature_importances

.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

</style>

	index	importance
0	satisfaction	0.279718
1	yearsAtCompany	0.240698
2	averageMonthlyHours	0.178100
3	evaluation	0.129985
4	projectCount	0.119583
5	workAccident	0.013300
6	salary_low	0.011167
7	department_technical	0.005552
8	department_sales	0.004075
9	salary_medium	0.003387
10	department_support	0.003291
11	promotion	0.002225
12	department_hr	0.002103
13	department_management	0.001688
14	department_accounting	0.001502
15	department_RandD	0.001363
16	department_marketing	0.001333
17	department_product_mng	0.000928

sns.set(style="whitegrid") # Initialize the matplotlib figure f, ax = plt.subplots(figsize=(13, 7)) # Plot the total schools per city sns.set_color_codes("pastel") sns.barplot(x="importance", y='index', data=feature_importances, label="Total", color="b")

<matplotlib.axes._subplots.AxesSubplot at 0x1d0835083c8>

Retention PLan

Reference: http://rupeshkhare.com/wp-content/uploads/2013/12/Employee-Attrition-Risk-Assessment-using-Logistic-Regression-Analysis.pdf

Since this model is being used for people, we should refrain from soley relying on the output of our model. Instead, we can use it's probability output and design our own system to treat each employee accordingly.

Safe Zone (Green) – Employees within this zone are considered safe.
Low Risk Zone (Yellow) – Employees within this zone are too be taken into consideration of potential turnover. This is more of a long-term track.
Medium Risk Zone (Orange) – Employees within this zone are at risk of turnover. Action should be taken and monitored accordingly.
High Risk Zone (Red) – Employees within this zone are considered to have the highest chance of turnover. Action should be taken immediately.

rf.predict_proba(X_test)[175:200,]

array([[1. , 0. ], [0. , 1. ], [1. , 0. ], [0. , 1. ], [0.8, 0.2], [0. , 1. ], [1. , 0. ], [0. , 1. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [1. , 0. ], [0.9, 0.1], [1. , 0. ], [0.4, 0.6], [1. , 0. ], [1. , 0. ], [0. , 1. ], [1. , 0. ], [0. , 1. ]])

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
HR_comma_sep.csv.txt		HR_comma_sep.csv.txt
README.md		README.md
Workshop1_DSDJ.ipynb		Workshop1_DSDJ.ipynb
heatmap.png		heatmap.png
output_17_1.png		output_17_1.png
output_18_2.png		output_18_2.png
output_48_0.png		output_48_0.png
output_51_1.png		output_51_1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding and Predicting Employee Turnover

HR Analytics

Objective:

The Problem:

Import Packages

Read the Data

Data Quality Check

Exploratory Data Analysis

Pre-processing

Split Train/Test Set

Train Logistic Regression Model

Apply 10-Fold Cross Validation for Logistic Regression

Average Score

Logistic Regression AUC (0.78)

Train Random Forest Classifier Model

Apply 10-Fold Cross Validation for Random Forest

Average Score

Random Forest AUC (0.99)

Support Vector Classifier

ROC Graph

Random Forest Feature Importances

Retention PLan

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Understanding and Predicting Employee Turnover

HR Analytics

Objective:

The Problem:

Import Packages

Read the Data

Data Quality Check

Exploratory Data Analysis

Pre-processing

Split Train/Test Set

Train Logistic Regression Model

Apply 10-Fold Cross Validation for Logistic Regression

Average Score

Logistic Regression AUC (0.78)

Train Random Forest Classifier Model

Apply 10-Fold Cross Validation for Random Forest

Average Score

Random Forest AUC (0.99)

Support Vector Classifier

ROC Graph

Random Forest Feature Importances

Retention PLan

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages