IBM人力资源分析员工流失&使用KNN

作者：LC--Vincent | 来源：互联网 | 2023-07-30 02:03

IBM人力资源分析员工流失&使用KNN的绩效原

IBM 人力资源分析员工流失&使用 KNN

的绩效

原文:https://www . geesforgeks . org/IBM-HR-analytics-员工-自然减员-绩效-使用-knn/

减员是一个影响所有企业的问题，无论地理位置、行业和公司规模如何。对一个组织来说，这是一个重大问题，预测人员流动是许多组织人力资源需求的首要问题。组织面临着员工流动带来的巨大成本。随着机器学习和数据科学的进步，预测员工流失成为可能，我们将使用 KNN (k 近邻)算法进行预测。
数据集:
由 IBM 人力资源部发布的数据集在 Kaggle 提供。
数据集
代码:实现 KNN 算法进行分类。
加载库

Python 3

# performing linear algebra import numpy as np # data processing import pandas as pd # visualisation import matplotlib.pyplot as plt import seaborn as sns % matplotlib inline

编码:导入数据集

Python 3

dataset = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv") print (dataset.head)

输出:

代码:数据集信息

Python 3

df.info()

输出:

RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): Age 1470 non-null int64 Attrition 1470 non-null object BusinessTravel 1470 non-null object DailyRate 1470 non-null int64 Department 1470 non-null object DistanceFromHome 1470 non-null int64 Education 1470 non-null int64 EducationField 1470 non-null object EmployeeCount 1470 non-null int64 EmployeeNumber 1470 non-null int64 EnvironmentSatisfaction 1470 non-null int64 Gender 1470 non-null object HourlyRate 1470 non-null int64 JobInvolvement 1470 non-null int64 JobLevel 1470 non-null int64 JobRole 1470 non-null object JobSatisfaction 1470 non-null int64 MaritalStatus 1470 non-null object MonthlyIncome 1470 non-null int64 MonthlyRate 1470 non-null int64 NumCompaniesWorked 1470 non-null int64 Over18 1470 non-null object OverTime 1470 non-null object PercentSalaryHike 1470 non-null int64 PerformanceRating 1470 non-null int64 RelationshipSatisfaction 1470 non-null int64 StandardHours 1470 non-null int64 StockOptionLevel 1470 non-null int64 TotalWorkingYears 1470 non-null int64 TrainingTimesLastYear 1470 non-null int64 WorkLifeBalance 1470 non-null int64 YearsAtCompany 1470 non-null int64 YearsInCurrentRole 1470 non-null int64 YearsSinceLastPromotion 1470 non-null int64 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.0+ KB

代码:可视化数据

Python 3

# heatmap to check the missing value plt.figure(figsize =(10, 4)) sns.heatmap(dataset.isnull(), yticklabels = False, cbar = False, cmap ='viridis')

输出:

因此，我们可以看到数据集中没有缺失值。
这是一个二元分类问题，因此实例在两个类中的分布如下图所示:

Python 3

sns.set_style('darkgrid') sns.countplot(x ='Attrition', data = dataset)

输出:

代码:

Python 3

sns.lmplot(x = 'Age', y = 'DailyRate', hue = 'Attrition', data = dataset)

输出:

代码:

Python 3

plt.figure(figsize =(10, 6)) sns.boxplot(y ='MonthlyIncome', x ='Attrition', data = dataset)

输出:

数据预处理
数据集中有 4 个不相关的列，分别是:EmployeeCount、EmployeeNumber、Over18 和 StandardHour。所以，为了更准确，我们必须去掉这些。
T4【代码:

Python 3

dataset.drop('EmployeeCount', axis = 1, inplace = True) dataset.drop('StandardHours', axis = 1, inplace = True) dataset.drop('EmployeeNumber', axis = 1, inplace = True) dataset.drop('Over18', axis = 1, inplace = True) print(dataset.shape)

输出:

(1470, 31)

所以，我们删除了无关的栏目。
代码:输入输出数据

Python 3

y = dataset.iloc[:, 1] X = dataset X.drop('Attrition', axis = 1, inplace = True)

代码:标签编码

Python 3

from sklearn.preprocessing import LabelEncoder lb = LabelEncoder() y = lb.fit_transform(y)

在数据集中有 7 个分类数据，因此我们必须将它们更改为 int 数据，也就是说，我们必须创建 7 个虚拟变量以提高准确性。
代码:虚拟变量创建

Python 3

dum_BusinessTravel = pd.get_dummies(dataset['BusinessTravel'], prefix ='BusinessTravel') dum_Department = pd.get_dummies(dataset['Department'], prefix ='Department') dum_EducatiOnField= pd.get_dummies(dataset['EducationField'], prefix ='EducationField') dum_Gender = pd.get_dummies(dataset['Gender'], prefix ='Gender', drop_first = True) dum_JobRole = pd.get_dummies(dataset['JobRole'], prefix ='JobRole') dum_MaritalStatus = pd.get_dummies(dataset['MaritalStatus'], prefix ='MaritalStatus') dum_OverTime = pd.get_dummies(dataset['OverTime'], prefix ='OverTime', drop_first = True) # Adding these dummy variable to input X X = pd.concat([x, dum_BusinessTravel, dum_Department, dum_EducationField, dum_Gender, dum_JobRole, dum_MaritalStatus, dum_OverTime], axis = 1) # Removing the categorical data X.drop(['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime'], axis = 1, inplace = True) print(X.shape) print(y.shape)

输出:

(1470, 49) (1470, )

代码:拆分数据进行训练测试

Python 3

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state = 40)

预处理已经完成，现在我们必须对数据集应用 KNN。
模型执行代码:利用 KNeighborsClassifier，借助误分类错误找到最佳邻居数。

Python 3

from sklearn.neighbors import KNeighborsClassifier neighbors = [] cv_scores = [] from sklearn.model_selection import cross_val_score # perform 10 fold cross validation for k in range(1, 40, 2): neighbors.append(k) knn = KNeighborsClassifier(n_neighbors = k) scores = cross_val_score( knn, X_train, y_train, cv = 10, scoring = 'accuracy') cv_scores.append(scores.mean()) error_rate = [1-x for x in cv_scores] # determining the best k optimal_k = neighbors[error_rate.index(min(error_rate))] print('The optimal number of neighbors is % d ' % optimal_k) # plot misclassification error versus k plt.figure(figsize = (10, 6)) plt.plot(range(1, 40, 2), error_rate, color ='blue', linestyle ='dashed', marker ='o', markerfacecolor ='red', markersize = 10) plt.xlabel('Number of neighbors') plt.ylabel('Misclassification Error') plt.show()

输出:

The optimal number of neighbors is 7

代码:预测得分

Python 3

from sklearn.model_selection import cross_val_predict, cross_val_score from sklearn.metrics import accuracy_score, classification_report from sklearn.metrics import confusion_matrix def print_score(clf, X_train, y_train, X_test, y_test, train = True): if train: print("Train Result:") print("------------") print("Classification Report: \n {}\n".format(classification_report( y_train, clf.predict(X_train)))) print("Confusion Matrix: \n {}\n".format(confusion_matrix( y_train, clf.predict(X_train)))) res = cross_val_score(clf, X_train, y_train, cv = 10, scoring ='accuracy') print("Average Accuracy: \t {0:.4f}".format(np.mean(res))) print("Accuracy SD: \t\t {0:.4f}".format(np.std(res))) print("accuracy score: {0:.4f}\n".format(accuracy_score( y_train, clf.predict(X_train)))) print("----------------------------------------------------------") elif train == False: print("Test Result:") print("-----------") print("Classification Report: \n {}\n".format( classification_report(y_test, clf.predict(X_test)))) print("Confusion Matrix: \n {}\n".format( confusion_matrix(y_test, clf.predict(X_test)))) print("accuracy score: {0:.4f}\n".format( accuracy_score(y_test, clf.predict(X_test)))) print("-----------------------------------------------------------") knn = KNeighborsClassifier(n_neighbors = 7) knn.fit(X_train, y_train) print_score(knn, X_train, y_train, X_test, y_test, train = True) print_score(knn, X_train, y_train, X_test, y_test, train = False)

输出:

Train Result: ------------ Classification Report: precision recall f1-score support 0 0.86 0.99 0.92 922 1 0.83 0.19 0.32 180 accuracy 0.86 1102 macro avg 0.85 0.59 0.62 1102 weighted avg 0.86 0.86 0.82 1102 Confusion Matrix: [[915 7] [145 35]] Average Accuracy: 0.8421 Accuracy SD: 0.0148 accuracy score: 0.8621 ----------------------------------------------------------- Test Result: ----------- Classification Report: precision recall f1-score support 0 0.84 0.96 0.90 311 1 0.14 0.04 0.06 57 accuracy 0.82 368 macro avg 0.49 0.50 0.48 368 weighted avg 0.74 0.82 0.77 368 Confusion Matrix: [[299 12] [ 55 2]] accuracy score: 0.8179

推荐阅读

object
不同优化算法的比较分析及实验验证

本文介绍了神经网络优化中常用的优化方法，包括学习率调整和梯度估计修正，并通过实验验证了不同优化算法的效果。实验结果表明，Adam算法在综合考虑学习率调整和梯度估计修正方面表现较好。该研究对于优化神经网络的训练过程具有指导意义。 ... [详细]

蜡笔小新 2023-12-13 16:05:14
const
浏览器中的异常检测算法及其在深度学习中的应用

本文介绍了在浏览器中进行异常检测的算法，包括统计学方法和机器学习方法，并探讨了异常检测在深度学习中的应用。异常检测在金融领域的信用卡欺诈、企业安全领域的非法入侵、IT运维中的设备维护时间点预测等方面具有广泛的应用。通过使用TensorFlow.js进行异常检测，可以实现对单变量和多变量异常的检测。统计学方法通过估计数据的分布概率来计算数据点的异常概率，而机器学习方法则通过训练数据来建立异常检测模型。 ... [详细]

蜡笔小新 2023-12-12 16:22:39
object
Python爬虫技术基础篇面向对象高级编程（中）的多重继承

本文介绍了Python爬虫技术基础篇面向对象高级编程（中）中的多重继承概念。通过继承，子类可以扩展父类的功能。文章以动物类层次的设计为例，讨论了按照不同分类方式设计类层次的复杂性和多重继承的优势。最后给出了哺乳动物和鸟类的设计示例，以及能跑、能飞、宠物类和非宠物类的增加对类数量的影响。 ... [详细]

蜡笔小新 2023-12-12 16:19:02
go
十大经典排序算法动图演示+Python实现

本文介绍了十大经典排序算法的原理、演示和Python实现。排序算法分为内部排序和外部排序，常见的内部排序算法有插入排序、希尔排序、选择排序、冒泡排序、归并排序、快速排序、堆排序、基数排序等。文章还解释了时间复杂度和稳定性的概念，并提供了相关的名词解释。 ... [详细]

蜡笔小新 2023-12-10 19:28:59
format
Python实验报告文档中的文件和数据格式化操作

本文介绍了Python语言程序设计中文件和数据格式化的操作，包括使用np.savetext保存文本文件，对文本文件和二进制文件进行统一的操作步骤，以及使用Numpy模块进行数据可视化编程的指南。同时还提供了一些关于Python的测试题。 ... [详细]

蜡笔小新 2023-12-10 17:02:16
header
超级简单加解密工具的方案和功能

本文介绍了一个超级简单的加解密工具的方案和功能。该工具可以读取文件头，并根据特定长度进行加密，加密后将加密部分写入源文件。同时，该工具也支持解密操作。加密和解密过程是可逆的。本文还提到了一些相关的功能和使用方法，并给出了Python代码示例。 ... [详细]

蜡笔小新 2023-12-10 16:38:34
search
python3 nmap函数简介及使用方法

本文介绍了python3 nmap函数的简介及使用方法，python-nmap是一个使用nmap进行端口扫描的python库，它可以生成nmap扫描报告，并帮助系统管理员进行自动化扫描任务和生成报告。同时，它也支持nmap脚本输出。文章详细介绍了python-nmap的几个py文件的功能和用途，包括__init__.py、nmap.py和test.py。__init__.py主要导入基本信息，nmap.py用于调用nmap的功能进行扫描，test.py用于测试是否可以利用nmap的扫描功能。 ... [详细]

蜡笔小新 2023-12-10 12:15:27
object
Python字典视图对象的示例和用法

本文介绍了Python字典视图对象的示例和用法。通过对示例代码的解释，展示了字典视图对象的基本操作和特点。字典视图对象可以通过迭代或转换为列表来获取字典的键或值。同时，字典视图对象也是动态的，可以反映字典的变化。通过学习字典视图对象的用法，可以更好地理解和处理字典数据。 ... [详细]

蜡笔小新 2023-12-09 09:14:13
object
python 终止函数命令_如何使“停止”按钮终止已经在Tkinter（Python）中运行的“启动”函数...

我用Tkinter制作了一个图形用户界面，有两个主按钮：“开始”和“停止”。请您就如何使用“停止”按钮终止“开始”按钮为以下代码调用的已运行功能提供建议 ... [详细]

蜡笔小新 2023-10-17 20:02:38
object
机器学习算法代码实现——线性回归

前言：拿到一个案例，去分析：它该是做分类还是做回归，哪部分该做分类，哪部分该做回归，哪部分该做优化，它们的目标值分别是什么。再挑影响因素，哪些和分类有关的影响因素，哪些和回归有关的 ... [详细]

蜡笔小新 2023-10-17 19:58:52
object
Python对Excel文件的读取方法及模块安装

本文介绍了Python对Excel文件的读取方法，包括模块的安装和使用。通过安装xlrd、xlwt、xlutils、pyExcelerator等模块，可以实现对Excel文件的读取和处理。具体的读取方法包括打开excel文件、抓取所有sheet的名称、定位到指定的表单等。本文提供了两种定位表单的方式，并给出了相应的代码示例。 ... [详细]

蜡笔小新 2023-12-14 19:49:05
schema
的错误消息：

ZSI.generate.Wsdl2PythonError: unsupported local simpleType restriction ... [详细]

蜡笔小新 2023-12-13 20:28:08
object
scrapy存入excel时，excel文件被反复擦除重写。文件大小始终不超过100k，请问这种情况改如何解决

怀疑是每次都在新建文件，具体代码如下 ... [详细]

蜡笔小新 2023-12-13 17:53:49
object
Pandas 基础(3) - 生成 Dataframe 的几种方式总结

本文总结了使用不同方式生成 Dataframe 的方法，包括通过CSV文件、Excel文件、python dictionary、List of tuples和List of dictionary。同时介绍了一些注意事项，如使用绝对路径引入文件和安装xlrd包来读取Excel文件。 ... [详细]

蜡笔小新 2023-12-10 12:59:34
search
广度优先遍历（BFS）算法的概述、代码实现和应用

本文介绍了广度优先遍历（BFS）算法的概述、邻接矩阵和邻接表的代码实现，并讨论了BFS在求解最短路径或最短步数问题上的应用。以LeetCode中的934.最短的桥为例，详细阐述了BFS的具体思路和代码实现。最后，推荐了一些相关的BFS算法题目供大家练习。 ... [详细]

蜡笔小新 2023-12-09 02:51:05

LC--Vincent

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章