数据预处理样本选择、交叉验证

作者：violalal_134 | 来源：互联网 | 2023-09-23 23:38

1.样本下采样选择#下采样取样本数据Xdata.ix[:,data.columns!Class]ydata.ix[:,data.columnsClass]#Numberofda

1.样本下采样选择

# 下采样取样本数据
X &＃61; data.ix[:, data.columns !&＃61; &＃39;Class&＃39;]
y &＃61; data.ix[:, data.columns &＃61;&＃61; &＃39;Class&＃39;]# Number of data points in the minority class
number_records_fraud &＃61; len(data[data.Class &＃61;&＃61; 1])
fraud_indices &＃61; np.array(data[data.Class &＃61;&＃61; 1].index)# Picking the indices of the normal classes
normal_indices &＃61; data[data.Class &＃61;&＃61; 0].index# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices &＃61; np.random.choice(normal_indices, number_records_fraud, replace &＃61; False)
random_normal_indices &＃61; np.array(random_normal_indices)# Appending the 2 indices
under_sample_indices &＃61; np.concatenate([fraud_indices,random_normal_indices])# Under sample dataset
under_sample_data &＃61; data.iloc[under_sample_indices,:]X_undersample &＃61; under_sample_data.ix[:, under_sample_data.columns !&＃61; &＃39;Class&＃39;]
y_undersample &＃61; under_sample_data.ix[:, under_sample_data.columns &＃61;&＃61; &＃39;Class&＃39;]# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class &＃61;&＃61; 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class &＃61;&＃61; 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))# 下采样后的数据进行训练、验证数据集拆分
from sklearn.cross_validation import train_test_split# Whole dataset
X_train, X_test, y_train, y_test &＃61; train_test_split(X,y,test_size &＃61; 0.3, random_state &＃61; 0)print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)&＃43;len(X_test))# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample &＃61; train_test_split(X_undersample,y_undersample,test_size &＃61; 0.3,random_state &＃61; 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)&＃43;len(X_test_undersample))

交叉验证选择最优参数&＃xff1a;

#Recall &＃61; TP/(TP&＃43;FN)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report
def printing_Kfold_scores(x_train_data,y_train_data):fold &＃61; KFold(len(y_train_data),5,shuffle&＃61;False) # Different C parametersc_param_range &＃61; [0.01,0.1,1,10,100]results_table &＃61; pd.DataFrame(index &＃61; range(len(c_param_range),2), columns &＃61; [&＃39;C_parameter&＃39;,&＃39;Mean recall score&＃39;])results_table[&＃39;C_parameter&＃39;] &＃61; c_param_range# the k-fold will give 2 lists: train_indices &＃61; indices[0], test_indices &＃61; indices[1]j &＃61; 0for c_param in c_param_range:print(&＃39;-------------------------------------------&＃39;)print(&＃39;C parameter: &＃39;, c_param)print(&＃39;-------------------------------------------&＃39;)print(&＃39;&＃39;)recall_accs &＃61; []for iteration, indices in enumerate(fold,start&＃61;1):# Call the logistic regression model with a certain C parameterlr &＃61; LogisticRegression(C &＃61; c_param, penalty &＃61; &＃39;l1&＃39;)# Use the training data to fit the model. In this case, we use the portion of the fold to train the model# with indices[0]. We then predict on the portion assigned as the &＃39;test cross validation&＃39; with indices[1]
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())# Predict values using the test indices in the training datay_pred_undersample &＃61; lr.predict(x_train_data.iloc[indices[1],:].values)# Calculate the recall score and append it to a list for recall scores representing the current c_parameterrecall_acc &＃61; recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)recall_accs.append(recall_acc)print(&＃39;Iteration &＃39;, iteration,&＃39;: recall score &＃61; &＃39;, recall_acc)# The mean value of those recall scores is the metric we want to save and get hold of.results_table.ix[j,&＃39;Mean recall score&＃39;] &＃61; np.mean(recall_accs)j &＃43;&＃61; 1print(&＃39;&＃39;)print(&＃39;Mean recall score &＃39;, np.mean(recall_accs))print(&＃39;&＃39;)best_c &＃61; results_table.loc[results_table[&＃39;Mean recall score&＃39;].idxmax()][&＃39;C_parameter&＃39;]# Finally, we can check which C parameter is the best amongst the chosen.print(&＃39;*********************************************************************************&＃39;)print(&＃39;Best model to choose from cross validation is with C parameter &＃61; &＃39;, best_c)print(&＃39;*********************************************************************************&＃39;)return best_cbest_c &＃61; printing_Kfold_scores(X_train_undersample,y_train_undersample)

绘制混淆矩阵

def plot_confusion_matrix(cm, classes,title&＃61;&＃39;Confusion matrix&＃39;,cmap&＃61;plt.cm.Blues):"""This function prints and plots the confusion matrix."""plt.imshow(cm, interpolation&＃61;&＃39;nearest&＃39;, cmap&＃61;cmap)plt.title(title)plt.colorbar()tick_marks &＃61; np.arange(len(classes))plt.xticks(tick_marks, classes, rotation&＃61;0)plt.yticks(tick_marks, classes)thresh &＃61; cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, cm[i, j],horizontalalignment&＃61;"center",color&＃61;"white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel(&＃39;True label&＃39;)plt.xlabel(&＃39;Predicted label&＃39;)

import itertools
lr &＃61; LogisticRegression(C &＃61; best_c, penalty &＃61; &＃39;l1&＃39;)
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample &＃61; lr.predict(X_test_undersample.values)# Compute confusion matrix
cnf_matrix &＃61; confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision&＃61;2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]&＃43;cnf_matrix[1,1]))# Plot non-normalized confusion matrix
class_names &＃61; [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes&＃61;class_names, title&＃61;&＃39;Confusion matrix&＃39;)
plt.show()

查看不同阈值对应召回率

lr &＃61; LogisticRegression(C &＃61; 0.01, penalty &＃61; &＃39;l1&＃39;)
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba &＃61; lr.predict_proba(X_test_undersample.values)thresholds &＃61; [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]plt.figure(figsize&＃61;(10,10))j &＃61; 1
for i in thresholds:y_test_predictions_high_recall &＃61; y_pred_undersample_proba[:,1] > iplt.subplot(3,3,j)j &＃43;&＃61; 1# Compute confusion matrixcnf_matrix &＃61; confusion_matrix(y_test_undersample,y_test_predictions_high_recall)np.set_printoptions(precision&＃61;2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]&＃43;cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names &＃61; [0,1]plot_confusion_matrix(cnf_matrix, classes&＃61;class_names, title&＃61;&＃39;Threshold >&＃61; %s&＃39;%i)

转载于:https://www.cnblogs.com/itbuyixiaogong/p/9850128.html

推荐阅读

c语言
PHP实现断点续传乱序合并文件的方法和源码

本文介绍了使用PHP实现断点续传乱序合并文件的方法和源码。由于网络原因，文件需要分割成多个部分发送，因此无法按顺序接收。文章中提供了merge2.php的源码，通过使用shuffle函数打乱文件读取顺序，实现了乱序合并文件的功能。同时，还介绍了filesize、glob、unlink、fopen等相关函数的使用。阅读本文可以了解如何使用PHP实现断点续传乱序合并文件的具体步骤。 ... [详细]

蜡笔小新 2023-12-14 04:33:19
format
C#学习教程：在Console中工作但在Windows窗体中不工作的异步代码分享

本文分享了一个关于在C#中使用异步代码的问题，作者在控制台中运行时代码正常工作，但在Windows窗体中却无法正常工作。作者尝试搜索局域网上的主机，但在窗体中计数器没有减少。文章提供了相关的代码和解决思路。 ... [详细]

蜡笔小新 2023-12-14 15:56:00
io
CSS3选择器的使用方法详解，提高Web开发效率和精准度

本文详细介绍了CSS3新增的选择器方法，包括属性选择器的使用。通过CSS3选择器，可以提高Web开发的效率和精准度，使得查找元素更加方便和快捷。同时，本文还对属性选择器的各种用法进行了详细解释，并给出了相应的代码示例。通过学习本文，读者可以更好地掌握CSS3选择器的使用方法，提升自己的Web开发能力。 ... [详细]

蜡笔小新 2023-12-14 14:37:52
io
Spring特性实现接口多类的动态调用详解

本文详细介绍了如何使用Spring特性实现接口多类的动态调用。通过对Spring IoC容器的基础类BeanFactory和ApplicationContext的介绍，以及getBeansOfType方法的应用，解决了在实际工作中遇到的接口及多个实现类的问题。同时，文章还提到了SPI使用的不便之处，并介绍了借助ApplicationContext实现需求的方法。阅读本文，你将了解到Spring特性的实现原理和实际应用方式。 ... [详细]

蜡笔小新 2023-12-14 03:24:19
const
关于cuowu类的错误提示和使用AdjustmentListener的问题

本文讨论了一个关于cuowu类的问题，作者在使用cuowu类时遇到了错误提示和使用AdjustmentListener的问题。文章提供了16个解决方案，并给出了两个可能导致错误的原因。 ... [详细]

蜡笔小新 2023-12-13 22:09:56
const
Java String与StringBuffer的区别及其应用场景

本文主要介绍了Java中String和StringBuffer的区别，String是不可变的，而StringBuffer是可变的。StringBuffer在进行字符串处理时不生成新的对象，内存使用上要优于String类。因此，在需要频繁对字符串进行修改的情况下，使用StringBuffer更加适合。同时，文章还介绍了String和StringBuffer的应用场景。 ... [详细]

蜡笔小新 2023-12-13 19:21:06
io
拥抱Android Design Support Library新变化（导航视图、悬浮ActionBar）

转载请注明明桑AndroidAndroid5.0Loollipop作为Android最重要的版本之一，为我们带来了全新的界面风格和设计语言。看起来很受欢迎࿰ ... [详细]

蜡笔小新 2023-12-13 16:11:00
io
VB.NET在线急等问题解决方法，如何统计数据库字段下的数据并显示在文本框里？

本文介绍了一个在线急等问题解决方法，即如何统计数据库中某个字段下的所有数据，并将结果显示在文本框里。作者提到了自己是一个菜鸟，希望能够得到帮助。作者使用的是ACCESS数据库，并且给出了一个例子，希望得到的结果是560。作者还提到自己已经尝试了使用"select sum(字段2) from 表名"的语句，得到的结果是650，但不知道如何得到560。希望能够得到解决方案。 ... [详细]

蜡笔小新 2023-12-13 15:15:30
io
自动轮播，反转播放的ViewPagerAdapter的使用方法和效果展示

本文介绍了如何使用自动轮播、反转播放的ViewPagerAdapter，并展示了其效果。该ViewPagerAdapter支持无限循环、触摸暂停、切换缩放等功能。同时提供了使用GIF.gif的示例和github地址。通过LoopFragmentPagerAdapter类的getActualCount、getActualItem和getActualPagerTitle方法可以实现自定义的循环效果和标题展示。 ... [详细]

蜡笔小新 2023-12-13 14:41:31
instance
iOS数据库Sqlite的SQL语句分类和常见约束关键字

本文介绍了iOS数据库Sqlite的SQL语句分类和常见约束关键字。SQL语句分为DDL、DML和DQL三种类型，其中DDL语句用于定义、删除和修改数据表，关键字包括create、drop和alter。常见约束关键字包括if not exists、if exists、primary key、autoincrement、not null和default。此外，还介绍了常见的数据库数据类型，包括integer、text和real。 ... [详细]

蜡笔小新 2023-12-12 18:42:03
instance
Python自动提取文本中的时间（包含中文日期）及特殊时间识别方法

本文介绍了在处理不规则数据时如何使用Python自动提取文本中的时间日期，包括使用dateutil.parser模块统一日期字符串格式和使用datefinder模块提取日期。同时，还介绍了一段使用正则表达式的代码，可以支持中文日期和一些特殊的时间识别，例如'2012年12月12日'、'3小时前'、'在2012/12/13哈哈'等。 ... [详细]

蜡笔小新 2023-12-12 12:09:33
jar
Swing组件及其用法，图标接口的定义和创建方法

本文介绍了Swing组件的用法，重点讲解了图标接口的定义和创建方法。图标接口用来将图标与各种组件相关联，可以是简单的绘画或使用磁盘上的GIF格式图像。文章详细介绍了图标接口的属性和绘制方法，并给出了一个菱形图标的实现示例。该示例可以配置图标的尺寸、颜色和填充状态。 ... [详细]

蜡笔小新 2023-12-11 21:03:59
io
七牛上传图片成功之后，图片裂了

图像因存在错误而无法显示 ... [详细]

蜡笔小新 2023-12-11 13:17:11
filter
Android自定义控件绘图篇之Paint函数大汇总

本文介绍了Android自定义控件绘图篇中的Paint函数大汇总，包括重置画笔、设置颜色、设置透明度、设置样式、设置宽度、设置抗锯齿等功能。通过学习这些函数，可以更好地掌握Paint的用法。 ... [详细]

蜡笔小新 2023-12-10 23:11:57
install
Python使用Pillow包生成验证码图片的方法

本文介绍了使用Python中的Pillow包生成验证码图片的方法。通过随机生成数字和符号，并添加干扰象素，生成一幅验证码图片。需要配置好Python环境，并安装Pillow库。代码实现包括导入Pillow包和随机模块，定义随机生成字母、数字和字体颜色的函数。 ... [详细]

蜡笔小新 2023-12-10 16:51:25

violalal_134

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章