我是一个SVM新手,这是我的用例:我有很多不平衡的数据要使用线性SVM进行二进制分类.我需要修正某些值的误报率,并测量每个值的相应误差.我正在使用类似下面的代码使用scikit-learn svm实现:
# define training data X = [[0, 0], [1, 1]] y = [0, 1] # define and train the SVM clf = svm.LinearSVC(C=0.01, class_weight='auto') #auto for unbalanced distributions clf.fit(X, y) # compute false positives and false negatives predictions = [clf.predict(ex) for ex in X] false_positives = [(a, b) for (a, b) in zip(predictions,y) if a != b and b == 0] false_negatives = [(a, b) for (a, b) in zip(predictions,y) if a != b and b == 1]
有没有办法使用分类器的参数(或几个参数),以便有效地修复测量指标?
该class_weights
参数允许您向上或向下推动此误报率.让我用一个日常的例子来说明这是如何工作的.假设您拥有一个夜总会,并且您在两个限制条件下运营:
您希望尽可能多的人进入俱乐部(付费客户)
你不希望任何未成年人进入,因为这会让你陷入困境
平均每天,(比如)只有5%的人试图进入俱乐部将是未成年人.你面临着一个选择:宽容或严格.前者会使你的利润增加5%,但你冒着昂贵的诉讼风险.后者将不可避免地意味着一些超过法定年龄的人将被拒绝入境,这也将花费你的钱.你想调整relative cost
宽大与严格.注意:你不能直接控制有多少未成年人进入俱乐部,但你可以控制你的保镖有多严格.
这里有一些Python,它显示了在更改相对重要性时会发生什么.
from collections import Counter import numpy as np from sklearn.datasets import load_iris from sklearn.svm import LinearSVC data = load_iris() # remove a feature to make the problem harder # remove the third class for simplicity X = data.data[:100, 0:1] y = data.target[:100] # shuffle data indices = np.arange(y.shape[0]) np.random.shuffle(indices) X = X[indices, :] y = y[indices] for i in range(1, 20): clf = LinearSVC(class_weight={0: 1, 1: i}) clf = clf.fit(X[:50, :], y[:50]) print i, Counter(clf.predict(X[50:])) # print clf.decision_function(X[50:])
哪个输出
1 Counter({1: 22, 0: 28}) 2 Counter({1: 31, 0: 19}) 3 Counter({1: 39, 0: 11}) 4 Counter({1: 43, 0: 7}) 5 Counter({1: 43, 0: 7}) 6 Counter({1: 44, 0: 6}) 7 Counter({1: 44, 0: 6}) 8 Counter({1: 44, 0: 6}) 9 Counter({1: 47, 0: 3}) 10 Counter({1: 47, 0: 3}) 11 Counter({1: 47, 0: 3}) 12 Counter({1: 47, 0: 3}) 13 Counter({1: 47, 0: 3}) 14 Counter({1: 47, 0: 3}) 15 Counter({1: 47, 0: 3}) 16 Counter({1: 47, 0: 3}) 17 Counter({1: 48, 0: 2}) 18 Counter({1: 48, 0: 2}) 19 Counter({1: 48, 0: 2})
注意分类为0
减少的数据点的数量是类1
增加的相对权重.假设您有计算资源和时间来训练和评估10个分类器,您可以绘制每个分类器的精确度和召回率,并得到如下图所示的数字(在互联网上无耻地被盗).然后,您可以使用它来确定class_weights
用例的正确值.
LinearSVC
sklearn中的预测方法如下所示
def predict(self, X): """Predict class labels for samples in X. Parameters ---------- X : {array-like, sparse matrix}, shape = [n_samples, n_features] Samples. Returns ------- C : array, shape = [n_samples] Predicted class label per sample. """ scores = self.decision_function(X) if len(scores.shape) == 1: indices = (scores > 0).astype(np.int) else: indices = scores.argmax(axis=1) return self.classes_[indices]
因此,除了mbatchkarov
建议你可以通过改变分类器所说的某个类是一类或另一类的边界来改变分类器(真正的任何分类器)所做的决定.
from collections import Counter import numpy as np from sklearn.datasets import load_iris from sklearn.svm import LinearSVC data = load_iris() # remove a feature to make the problem harder # remove the third class for simplicity X = data.data[:100, 0:1] y = data.target[:100] # shuffle data indices = np.arange(y.shape[0]) np.random.shuffle(indices) X = X[indices, :] y = y[indices] decision_boundary = 0 print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8)) Counter({1: 27, 0: 23}) decision_boundary = 0.5 print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8)) Counter({0: 39, 1: 11})
您可以根据需要优化决策边界.