问

当我在scikit learning中提供自定义词汇表时，为什么不能为CountVectorizer指定最低频率？

Karson2012 发布于 2023-01-14 15:53

我了解可以在Python的Scikit-learn包中创建计数矢量化程序时指定最小频率。但是，我想知道是否只有您不提供先验字典才是这种情况？当我提供自己的自定义词汇表（列表）时，此参数似乎不起作用。

为了弄清楚这一点，我重新阅读了该min_df参数的文档：

Parameters :

min_df : float in range [0.0, 1.0] or int, optional, 1 by default

    When building the vocabulary, ignore terms that have a term frequency strictly lower than the given threshold.  

    This value is also called cut-off in the literature.   

    If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

就我而言，我向CountVectorizer提供了一个自定义词汇表，该词汇表由我之前获得的我自己的术语组成。

cv = CountVectorizer(vocabulary=my_own_terms, min_df=3)
X = cv.fit_transform(a_big_corpus)

在查看输出时，我得到了出现一次，两次等的各种术语。

是否有人在工作中发生过这种情况？如果是这样，是否有可能的解决方案？提前致谢。

撰写答案

今天，你开发时遇到什么问题呢？

立即提问

热门标签