问

Corpus参数上的DocumentTermMatrix错误

宅囧2502881733 发布于 2023-01-09 15:04

我有以下代码:

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.

corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)

news_dtm <- DocumentTermMatrix(corpus_clean) # errors here

当我运行该DocumentTermMatrix()方法时,它给了我这个错误:

错误:inherits(doc,"TextDocument")不为TRUE

为什么我会收到此错误？我的行不是文本文件吗？

这是检查时的输出corpus_clean:

[[153]]
[1] obama holds technical school model us

[[154]]
[1] oil boom produces jobs bonanza archaeologists

[[155]]
[1] islamic terrorist group expands territory captures tikrit

[[156]]
[1] republicans democrats feel eric cantors loss

[[157]]
[1] tea party candidates try build cantor loss

[[158]]
[1] vehicles materials stored delaware bridges

[[159]]
[1] hill testimony hagel defends bergdahl trade

[[160]]
[1] tweet selfpropagates tweetdeck

[[161]]
[1] blackwater guards face trial iraq shootings

[[162]]
[1] calif man among soldiers killed afghanistan

[[163]]
[1] stocks fall back world bank cuts growth outlook

[[164]]
[1] jabhat alnusra longer useful turkey

[[165]]
[1] catholic bishops keep focus abortion marriage

[[166]]
[1] barbra streisand visits hill heart disease

[[167]]
[1] rand paul cantors loss reason stop talking immigration

[[168]]
[1] israeli airstrike kills northern gaza

编辑:这是我的数据:

type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods

我检索如下:

news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)

编辑:这是traceback():

> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), 
       ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)

当我评价inherits(corpus_clean, "TextDocument")它是假的.

3 个回答

改变这个:

corpus_clean <- tm_map(news_corpus, tolower)

为了这:

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

2023-01-09 15:06 回答

小菠萝

看起来这样做会很好,tm 0.5.10但是变化tm 0.6.0似乎打破了它.问题是,职能tolower和trim方法不一定返回TextDocuments(它看起来像旧版本可能会自动进行转换).它们返回字符而DocumentTermMatrix不确定如何处理字符语料库.

所以你可以改为
```
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
```
或者你可以跑
```
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
```
在完成所有非标准转换(不在其中getTransformations())之后,在创建DocumentTermMatrix之前完成.这应该确保您的所有数据都在PlainTextDocument中,并且应该使DocumentTermMatrix满意.
2023-01-09 15:06 回答

薛薛Sying

我在一篇关于TM的文章中找到了解决这个问题的方法.

以下错误的示例如下:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)

警告信息:

在TermDocumentMatrix.VCorpus(x,control)中:无效的文档标识符

发生此错误的原因是您需要类Vector Source的对象来执行Term Document Matrix,但之前的转换会转换您的文本语料库,因此,更改函数不接受的类.

但是,如果在tm_map命令中添加函数content_transformer,则在使用函数TermDocumentMatrix继续运行之前,可能不需要再执行一个命令.

下面的代码更改了类(请参见倒数第二行)并避免错误:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class 
matrix_term <- DocumentTermMatrix(corpus)

2023-01-09 15:06 回答

晓梦

撰写答案

今天，你开发时遇到什么问题呢？

立即提问

热门标签