我有一个URL列表,我已经获取了webcontent,并将其包含在tm语料库中:
library(tm) library(XML) link <- c( "http://www.r-statistics.com/tag/hadley-wickham/", "http://had.co.nz/", "http://vita.had.co.nz/articles.html", "http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html", "http://www.analyticstory.com/hadley-wickham/" ) create.corpus <- function(url.name){ doc=htmlParse(url.name) parag=xpathSApply(doc,'//p',xmlValue) if (length(parag)==0){ parag="empty" } cc=Corpus(VectorSource(parag)) meta(cc,"link")=url.name return(cc) } link=catch$url cc <- lapply(link, create.corpus)
这给了我一个语料库的"大列表",每个URL一个.将它们逐一组合起来:
x=cc[[1]] y=cc[[2]] z=c(x,y,recursive=T) # preserved metadata x;y;z # A corpus with 8 text documents # A corpus with 2 text documents # A corpus with 10 text documents
但这对于拥有几千个语料库的列表来说变得不可行.那么如何在维护元数据的同时将语料库列表合并到一个语料库中呢?
你可以do.call
用来打电话c
:
do.call(function(...) c(..., recursive = TRUE), cc) # A corpus with 155 text documents