Let's say I have a vector of variables like this:
假设我有一个像这样的变量向量:
>variable
[1] "A1" "A1" "A1" "A1" "A2" "A2" "A2" "A2" "B1" "B1" "B1" "B1"
and I want to covert this into into a data frame like this:
我想把它转换成这样的数据框:
treatment time
1 A 1
2 A 1
3 A 1
4 A 1
5 A 2
6 A 2
7 A 2
8 A 2
9 B 1
10 B 1
11 B 1
12 B 1
To that end, I used reshape2's colsplit function. It rquires a pattern to split the string, but I quickly realize there is no obvious pattern to split the two characters without any space. I tried "" and got the following results:
为此,我使用了reshape2的colsplit功能。它需要一个模式来分割字符串,但我很快意识到没有明显的模式来分割两个字符而没有任何空格。我试过“”并得到以下结果:
> colsplit(trialm$variable,"",names=c("treatment","time"))
treatment time
1 NA A1
2 NA A1
3 NA A1
4 NA A1
5 NA A2
6 NA A2
7 NA A2
8 NA A2
9 NA B1
10 NA B1
11 NA B1
12 NA B1
I also tried a lookbehind or lookahead regular expression :
我也尝试过lookbehind或lookahead正则表达式:
>colsplit(trialm$variable,"(?<=\\w)",names=c("treatment","time"))
Error in gregexpr("(?<=\\w)", c("A1", "A1", "A1", "A1", "A2", "A2", "A2", :
invalid regular expression '(?<=\w)', reason 'Invalid regexp'
but it gave me the above error. How can I solve this problem?
但它给了我上面的错误。我怎么解决这个问题?
7
substr
is another way to do it.
substr是另一种方法。
> variable <- c(rep("A1", 4), rep("A2", 4), rep("B1", 4))
> data.frame(treatment=substr(variable, 1,1), time=as.numeric(substr(variable,2,2)))
treatmen time
1 A 1
2 A 1
3 A 1
4 A 1
5 A 2
6 A 2
7 A 2
8 A 2
9 B 1
10 B 1
11 B 1
12 B 1
9
Somewhere along the line, the "stringr" package (which is imported with "reshape2" and which is responsible for the splitting that takes place with colsplit
) started to use "stringi" for several of its functions. Some behavior seems to have changed because of that.
沿着这条线的某个地方,“stringr”包(使用“reshape2”导入并负责使用colsplit进行拆分)开始使用“stringi”来实现其几个功能。由于这一点,一些行为似乎已经改变。
Using the current "reshape2" (and current "stringr" package), colsplit
works the way you would have expected it to with your code:
使用当前的“reshape2”(以及当前的“stringr”包),colsplit的工作方式与您对代码的预期方式相同:
packageVersion("reshape2")
## [1] ‘1.4.3’
packageVersion("stringr")
## [1] ‘1.2.0’
colsplit(variable, "", names = c("treatment", "time"))
## treatment time
## 1 A 1
## 2 A 1
## 3 A 1
## 4 A 1
## 5 A 2
## 6 A 2
## 7 A 2
## 8 A 2
## 9 B 1
## 10 B 1
## 11 B 1
## 12 B 1
If a pattern can be detected in your "variable" but there is no clean split character that can be used, then add one :)
如果可以在“变量”中检测到模式但是没有可以使用的干净分割字符,那么添加一个:)
library(reshape2)
variable <- c("A1", "A1", "A1", "A1", "A2", "A2",
"A2", "A2", "B1", "B1", "B1", "B1")
## Here, we add a "." between upper case letters and numbers
colsplit(gsub("([A-Z])([0-9])", "\\1\\.\\2", variable),
"\\.", c("Treatment", "Time"))
# Treatment Time
# 1 A 1
# 2 A 1
# 3 A 1
# 4 A 1
# 5 A 2
# ::::: snip :::: #
# 11 B 1
# 12 B 1
My "splitstackshape" package has a single-purpose non-exported helper function called NoSep
that can be used for this:
我的“splitstackshape”包有一个名为NoSep的单用途非导出辅助函数,可用于此:
splitstackshape:::NoSep(variable)
## .var .time_1
## 1 A 1
## 2 A 1
## 3 A 1
## 4 A 1
## 5 A 2
## ::: snip :::: #
## 11 B 1
## 12 B 1
The "tidyverse" (specifically the "tidyr" package) has a couple of convenient functions for splitting values into different columns: separate
and extract
. separate
has already been demonstrated by jazzuro, but the solution is very specific to this particular problem. Also, it generally works better with a delimiter. extract
expects you to specify a regular expression with the groups you want to capture:
“tidyverse”(特别是“tidyr”包)有几个方便的功能,可以将值分成不同的列:单独和提取。 jazzuro已经证明了分离,但解决方案非常特定于这一特定问题。此外,它通常使用分隔符更好。 extract希望您指定包含要捕获的组的正则表达式:
library(tidyverse)
data.frame(variable) %>%
extract(variable, into = c("Treatment", "Time"), regex = "([A-Z]+)([0-9]+)")
# Treatment Time
# 1 A 1
# 2 A 1
# 3 A 1
# 4 A 1
# 5 A 2
# ::::: snip :::: #
# 11 B 1
# 12 B 1
5
You can use substr
to split it:
您可以使用substr来拆分它:
e.g.
例如
df <- data.frame(treatment = substr(variable, start = 1, stop = 1),
time = substr(variable, start = 2, stop = 2) )
5
If you create a data frame with the vector, variable
, you could use separate()
from the tidyr
package now.
如果使用vector,variable创建数据框,则可以立即使用tidyr包中的separate()。
mydf <- data.frame(variable = c(rep("A1", 4), rep("A2", 4), rep("B1", 4)),
stringsAsFactors = FALSE)
separate(mydf, variable, c("treatement", "time"), sep = 1)
# treatement time
#1 A 1
#2 A 1
#3 A 1
#4 A 1
#5 A 2
#6 A 2
#7 A 2
#8 A 2
#9 B 1
#10 B 1
#11 B 1
#12 B 1
4
Another solution using regular expression
另一种使用正则表达式
require(stringr)
variable <- c(paste0("A", c(rep(1, 4), rep(2, 3))),
paste0("B", rep(1, 4))
)
data.frame(
treatment = str_extract(variable, "[[:alpha:]]"),
time = as.numeric(str_extract(variable, "[[:digit:]]"))
)
## treatment time
## 1 A 1
## 2 A 1
## 3 A 1
## 4 A 1
## 5 A 2
## 6 A 2
## 7 A 2
## 8 B 1
## 9 B 1
## 10 B 1
## 11 B 1
4
A new function tstrsplit()
was introduced in data.table v1.9.5
. The t
stands for transpose. It's the result of splitting a character vector with strsplit()
and then transposing it.
在data.table v1.9.5中引入了一个新函数tstrsplit()。 t代表转置。这是用strsplit()分割字符向量然后转置它的结果。
# dummy data
library(data.table)
dt <- data.table(var = c(rep("A1", 4), rep("A2", 4), rep("B1", 4)))
Using tstrsplit()
:
使用tstrsplit():
dt[, tstrsplit(var, "")]
V1 V2
1: A 1
2: A 1
3: A 1
4: A 1
5: A 2
6: A 2
7: A 2
8: A 2
9: B 1
10: B 1
11: B 1
12: B 1
Yes, it's that easy. :-)
是的,就这么简单。 :-)
3
You can use substring() to create vectors then join them using the data.frame function.
您可以使用substring()创建向量,然后使用data.frame函数将它们连接起来。
yyy<-c("A1", "A1", "A1", "A1", "A2", "A2", "A2", "A2", "B1", "B1", "B1", "B1")
treatment<-substring(yyy, 1,1)
time<-as.numeric(substring(yyy,2,2))
data.frame(treatment,time)
2
You could just use strsplit
你可以使用strsplit
df <- t(data.frame(strsplit(variable, "")))
rownames(df) <- NULL
colnames(df) <- c("treatment" , "time" )
df
treatment time
[1,] "A" "1"
[2,] "A" "1"
[3,] "A" "1"
[4,] "A" "1"
[5,] "A" "2"
[6,] "A" "2"
[7,] "A" "2"
[8,] "A" "2"
[9,] "B" "1"
[10,] "B" "1"
[11,] "B" "1"
[12,] "B" "1"
Instead of using t
you can use rbind
and then coerce to data.frame
as follows:
您可以使用rbind而不是使用t,然后强制使用data.frame,如下所示:
setNames(as.data.frame(do.call(rbind, strsplit(variable, ""))),
c("Treatment", "Time"))
# Treatment Time
# 1 A 1
# 2 A 1
# 3 A 1
# 4 A 1
# 5 A 2
# 6 A 2
# 7 A 2
# 8 B 1
# 9 B 1
# 10 B 1
# 11 B 1
1
Based on the comment of @Justin I suggest this (using v <- c("A1", "B2")
):
基于@Justin的评论我建议这个(使用v <- c(“A1”,“B2”)):
> t(sapply(strsplit(v, ''), '[', c(1, 2)))
[,1] [,2]
[1,] "A" "1"
[2,] "B" "2"
The vector after `'[' selects the items from the split vector. So I split only once, keeping both items. Maybe this is even easier if you want to keep every item:
''['选择分裂向量中的项后的向量。所以我只拆分一次,保留两个项目。如果你想保留每件物品,这可能更容易:
t(sapply(strsplit(v, ''), identity))