作者:山野木每子 | 来源:互联网 | 2022-12-09 12:34
我正在尝试查找性能问题,并将其很大程度上隔离为多列非等额联接。以下是我尝试做的事情的合理(但不是确切)示例,以及时间安排。
library(quantmod)
library(data.table)
p <- last(OHLC(getSymbols("SPY", auto.assign = F,)), 700)
d <- as.data.table(p) #convert to a data.table for processing
d[, index := as.POSIXct(index)] #to match my use case. leaving as Date does not significantly alter timings
setnames(d, c("index", "Open", "High", "Low", "Close"))
# create partitions for analysis
partitiOns= unique(d[d, .(Top = x.Close, Bot = i.Close, Start = pmin(x.index, i.index)),
on = .(Close >= Close), allow.cartesian = T][!is.na(Start)])
#desired analysis
system.time(r1 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
on = .(Close >= Bot, Close <= Top, index >= Start), allow.cartesian = T, by = .EACHI])
#7.67
具有相同数据集的单列连接要快得多(但不会产生所需的结果)。只是在这里进行时间比较:
system.time(r2 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
on = .(Close >= Bot, Close <= Top), allow.cartesian = T, by = .EACHI])
#4.4
system.time(r4 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
on = .(index >= Start), allow.cartesian = T, by = .EACHI])
#4.67
我知道,如果减少partition
表中的行数,我可以加快速度,但是我已经尽我所能走了很远,而且速度仍然很慢。我也理解这要求在引擎盖下实现非常大的连接,但是仅凭单列约束,该实现的连接就更大了,因此相对性能仍然困扰着我。
难道我做错了什么?我真的不明白为什么添加第二列条件会导致如此急剧的下降。关于如何使其更快的任何建议?
编辑7/30/18
因此,在verbose=T
尝试了非常有用的功能之后,我发现了问题的另一个方面。median()
在这种情况下,这似乎非常慢:
首先,使用mean()
带有详细输出的现有分析:
r1 <- d[partitions, .(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close)),
on = .(Close >= Bot, Close <= Top, index >= Start), allow.cartesian = T, by = .EACHI, verbose = T]
Non-equi join operators detected ...
forder took ... 0.000sec
Generating non-equi group ids ... done in 0.000sec
Recomputing forder with non-equi ids ... done in 0.000sec
Found 26 non-equi group(s) ...
Starting bmerge ...done in 0.790sec
Detected that j uses these columns: i.Top,i.Bot,i.Start,x.Close
lapply optimization is on, j unchanged as 'list(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close))'
Old mean optimization changed j from 'list(i.Top, i.Bot, i.Start, mean(x.Close), sd(x.Close))' to 'list(i.Top, i.Bot, i.Start, .External(Cfastmean, x.Close, FALSE), sd(x.Close))'
Making each group and running j (GForce FALSE) ...
collecting discontiguous groups took 0.077s for 235273 groups
eval(j) took 4.475s for 235273 calls
4.690sec
接下来,再次使用中值()进行类似分析,并输出详细信息:
r1 <- d[partitions, .(i.Top, i.Bot, i.Start, median(x.Close), sd(x.Close)),
on = .(Close >= Bot, Close <= Top, index >= Start), allow.cartesian = T, by = .EACHI, verbose = T]
Non-equi join operators detected ...
forder took ... 0.000sec
Generating non-equi group ids ... done in 0.000sec
Recomputing forder with non-equi ids ... done in 0.000sec
Found 26 non-equi group(s) ...
Starting bmerge ...done in 0.810sec
Detected that j uses these columns: i.Top,i.Bot,i.Start,x.Close
lapply optimization is on, j unchanged as 'list(i.Top, i.Bot, i.Start, median(x.Close), sd(x.Close))'
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ...
collecting discontiguous groups took 0.079s for 235273 groups
eval(j) took 12.826s for 235273 calls
13.1sec
以供参考:
> getOption("datatable.optimize")
[1] Inf
所以,我想另一个问题是:median()
在non-equi
通过by连接的上下文中,有什么方法可以加快通话速度吗? .EACHI