data.table vs dplyr:一个人能做得好吗,另一个做不好或做得不好?

 xia 发布于 2023-01-30 15:54

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

On benchmarks, it would be great to cover these remaining aspects as well:

Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

Benchmark other operations such as update and joins.

Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

    Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

    data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

    # sub-assign by reference, updates 'y' in-place
    DT[x >= 1L, y := NA]
    

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

    # copies the entire 'y' column
    ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
    

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

    foo <- function(DT) {
        DT = shallow(DT)          ## shallow copy DT
        DT[, newcol := 1L]        ## does not affect the original DT 
        DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
        DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                  ## also get modified.
    }
    

    By not using shallow(), the old functionality is retained:

    bar <- function(DT) {
        DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
        DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
    }
    

    通过创建一个浅拷贝使用shallow(),我们明白,您不想修改原来的对象.我们在内部处理所有事情,以确保在确保复制列时仅在绝对必要时修改.实施时,这应该完全解决参考透明度问题,同时为用户提供两种可能性.

    另外,一旦shallow()导出,dplyr的data.table接口应该避免几乎所有的副本.所以那些喜欢dplyr语法的人可以将它与data.tables一起使用.

    但它仍然缺少data.table提供的许多功能,包括(sub)-assignment by reference.

    加入时聚合:

    假设您有两个data.tables如下:

    DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
    #    x y z
    # 1: 1 a 1
    # 2: 1 a 2
    # 3: 1 b 3
    # 4: 1 b 4
    # 5: 2 a 5
    # 6: 2 a 6
    # 7: 2 b 7
    # 8: 2 b 8
    DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
    #    x y mul
    # 1: 1 a   4
    # 2: 2 b   3
    

    并且您希望sum(z) * mulDT2按列加入时获取每一行x,y.我们可以:

    1)聚合DT1得到sum(z),2)执行连接和3)乘(或)

    # data.table way
    DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
    
    # dplyr equivalent
    DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
        right_join(DF2) %>% mutate(z = z * mul)
    

    2)一次完成(使用by = .EACHI功能):

    DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
    

    有什么好处?

    我们不必为中间结果分配内存.

    我们没有两次分组/哈希(一个用于聚合,另一个用于加入).

    更重要的是,通过查看j(2),我们想要执行的操作是清楚的.

    查看此帖子以获取详细说明by = .EACHI.没有实现中间结果,并且连接+聚合一次性完成.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

    Update and joins:

    Consider the data.table code shown below:

    DT1[DT2, col := i.mul]
    

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

    Basic aggregation/update operations.

    # case (a)
    DT[, sum(y), by = z]                       ## data.table syntax
    DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
    DT[, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
    # case (b)
    DT[x > 2, sum(y), by = z]
    DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
    DT[x > 2, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
    
    # case (c)
    DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
    DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
    DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
    DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    

    data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

    This is what we mean when we say the DT[i, j, by] form is consistent.

    Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

    While it returns the same result, using filter() here makes the actual operation less obvious.

    It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

    Aggregation/update on multiple columns

    # case (a)
    DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
    DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
    DT[, (cols) := lapply(.SD, sum), by = z]
    ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
    # case (b)
    DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
    DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
    # case (c)
    DT[, c(.N, lapply(.SD, sum)), by = z]     
    DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    

    In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

    Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

    setkey(DT1, x, y)
    
    # 1. normal join
    DT1[DT2]            ## data.table syntax
    left_join(DT2, DT1) ## dplyr syntax
    
    # 2. select columns while join    
    DT1[DT2, .(z, i.mul)]
    left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
    
    # 3. aggregate while join
    DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
    DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
        inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
    # 4. update while join
    DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
    ??
    
    # 5. rolling join
    DT1[DT2, roll = -Inf]
    ??
    
    # 6. other arguments to control output
    DT1[DT2, mult = "first"]
    ??
    

    Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

    However dplyr joins do just that. Nothing more. Nothing less.

    data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

    data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?

    data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

    data.table also has mult = argument which selects first, last or all matches (6).

    data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

    do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

    DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
    DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
    DT[, list(x[1:2], y[1]), by = z]
    DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
    DT[, quantile(x, 0.25), by = z]
    DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
    DT[, quantile(x, c(0.25, 0.75)), by = z]
    DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
    DT[, as.list(summary(x)), by = z]
    DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    

    .SD's equivalent is .

    In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

    In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

fread - fast file reader has been available for a long time now.

fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

Automatic indexing - another handy feature to optimise base R syntax as is, internally.

Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

Numerous advantages in data.table joins (for speed/memory efficiency and syntax) mentioned above.

Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

setorder() function in data.table that allows really fast reordering of data.tables by reference.

dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


Finally:

On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

@bluefeet:我不认为你通过讨论聊天来为我们其他人提供任何出色的服务.我的印象是Arun是开发人员之一,这可能会产生有用的见解. (7认同)

我认为你所使用的任何地方通过引用(`:=`),`dplyr`等效的地方也应该使用`<-`,如`DF < - DF%>%mutate ...`而不仅仅是` DF%>%mutate ...` (6认同)

我已经阅读了很多篇文章,它对我理解data.table以及更好地使用它有很大帮助。在大多数情况下,我更喜欢data.table而不是dplyr或pandas或PL / pgSQL。但是,我不能停止思考如何表达它。语法*不*简单,清晰或冗长。实际上,即使我经常使用data.table之后,我还是常常难以理解自己的代码,我实际上是一周前写的。这是只写语言的生活示例。https://en.wikipedia.org/wiki/Write-only_language因此,希望,有一天,我们将能够在data.table上使用dplyr。 (6认同)

关于语法.我相信`dplyr`可以是谁使用`plyr`语法的用户更容易,但`data.table`可以是谁使用的查询语言语法像`SQL`用户更容易,和它背后的关系代数,这是所有关于表格数据转换.@Arun你应该注意**set操作符**通过包装`data.table`函数非常容易实现,当然会带来显着的加速. (4认同)


hadley.. 370

这是我从dplyr角度全面回答的尝试,遵循Arun回答的大致轮廓(但根据不同的优先级进行了一些重新安排).

句法

语法有一些主观性,但我坚持认为data.table的简洁使得学习更难学,更难阅读.这部分是因为dplyr解决了一个更容易的问题!

dplyr为您做的一件非常重要的事情是它 限制了您的选择.我声称大多数单表问题都可以通过五个关键动词过滤,选择,变异,排列和总结以及"按组"副词来解决.当您学习数据操作时,这种约束是一个很大的帮助,因为它有助于您对问题的思考.在dplyr中,每个动词都映射到一个函数.每个功能都可以完成一项工作,并且易于理解.

您可以通过与这些管道的简单操作共创复杂 %>%.以下是Arun 链接到的帖子中的一个示例:

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

即使你以前从未见过dplyr(甚至是R!),你仍然可以得到正在发生的事情的要点,因为这些函数都是英语动词.英语动词的缺点是它们需要更多的打字 [,但我认为通过更好的自动完成可以在很大程度上减轻这种情况.

这是等效的data.table代码:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

除非您已经熟悉data.table,否则更难遵循此代码.(我也无法弄清楚如何以[ 一种看起来很好看的方式缩进重复).就个人而言,当我看到我6个月前写的代码时,就像查看陌生人编写的代码一样,所以我更喜欢简单明了的代码.

我认为其他两个次要因素会略微降低可读性:

由于几乎每个数据表操作都使用[您需要额外的上下文来确定发生了什么.例如,是x[y] 从数据框连接两个数据表还是从中提取列?这只是一个小问题,因为在编写良好的代码中,变量名称应该表明发生了什么.

我喜欢这group_by()是dplyr中的一个单独操作.它从根本上改变了计算,所以我认为在略读代码时应该是显而易见的,并且它group_by()by参数更容易发现[.data.table.

我也喜欢管道 不仅限于一个包装.您可以先用tidyr整理数据,然后用 ggvis中的绘图完成.而且你不仅限于我写的软件包 - 任何人都可以编写一个函数来构成数据操作管道的无缝部分.实际上,我更喜欢以前用%>%以下内容重写的data.table代码:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

管道的概念%>%不仅限于数据框架,而且很容易推广到其他环境:交互式网页图形,网页抓取, 要点,运行时间合同 ......)

记忆和表现

我把它们混在一起,因为对我来说,它们并不那么重要.大多数R用户的工作量低于100万行,而dplyr足够快,足以满足您不了解处理时间的数据量.我们优化dplyr以表达对中等数据的表达; 随时可以使用data.table获取更大数据的原始速度.

dplyr的灵活性还意味着您可以使用相同的语法轻松调整性能特征.如果dplyr与数据帧后端的性能不够好,则可以使用data.table后端(尽管功能有限).如果您使用的数据不适合内存,则可以使用数据库后端.

总而言之,dplyr的表现会在长期内变得更好.我们肯定会实现data.table的一些好主意,比如基数排序和联接和过滤器使用相同的索引.我们还致力于并行化,因此我们可以利用多个内核.

特征

我们计划在2015年开展的一些工作:

readr软件包,可以轻松地将文件从磁盘上移到内存中,类似于fread().

更灵活的连接,包括对非equi-joins的支持.

更灵活的分组,如bootstrap样本,汇总等

我还投入时间来改进R的数据库连接器,与web apis交谈的能力 ,以及更容易 抓取HTML页面.

3 个回答
  • 直接回答问题标题 ......

    dplyr 绝对data.table不了的事情.

    你的观点#3

    dplyr抽象(或将)潜在的DB交互

    是你自己的问题的直接答案,但没有提升到足够高的水平.dplyr对于多个数据存储机制来说,它data.table是一个真正的可扩展前端,同时也是单个数据存储机制的扩展.

    将其dplyr视为后端不可知接口,所有目标都使用相同的语法,您可以随意扩展目标和处理程序.data.table从这个dplyr角度来看,是其中一个目标.

    你永远不会(我希望)看到有一天data.table试图翻译你的查询来创建与磁盘或网络数据存储一起运行的SQL语句.

    dplyr可能做的事情data.table不会或可能不会做.

    基于内存工作的设计,data.table可能会有更难以将自身扩展到并行处理查询dplyr.


    回应体内问题......

    用法

    对于熟悉软件包的人来说,是否需要使用一个或另一个软件包来编写分析任务更加容易(例如,需要按键的一些组合与所需的深奥水平相结合,其中每个项目的好处都是好事).

    这似乎是一个平底船,但真正的答案是否定的.人们熟悉的工具似乎使用的无论是一个最熟悉的他们还是一个实际上是手头的工作是正确的.话虽如此,有时你想要提供一个特定的可读性,有时候是一个性能水平,当你需要足够高的两者时,你可能只需要另一个工具来配合你已经拥有的东西来做出更清晰的抽象.

    性能

    是否存在在一个包装与另一个包装中更有效地执行分析任务(即,超过2倍)的分析任务.

    再一次,没有. data.table擅长在所做的一切事情中保持高效,dplyr在某些方面受到限制,因为它们在底层数据存储和注册处理程序方面受到限制.

    这意味着当您遇到性能问题与data.table您可以相当肯定它在你的查询功能,如果它实际上是一个瓶颈问题data.table,那么你已经赢得了自己在提交报告的喜悦.时也是如此dplyr使用data.table作为后端; 你可能会看到一些开销,dplyr但可能是你的查询.

    dplyr后端存在性能问题时,您可以通过注册混合评估函数或(在数据库的情况下)在执行之前操作生成的查询来绕过它们.

    另请参阅plyr何时比data.table更好的接受答案?

    2023-01-30 15:56 回答
  • 这是我从dplyr角度全面回答的尝试,遵循Arun回答的大致轮廓(但根据不同的优先级进行了一些重新安排).

    句法

    语法有一些主观性,但我坚持认为data.table的简洁使得学习更难学,更难阅读.这部分是因为dplyr解决了一个更容易的问题!

    dplyr为您做的一件非常重要的事情是它 限制了您的选择.我声称大多数单表问题都可以通过五个关键动词过滤,选择,变异,排列和总结以及"按组"副词来解决.当您学习数据操作时,这种约束是一个很大的帮助,因为它有助于您对问题的思考.在dplyr中,每个动词都映射到一个函数.每个功能都可以完成一项工作,并且易于理解.

    您可以通过与这些管道的简单操作共创复杂 %>%.以下是Arun 链接到的帖子中的一个示例:

    diamonds %>%
      filter(cut != "Fair") %>%
      group_by(cut) %>%
      summarize(
        AvgPrice = mean(price),
        MedianPrice = as.numeric(median(price)),
        Count = n()
      ) %>%
      arrange(desc(Count))
    

    即使你以前从未见过dplyr(甚至是R!),你仍然可以得到正在发生的事情的要点,因为这些函数都是英语动词.英语动词的缺点是它们需要更多的打字 [,但我认为通过更好的自动完成可以在很大程度上减轻这种情况.

    这是等效的data.table代码:

    diamondsDT <- data.table(diamonds)
    diamondsDT[
      cut != "Fair", 
      .(AvgPrice = mean(price),
        MedianPrice = as.numeric(median(price)),
        Count = .N
      ), 
      by = cut
    ][ 
      order(-Count) 
    ]
    

    除非您已经熟悉data.table,否则更难遵循此代码.(我也无法弄清楚如何以[ 一种看起来很好看的方式缩进重复).就个人而言,当我看到我6个月前写的代码时,就像查看陌生人编写的代码一样,所以我更喜欢简单明了的代码.

    我认为其他两个次要因素会略微降低可读性:

    由于几乎每个数据表操作都使用[您需要额外的上下文来确定发生了什么.例如,是x[y] 从数据框连接两个数据表还是从中提取列?这只是一个小问题,因为在编写良好的代码中,变量名称应该表明发生了什么.

    我喜欢这group_by()是dplyr中的一个单独操作.它从根本上改变了计算,所以我认为在略读代码时应该是显而易见的,并且它group_by()by参数更容易发现[.data.table.

    我也喜欢管道 不仅限于一个包装.您可以先用tidyr整理数据,然后用 ggvis中的绘图完成.而且你不仅限于我写的软件包 - 任何人都可以编写一个函数来构成数据操作管道的无缝部分.实际上,我更喜欢以前用%>%以下内容重写的data.table代码:

    diamonds %>% 
      data.table() %>% 
      .[cut != "Fair", 
        .(AvgPrice = mean(price),
          MedianPrice = as.numeric(median(price)),
          Count = .N
        ), 
        by = cut
      ] %>% 
      .[order(-Count)]
    

    管道的概念%>%不仅限于数据框架,而且很容易推广到其他环境:交互式网页图形,网页抓取, 要点,运行时间合同 ......)

    记忆和表现

    我把它们混在一起,因为对我来说,它们并不那么重要.大多数R用户的工作量低于100万行,而dplyr足够快,足以满足您不了解处理时间的数据量.我们优化dplyr以表达对中等数据的表达; 随时可以使用data.table获取更大数据的原始速度.

    dplyr的灵活性还意味着您可以使用相同的语法轻松调整性能特征.如果dplyr与数据帧后端的性能不够好,则可以使用data.table后端(尽管功能有限).如果您使用的数据不适合内存,则可以使用数据库后端.

    总而言之,dplyr的表现会在长期内变得更好.我们肯定会实现data.table的一些好主意,比如基数排序和联接和过滤器使用相同的索引.我们还致力于并行化,因此我们可以利用多个内核.

    特征

    我们计划在2015年开展的一些工作:

    readr软件包,可以轻松地将文件从磁盘上移到内存中,类似于fread().

    更灵活的连接,包括对非equi-joins的支持.

    更灵活的分组,如bootstrap样本,汇总等

    我还投入时间来改进R的数据库连接器,与web apis交谈的能力 ,以及更容易 抓取HTML页面.

    2023-01-30 15:56 回答
  • We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

    My intent is to cover each one of these as clearly as possible from data.table perspective.

    Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


    The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

    1. Speed

    Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

    On benchmarks, it would be great to cover these remaining aspects as well:

    Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

    Benchmark other operations such as update and joins.

    Also benchmark memory footprint for each operation in addition to runtime.

    2. Memory usage

      Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

      Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

      data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

      # sub-assign by reference, updates 'y' in-place
      DT[x >= 1L, y := NA]
      

      But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

      # copies the entire 'y' column
      ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
      

      A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

      Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

      foo <- function(DT) {
          DT = shallow(DT)          ## shallow copy DT
          DT[, newcol := 1L]        ## does not affect the original DT 
          DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
          DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                    ## also get modified.
      }
      

      By not using shallow(), the old functionality is retained:

      bar <- function(DT) {
          DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
          DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
      }
      

      通过创建一个浅拷贝使用shallow(),我们明白,您不想修改原来的对象.我们在内部处理所有事情,以确保在确保复制列时仅在绝对必要时修改.实施时,这应该完全解决参考透明度问题,同时为用户提供两种可能性.

      另外,一旦shallow()导出,dplyr的data.table接口应该避免几乎所有的副本.所以那些喜欢dplyr语法的人可以将它与data.tables一起使用.

      但它仍然缺少data.table提供的许多功能,包括(sub)-assignment by reference.

      加入时聚合:

      假设您有两个data.tables如下:

      DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
      #    x y z
      # 1: 1 a 1
      # 2: 1 a 2
      # 3: 1 b 3
      # 4: 1 b 4
      # 5: 2 a 5
      # 6: 2 a 6
      # 7: 2 b 7
      # 8: 2 b 8
      DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
      #    x y mul
      # 1: 1 a   4
      # 2: 2 b   3
      

      并且您希望sum(z) * mulDT2按列加入时获取每一行x,y.我们可以:

      1)聚合DT1得到sum(z),2)执行连接和3)乘(或)

      # data.table way
      DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
      
      # dplyr equivalent
      DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
          right_join(DF2) %>% mutate(z = z * mul)
      

      2)一次完成(使用by = .EACHI功能):

      DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
      

      有什么好处?

      我们不必为中间结果分配内存.

      我们没有两次分组/哈希(一个用于聚合,另一个用于加入).

      更重要的是,通过查看j(2),我们想要执行的操作是清楚的.

      查看此帖子以获取详细说明by = .EACHI.没有实现中间结果,并且连接+聚合一次性完成.

      Have a look at this, this and this posts for real usage scenarios.

      In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

      Update and joins:

      Consider the data.table code shown below:

      DT1[DT2, col := i.mul]
      

      adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

      Check this post for a real usage scenario.

    To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

    3. Syntax

    Let's now look at syntax. Hadley commented here:

    Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

    I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

    We will work with the dummy data shown below:

    DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
    DF = as.data.frame(DT)
    

      Basic aggregation/update operations.

      # case (a)
      DT[, sum(y), by = z]                       ## data.table syntax
      DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
      DT[, y := cumsum(y), by = z]
      ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
      
      # case (b)
      DT[x > 2, sum(y), by = z]
      DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
      DT[x > 2, y := cumsum(y), by = z]
      ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
      
      # case (c)
      DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
      DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
      DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
      DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
      

      data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

      In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

      Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

      Aggregation/update on multiple columns

      # case (a)
      DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
      DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
      DT[, (cols) := lapply(.SD, sum), by = z]
      ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
      
      # case (b)
      DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
      DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
      
      # case (c)
      DT[, c(.N, lapply(.SD, sum)), by = z]     
      DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
      

      In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

      data.table's := requires column names to be provided, whereas dplyr generates it automatically.

      In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

      In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

      Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

      You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

      Joins

      dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

      setkey(DT1, x, y)
      
      # 1. normal join
      DT1[DT2]            ## data.table syntax
      left_join(DT2, DT1) ## dplyr syntax
      
      # 2. select columns while join    
      DT1[DT2, .(z, i.mul)]
      left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
      
      # 3. aggregate while join
      DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
      DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
          inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
      
      # 4. update while join
      DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
      ??
      
      # 5. rolling join
      DT1[DT2, roll = -Inf]
      ??
      
      # 6. other arguments to control output
      DT1[DT2, mult = "first"]
      ??
      

      Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

      However dplyr joins do just that. Nothing more. Nothing less.

      data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

      data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?

      data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

      data.table also has mult = argument which selects first, last or all matches (6).

      data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

    Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

      do()...

      dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

      DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
      DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
      DT[, list(x[1:2], y[1]), by = z]
      DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
      
      DT[, quantile(x, 0.25), by = z]
      DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
      DT[, quantile(x, c(0.25, 0.75)), by = z]
      DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
      
      DT[, as.list(summary(x)), by = z]
      DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
      

      .SD's equivalent is .

      In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

      In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

    Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

    Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

    To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

    data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

    But one should also consider the number of features that dplyr lacks in comparison to data.table.

    4. Features

    I have pointed out most of the features here and also in this post. In addition:

    fread - fast file reader has been available for a long time now.

    fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

    Automatic indexing - another handy feature to optimise base R syntax as is, internally.

    Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

    Numerous advantages in data.table joins (for speed/memory efficiency and syntax) mentioned above.

    Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

    Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

    setorder() function in data.table that allows really fast reordering of data.tables by reference.

    dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

    data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

    data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


    Finally:

    On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

    On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

    2023-01-30 15:58 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有