data.table vs dplyr:一个人能做得好吗,另一个做不好或做得不好?

本文翻译自:data.table vs dplyr: can one do something well the other can't or does poorly?

Overview 概观

I'm relatively familiar with data.table , not so much with dplyr . 我对data.table比较熟悉,而不是dplyr I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that: 我已经阅读了一些dplyr小插曲以及出现在SO上的例子,到目前为止,我的结论是:

  1. data.table and dplyr are comparable in speed, except when there are many (ie >10-100K) groups, and in some other circumstances (see benchmarks below) data.tabledplyr在速度上具有可比性,除非有许多(即> 10-100K)组,并且在某些其他情况下(参见下面的基准)
  2. dplyr has more accessible syntax dplyr具有更易于访问的语法
  3. dplyr abstracts (or will) potential DB interactions dplyr抽象(或将)潜在的DB交互
  4. There are some minor functionality differences (see "Examples/Usage" below) 有一些小的功能差异(参见下面的“示例/用法”)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table , though I understand that for users new to both it will be a big factor. 在我的脑海里2.没有太大的重量,因为我对data.table非常熟悉,虽然我明白对于那些对这两者data.table熟悉的用户来说这将是一个很重要的因素。 I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table . 我想避免争论哪个更直观,因为这与我从熟悉data.table的人的角度提出的具体问题无关。 I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here). 我还想避免讨论“更直观”如何导致更快的分析(当然是真的,但同样,不是我最感兴趣的)。

Question

What I want to know is: 我想知道的是:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (ie some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing). 对于熟悉软件包的人来说,是否需要使用一个或另一个软件包来编写分析任务更加容易(例如,需要按键的一些组合与所需的深奥水平相结合,其中每个项目都是好事。
  2. Are there analytical tasks that are performed substantially (ie more than 2x) more efficiently in one package vs. another. 是否存在在一个包装与另一个包装中更有效地执行分析任务(即,超过2倍)的分析任务。

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table . 最近的一个问题让我想到了这个问题 ,因为直到那时我才认为dplyr会提供超出我在data.table已经做的data.table Here is the dplyr solution (data at end of Q): 这是dplyr解决方案(Q末尾的数据):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. 这比我的hack尝试data.table解决方案要好得多。 That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution): 也就是说,良好的data.table解决方案也相当不错(感谢Jean-Robert,Arun,并注意到这里我赞成对最严格的最佳解决方案的单一陈述):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (ie doesn't use some of the more esoteric tricks). 后者的语法可能看起来非常深奥,但如果您习惯于data.table (即不使用某些更深奥的技巧),它实际上非常简单。

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better. 理想情况下,我想看到的是一些很好的例子, dplyrdata.table方式基本上更简洁或表现更好。

Examples 例子

Usage 用法
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question , note: this looks like it will be implemented in dplyr 0.5 , also, @beginneR shows a potential work-around using do in the answer to @eddi's question). dplyr不允许返回任意行数的分组操作(来自eddi的问题 ,请注意:这看起来它将在dplyr 0.5中实现,同样,@ beginneR显示了在@ eddi的问题的答案中使用do的潜在解决方法) 。
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins data.table支持滚动连接 (感谢@dholstius)以及重叠连接
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. data.table内部优化形式的表达式DT[col == value]DT[col %in% values]速度通过自动索引 ,它使用二进制搜索 ,同时使用相同的基础R语法。 See here for some more details and a tiny benchmark. 请参阅此处了解更多详细信息和一个小基准。
  • dplyr offers standard evaluation versions of functions (eg regroup , summarize_each_ ) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge) dplyr提供的功能标准评估版本(如regroupsummarize_each_ ),可以简化程序中使用的dplyr (注意程序中使用的data.table是绝对有可能,只是需要一些认真思考,置换/报价,等等,至少据我所知)
Benchmarks 基准
  • I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point data.table becomes substantially faster. 我运行自己的基准测试 ,发现两个软件包在“拆分应用组合”样式分析中具有可比性,除非有非常多的组(> 100K),此时data.table变得非常快。
  • @Arun ran some benchmarks on joins , showing that data.table scales better than dplyr as the number of groups increase (updated with recent enhancements in both packages and recent version of R). @Arun 在连接上运行了一些基准测试 ,结果表明, data.table组数量的增加, data.tabledplyr更好地dplyr (更新了包和最近版本的R中的最新增强功能)。 Also, a benchmark when trying to get unique values has data.table ~6x faster. 此外,尝试获取唯一值时的基准测试具有data.table ~6倍的速度。
  • (Unverified) has data.table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones ( another SO question from comments , thanks danas). (未验证) data.table在组/ apply / sort的较大版本上快75%,而dplyr在较小版本上快40%( 来自评论的另一个SO问题 ,感谢danas)。
  • Matt, the main author of data.table , has benchmarked grouping operations on data.table , dplyr and python pandas on up to 2 billion rows (~100GB in RAM) . Matt是data.table的主要作者, data.table data.tabledplyr和python pandas进行了基准测试,最多可达20亿行(RAM中约100GB)
  • An older benchmark on 80K groups has data.table ~8x faster 80K组旧基准测试数据速度快data.table

Data 数据

This is for the first example I showed in the question section. 这是我在问题部分中展示的第一个例子。

 dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", "name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, -16L)) 

#1楼

参考:https://stackoom.com/question/1RwJH/data-table-vs-dplyr-一个人能做得好吗-另一个做不好或做得不好


#2楼

In direct response to the Question Title ... 直接回答问题标题 ......

dplyr definitely does things that data.table can not. dplyr 绝对data.table不能做的事情。

Your point #3 你的观点#3

dplyr abstracts (or will) potential DB interactions dplyr抽象(或将)潜在的DB交互

is a direct answer to your own question but isn't elevated to a high enough level. 是你自己的问题的直接答案,但没有提升到足够高的水平。 dplyr is truly an extendable front-end to multiple data storage mechanisms where as data.table is an extension to a single one. dplyr是多个数据存储机制的真正可扩展前端,其中data.table是单个数据存储机制的扩展。

Look at dplyr as a back-end agnostic interface, with all of the targets using the same grammer, where you can extend the targets and handlers at will. dplyr视为后端不可知接口,所有目标都使用相同的语法,您可以随意扩展目标和处理程序。 data.table is, from the dplyr perspective, one of those targets. dplyr角度来看, data.table是其中一个目标。

You will never (I hope) see a day that data.table attempts to translate your queries to create SQL statements that operate with on-disk or networked data stores. 您永远不会(我希望)看到有一天data.table尝试转换您的查询以创建与磁盘或网络数据存储一起运行的SQL语句。

dplyr can possibly do things data.table will not or might not do as well. dplyr可能会做data.table不会或不会做的事情。

Based on the design of working in-memory, data.table could have a much more difficult time extending itself into parallel processing of queries than dplyr . 基于内存工作的设计, data.table可能比dplyr更难以将自身扩展到查询的并行处理。


In response to the in-body questions... 回应体内问题......

Usage 用法

Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (ie some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing). 对于熟悉软件包的人来说 ,是否需要使用一个或另一个软件包来编写分析任务更加容易(例如,需要按键的一些组合与所需的深奥水平相结合,其中每个项目都是好事。

This may seem like a punt but the real answer is no. 这似乎是一个平底船,但真正的答案是否定的。 People familiar with tools seem to use the either the one most familiar to them or the one that is actually the right one for the job at hand. 熟悉工具的人似乎使用他们最熟悉的工具或实际上适合工作的工具。 With that being said, sometimes you want to present a particular readability, sometimes a level of performance, and when you have need for a high enough level of both you may just need another tool to go along with what you already have to make clearer abstractions. 话虽如此,有时你想要提供一个特定的可读性,有时候是一个性能水平,当你需要足够高的两者时,你可能只需要另一个工具来配合你已经拥有的东西来做出更清晰的抽象。

Performance 性能

Are there analytical tasks that are performed substantially (ie more than 2x) more efficiently in one package vs. another. 是否存在在一个包装与另一个包装中更有效地执行分析任务(即,超过2倍)的分析任务。

Again, no. 再一次,没有。 data.table excels at being efficient in everything it does where dplyr gets the burden of being limited in some respects to the underlying data store and registered handlers. data.table擅长是在这里的一切有效dplyr得到的在某些方面被限制在底层数据存储和处理注册的负担。

This means when you run into a performance issue with data.table you can be pretty sure it is in your query function and if it is actually a bottleneck with data.table then you've won yourself the joy of filing a report. 这意味着当您遇到了性能问题data.table你可以相当肯定它在你的查询功能,如果它实际上一个瓶颈data.table那么你已经赢得了自己在提交报告的喜悦。 This is also true when dplyr is using data.table as the back-end; dplyr使用data.table作为后端时也是如此; you may see some overhead from dplyr but odds are it is your query. 可能会看到dplyr 一些开销,但可能是你的查询。

When dplyr has performance issues with back-ends you can get around them by registering a function for hybrid evaluation or (in the case of databases) manipulating the generated query prior to execution. dplyr性能问题时,您可以通过注册混合评估函数或(在数据库的情况下)在执行之前操作生成的查询来绕过它们。

Also see the accepted answer to when is plyr better than data.table? 另请参阅plyr何时比data.table更好的接受答案?


#3楼

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed , Memory usage , Syntax and Features . 我们需要至少涵盖这些方面,以提供全面的答案/比较(没有特别重要的顺序): SpeedMemory usageSyntaxFeatures

My intent is to cover each one of these as clearly as possible from data.table perspective. 我的目的是从data.table的角度尽可能清楚地涵盖其中的每一个。

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp. 注意:除非另有明确说明,否则通过引用dplyr,我们引用dplyr的data.frame接口,其内部使用Rcpp在C ++中。


The data.table syntax is consistent in its form - DT[i, j, by] . data.table语法的形式是一致的 - DT[i, j, by] To keep i , j and by together is by design. 保持ijby在一起是设计的。 By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage , and also provide some powerful features , all while maintaining the consistency in syntax. 通过将相关操作保持在一起,它允许轻松优化操作以提高速度 ,更重要的是内存使用 ,并提供一些强大的功能 ,同时保持语法的一致性。

1. Speed 1.速度

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas . 相当多的基准测试(尽管主要是分组操作)已经添加到已经显示data.table的问题比dplyr 更快 ,因为要分组的组和/或行的数量增加,包括Matt在分组上的基准从1000万到20亿行 (100GB内存)在100-1000万个组和不同的分组列上,这也比较了pandas See also updated benchmarks , which include Spark and pydatatable as well. 另请参阅更新的基准测试 ,其中包括Sparkpydatatable

On benchmarks, it would be great to cover these remaining aspects as well: 在基准测试中,覆盖这些剩余方面也很棒:

  • Grouping operations involving a subset of rows - ie, DT[x > val, sum(y), by = z] type operations. 涉及行子集的分组操作 - 即DT[x > val, sum(y), by = z]类型的操作。

  • Benchmark other operations such as update and joins . 对其他操作进行基准测试,例如更新连接

  • Also benchmark memory footprint for each operation in addition to runtime. 除运行时外,还为每个操作的基准内存占用量

2. Memory usage 2.内存使用情况

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). 涉及dplyr中的filter()slice()的操作可能内存效率低(在data.frames和data.tables上)。 See this post . 看这篇文章 。

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory . 请注意, Hadley的评论谈论速度 (dplyr对他来说很快),而主要关注的是记忆

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable). data.table接口目前允许通过引用修改/更新列(请注意,我们不需要将结果重新分配回变量)。

     # sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA] 

    But dplyr will never update by reference. 但dplyr 永远不会通过引用更新。 The dplyr equivalent would be (note that the result needs to be re-assigned): dplyr等价物将是(注意结果需要重新分配):

     # copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA)) 

    A concern for this is referential transparency . 对此的关注是参考透明度 Updating a data.table object by reference, especially within a function may not be always desirable. 通过引用更新data.table对象,尤其是在函数内更新可能并不总是令人满意的。 But this is an incredibly useful feature: see this and this posts for interesting cases. 但这是一个非常有用的功能:看到这个和这个帖子的有趣案例。 And we want to keep it. 我们想保留它。

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities . 因此,我们正在努力在data.table中导出shallow()函数,它将为用户提供两种可能性 For example, if it is desirable to not modify the input data.table within a function, one can then do: 例如,如果希望不修改函数中的输入data.table,则可以执行以下操作:

     foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. } 

    By not using shallow() , the old functionality is retained: 通过不使用shallow() ,保留旧功能:

     bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. } 

    By creating a shallow copy using shallow() , we understand that you don't want to modify the original object. 通过使用shallow()创建浅拷贝 ,我们知道您不想修改原始对象。 We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary . 我们在内部处理所有事情,以确保在确保复制列时仅在绝对必要时修改。 When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties. 实施时,这应该完全解决参考透明度问题,同时为用户提供两种可能性。

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. 此外,一旦shallow()被导出,dplyr的data.table接口应该避免几乎所有的副本。 So those who prefer dplyr's syntax can use it with data.tables. 所以那些喜欢dplyr语法的人可以将它与data.tables一起使用。

    But it will still lack many features that data.table provides, including (sub)-assignment by reference. 但它仍然缺少data.table提供的许多功能,包括(sub)-assignment by reference。

  3. Aggregate while joining: 加入时聚合:

    Suppose you have two data.tables as follows: 假设您有两个data.tables如下:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # xyz # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # xy mul # 1: 1 a 4 # 2: 2 b 3 

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y . 并且您希望在按列x,y连接时获得DT2每行的sum(z) * mul We can either: 我们可以:

    • 1) aggregate DT1 to get sum(z) , 2) perform a join and 3) multiply (or) 1)聚合DT1得到sum(z) ,2)执行连接3)乘法(或)

       # data.table way DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][] # dplyr equivalent DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul) 
    • 2) do it all in one go (using by = .EACHI feature): 2)一次完成(使用by = .EACHI功能):

       DT1[DT2, list(z=sum(z) * mul), by = .EACHI] 

    What is the advantage? 有什么好处?

    • We don't have to allocate memory for the intermediate result. 我们不必为中间结果分配内存。

    • We don't have to group/hash twice (one for aggregation and other for joining). 我们没有两次分组/哈希(一个用于聚合,另一个用于加入)。

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2). 更重要的是,通过查看(2)中的j ,我们想要执行的操作是清楚的。

    Check this post for a detailed explanation of by = .EACHI . 查看这篇文章 ,了解by = .EACHI的详细说明。 No intermediate results are materialised, and the join+aggregate is performed all in one go. 没有实现中间结果,并且连接+聚合一次性完成。

    Have a look at this , this and this posts for real usage scenarios. 看看这个 , 这个和这个帖子的实际使用场景。

    In dplyr you would have to join and aggregate or aggregate first and then join , neither of which are as efficient, in terms of memory (which in turn translates to speed). dplyr你必须首先加入并聚合或聚合然后加入 ,这两者在内存方面都不是那么有效(这反过来又转化为速度)。

  4. Update and joins: 更新和加入:

    Consider the data.table code shown below: 考虑下面显示的data.table代码:

     DT1[DT2, col := i.mul] 

    adds/updates DT1 's column col with mul from DT2 on those rows where DT2 's key column matches DT1 . DT2的键列与DT1匹配的那些行上, DT1的列col与来自DT2 mul一起添加/更新。 I don't think there is an exact equivalent of this operation in dplyr , ie, without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary. 我不认为在dplyr有这个操作的确切等价,即没有避免*_join操作,它必须复制整个DT1只是为了向它添加一个新列,这是不必要的。

    Check this post for a real usage scenario. 查看此帖子了解真实的使用场景。

To summarise, it is important to realise that every bit of optimisation matters. 总而言之,重要的是要意识到每一点优化都很重要。 As Grace Hopper would say, Mind your nanoseconds ! 正如Grace Hopper所说, 记住你的纳秒

3. Syntax 3.语法

Let's now look at syntax . 我们现在看一下语法 Hadley commented here : 哈德利在这里评论道:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ... 数据表非常快,但我认为它们的简洁性使得学习起来更加困难 ,使用它的代码在编写之后难以阅读 ...

I find this remark pointless because it is very subjective. 我发现这句话毫无意义,因为它非常主观。 What we can perhaps try is to contrast consistency in syntax . 我们可以尝试的是对比语法的一致性 We will compare data.table and dplyr syntax side-by-side. 我们将并排比较data.table和dplyr语法。

We will work with the dummy data shown below: 我们将使用下面显示的虚拟数据:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
  1. Basic aggregation/update operations. 基本的聚合/更新操作。

     # case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L]) 
    • data.table syntax is compact and dplyr's quite verbose. data.table语法是紧凑的,dplyr非常详细。 Things are more or less equivalent in case (a). 在(a)的情况下,事情或多或少是等价的。

    • In case (b), we had to use filter() in dplyr while summarising . 在情况(b)中,我们必须在总结时在dplyr中使用filter() But while updating , we had to move the logic inside mutate() . 但是在更新时 ,我们不得不在mutate()移动逻辑。 In data.table however, we express both operations with the same logic - operate on rows where x > 2 , but in first case, get sum(y) , whereas in the second case update those rows for y with its cumulative sum. 但是,在data.table中,我们使用相同的逻辑表示两个操作 - 对x > 2行进行操作,但在第一种情况下,得到sum(y) ,而在第二种情况下,使用其累积和更新y那些行。

      This is what we mean when we say the DT[i, j, by] form is consistent . 当我们说DT[i, j, by]形式是一致的时,这就是我们的意思。

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. 类似地,在情况(c)中,当我们有if-else条件时,我们能够在data.table和dplyr中表达逻辑“原样” However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). 但是,如果我们只想返回if条件满足的那些行,否则我们不能直接使用summarise() (AFAICT)。 We have to filter() first and then summarise because summarise() always expects a single value . 我们必须首先filter()然后汇总,因为summarise()总是需要一个

      While it returns the same result, using filter() here makes the actual operation less obvious. 虽然它返回相同的结果,但在这里使用filter()会使实际操作不那么明显。

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to. 在第一种情况下也可以使用filter() (对我来说似乎不太明显),但我的观点是我们不应该这样做。

  2. Aggregation / update on multiple columns 多列的聚合/更新

     # case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean)) 
    • In case (a), the codes are more or less equivalent. 在情况(a)中,代码或多或少相等。 data.table uses familiar base function lapply() , whereas dplyr introduces *_each() along with a bunch of functions to funs() . data.table使用熟悉的基函数lapply() ,而dplyr*_each()和一堆函数引入*_each() funs()

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically. data.table's :=需要提供列名,而dplyr会自动生成它。

    • In case (b), dplyr's syntax is relatively straightforward. 在情况(b)中,dplyr的语法相对简单。 Improving aggregations/updates on multiple functions is on data.table's list. 改进多个函数的聚合/更新是在data.table的列表中。

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. 在情况(c)中,dplyr将返回n()尽可能多的列,而不是仅返回一次。 In data.table, all we need to do is to return a list in j . 在data.table中,我们需要做的就是在j返回一个列表。 Each element of the list will become a column in the result. 列表的每个元素都将成为结果中的一列。 So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list . 所以,我们可以使用,再次熟悉的基函数c()来连接.N一个list返回一个list

    Note: Once again, in data.table, all we need to do is return a list in j . 注意:再次,在data.table中,我们需要做的就是在j返回一个列表。 Each element of the list will become a column in result. 列表的每个元素都将成为结果中的列。 You can use c() , as.list() , lapply() , list() etc... base functions to accomplish this, without having to learn any new functions. 您可以使用c()as.list()lapply()list()等...基本函数来完成此任务,而无需学习任何新功能。

    You will need to learn just the special variables - .N and .SD at least. 您至少需要学习特殊变量 - .N.SD The equivalent in dplyr are n() and . dplyr中的等价物是n().

  3. Joins 加盟

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). dplyr为每种类型的连接提供单独的函数,其中data.table允许使用相同的语法DT[i, j, by] (和有理由)进行连接。 It also provides an equivalent merge.data.table() function as an alternative. 它还提供了一个等效的merge.data.table()函数作为替代。

     setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ?? 
    • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by] , or merge() which is similar to base R. 有些人可能会发现每个连接的单独函数更好(左,右,内,反,半等),而其他人可能喜欢data.table的DT[i, j, by]merge()类似于base R.

    • However dplyr joins do just that. 然而,dplyr加入就是这样做的。 Nothing more. 而已。 Nothing less. 没什么。

    • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. data.tables可以在加入(2)时选择列,而在dplyr中,如上所示,您需要先在两个data.frames上select() Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient. 否则,您将使用不必要的列来实现连接,以便稍后删除它们,这是低效的。

    • data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. data.tables可以在加入 (3)时聚合 ,也可以在加入 (4)时进行更新 ,使用by = .EACHI功能。 Why materialse the entire join result to add/update just a few columns? 为什么要使整个连接结果只添加/更新几列?

    • data.table is capable of rolling joins (5) - roll forward, LOCF , roll backward, NOCB , nearest . data.table能够滚动连接 (5) - 前滚,LOCF ,后滚,NOCB , 最近 。

    • data.table also has mult = argument which selects first , last or all matches (6). data.table也有mult =参数,它选择第一个最后一个所有匹配(6)。

    • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins. data.table具有allow.cartesian = TRUE参数以防止意外的无效连接。

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further. 再一次,语法与DT[i, j, by] ,附加参数允许进一步控制输出。

  1. do() ... do() ......

    dplyr's summarise is specially designed for functions that return a single value. dplyr的总结是专门为返回单个值的函数而设计的。 If your function returns multiple/unequal values, you will have to resort to do() . 如果函数返回多个/不相等的值,则必须求助于do() You have to know beforehand about all your functions return value. 您必须事先知道所有函数的返回值。

     DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x)))) 
    • .SD 's equivalent is . .SD的等价物是.

    • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column. 在data.table中,你可以在j抛出几乎任何东西 - 唯一要记住的是它返回一个列表,以便列表的每个元素都转换为一列。

    • In dplyr, cannot do that. 在dplyr中,不能那样做。 Have to resort to do() depending on how sure you are as to whether your function would always return a single value. 必须求助于do()取决于你的功能是否总是返回单个值。 And it is quite slow. 它很慢。

Once again, data.table's syntax is consistent with DT[i, j, by] . data.table的语法再一次与DT[i, j, by] We can just keep throwing expressions in j without having to worry about these things. 我们可以继续在j抛出表达式,而不必担心这些事情。

Have a look at this SO question and this one . 看看这个问题和这个 问题 。 I wonder if it would be possible to express the answer as straightforward using dplyr's syntax... 我想知道使用dplyr的语法是否可以直截了当地表达答案...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. 总而言之,我特别强调了几个实例,其中dplyr的语法效率低,有限或无法使操作简单明了。 This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). 这尤其是因为data.table对“难以阅读/学习”的语法(如上面粘贴/链接的语法)产生了相当大的反对意见。 Most posts that cover dplyr talk about most straightforward operations. 大多数关于dplyr的帖子都谈到了最简单的操作。 And that is great. 这很棒。 But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it. 但重要的是要实现其语法和功能限制,我还没有看到它的帖子。

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). data.table也有它的怪癖(其中一些我已经指出我们正试图修复)。 We are also attempting to improve data.table's joins as I have highlighted here . 我们也试图改进data.table的连接,正如我在这里强调的那样。

But one should also consider the number of features that dplyr lacks in comparison to data.table. 但是,与data.table相比,还应该考虑dplyr缺少的功能数量。

4. Features 4.特点

I have pointed out most of the features here and also in this post. 我在这里和本文中都指出了大部分功能。 In addition: 此外:

  • fread - fast file reader has been available for a long time now. fread - 快速文件阅读器已经有很长一段时间了。

  • fwrite - a parallelised fast file writer is now available. fwrite - 现在可以使用并行快速文件编写器。 See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments. 有关实施的详细说明,请参阅此文章 ;有关进一步发展的信息 ,请参阅#1664 。

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally. 自动索引 - 内部优化基本R语法的另一个便利功能。

  • Ad-hoc grouping : dplyr automatically sorts the results by grouping variables during summarise() , which may not be always desirable. Ad-hoc分组dplyr通过在summarise()期间对变量进行分组 dplyr自动对结果进行排序,这可能并不总是令人满意。

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above. 上面提到的data.table连接(用于速度/内存效率和语法)的许多优点。

  • Non-equi joins : Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins. 非equi连接 :允许使用其他运算符连接<=, <, >, >=以及data.table连接的所有其他优点。

  • Overlapping range joins was implemented in data.table recently. 最近在data.table中实现了重叠范围连接 。 Check this post for an overview with benchmarks. 查看这篇文章 ,了解基准测试的概述。

  • setorder() function in data.table that allows really fast reordering of data.tables by reference. setorder()函数允许通过引用真正快速重新排序data.tables。

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment. dplyr使用相同的语法为数据库提供接口 ,而data.table目前还没有。

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff , fintersect , funion and fsetequal with additional all argument (as in SQL). data.table提供更快的等效设置操作 (由Jan Gorecki编写) - fsetdifffintersectfunionfsetequal以及附加的all参数(如在SQL中)。

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. data.table干净地加载,没有屏蔽警告,并且具有这里描述的机制,用于传递给任何R包时的[.data.frame兼容性。 dplyr changes base functions filter , lag and [ which can cause problems; dplyr改变基函数filterlag[这会导致问题; eg here and here . 例如这里和这里 。


Finally: 最后:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. 在数据库上 - 没有理由说data.table不能提供类似的接口,但现在这不是优先事项。 It might get bumped up if users would very much like that feature.. not sure. 如果用户非常喜欢这个功能,它可能会被提升......不确定。

  • On parallelism - Everything is difficult, until someone goes ahead and does it. 关于并行性 - 一切都很困难,直到有人继续前进并做到这一点。 Of course it will take effort (being thread safe). 当然需要付出努力(线程安全)。

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP . 目前正在取得进展(在v1.9.7开发中),以便使用OpenMP并行化已知的耗时部件以实现增量性能增益。

#4楼

Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities). 这是我从dplyr角度全面回答的尝试,遵循Arun回答的大致轮廓(但根据不同的优先级进行了一些重新安排)。

Syntax 句法

There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. 语法有一些主观性,但我坚持认为data.table的简洁使得学习更难学,更难阅读。 This is partly because dplyr is solving a much easier problem! 这部分是因为dplyr解决了一个更容易的问题!

One really important thing that dplyr does for you is that it constrains your options. dplyr为您做的一件非常重要的事情是它限制了您的选择。 I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. 我声称大多数单表问题都可以通过五个关键动词过滤,选择,变异,排列和总结以及“按组”副词来解决。 That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. 当您学习数据操作时,这种约束是一个很大的帮助,因为它有助于您对问题的思考。 In dplyr, each of these verbs is mapped to a single function. 在dplyr中,每个动词都映射到一个函数。 Each function does one job, and is easy to understand in isolation. 每个功能都可以完成一项工作,并且易于理解。

You create complexity by piping these simple operations together with %>% . 通过将这些简单操作与%>%一起管道,可以创建复杂性。 Here's an example from one of the posts Arun linked to : 以下是Arun 链接到的帖子中的一个示例:

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

Even if you've never seen dplyr before (or even R!), you can still get the gist of what's happening because the functions are all English verbs. 即使你以前从未见过dplyr(甚至是R!),你仍然可以得到正在发生的事情的要点,因为这些函数都是英语动词。 The disadvantage of English verbs is that they require more typing than [ , but I think that can be largely mitigated by better autocomplete. 英语动词的缺点是它们需要更多的打字而不是[ ,但我认为可以通过更好的自动完成来大大减轻这种情况。

Here's the equivalent data.table code: 这是等效的data.table代码:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

It's harder to follow this code unless you're already familiar with data.table. 除非您已经熟悉data.table,否则更难遵循此代码。 (I also couldn't figure out how to indent the repeated [ in a way that looks good to my eye). (我也无法弄清楚如何缩进重复[以一种看起来很好看的方式)。 Personally, when I look at code I wrote 6 months ago, it's like looking at a code written by a stranger, so I've come to prefer straightforward, if verbose, code. 就个人而言,当我看到我6个月前写的代码时,就像查看陌生人编写的代码一样,所以我更喜欢简单明了的代码。

Two other minor factors that I think slightly decrease readability: 我认为其他两个次要因素会略微降低可读性:

  • Since almost every data table operation uses [ you need additional context to figure out what's happening. 由于几乎每个数据表操作都使用[您需要额外的上下文来弄清楚发生了什么。 For example, is x[y] joining two data tables or extracting columns from a data frame? 例如, x[y]连接两个数据表还是从数据帧中提取列? This is only a small issue, because in well-written code the variable names should suggest what's happening. 这只是一个小问题,因为在编写良好的代码中,变量名称应该表明发生了什么。

  • I like that group_by() is a separate operation in dplyr. 我喜欢group_by()是dplyr中的一个单独操作。 It fundamentally changes the computation so I think should be obvious when skimming the code, and it's easier to spot group_by() than the by argument to [.data.table . 它从根本上改变了计算,所以我认为在浏览代码时应该是显而易见的,并且发现group_by()[.data.table by参数更容易。

I also like that the the pipe isn't just limited to just one package. 我也喜欢管道不仅仅局限于一个包装。 You can start by tidying your data with tidyr , and finish up with a plot in ggvis . 您可以先用tidyr整理数据,然后用ggvis中的绘图完成 。 And you're not limited to the packages that I write - anyone can write a function that forms a seamless part of a data manipulation pipe. 而且你并不局限于我写的软件包 - 任何人都可以编写一个函数来构成数据操作管道的无缝部分。 In fact, I rather prefer the previous data.table code rewritten with %>% : 实际上,我更喜欢先前用%>%重写的data.table代码:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

And the idea of piping with %>% is not limited to just data frames and is easily generalised to other contexts: interactive web graphics , web scraping , gists , run-time contracts , ...) 使用%>%管道的想法不仅限于数据框架,而且很容易推广到其他环境: 交互式Web图形 , Web抓取 , 要点 , 运行时合同 ,......)

Memory and performance 记忆和表现

I've lumped these together, because, to me, they're not that important. 我把它们混在一起,因为对我来说,它们并不那么重要。 Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. 大多数R用户的工作量低于100万行,dplyr足够快,足以满足您不知道处理时间的数据大小。 We optimise dplyr for expressiveness on medium data; 我们优化dplyr以表达对中等数据的表达; feel free to use data.table for raw speed on bigger data. 随时可以使用data.table获取更大数据的原始速度。

The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. dplyr的灵活性还意味着您可以使用相同的语法轻松调整性能特征。 If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). 如果dplyr与数据帧后端的性能不够好,则可以使用data.table后端(尽管功能有限)。 If the data you're working with doesn't fit in memory, then you can use a database backend. 如果您使用的数据不适合内存,则可以使用数据库后端。

All that said, dplyr performance will get better in the long-term. 总而言之,dplyr的表现会在长期内变得更好。 We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. 我们肯定会实现data.table的一些好主意,比如基数排序和联接和过滤器使用相同的索引。 We're also working on parallelisation so we can take advantage of multiple cores. 我们还致力于并行化,因此我们可以利用多个内核。

Features 特征

A few things that we're planning to work on in 2015: 我们计划在2015年开展的一些工作:

  • the readr package, to make it easy to get files off disk and in to memory, analogous to fread() . readr包,使文件从磁盘和内存中轻松获取,类似于fread()

  • More flexible joins, including support for non-equi-joins. 更灵活的连接,包括对非equi-joins的支持。

  • More flexible grouping like bootstrap samples, rollups and more 更灵活的分组,如bootstrap样本,汇总等

I'm also investing time into improving R's database connectors , the ability to talk to web apis , and making it easier to scrape html pages . 我还投入时间来改进R的数据库连接器 ,与web apis交谈的能力,以及更容易抓取HTML页面 。

你可能感兴趣的:(r,data.table,dplyr)