作者:龙帅1314的爱_530 | 来源:互联网 | 2023-02-06 03:14
So I'm a huge data.table
fan in R. I use it almost all the time but have come across a situation in which it won't work for me at all. I have a package (internal to my company) that uses R's double
to store the value of an unsigned 64 bit integer whose bit sequence corresponds to some fancy encoding. This package works very nicely everywhere except data.table. I found that if I aggregate on a column of this data that I lose a large number of my unique values. My only guess here is that data.table
is truncating bits in some kind of weird double
optimization.
所以我是R中的一个巨大的data.table粉丝。我几乎一直使用它,但遇到了一个根本不适合我的情况。我有一个包(我公司内部)使用R的double来存储无符号64位整数的值,其位序列对应于某些奇特的编码。这个包在除data.table之外的任何地方都可以很好地工作。我发现,如果我在这个数据的列上聚合,我会丢失大量的唯一值。我唯一的猜测是data.table在某种奇怪的双重优化中截断位。
Can anyone confirm that this is the case? Is this simply a bug?
任何人都可以确认是这种情况吗?这只是一个错误吗?
Below see a reproduction of the issue and comparison to the package I currently must use but want to avoid with a passion (dplyr
).
下面看到问题的复制和我目前必须使用的包装的比较,但希望避免激情(dplyr)。
temp <- structure(list(obscure_math = c(6.95476896592629e-309, 6.95476863436446e-309,
6.95476743245288e-309, 6.95476942182375e-309, 6.95477149408563e-309,
6.95477132830476e-309, 6.95477132830476e-309, 6.95477149408562e-309,
6.95477174275702e-309, 6.95476880014538e-309, 6.95476896592647e-309,
6.95476896592647e-309, 6.95476900737172e-309, 6.95476900737172e-309,
6.95476946326899e-309, 6.95476958760468e-309, 6.95476958760468e-309,
6.95477020928318e-309, 6.95477124541406e-309, 6.95476859291965e-309,
6.95476875870014e-309, 6.95476904881676e-309, 6.95476904881676e-309,
6.95476904881676e-309, 6.95476909026199e-309, 6.95476909026199e-309,
6.95476909026199e-309, 6.95476909026199e-309, 6.9547691317072e-309,
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309,
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309,
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309,
6.9547691317072e-309, 6.9547691317072e-309, 6.95477211576406e-309,
6.95476880014538e-309, 6.95476880014538e-309, 6.95476880014538e-309,
6.95476892448104e-309, 6.95476880014538e-309, 6.95476892448105e-309,
6.9547689659263e-309, 6.95476913170719e-309, 6.95476933893334e-309
)), .Names = "obscure_math", class = c("data.table", "data.frame"), row.names = c(NA,
-50L))
dt_collapsed <- temp[, .(count=.N), by=obscure_math]
nrow(dt_collapsed) == length(unique(temp$obscure_math))
setDF(temp)
dplyr_collapsed <- temp %>% group_by(obscure_math) %>% summarise(count=n())
nrow(dplyr_collapsed) == length(unique(temp$obscure_math))
1 个解决方案
18
Update: the default rounding feature has been removed in the current development version of data.table (v1.9.7). See installation instructions for devel version here.
更新:已在当前开发版本的data.table(v1.9.7)中删除了默认的舍入功能。请参阅此处的devel版本的安装说明。
This also means that you're responsible for understanding the limitations in representing floating point numbers and dealing with it.
这也意味着您有责任理解表示浮点数并处理浮点数的限制。
data.table has been around for a long time. We used to deal with limitations in floating point representations by using a threshold (like base R does, e.g., all.equal
). However it simply does not work, since it needs to be adaptive depending on how big the numbers compared are. This series of articles is an excellent read on this topic and other potential issues.
data.table已存在很长时间了。我们过去常常通过使用阈值来处理浮点表示的限制(例如,基数R,例如all.equal)。然而它根本不起作用,因为它需要自适应,这取决于比较的数字有多大。本系列文章是关于此主题和其他潜在问题的精彩读物。
This being a recurring issue because a) people don't realise the limitations, or b) thresholding did not really help their issue, meant that people kept asking here or posting on the project page.
这是一个反复出现的问题,因为a)人们没有意识到这些限制,或者b)阈值并没有真正帮助他们解决问题,这意味着人们不断在这里询问或在项目页面上发帖。
While we reimplemented data.table's order to fast radix ordering, we took the opportunity to provide an alternative way of fixing the issue, and providing a way out if it proves undesirable (exporting setNumericRounding
). With #1642 issue, ordering probably doesn't need to have rounding of doubles (but it's not that simple, since order directly affects binary search based subsets).
虽然我们重新实现了data.table对快速基数排序的顺序,但我们借此机会提供了另一种解决问题的方法,并提供了一种解决方法,如果它被证明是不合需要的(导出setNumericRounding)。对于#1642问题,排序可能不需要舍入双精度(但它并不那么简单,因为顺序直接影响基于二进制搜索的子集)。
The actual problem here is grouping on floating point numbers, even worse is such numbers as in your case. That is just a bad choice IMHO.
这里的实际问题是对浮点数进行分组,更糟糕的是在你的情况下这样的数字。这只是一个糟糕的选择恕我直言。
I can think of two ways forward:
我可以想到前进的两种方式:
When grouping on columns that are really doubles (in R, 1 is double as opposed to 1L, and those cases don't have issues) we provide a warning that the last 2 bytes are rounded off, and that people should read ?setNumericRounding
. And also suggest using bit64::integer64
.
当对真正加倍的列进行分组时(在R中,1是双倍而不是1L,并且这些情况没有问题),我们提供一个警告,即最后2个字节被舍入,并且人们应该读取?setNumericRounding。并且还建议使用bit64 :: integer64。
Remove the functionality of allowing grouping operations on really double values or force them to fix the precision to certain digits before continuing. I can't think of a valid reason why one would want to group by floating point numbers really (would love to hear from people who do).
删除允许对实际双值进行分组操作的功能,或强制它们在继续之前将精度固定为某些数字。我想不出一个有效的理由,为什么人们会想要通过浮点数进行分组(很想听到那些做过的人)。
What is very unlikely to happen is going back to thresholding based checks for identifying which doubles should belong to the same group.
最不可能发生的是回到基于阈值的检查,以确定哪些双打应属于同一组。
Just so that the Q remains answered, use setNumericRounding(0L)
.
为了使Q保持应答,请使用setNumericRounding(0L)。