Ihavetwolargedataframes,aandbforwhichidentical(a,b)isTRUE,asisall.equal(a,b),butid
I have two large data frames, a
and b
for which identical(a,b)
is TRUE
, as is all.equal(a,b)
, but identical(digest(a),digest(b))
is FALSE
. What could cause this?
我有两个大数据帧,a和b的相同(a,b)为TRUE,all.equal(a,b),但相同(digest(a),digest(b))为FALSE。什么可能导致这个?
What's more, I tried to dig in deeper, by applying digest to bunches of rows. Incredibly, at least to me, there is agreement in the digest values on sub-frames all the way to the last row of the data frames.
更重要的是,我试图通过将摘要应用于一堆行来深入挖掘。令人难以置信的是,至少在我看来,子帧的摘要值一直到数据帧的最后一行是一致的。
Here is a sequence of comparisons:
这是一系列比较:
> identical(a, b)
[1] TRUE
> all.equal(a, b)
[1] TRUE
> digest(a)
[1] "cac56b06078733b6fb520442e5482684"
> digest(b)
[1] "fdd5ab78ca961982d195f800e3cf60af"
> digest(a[1:nrow(a),])
[1] "e44f906723405756509a6b17b5949d1a"
> digest(b[1:nrow(b),])
[1] "e44f906723405756509a6b17b5949d1a"
Every method I can think of indicates these two objects are identical, but their digest values are different. Is there something else about data frames that can produce such discrepancies?
我能想到的每个方法都表明这两个对象是相同的,但它们的摘要值是不同的。数据框还有其他可以产生这种差异的东西吗?
For further details: the objects are about 10M rows x 12 columns. Here's the output of str()
:
有关更多详细信息:对象大约是10M行x12列。这是str()的输出:
'data.frame': 10056987 obs. of 12 variables:
$ V1 : num 1 11 21 31 41 61 71 81 91 101 ...
$ V2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ V3 : num 2 3 2 3 4 5 2 4 2 4 ...
$ V4 : num 1 1 1 1 1 1 1 1 1 1 ...
$ V5 : num 1.8 2.29 1.94 2.81 3.06 ...
$ V6 : num 0.0653 0.0476 0.0324 0.034 0.0257 ...
$ V7 : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
$ V8 : num 0.00653 0.00476 0.00324 0.0034 0.00257 ...
$ V9 : num 1.8 2.3 1.94 2.81 3.06 ...
$ V10: num 0.1957 0.7021 0.0604 0.1866 0.9371 ...
$ V11: num 1704 1554 1409 1059 1003 ...
$ V12: num 23309 23309 23309 23309 23309 ...
> print(object.size(a), units = "Mb")
920.7 Mb
Update 1: On a whim, I converted these to matrices. The digests are the same.
更新1:一时兴起,我将这些转换为矩阵。摘要是相同的。
> aM = as.matrix(a)
> bM= as.matrix(b)
> identical(aM,bM)
[1] TRUE
> digest(aM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
> digest(bM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
I then tried converting back to a data frame, and the digest values are equal (and equal to the previous value for a
).
然后我尝试转换回数据帧,并且摘要值相等(并且等于a的先前值)。
> aMF = as.data.frame(aM)
> bMF = as.data.frame(bM)
> digest(aMF)
[1] "cac56b06078733b6fb520442e5482684"
> digest(bMF)
[1] "cac56b06078733b6fb520442e5482684"
So, b
looks like the bad boy, and it has a colorful past. b
came from a much bigger data frame, say B
. I took only the columns of B
that appeared in a
and checked to see if they were equal. Well, they were equal, but had different digests. I converted the column names (from "InformativeColumnName1" to "V1", etc.), just to avoid any issues that might arise - though all.equal
and identical
tend to point out when column names differ.
所以,b看起来像坏男孩,它有一个丰富多彩的过去。 b来自一个更大的数据框架,比如B.我只拿出了出现在a中的B列并检查它们是否相等。嗯,他们是平等的,但有不同的摘要。我转换了列名(从“InformativeColumnName1”到“V1”等),只是为了避免可能出现的任何问题 - 尽管当列名不同时all.equal和same往往会指出。
Since I am working on two different programs and don't have simultaneous access to a
and b
, it is easiest for me to use the digest values to check the calculations. However, something seems to be odd in how I extract columns from a data frame and then apply digest()
to it.
由于我正在处理两个不同的程序而且没有同时访问a和b,因此最简单的方法是使用摘要值来检查计算。但是,如何从数据框中提取列然后对其应用digest()似乎有些奇怪。
ANSWER: It turns out, to my astonishment (dismay, horror, embarrassment, you name it), identical
is very forgiving about attributes. I had assumed that only all.equal
was forgiving about attributes.
答案:事实证明,令我惊讶的是(沮丧,恐怖,尴尬,你的名字),同样对属性非常宽容。我曾假设只有all.equal对属性宽容。
This was discovered via Tommy's suggestion identical(d1, d2, attrib.as.set=FALSE)
. Running attributes(a)
is a bad, bad idea: the deluge of row names took awhile before Ctrl-C could interrupt it. Here is the output of names(attributes())
:
这是通过Tommy的建议相同发现的(d1,d2,attrib.as.set = FALSE)。运行属性(a)是一个糟糕的坏主意:在Ctrl-C可以中断之前,行名称的泛滥需要一段时间。这是名称的输出(attributes()):
> names(attributes(a))
[1] "names" "row.names" "class"
> names(attributes(b))
[1] "names" "class" "row.names"
They're in different orders! Kudos to digest()
for being straight with me.
他们的订单不同!感谢与我直接消化()。
UPDATE
To aid others with this problem, it seems that simply rearranging the attributes will be adequate to get identical hash values. Since tinkering with attribute orders is new to me, this may break something, but it works in my case. Note that it is a little time consuming if the objects are big; I'm not aware of a faster method for doing this. (I'm also looking to move to using matrices or data tables instead of data frames, and this may be another incentive to avoid data frames.)
为了帮助其他人解决这个问题,似乎只需重新排列属性就可以获得相同的哈希值。由于修改属性订单对我来说是新的,这可能会破坏某些东西,但它适用于我的情况。请注意,如果对象很大,则需要花费一些时间;我不知道更快的方法。 (我也希望转向使用矩阵或数据表而不是数据帧,这可能是避免数据帧的另一个动机。)
tmpA0 = attributes(a)
tmpA1 = tmpA0[sort(names(tmpA0))]
a2 = a
attributes(a2) = tmpA1
tmpB0 = attributes(b)
tmpB1 = tmpB0[sort(names(tmpB0))]
b2 = b
attributes(b2) = tmpB1
digest(a2) # e04e624692d82353479efbd713ec03f6
digest(b2) # e04e624692d82353479efbd713ec03f6
identical(b,b2, attrib.as.set = FALSE) # FALSE
identical(b,b2, attrib.as.set = TRUE) # TRUE
identical(a2,b2, attrib.as.set = FALSE) # TRUE
2 个解决方案