在一个不适合内存的巨大文件中找到'n'个最重复的单词/字符串

Question

问

在一个不适合内存的巨大文件中找到'n'个最重复的单词/字符串

sijiamian_767 发布于 2023-01-29 12:58

我想验证我的伪代码,建议优化和更好的方法.

mostRepeatingWords(int rank):(
此处的排名定义了您想要选择的排名,即排名前X的最重复的单词)

外部按字母顺序排序所有文件.遵循这里提到的算法.
复杂性:ø(n log n/m)

创建一个新文件"wordCount",其中包含类似的条目

"word1" : 3- 即word1重复三次
"word2" : 1- 即word2是唯一的.

for ( read each of the sorted file one by one ) { 
    for (read each "word" in the current file) { 
        int count =  checkIfDuplicateAndReturnCount(); 
        enter "word" and "count" in the "wordCount" file. 

        sanitizeFile(); 

        if  (wordCount file is > page size) { 
                write wordCount file to disk. 
                create a new wordCount file.
         } 

        move to next word ie currentword + count;
    } 
}

复杂度O(n + p)其中n是已排序页面的数量,p <= n,是"wordcount"页面的数量

checkIfDuplicateAndReturnCount() 是一个简单的函数,它将该元素与first和previous等进行比较,并确定单词的频率.

sanitizeFile()当页面全部被相同的单词淹没时使用.让我们说一个页面的大小是4KB,并且说,但是在排序时,包含单词常用词"the"的页数大于(4KB /单词的大小).然后我们可能需要创建一个新文件.但是,清理文件将需要额外的步骤来将文件中的两个条目组合成相同的单词.例:

:500

:600

将合并为:

:1100

for  (reading the all the wordcount files one by one ) {
  for (each word in the wordcount) {
    Use a minimum heap / priority queue of size "rank" to keep track of the frequency.
    if (wordcount > heap.peek()) {
        heap.addAtPositionZero(word);
    }   
  }
}

复杂性是O(p)

复杂度:ø(n log n/m)+ O(n + p)+ O(p)有效:ø(n log n/m)

Jim Mischel.. 15

没有必要提前对文件进行排序.这样做只是一堆不必要的I/O.

更直接的方法是:

Create an empty dictionary (hash map), keyed by word. The value is the count.
for each file
    while not end of file
        read word
        if word in dictionary
            update count
        else
            if dictionary full
                sort dictionary by word
                output dictionary to temporary file
                Clear dictionary
            Add word to dictionary, with count 1
    end
end
if dictionary not empty
    sort dictionary by word
    output dictionary to temporary file

您现在拥有一些临时文件,每个文件按字排序,每行包含一个字/计数对.喜欢:

aardvark,12
bozo,3
zebra,5

创建一个用于保存n个最大项目的最小堆.叫它largest_items.

执行这些临时文件的标准n路合并.当您找到每个唯一项目(即您在多个文件中合并所有"aardvark"条目)时,您执行以下操作:

if (largest_items.count < n)
    largest_items.add(word)
else if (word.count > largest_items.peek().count)
{
    // the count for this word is more than the smallest count
    // already on the heap. So remove the item with the
    // smallest count, and add this one.
    largest_items.remove_root()
    largest_items.add(word)
}

复杂:

构建词典是O(N),其中N是文件中单个词的总数.

对每个临时字典进行排序是O(k log k),其中"k"是字典中的单词数.

写每个临时字典是O(k)

合并为O(M log x),其中M是所有临时文件的条目总数,x是临时文件的数量.

选择项目是O(m log n),其中m是唯一单词n的数量,是您要选择的单词数.

如果你看一下最坏的情况(即所有单词都是唯一的),复杂性就会达到(n是单词的总数):

构建词典是O(n)

排序和写入临时文件是(n/m) * (m log m),这里m是字典大小.

合并是n log (n/m).

选择是O(m +(k log k)),其中k是您要选择的单词m数,是唯一单词的数量.因为所有单词都是唯一的,所以它们具有相同的计数,因此您只需要k插入堆中.m当k远小于m(在这些情况下通常是这种情况)时,该术语占主导地位.所以选择结果是O(m).

当您处理大于内存的数据集时,您的瓶颈往往是文件I/O. 我上面概述的算法试图最小化I/O. 在最坏的情况下(所有单词都是唯一的),每个单词将被读取两次并写入一次.但是在一般情况下,每个单词被读取一次,然后每个散列页面被写入一次并读取一次.另外,您的排序是使用哈希页而不是原始单词,并且临时文件的总大小将远小于原始文本.

1 个回答

没有必要提前对文件进行排序.这样做只是一堆不必要的I/O.

更直接的方法是:
```
Create an empty dictionary (hash map), keyed by word. The value is the count.
for each file
    while not end of file
        read word
        if word in dictionary
            update count
        else
            if dictionary full
                sort dictionary by word
                output dictionary to temporary file
                Clear dictionary
            Add word to dictionary, with count 1
    end
end
if dictionary not empty
    sort dictionary by word
    output dictionary to temporary file
```
您现在拥有一些临时文件,每个文件按字排序,每行包含一个字/计数对.喜欢:
```
aardvark,12
bozo,3
zebra,5
```
创建一个用于保存n个最大项目的最小堆.叫它largest_items.

执行这些临时文件的标准n路合并.当您找到每个唯一项目(即您在多个文件中合并所有"aardvark"条目)时,您执行以下操作:
```
if (largest_items.count < n)
    largest_items.add(word)
else if (word.count > largest_items.peek().count)
{
    // the count for this word is more than the smallest count
    // already on the heap. So remove the item with the
    // smallest count, and add this one.
    largest_items.remove_root()
    largest_items.add(word)
}
```
复杂:

构建词典是O(N),其中N是文件中单个词的总数.

对每个临时字典进行排序是O(k log k),其中"k"是字典中的单词数.

写每个临时字典是O(k)

合并为O(M log x),其中M是所有临时文件的条目总数,x是临时文件的数量.

选择项目是O(m log n),其中m是唯一单词n的数量,是您要选择的单词数.

如果你看一下最坏的情况(即所有单词都是唯一的),复杂性就会达到(n是单词的总数):

构建词典是O(n)

排序和写入临时文件是(n/m) * (m log m),这里m是字典大小.

合并是n log (n/m).

选择是O(m +(k log k)),其中k是您要选择的单词m数,是唯一单词的数量.因为所有单词都是唯一的,所以它们具有相同的计数,因此您只需要k插入堆中.m当k远小于m(在这些情况下通常是这种情况)时,该术语占主导地位.所以选择结果是O(m).

当您处理大于内存的数据集时,您的瓶颈往往是文件I/O. 我上面概述的算法试图最小化I/O. 在最坏的情况下(所有单词都是唯一的),每个单词将被读取两次并写入一次.但是在一般情况下,每个单词被读取一次,然后每个散列页面被写入一次并读取一次.另外,您的排序是使用哈希页而不是原始单词,并且临时文件的总大小将远小于原始文本.
2023-01-29 13:01 回答

建霖怡旭家其

撰写答案