The code below will print me the highest frequency it can find in my hash table (of which is a bunch of linked lists) 10 times. I need my code to print the top 10 frequencies in my hash table. I do not know how to do this (code examples would be great, plain english logic/pseudocode is just as great).
下面的代码将打印出我在哈希表(其中包含一堆链表)中可以找到的最高频率10次。我需要我的代码来打印哈希表中的前10个频率。我不知道怎么做(代码示例很棒,简单的英文逻辑/伪代码也一样好)。
我创建了一个名为'tmp'的临时散列列表,它指向我的散列表'hashtable'
然后while循环遍历列表并查找最高频率,这是一个int'tmp-> freq'
循环将继续此过程,使用变量“topfreq”复制它找到的最高频率,直到它到达散列表上链接列表的末尾。
My 'node' is a struct comprising of the variables 'freq' (int) and 'word' (128 char). When the loop has nothing else to search for it prints these two values on screen.
我的'node'是一个包含变量'freq'(int)和'word'(128 char)的结构。当循环没有其他任何东西可以搜索时,它会在屏幕上打印这两个值。
The problem is, I can't wrap my head around figuring out how to find the next lowest number from the number I've just found (and this can include another node with the same freq value, so I have to check that the word is not the same too).
问题是,我无法解决如何从我刚找到的数字中找到下一个最低数字(这可能包括具有相同频率值的另一个节点,所以我必须检查这个词也不一样)。
void toptenwords()
{
int topfreq = 0;
int minfreq = 0;
char topword[SIZEOFWORD];
for(int p = 0; p <10; p++) // We need the top 10 frequencies... so we do this 10 times
{
for(int m = 0; m freq > topfreq) // If the freqency on hand is larger that the one found, store...
{
topfreq = tmp->freq;
strcpy(topword, tmp->word);
}
tmp = tmp->next;
}
}
cout <
Any and all help would be GREATLY appreciated :)
任何和所有的帮助将非常感谢:)
Keep an array of 10 node pointers, and insert each node into the array, maintaining the array in sorted order. The eleventh node in the array is overwritten on each iteration and contains junk.
保留一个包含10个节点指针的数组,并将每个节点插入到数组中,按排序顺序维护数组。数组中的第十一个节点在每次迭代时被覆盖并包含垃圾。
void toptenwords()
{
int topfreq = 0;
int minfreq = 0;
node *topwords[11];
int current_topwords = 0;
for(int m = 0; m 0; i--)
{
if(topwords[i]->freq > topwords[i - 1]->freq)
{
node *temp = topwords[i - 1];
topwords[i - 1] = topwords[i];
topwords[i] = temp;
}
else break;
}
if(current_topwords > 10) current_topwords = 10;
tmp = tmp->next;
}
}
}
I would maintain a set of words already used and change the inner-most if condition to test for frequency greater than previous top frequency AND tmp->word not in list of words already used.
我将维护一组已经使用过的单词并更改最内部的if条件以测试频率大于先前的顶部频率AND tmp-> word不在已使用的单词列表中。
When iterating over the hash table (and then over each linked list contained therein) keep a self balancing binary tree (std::set) as a "result" list. As you come across each frequency, insert it into the list, then truncate the list if it has more than 10 entries. When you finish, you'll have a set (sorted list) of the top ten frequencies, which you can manipulate as you desire.
当迭代哈希表(然后遍历其中包含的每个链表)时,将自平衡二叉树(std :: set)保持为“结果”列表。当您遇到每个频率时,将其插入列表中,如果列表超过10个,则截断列表。完成后,您将拥有前十个频率的集合(排序列表),您可以根据需要进行操作。
There may be perform gains to be had by using sets instead of linked lists in the hash table itself, but you can work that out for yourself.
通过在哈希表本身中使用集合而不是链接列表可能会有增益,但您可以自己解决这个问题。
Step 1 (Inefficient):
第1步(效率低下):
Move the vector into a sorted container via insertion sort, but insert into a container (e.g. linkedlist or vector) of size 10, and drop any elements that fall off the bottom of the list.
通过插入排序将向量移动到已排序的容器中,但插入到大小为10的容器(例如,链表或向量)中,并删除从列表底部掉落的所有元素。
Step 2 (Efficient):
第2步(高效):
Same as step 1, but keep track of the size of the item at the bottom of the list, and skip the insertion step entirely if the current item is too small.
与步骤1相同,但跟踪列表底部项目的大小,如果当前项目太小,则完全跳过插入步骤。
Suppose there are n words in total, and we need the most-frequent k words (here, k = 10).
假设总共有n个单词,我们需要最频繁的k个单词(这里,k = 10)。
If n is much larger than k, the most efficient way I know of is to maintain a min-heap (i.e. the top element has the minimum frequency of all elements in the heap). On each iteration, you insert the next frequency into the heap, and if the heap now contains k+1 elements, you remove the smallest. This way, the heap is maintained at a size of k elements throughout, containing at any time the k highest-frequency elements seen so far. At the end of processing, read out the k highest-frequency elements in increasing order.
如果n比k大得多,我所知道的最有效的方法是维持一个最小堆(即顶层元素具有堆中所有元素的最小频率)。在每次迭代中,将下一个频率插入堆中,如果堆现在包含k + 1个元素,则删除最小的元素。这样,堆始终保持在k个元素的大小,包含到目前为止看到的k个最高频率元素。在处理结束时,按递增顺序读出k个最高频率元素。
Time complexity: For each of n words, we do two things: insert into a heap of size at most k, and remove the minimum element. Each operation costs O(log k) time, so the entire loop takes O(nlog k) time. Finally, we read out the k elements from a heap of size at most k, taking O(klog k) time, for a total time of O((n+k)log k). Since we know that k <n, O(klog k) is at worst O(nlog k), so this can be simplified to just O(nlog k).
时间复杂度:对于n个单词中的每一个,我们做两件事:插入最多为k的大小的堆,并删除最小元素。每个操作花费O(log k)时间,因此整个循环需要O(nlog k)时间。最后,我们从大小为k的堆中读出k个元素,取O(klog k)时间,总时间为O((n + k)log k)。因为我们知道k
A hash table containing linked lists of words seems like a peculiar data structure to use if the goal is to accumulate are word frequencies.
包含链接的单词列表的哈希表似乎是一种特殊的数据结构,如果目标是累积,则使用的是字频率。
Nonetheless, the efficient way to get the ten highest frequency nodes is to insert each into a priority queue/heap, such as the Fibonacci heap, which has O(1) insertion time and O(n) deletion time. Assuming that iteration over the hash table table is fast, this method has a runtime which is O(n×O(1) + 10×O(n)) ≡ O(n).
尽管如此,获得十个最高频率节点的有效方法是将每个节点插入优先级队列/堆,例如Fibonacci堆,其具有O(1)插入时间和O(n)删除时间。假设对哈希表表的迭代很快,该方法的运行时间为O(n×O(1)+ 10×O(n))≡O(n)。
The absolute fastest way to do this would be to use a SoftHeap. Using a SoftHeap, you can find the top 10 items in O(n) time whereas every other solution posted here would take O(n lg n) time.
绝对最快的方法是使用SoftHeap。使用SoftHeap,您可以在O(n)时间内找到前10个项目,而此处发布的每个其他解决方案都需要O(n lg n)时间。
http://en.wikipedia.org/wiki/Soft_heap
This wikipedia article shows how to find the median in O(n) time using a softheap, and the top 10 is simply a subset of the median problem. You could then sort the items that were in the top 10 if you needed them in order, and since you're always at most sorting 10 items, it's still O(n) time.
这篇维基百科文章展示了如何使用softheap找到O(n)时间的中位数,前10位只是中位数问题的一个子集。然后,如果您按顺序需要,您可以对前10名中的项目进行排序,并且因为您总是最多排序10个项目,所以它仍然是O(n)时间。
I eventually figured it out...
我最终想通了......
void toptenwords()
{
int topfreq = 0;
char topword[SIZEOFWORD];
int counter = 0;
cout <<"\nTop Ten Words" <freq > topfreq) // If the freqency on hand is larger that the one found, store...
{
topfreq = tmp->freq;
strcpy(topword, tmp->word);
}
tmp = tmp->next;
}
}
while(topfreq > 0 && counter <10) // This will now find the top 10 frequencies
{
for(int m = 0; m freq == topfreq) // If the freqency we're on is equal to the frequency we're keeping count of...
{
counter++;
if(counter > 10) // We only want the top 10 words
break;
topfreq = tmp->freq; // Store the node details...
strcpy(topword, tmp->word);
cout <next;
}
}
topfreq--; // If counter is never incremented again, this will surely kill the loop... eventually.
}
}
It takes about 30-60 seconds to execute, but it does the job. I don't know much about efficiency (obviously noted from the amount of time it takes).
执行大约需要30-60秒,但它可以完成工作。我对效率知之甚少(显然需要花费很多时间)。