本文介绍了什么是映射到基元类型的HashMap的快速替代方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先让我告诉你,我已经阅读了以下在,我也有类似的问题。

我想要做的是从纽约时代文本中获取大量依赖项,这些文本将由stanford解析器处理以提供依赖关系并将依赖项存储在散列图中他们的分数,也就是说,如果我看到两次依赖关系,我会将哈希表中的分数递增1.



这个任务很快就开始了,大约每秒执行10个句子,迅速关闭。在30 000个句子(假设每个句子中有10个单词,每个单词存储大约3到4个依赖关系)在我的散列表中大约有300 000个条目。

我将如何提高散列表的性能?我可以使用什么样的hashkey?



非常感谢
Martinos



编辑1: / p>

好吧,也许我错误地解释了我的问题,好吧,我的项目中没有使用字节数组,但是在上述另一个人的类似问题中。我不知道他们在用什么,因此这就是为什么我问。第二:我不会发布代码,因为我认为它会让事情变得非常难理解,但是这里有一个例子:


$ b $

$ b有一句话:我要睡觉我有依赖关系:
(i,am,-1)
(i,going,-2)
(i,to ,-3)
(am,going,-1)



(to,bed,-1)
所有句子(1 000 000句子)的这些相关性将存储在散列表中。
如果我看到一个依赖关系两次,我将得到现有依赖关系的分数并添加1.



这就是它。一切都很好,但在hashmap中添加句子的速度(或检索)在这一行上缩小:
dependancyBank.put(newDependancy,dependancyBank.get(newDependancy)+ 1);
谁能告诉我为什么?
Regards
Martinos

解决方案

为key或value为原始类型的情况优化了hashmaps。

然而,智能选择结构和密钥的散列码。

你的问题的这一部分不清楚:这个任务很快就开始了,大约每秒钟处理10个句子,但是很快就会缩小。在30 000个句子中(假设每个句子中有10个单词,每个单词存储大约3至4个依赖关系)在我的散列映射中大约有300 000个条目。但是你没有说大数据的性能。你的地图会增长,这很明显。 HashMap仅在理论上是 O(1),实际上由于缓存局部性较少以及由于重新哈希引起的偶然跳转,您会看到一些性能随大小的变化。因此, put() get()次数不会是恒定的,但它们应该接近于此。也许你正在以不保证快速访问的方式使用散列表,例如通过迭代它?在这种情况下,你的时间会随着大小线性增长,除非你改变算法,否则你不能改变它。


First of all let me tell you that i have read the following questions that has been asked before Java HashMap performance optimization / alternative and i have a similar question.

What i want to do is take a LOT of dependencies from New york times text that will be processed by stanford parser to give dependencies and store the dependencies in a hashmap along with their scores, i.e. if i see a dependency twice i will increment the score from the hashmap by 1.

The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.

How will i be able to increase the performance of my hashmap? What kind of hashkey can i use?

Thanks a lot Martinos

EDIT 1:

ok guys maybe i phrased my question wrongly ok , well the byte arrays are not used in MY project but in the similar question of another person above. I dont know what they are using it for hence thats why i asked.

secondly: i will not post code as i consider it will make things very hard to understand but here is a sample:

With sentence : "i am going to bed" i have dependencies:(i , am , -1) (i, going, -2)(i,to,-3) (am, going, -1) . . .(to,bed,-1) These dependencies of all sentences(1 000 000 sentences) will be stored in a hashmap. If i see a dependency twice i will get the score of the existing dependency and add 1.

And that is pretty much it. All is well but the rate of adding sentences in hashmap(or retrieving) scales down on this line:dependancyBank.put(newDependancy, dependancyBank.get(newDependancy) + 1);Can anyone tell me why?Regards Martinos

解决方案

Trove has optimized hashmaps for the case where key or value are of primitive type.

However, much will still depend on smart choice of structure and hash code for your keys.

This part of your question is unclear: The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.. But you don't say what the performance is for the larger data. Your map grows, which is kind of obvious. Hashmaps are O(1) only in theory, in practice you will see some performance changes with size, due to less cache locality, and due to occasional jumps caused by rehashing. So, put() and get() times will not be constant, but still they should be close to that. Perhaps you are using the hashmap in a way which doesn't guarantee fast access, e.g. by iterating over it? In that case your time will grow linearly with size and you can't change that unless you change your algorithm.

这篇关于什么是映射到基元类型的HashMap的快速替代方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 22:24