本文介绍了在大规模数据上删除java中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题。
我正在使用和API连接到某个地方并将数据作为输入流。
目标是在删除重复行后保存数据。由列10,15,22定义的
重复。

I have the following issue.I'm connecting to some place using and API and getting the data as an inputstream.the goal is to save the data after removing duplicate lines.duplication defined by columns 10, 15, 22.

我使用多个线程获取数据。
目前我首先将数据保存到csv文件中,然后删除重复项。
我想在读数据时这样做。
数据量约为1000万条记录。
我的内存有限,我可以使用。
机器有32GB的内存,但我有限,因为有其他应用程序使用它。

i'm getting the data using several threads.currently I first save the data into a csv file and then remove duplicates.I want to do it while i'm reading the data.the volume of the data is about 10 million records.I have limited memory that I can use.the machine has 32gb of memory but I am limited since there are other applications that using it.

我在这里读到有关使用哈希映射的信息。
但是我不确定我有足够的内存来使用它。

I read here about using hash maps.but I'm not sure I have enough memory to use it.

是否有人建议如何解决这个问题?

does any one has a suggestion how to solve this issue?

推荐答案

解决方案取决于第10,15,22列中您的数据有多大。

The solution depends on how big is your data in columns 10, 15, 22.

假设它不是太大(例如,大约1kb),你实际上可以实现内存中的解决方案。

Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.


  • 实现 Key 类,用于存储第10,15,22列的值。小心实现等于 hashCode 方法。 (您也可以使用普通的 ArrayList 。)

  • 创建 Set 它将包含您阅读的所有记录的键。

  • 对于您阅读的每条记录,请检查该密钥是否已存在于该集合中。如果是,请跳过记录。如果没有,请将记录写入输出,将密钥添加到集合中。确保以线程安全的方式使用set。

  • Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
  • Create a Set which would contain keys of all records you read.
  • For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.

在最坏的情况下,你需要记录数*密钥大小内存量。对于10000000条记录和假设的

In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.

如果密钥大小仍然太大,你可能需要一个数据库来存储密钥集。

If the key size is still too large, you'll probably need a database to store the set of key.

另一种选择是存储密钥的哈希值而不是完整的密钥。这将需要更少的内存,但您可能会遇到哈希冲突。这可能导致误报,即实际上不重复的错误重复。要完全避免这种情况,您需要一个数据库。

Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.

这篇关于在大规模数据上删除java中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 15:28