本文介绍了其中数据聚类算法是适当的,以检测在一时间系列事件未知数量簇?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是我的方案。考虑一组发生在不同的地点和时间的事件 - 作为一个例子,考虑一个人高高在上风暴期间记录在一个城市的雷击。对于我的目的,闪电的瞬间,只能打到特定的位置(如高层建筑)。此外想象每个雷击都有一个唯一的ID,这样可以日后参考罢工。还有在这个城市约10万这样的位置(如你猜,这是一个比喻为我现在​​的老板是关于实际问题的敏感)。

Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).

有关第一阶段,我的输入是一组(打击ID,罢工时,罢工位置)的元组。所需的输出是一组以上的1个事件的簇短时间内击中同一位置的。群集数是事先不知道(因此K-装置是没有多大用处这里)。什么是被视为短可能是pdefined对于一个给定集群的尝试$ P $。也就是说,我可以将其设置为,说,3分钟,比运行算法;后来尝试用4分钟或10分钟。也许一个很好的接触将是该算法来确定聚类的一个强度和建议,对于一个给定的输入,在最紧凑的聚类通过使用特定的值为短实现,但是这不是必需的初始

For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.

有关第二阶段,我想考虑到罢工(即实数)的幅度和查找集群都只有很短的时间,并有类似幅度的。

For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.

我用Google搜索,并在这里检查了解答有关数据聚类。这些信息是有点令人困惑(下面是一个链接,我发现有用的列表)。 AFAIK,因为它们需要簇的数目为规定的先验k均值和相关的算法不会是有用的。我不要求别人来解决我的问题(我喜欢解决它),但是在世界上大的数据聚类算法的一些方向是为了节省一些时间非常有用。具体而言,什么聚类算法适合当簇的数目是未知的。

I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.

编辑:我认识的位置是不相关的,在这个意义上,虽然事件发生的时候,我只需要他们聚集每个位置。因此,每个位置都有其自己的时间序列,可以由此来独立分析事件

I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.

一些技术细节:
- 作为数据集是不是大,它可以在存储器适合所有
。 - 并行处理是不错的,但不是必需的。我只有一个4芯机和麻preduce和Hadoop那就太过分了。
- 我最熟悉的语言是Java。我还没有使用R和它的学习曲线很可能是太多什么时候我得到了。我要看看它反正在我的业余时间。
- 暂时,使用工具运行分析是确定的,我不具备生产只是code。我提到这一点,因为可能 Weka的将建议。
- 可视化将是有益的。由于数据集是足够大的,因此不适合在内存中,可视化至少应该支持缩放和平移。并澄清:我并不需要建立一个可视化的图形用户界面,它只是用于检查与工具产生的结果一个很好的功能。

Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.

感谢您。我发现有用的问题是: http://stackoverflow.com/questions/2027252 ,的, http://stackoverflow.com /问题/ 2129269 http://stackoverflow.com/questions/691922 ,的

Thank you. Questions that I found useful are: http://stackoverflow.com/questions/2027252, http://stackoverflow.com/questions/562904, http://stackoverflow.com/questions/2129269, http://stackoverflow.com/questions/691922, http://stackoverflow.com/questions/356035

推荐答案

你不能只是使用层次聚类在罢工的时间差中作为距离的一部分度量?

Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?

这篇关于其中数据聚类算法是适当的,以检测在一时间系列事件未知数量簇?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-24 15:48