本文介绍了使用sklearn聚类单变量时间序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫DataFrame,我想从中为每个列进行聚类.我正在使用sklearn,这就是我所拥有的:

I have a panda DataFrame from which, i would like to do clustering for each columns. I am using sklearn and this is what i have:

data= pd.read_csv("data.csv")
data=pd.DataFrame(data)
data=data.set_index("Time")
#print(data)
cluster_numbers=2
list_of_cluster=[]
for k,v in data.iteritems():
   temp=KMeans(n_clusters=cluster_numbers)
   temp.fit(data[k])
   print(k)
   print("predicted",temp.predict(data[k]))
   list_of_cluster.append(temp.predict(data[k]))

当我尝试运行它时,出现此错误:ValueError: n_samples=1 should be >= n_clusters=2

when i try to run it, i have this error: ValueError: n_samples=1 should be >= n_clusters=2

我想知道问题出在什么地方,因为我的样本数比簇数多.任何帮助将不胜感激

I am wondering what is the problem as i have more samples than number of clusters. Any help will be appreciated

推荐答案

K-Means聚类器期望一个2D数组,每行一个数据点,也可以是一维的.在您的情况下,您必须将pandas列重塑为具有len(data)行和1列的矩阵.参见下面的示例:

The K-Means clusterer expects a 2D array, each row a data point, which can also be one-dimensional. In your case you have to reshape the pandas column to a matrix having len(data) rows and 1 column. See below an example that works:

from sklearn.cluster import KMeans
import pandas as pd

data = {'one': [1., 2., 3., 4., 3., 2., 1.], 'two': [4., 3., 2., 1., 2., 3., 4.]}
data = pd.DataFrame(data)

n_clusters = 2

for col in data.columns:
    kmeans = KMeans(n_clusters=n_clusters)
    X = data[col].reshape(-1, 1)
    kmeans.fit(X)
    print "{}: {}".format(col, kmeans.predict(X))

这篇关于使用sklearn聚类单变量时间序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 16:25