Spark中的分层聚集聚类

您可以在此处下载这是一个很棒的工具，可以轻松地运行独立的Spark集群.如果您需要有关Linux或Mac的帮助，我可以提供说明.下载后，您需要使用SBT进行编译...从基本目录sbt，然后从run 使用以下命令可以在localhost:9000上访问必需的进口 import org.apache.spark.sql.types._import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.ml.clustering.BisectingKMeans 在Spark-Notebook中创建sqlContext的方法 import org.apache.spark.sql.SQLContextval sqlContext = new org.apache.spark.sql.SQLContext(sc) 定义导入架构 val customSchema = StructType(Array(StructField("c0", IntegerType, true),StructField("Sepal_Length", DoubleType, true),StructField("Sepal_Width", DoubleType, true),StructField("Petal_Length", DoubleType, true),StructField("Petal_Width", DoubleType, true),StructField("Species", StringType, true))) 制作DF val iris_df = sqlContext.read.format("csv").option("header", "true") //reading the headers.option("mode", "DROPMALFORMED").schema(customSchema).load("/your/path/to/iris.csv") 指定功能 val assembler = new VectorAssembler().setInputCols(Array("c0","Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width")).setOutputCol("features")val iris_df_trans = assembler.transform(iris_df) 具有3个簇的模型(更改为.setK) val bkm = new BisectingKMeans().setK(3).setSeed(1L).setFeaturesCol("features")val model = bkm.fit(iris_df_trans) 计算成本 val cost = model.computeCost(iris_df_trans) 计算中心 println(s"Within Set Sum of Squared Errors = $cost")println("Cluster Centers: ")val centers = model.clusterCenterscenters.foreach(println) 一种凝聚方法以下内容提供了Spark中的聚集层次聚类实现，值得一看，它不像平分Kmeans方法那样包含在基本MLlib中，并且我没有示例.但是值得那些好奇的人看看. Github项目在Spark-Summit上的YouTube展示 Spark-Summit中的幻灯片 I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods. I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information. If anyone has some insight about it, I would be very grateful.Thank you. 解决方案 The Bisecting Kmeans ApproachSeems to do a decent job, and runs quite fast in terms of performance. Here is a sample code I wrote for utilizing the Bisecting-Kmeans algorithm in Spark (scala) to get cluster centers from the Iris Data Set (which many people are familiar with). Note: (I use Spark-Notebook for most of my Spark work, it is very similar to Jupyter Notebooks). I bring this up because you will need to create a Spark SQLContext for this example to work, which may differ based on where or how you are accessing Spark. You can download the Iris.csv to test hereYou can download Spark-Notebook hereIt is a great tool, which will easily allow you to run a standalone spark cluster. If you want help with it on linux or Mac, I can provide instructions. Once you download it you need to use SBT to compile it... Use the following commands from the base directory sbt, then runIt will be accessible at localhost:9000 Required Importsimport org.apache.spark.sql.types._import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.ml.clustering.BisectingKMeansMethod to create sqlContext in Spark-Notebookimport org.apache.spark.sql.SQLContextval sqlContext = new org.apache.spark.sql.SQLContext(sc)Defining Import Schemaval customSchema = StructType(Array(StructField("c0", IntegerType, true),StructField("Sepal_Length", DoubleType, true),StructField("Sepal_Width", DoubleType, true),StructField("Petal_Length", DoubleType, true),StructField("Petal_Width", DoubleType, true),StructField("Species", StringType, true)))Making the DFval iris_df = sqlContext.read.format("csv").option("header", "true") //reading the headers.option("mode", "DROPMALFORMED").schema(customSchema).load("/your/path/to/iris.csv")Specifying featuresval assembler = new VectorAssembler().setInputCols(Array("c0","Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width")).setOutputCol("features")val iris_df_trans = assembler.transform(iris_df)Model with 3 Clusters (change with .setK)val bkm = new BisectingKMeans().setK(3).setSeed(1L).setFeaturesCol("features")val model = bkm.fit(iris_df_trans)Computing costval cost = model.computeCost(iris_df_trans)Calculating Centers println(s"Within Set Sum of Squared Errors = $cost")println("Cluster Centers: ")val centers = model.clusterCenterscenters.foreach(println)An Agglomerative ApproachThe following provides an Agglomerative hierarchical clustering implementation in Spark which is worth a look, it is not included in the base MLlib like the bisecting Kmeans method and I do not have an example. But it is worth a look for those curious.Github ProjectYoutube of Presentation at Spark-SummitSlides from Spark-Summit 这篇关于Spark中的分层聚集聚类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！