目录:1、下载并配置CentOS7、Hadoop、Spark、jdk、scala、scala ide for ecplise
2.安装CentOS7 参考博客:http://www.osyunwei.com/archives/7829.html
1.2 下载并安装Hadoop2.8.0
主要的步骤就是:实现不同虚拟机之间的免密登录-->安装hadoop-->修改hadoop配置文件-->启动hadoop--> 测试hadoop
1.3 下载并安装JDK1.8.0_162
1.4 下载并安装spark
1.下载地址:http://scala-ide.org/。使用Scala ide 需要用到CentOS7的GUI界面,可以用#yum installgroup GNOME 下载一个可视化桌面。安装完可视化界面之后,就解压压缩包,即可使用Scala Ide for Ecplise。
3.配置Scala Ide for Ecplise:想要使用Spark Millb之类的包,必须导入spark依赖包。具体的方法是:在自己project右键-->properties-->Java Build Path-->Add extern jars-->打开自己spark文件夹中的Jars文件夹-->把所有.jar文件导入(spark2.0以上的版本才有jars文件夹)
4.Scala IDE环境已经搭建完毕,运行Demo:wordcount。
参考博客: https://blog.csdn.net/hqwang4/article/details/72615125
2.1 启动集群
2.2 把任务提交到集群
3.1 实现K-Means算法
实现K-Means算法的方式有很多种,我选择的是用基于Spark Millb包的K-Means算法,用的是Scala语言编写的。我选择的原始数据是Iris.csv(莺尾花数据)主要代码如下:
package test
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
object KmeansTest {
def main(args: Array[String]) {
val conf = new
SparkConf().setAppName("K-Means Clustering").setMaster("spark://")
val sc = new SparkContext(conf)
val rawTrainingData = sc.textFile("hdfs://")
val parsedTrainingData =
rawTrainingData.map(line => {
//Cluster the data into two classes using KMeans
val numClusters = 3
val numIterations = 40
val runTimes = 3
var clusterIndex: Int = 0
val clusters : KMeansModel = KMeans.train(parsedTrainingData, numClusters, numIterations, runTimes)
println("Cluster Number:" + clusters.clusterCenters.length)
println("Cluster Centers Information Overview:")
x => {
println("Center Point of Cluster " + clusterIndex + ":")
clusterIndex += 1
//begin to check which cluster each test data belongs to based on the clustering result
val rawTestData = sc.textFile("hdfs://")
val parsedTestData = rawTestData.map(line => {
parsedTestData.collect().foreach(testDataLine => {
val predictedClusterIndex:
Int = clusters.predict(testDataLine)
println("The data " + testDataLine.toString + " belongs to cluster " +
println("Spark MLlib K-means clustering test finished.")
点击Run as Scala Application之后就会得出结果如下:
The data [5.1,3.5,1.4,0.2] belongs to cluster 0
The data [4.9,3.0,1.4,0.2] belongs to cluster 0
The data [4.7,3.2,1.3,0.2] belongs to cluster 0
The data [4.6,3.1,1.5,0.2] belongs to cluster 0
The data [5.0,3.6,1.4,0.2] belongs to cluster 0
The data [5.4,3.9,1.7,0.4] belongs to cluster 0
The data [4.6,3.4,1.4,0.3] belongs to cluster 0
The data [5.0,3.4,1.5,0.2] belongs to cluster 0
The data [4.4,2.9,1.4,0.2] belongs to cluster 0
The data [4.9,3.1,1.5,0.1] belongs to cluster 0
The data [5.4,3.7,1.5,0.2] belongs to cluster 0
The data [4.8,3.4,1.6,0.2] belongs to cluster 0
The data [4.8,3.0,1.4,0.1] belongs to cluster 0
The data [4.3,3.0,1.1,0.1] belongs to cluster 0
The data [5.8,4.0,1.2,0.2] belongs to cluster 0
The data [5.7,4.4,1.5,0.4] belongs to cluster 0
The data [5.4,3.9,1.3,0.4] belongs to cluster 0
The data [5.1,3.5,1.4,0.3] belongs to cluster 0
The data [5.7,3.8,1.7,0.3] belongs to cluster 0
The data [5.1,3.8,1.5,0.3] belongs to cluster 0
The data [5.4,3.4,1.7,0.2] belongs to cluster 0
The data [5.1,3.7,1.5,0.4] belongs to cluster 0
The data [4.6,3.6,1.0,0.2] belongs to cluster 0
The data [5.1,3.3,1.7,0.5] belongs to cluster 0
The data [4.8,3.4,1.9,0.2] belongs to cluster 0
The data [5.0,3.0,1.6,0.2] belongs to cluster 0
The data [5.0,3.4,1.6,0.4] belongs to cluster 0
The data [5.2,3.5,1.5,0.2] belongs to cluster 0
The data [5.2,3.4,1.4,0.2] belongs to cluster 0
The data [4.7,3.2,1.6,0.2] belongs to cluster 0
The data [4.8,3.1,1.6,0.2] belongs to cluster 0
The data [5.4,3.4,1.5,0.4] belongs to cluster 0
The data [5.2,4.1,1.5,0.1] belongs to cluster 0
The data [5.5,4.2,1.4,0.2] belongs to cluster 0
The data [4.9,3.1,1.5,0.1] belongs to cluster 0
The data [5.0,3.2,1.2,0.2] belongs to cluster 0
The data [5.5,3.5,1.3,0.2] belongs to cluster 0
The data [4.9,3.1,1.5,0.1] belongs to cluster 0
The data [4.4,3.0,1.3,0.2] belongs to cluster 0
The data [5.1,3.4,1.5,0.2] belongs to cluster 0
The data [5.0,3.5,1.3,0.3] belongs to cluster 0
The data [4.5,2.3,1.3,0.3] belongs to cluster 0
The data [4.4,3.2,1.3,0.2] belongs to cluster 0
The data [5.0,3.5,1.6,0.6] belongs to cluster 0
The data [5.1,3.8,1.9,0.4] belongs to cluster 0
The data [4.8,3.0,1.4,0.3] belongs to cluster 0
The data [5.1,3.8,1.6,0.2] belongs to cluster 0
The data [4.6,3.2,1.4,0.2] belongs to cluster 0
The data [5.3,3.7,1.5,0.2] belongs to cluster 0
The data [5.0,3.3,1.4,0.2] belongs to cluster 0
The data [7.0,3.2,4.7,1.4] belongs to cluster 2
The data [6.4,3.2,4.5,1.5] belongs to cluster 1
The data [6.9,3.1,4.9,1.5] belongs to cluster 2
The data [5.5,2.3,4.0,1.3] belongs to cluster 1
The data [6.5,2.8,4.6,1.5] belongs to cluster 1
The data [5.7,2.8,4.5,1.3] belongs to cluster 1
The data [6.3,3.3,4.7,1.6] belongs to cluster 1
The data [4.9,2.4,3.3,1.0] belongs to cluster 1
The data [6.6,2.9,4.6,1.3] belongs to cluster 1
The data [5.2,2.7,3.9,1.4] belongs to cluster 1
The data [5.0,2.0,3.5,1.0] belongs to cluster 1
The data [5.9,3.0,4.2,1.5] belongs to cluster 1
The data [6.0,2.2,4.0,1.0] belongs to cluster 1
The data [6.1,2.9,4.7,1.4] belongs to cluster 1
The data [5.6,2.9,3.6,1.3] belongs to cluster 1
The data [6.7,3.1,4.4,1.4] belongs to cluster 1
The data [5.6,3.0,4.5,1.5] belongs to cluster 1
The data [5.8,2.7,4.1,1.0] belongs to cluster 1
The data [6.2,2.2,4.5,1.5] belongs to cluster 1
The data [5.6,2.5,3.9,1.1] belongs to cluster 1
The data [5.9,3.2,4.8,1.8] belongs to cluster 1
The data [6.1,2.8,4.0,1.3] belongs to cluster 1
The data [6.3,2.5,4.9,1.5] belongs to cluster 1
The data [6.1,2.8,4.7,1.2] belongs to cluster 1
The data [6.4,2.9,4.3,1.3] belongs to cluster 1
The data [6.6,3.0,4.4,1.4] belongs to cluster 1
The data [6.8,2.8,4.8,1.4] belongs to cluster 1
The data [6.7,3.0,5.0,1.7] belongs to cluster 2
The data [6.0,2.9,4.5,1.5] belongs to cluster 1
The data [5.7,2.6,3.5,1.0] belongs to cluster 1
The data [5.5,2.4,3.8,1.1] belongs to cluster 1
The data [5.5,2.4,3.7,1.0] belongs to cluster 1
The data [5.8,2.7,3.9,1.2] belongs to cluster 1
The data [6.0,2.7,5.1,1.6] belongs to cluster 1
The data [5.4,3.0,4.5,1.5] belongs to cluster 1
The data [6.0,3.4,4.5,1.6] belongs to cluster 1
The data [6.7,3.1,4.7,1.5] belongs to cluster 1
The data [6.3,2.3,4.4,1.3] belongs to cluster 1
The data [5.6,3.0,4.1,1.3] belongs to cluster 1
The data [5.5,2.5,4.0,1.3] belongs to cluster 1
The data [5.5,2.6,4.4,1.2] belongs to cluster 1
The data [6.1,3.0,4.6,1.4] belongs to cluster 1
The data [5.8,2.6,4.0,1.2] belongs to cluster 1
The data [5.0,2.3,3.3,1.0] belongs to cluster 1
The data [5.6,2.7,4.2,1.3] belongs to cluster 1
The data [5.7,3.0,4.2,1.2] belongs to cluster 1
The data [5.7,2.9,4.2,1.3] belongs to cluster 1
The data [6.2,2.9,4.3,1.3] belongs to cluster 1
The data [5.1,2.5,3.0,1.1] belongs to cluster 1
The data [5.7,2.8,4.1,1.3] belongs to cluster 1
The data [6.3,3.3,6.0,2.5] belongs to cluster 2
The data [5.8,2.7,5.1,1.9] belongs to cluster 1
The data [7.1,3.0,5.9,2.1] belongs to cluster 2
The data [6.3,2.9,5.6,1.8] belongs to cluster 2
The data [6.5,3.0,5.8,2.2] belongs to cluster 2
The data [7.6,3.0,6.6,2.1] belongs to cluster 2
The data [4.9,2.5,4.5,1.7] belongs to cluster 1
The data [7.3,2.9,6.3,1.8] belongs to cluster 2
The data [6.7,2.5,5.8,1.8] belongs to cluster 2
The data [7.2,3.6,6.1,2.5] belongs to cluster 2
The data [6.5,3.2,5.1,2.0] belongs to cluster 2
The data [6.4,2.7,5.3,1.9] belongs to cluster 2
The data [6.8,3.0,5.5,2.1] belongs to cluster 2
The data [5.7,2.5,5.0,2.0] belongs to cluster 1
The data [5.8,2.8,5.1,2.4] belongs to cluster 1
The data [6.4,3.2,5.3,2.3] belongs to cluster 2
The data [6.5,3.0,5.5,1.8] belongs to cluster 2
The data [7.7,3.8,6.7,2.2] belongs to cluster 2
The data [7.7,2.6,6.9,2.3] belongs to cluster 2
The data [6.0,2.2,5.0,1.5] belongs to cluster 1
The data [6.9,3.2,5.7,2.3] belongs to cluster 2
The data [5.6,2.8,4.9,2.0] belongs to cluster 1
The data [7.7,2.8,6.7,2.0] belongs to cluster 2
The data [6.3,2.7,4.9,1.8] belongs to cluster 1
The data [6.7,3.3,5.7,2.1] belongs to cluster 2
The data [7.2,3.2,6.0,1.8] belongs to cluster 2
The data [6.2,2.8,4.8,1.8] belongs to cluster 1
The data [6.1,3.0,4.9,1.8] belongs to cluster 1
The data [6.4,2.8,5.6,2.1] belongs to cluster 2
The data [7.2,3.0,5.8,1.6] belongs to cluster 2
The data [7.4,2.8,6.1,1.9] belongs to cluster 2
The data [7.9,3.8,6.4,2.0] belongs to cluster 2
The data [6.4,2.8,5.6,2.2] belongs to cluster 2
The data [6.3,2.8,5.1,1.5] belongs to cluster 1
The data [6.1,2.6,5.6,1.4] belongs to cluster 2
The data [7.7,3.0,6.1,2.3] belongs to cluster 2
The data [6.3,3.4,5.6,2.4] belongs to cluster 2
The data [6.4,3.1,5.5,1.8] belongs to cluster 2
The data [6.0,3.0,4.8,1.8] belongs to cluster 1
The data [6.9,3.1,5.4,2.1] belongs to cluster 2
The data [6.7,3.1,5.6,2.4] belongs to cluster 2
The data [6.9,3.1,5.1,2.3] belongs to cluster 2
The data [5.8,2.7,5.1,1.9] belongs to cluster 1
The data [6.8,3.2,5.9,2.3] belongs to cluster 2
The data [6.7,3.3,5.7,2.5] belongs to cluster 2
The data [6.7,3.0,5.2,2.3] belongs to cluster 2
The data [6.3,2.5,5.0,1.9] belongs to cluster 1
The data [6.5,3.0,5.2,2.0] belongs to cluster 2
The data [6.2,3.4,5.4,2.3] belongs to cluster 2
The data [5.9,3.0,5.1,1.8] belongs to cluster 1
Spark MLlib K-means clustering test finished.