更多代码请见:https://github.com/xubo245/SparkLearning
1.安装:
(1) Spark-shell:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0直接运行就进入了shell
(2)eclipse的project:
可以从(1)中安装的三个jar包导入到project中,jar在/home/hadoop/.ivy2中
2.使用:
由于spark-shell不方便调试,故没太研究,具体的请参考【1】,一下主要是在eclipse下
(0) 数据下载:
wget https://github.com/databricks/spark-csv/raw/master/src/test/resources/cars.csv
(1)readCSVBySPARKSQL
examples:
/** * @author xubo * sparkCSV learning * @time 20160419 * reference https://github.com/databricks/spark-csv * blog http://blog.csdn.net/xubo245/article/details/51184946 */ package com.apache.spark.sparkCSV.learning import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext object readCsvBySparkSQLLoad { def main(args: Array[String]) { val conf = new SparkConf().setAppName("SparkLearning:SparkCSV").setMaster("local") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext._ val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file/data/sparkCSV/input/cars.csv", "header" -> "true")) df.select("year", "model").save("file/data/sparkCSV/output/newcars.csv", "com.databricks.spark.csv") df.show sc.stop } }
+----+-----+-----+--------------------+-----+ |year| make|model| comment|blank| +----+-----+-----+--------------------+-----+ |2012|Tesla| S| No comment| | |1997| Ford| E350|Go get one now th...| | |2015|Chevy| Volt| null| null| +----+-----+-----+--------------------+-----+
/** * @author xubo * sparkCSV learning * @time 20160419 * reference https://github.com/databricks/spark-csv * blog http://blog.csdn.net/xubo245/article/details/51184946 */ package com.apache.spark.sparkCSV.learning import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext object readCsvBySparkSQLLoad2 { def main(args: Array[String]) { val conf = new SparkConf().setAppName("SparkLearning:SparkCSV").setMaster("local") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext._ val file1 = "file/data/sparkCSV/input/sample_submission.csv" println(file1); val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> file1, "header" -> "false")) // df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv") df.show sc.stop } }
file/data/input/sample_submission.csv 2016-04-19 00:19:32 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: 100.78.140.148, but we couldn't find any external IP address! +-----+---+---+ | C0| C1| C2| +-----+---+---+ | 535|all| 1| | 727|all| 1| | 1765|all| 1| | 8230|all| 1| | 9574|all| 1| | 9595|all| 1| | 9754|all| 1| | 9964|all| 1| |11068|all| 1| |12223|all| 1| |12940|all| 1| |13282|all| 1| |14920|all| 1| |17392|all| 1| |17731|all| 1| |18958|all| 1| |19966|all| 1| |22108|all| 1| |22282|all| 1| |23671|all| 1| +-----+---+---+
(2)其他版本的Spark和Pathon、R、Java等请见参考【1】
更多的Spark学习代码请见:https://github.com/xubo245/SparkLearning
参考:
【1】 https://github.com/databricks/spark-csv
【1】 http://www.iteblog.com/archives/1380