Spark生态之Spark-csv学习1之安装和简单的examples

更多代码请见:https://github.com/xubo245/SparkLearning


1.安装:

(1) Spark-shell:

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
直接运行就进入了shell


(2)eclipse的project:

可以从(1)中安装的三个jar包导入到project中,jar在/home/hadoop/.ivy2中



2.使用:

由于spark-shell不方便调试,故没太研究,具体的请参考【1】,一下主要是在eclipse下

(0) 数据下载:

wget https://github.com/databricks/spark-csv/raw/master/src/test/resources/cars.csv


(1)readCSVBySPARKSQL 

examples:

/**
 * @author xubo
 * sparkCSV learning
 * @time 20160419
 * reference https://github.com/databricks/spark-csv
 * blog http://blog.csdn.net/xubo245/article/details/51184946
 */
package com.apache.spark.sparkCSV.learning


import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object readCsvBySparkSQLLoad {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkLearning:SparkCSV").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext._


    val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file/data/sparkCSV/input/cars.csv", "header" -> "true"))


    df.select("year", "model").save("file/data/sparkCSV/output/newcars.csv", "com.databricks.spark.csv")
    df.show


    sc.stop


  }
}

运行结果:

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla|    S|          No comment|     |
|1997| Ford| E350|Go get one now th...|     |
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+


自己的文件:(天池比赛:菜鸟网络)

/**
 * @author xubo
 * sparkCSV learning
 * @time 20160419
 * reference https://github.com/databricks/spark-csv
 * blog http://blog.csdn.net/xubo245/article/details/51184946
 */
package com.apache.spark.sparkCSV.learning


import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object readCsvBySparkSQLLoad2 {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkLearning:SparkCSV").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext._
    val file1 = "file/data/sparkCSV/input/sample_submission.csv"
    println(file1);


    val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> file1, "header" -> "false"))
    //    df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv")
    df.show


    sc.stop


  }
}


运行结果:

file/data/input/sample_submission.csv
2016-04-19 00:19:32 WARN  :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: 100.78.140.148, but we couldn't find any external IP address!
+-----+---+---+
|   C0| C1| C2|
+-----+---+---+
|  535|all|  1|
|  727|all|  1|
| 1765|all|  1|
| 8230|all|  1|
| 9574|all|  1|
| 9595|all|  1|
| 9754|all|  1|
| 9964|all|  1|
|11068|all|  1|
|12223|all|  1|
|12940|all|  1|
|13282|all|  1|
|14920|all|  1|
|17392|all|  1|
|17731|all|  1|
|18958|all|  1|
|19966|all|  1|
|22108|all|  1|
|22282|all|  1|
|23671|all|  1|
+-----+---+---+


(2)其他版本的Spark和Pathon、R、Java等请见参考【1】


更多的Spark学习代码请见:https://github.com/xubo245/SparkLearning


参考:

【1】 https://github.com/databricks/spark-csv

【1】 http://www.iteblog.com/archives/1380

你可能感兴趣的:(Spark生态之Spark-csv学习1之安装和简单的examples)