Spark读取excle、xlsx数据(Session读取)

读取xlsx
版本:
IntelliJ IDEA Community Edition 2019.2.4
apache-maven-3.6.2
Spark 2.0.2
hadoop2.6_Win_x64-master

话不多说,直奔主题:
我开始试着用Spark Context去读取,发现不行,就用了SparkSession

1. 首先导入jar包(注意要版本一致,不然会喷错):

pom.xml

<!--        读取excel xlsx-->
        <dependency>
            <groupId>com.crealytics</groupId>
            <artifactId>spark-excel_2.11</artifactId>
            <version>0.12.2</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.xmlbeans/xmlbeans -->
        <dependency>
            <groupId>org.apache.xmlbeans</groupId>
            <artifactId>xmlbeans</artifactId>
            <version>3.1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.17</version>
        </dependency>

2.

package com.h3.pro

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object Task1 {
  def main(args: Array[String]): Unit = {
  //这是因为我没有配置hadoop环境变量,我是在win10上运行的。
    System.setProperty("hadoop.home.dir", "D:\\software\\hadoop2.6_Win_x64-master");
    val conf = new SparkConf().setAppName("Task1").setMaster("local")
    val context = new SparkContext(conf)
    val frame = SparkSession.builder().getOrCreate().read.format("com.crealytics.spark.excel")
      .option("useHeader", "true")
      //这三行可以要,可以不要
      //.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
      //.option("inferSchema", "false")
      //.option("workbookPassword", "None")
      .load("***.xlsx")
    frame.take(10).foreach(println)

  }
}

你可能感兴趣的:(spark,scala,Spark,Scala)