spark中读取路径下的多个文件(spark textFile读取多个文件)

1.spark textFile读取File

1.1 简单读取文件

val spark = SparkSession.builder()
    .appName("demo")
    .master("local[3]")
    .getOrCreate()

// 读取hdfs文件目录
spark.sparkContext.textFile("/user/data")
spark.sparkContext.textFile("hdfs://10.252.51.58:8088/user/data")
// 读取本地目录
spark.sparkContext.textFile("file://user/data")

1.2 正则模式读取文件

val spark = SparkSession.builder()
    .appName("demo")
    .master("local[3]")
    .getOrCreate()

// 读取hdfs文件目录
spark.sparkContext.textFile("/user/data/201908/0[1-9]/*")

2.spark textFile读取多个File

2.1 将多个文件变成一个 list 作为参数

正确写法:sc.TextFile( filename1 + "," + filename2 + "," + filename3)

val spark = SparkSession.builder()
    .appName("demo")
    .master("local[3]")
    .getOrCreate()

val fileList = Array("/user/data/source1","/user/data/source2","/user/data/source3")
// 读取hdfs文件目录
spark.sparkContext.textFile(fileList.mkString(","))

2.2 使用 union 连接

val spark = SparkSession.builder()
    .appName("demo")
    .master("local[3]")
    .getOrCreate()

val fileList = Array("/user/data/source1","/user/data/source2","/user/data/source3")
//array[RDD]
val fileRDD:Array[RDD[String]] = fileList.map(spark.sparkContext.textFile(_)

spark.sparkContext.union(fileRDD)

sparkSession

创建sparkSession

SparkSession sparkSession = SparkSession.builder()
                .appName("ads_huoyun_BeijingGongAn")
                .enableHiveSupport()
                .getOrCreate();

sparkContext中得到JavaSparkContext

JavaSparkContext javaSparkContext = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());

list转成sparkRdd

JavaSparkContext javaSparkContext = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
        JavaPairRDD fencePairRdd = javaSparkContext.parallelize(fenceList).flatMapToPair(new PairFlatMapFunction() {
            @Override
            public Iterator> call(String s) throws Exception {
                List> fenceList = new ArrayList<>();
                String[] lonlat = s.split(",");
                ArrayList gridsId = CoordinateIntoGrids.genNineSplitId(Double.parseDouble(lonlat[0]), Double.parseDouble(lonlat[1]));
                for (String s1 : gridsId) {
                    fenceList.add(new Tuple2<>(s1,s));
                }
                return fenceList.iterator();
            }
        });

你可能感兴趣的:(spark,hdfs,big,data,hadoop,java)