黑猴子的家:Spark RDD SequenceFile文件输入输出(数据读取与保存的主要方式之一)

SequenceFile文件是Hadoop用来存储二进制形式的key-value对而设计的一种平面文件(Flat File)。

Spark 有专门用来读取 SequenceFile 的接口。在 SparkContext 中,可以调用 sequenceFile keyClass, valueClass。

scala> val data = sc.parallelize(List((2,"aa"),(3,"bb"),(4,"cc"),(5,"dd"),(6,"ee")))
data: org.apache.spark.rdd.RDD[(Int, String)] 
            = ParallelCollectionRDD[1] at parallelize at :24

scala> data.saveAsSequenceFile("hdfs://hadoop102:9000/sequdata")

scala> val sdata = sc.sequenceFile[Int,String]("hdfs://hadoop102:9000/sequdata/p*")
sdata: org.apache.spark.rdd.RDD[(Int, String)] 
            = MapPartitionsRDD[6] at sequenceFile at :24

scala> sdata.collect
res1: Array[(Int, String)] = Array((2,aa), (3,bb), (4,cc), (5,dd), (6,ee))

可以直接调用 saveAsSequenceFile(path) 保存你的PairRDD,它会帮你写出数据。需要键和值能够自动转为Writable类型。

Scala 类型 Java类型 Hadoop Writable 类型
Int Integer InWritable 或VInWritable
Long Long LongWritable或VLongWritable
Float Float FloatWritable
Double Double DoubleWritable
Boolean Boolean BooleanWritable
Array[Byte] byte[] BytesWritable
String String Text
Array[T] T[] ArrayWritable
List[T] List[T] ArrayWritable
Map[A,B] Map MapWritable

你可能感兴趣的:(黑猴子的家:Spark RDD SequenceFile文件输入输出(数据读取与保存的主要方式之一))