Parameters:
# hadoopFile:返回键值对,键为为行的偏移量,值为行的内容
# log.txt:
# http://www.baidu.com
# http://www.google.com
# http://www.google.com
# ... ... ...
rdd = sc.hadoopFile("hdfs://centos03:9000/datas/log.txt",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd.collect()) #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect()) #2
#1
[(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
#2
[[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]
Parameters:
# newAPIHadoopFile:返回键值对,键为为行的偏移量,值为行的内容
rdd = sc.newAPIHadoopFile("hdfs://centos03:9000/datas/log.txt",
# inputFormatClass与旧的API不同
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text"
)
print(rdd.collect()) #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect()) #2
#1
[(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
#2
[[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]
Parameters:
confs = {"mapred.input.dir": "hdfs://centos03:9000/datas/log.txt"}
rdd = sc.hadoopRDD(inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text",
conf=confs)
print(rdd.collect()) #1
#1` [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
Parameters:
confs = {"mapreduce.input.fileinputformat.inputdir":"hdfs://centos03:9000/datas/log.txt"}
rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text",
conf=confs)
print(rdd.collect()) #1
#1
[(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
Parameter:
读取由saveAsPickleFile保存的RDD
# pickleFile读取由saveAsPickleFile保存的数据,数据形式与原来保存的数据形式一样
rdd = sc.newAPIHadoopFile("hdfs://centos03:9000/datas/log.txt",
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text"
)
print(rdd.collect()) #1
rdd1 = rdd.map(lambda x: x[1].split(":")).map(lambda x: (x[0], x[1]))
print(rdd1.collect()) #2
rdd1.saveAsPickleFile("hdfs://centos03:9000/datas/logp.txt")
print(sc.pickleFile("hdfs://centos03:9000/datas/logp.txt").collect()) #3
#1
[(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
#2
[(‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.sohu.com’), (‘http’, ‘//www.sina.com’), (‘http’, ‘//www.sin2a.com’), (‘http’, ‘//www.sin2desa.com’), (‘http’, ‘//www.sindsafa.com’)]
#3
[(‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.sohu.com’), (‘http’, ‘//www.sina.com’), (‘http’, ‘//www.sin2a.com’), (‘http’, ‘//www.sin2desa.com’), (‘http’, ‘//www.sindsafa.com’)]
Parameters:
# 读取hadoop序列化的文件,其中keyClass和valueClass可以不用指定
rdd = sc.sequenceFile(path="hdfs://centos03:9000/datas/seqFile",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd.collect()) #1
#1
[(‘Pandas’, 3), (‘Key’, 6), (‘Sanil’, 2)]
Parameter:
# textFile,如果use_unicode=False, 字符串为str类型,会比unicode更快更小
rdd = sc.textFile(name="hdfs://centos03:9000/datas/log.txt")
print(rdd.collect()) #1
#1
[‘http://www.baidu.com’, ‘http://www.google.com’, ‘http://www.google.com’, ‘http://cn.bing.com’, ‘http://cn.bing.com’, ‘http://www.baidu.com’, ‘http://www.sohu.com’, ‘http://www.sina.com’, ‘http://www.sin2a.com’, ‘http://www.sin2desa.com’, ‘http://www.sindsafa.com’]
从HDFS,本地文件系统或其他hadoop支持的文件系统中读取文件路径,每个文件作为一个record被读取,并返回一个key-value pair, key为每个文件的路径,value为文件的内容
Parameters:
# wholeTextFiles,比较适合小文件多的情况
rdd = sc.wholeTextFiles(path="hdfs://centos03:9000/table")
print(rdd.collect()) #1
rdd1 = rdd.map(lambda x: x[1].split("\t"))
print(rdd1.collect()) #2
#1
[(‘hdfs://centos03:9000/table/order.txt’, ‘1001\t01\t1\r\n1002\t02\t2\r\n1003\t03\t3\r\n1004\t01\t4\r\n1005\t02\t5\r\n1006\t03\t6’), (‘hdfs://centos03:9000/table/pd.txt’, ‘01\t小米\r\n02\t华为\r\n03\t格力\r\n’)]
#2
[[‘1001’, ‘01’, ‘1\r\n1002’, ‘02’, ‘2\r\n1003’, ‘03’, ‘3\r\n1004’, ‘01’, ‘4\r\n1005’, ‘02’, ‘5\r\n1006’, ‘03’, ‘6’], [‘01’, ‘小米\r\n02’, ‘华为\r\n03’, ‘格力\r\n’]]
Output a Python RDD of key-value pairs(of form RDD[(K, V)])
Parameters:
# saveAsHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapred.SequenceFileOutputFormat"
)
print(sc.sequenceFile("hdfs://centos03:9000/datas/rdd_seq").collect()) #1
#1
[(‘good’, 1), (“spark”, 4), (“beats”, 3)]
或:
# saveAsHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapred.TextOutputFormat")
rdd1 = sc.hadoopFile(
"hdfs://centos03:9000/datas/rdd_seq",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.IntWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd1.collect()) #1
#1
[(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]
从上面两段代码来看,序列化形式保存数据比较好。
但是当数据为sc.parallelize([{'good': 1}, {'spark': 4}, {'beats': 3}])
时会出现org.apache.spark.SparkException: RDD element of type java.util.HashMap cannot be used
的错误,即使rdd中的数据使用json.dumps后仍然出错(org.apache.spark.SparkException: RDD element of type java.lang.String cannot be used),在网上找到一句话: To use String
and Map
objects you will need to use the more extensive native support available in Scala and Java.
其实在官方API文档也解释了输出的是键值对的PythonRDD
Output a Python RDD of key-value pairs(of form RDD[(K, V)])
Parameters:
# saveAsNewAPIHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat")
print(sc.sequenceFile("hdfs://centos03:9000/datas/rdd_seq").collect()) #1
#1
[(‘good’, 1), (‘spark’, 4), (‘beats’, 3)]
# saveAsNewAPIHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"
)
rdd2 = sc.hadoopFile("hdfs://centos03:9000/datas/rdd_seq", inputFormatClass="org.apache.hadoop.mapred.TextInputFormat", keyClass="org.apache.hadoop.io.IntWritable", valueClass="org.apache.hadoop.io.Text")
print(rdd2.collect()) #1
#1
[(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]
如果改变数据存储形式呢:
rdd = sc.parallelize([(1, {'good': 1}), (2, {'spark': 4}), (3, {'beats': 3})])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat")
print(sc.sequenceFile("hdfs://centos03:9000/datas/rdd_seq").collect()) #1
#1
[(1, {‘good’: 1}), (2, {‘spark’: 4}), (3, {‘beats’: 3})]
rdd = sc.parallelize([(1, {'good': 1}), (2, {'spark': 4}), (3, {'beats': 3})])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat")
rdd2 = sc.hadoopFile(
"hdfs://centos03:9000/datas/rdd_seq",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.IntWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd2.collect()) #1
#1
[(0, ‘1\torg.apache.hadoop.io.MapWritable@3e9840’), (0,‘2\torg.apache.hadoop.io.MapWritable@83dcb79’), (0,‘3\torg.apache.hadoop.io.MapWritable@7493c20’)]
从上面代码看出,保存数据时还是使用序列化的形式比较好,能够保存原数据的结构
Output a Python RDD of key-value pairs (of form RDD[(K, V)])
Parameters:
# saveAsHadoopDataset
confs = {"outputFormatClass": "org.apache.hadoop.mapred.TextOutputFormat",
"keyClass": "org.apache.hadoop.io.LongWritable",
"valueClass": "org.apache.hadoop.io.Text",
"mapred.output.dir": "hdfs://centos03:9000/datas/rdd"}
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsHadoopDataset(conf=confs) # conf中配置job参数
rdd2 = sc.hadoopFile("hdfs://centos03:9000/datas/rdd", inputFormatClass="org.apache.hadoop.mapred.TextInputFormat", keyClass="org.apache.hadoop.io.LongWritable", valueClass="org.apache.hadoop.io.Text")
print(rdd2.collect()) #1
#1
[(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]
# saveAsHadoopDataset
confs = {"outputFormatClass":"org.apache.hadoop.mapred.SequenceFileOutputFormat", "keyClass": "org.apache.hadoop.io.LongWritable",
"valueClass": "org.apache.hadoop.io.Text",
"mapred.output.dir": "hdfs://centos03:9000/datas/rdd"
}
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsHadoopDataset(conf=confs)
rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd") # 序列化的文件可以被textFile读取
print(rdd2.collect()) #1
#1
[‘good\t1’, ‘spark\t4’, ‘beats\t3’]
Output a Python RDD of key-value pairs (of form RDD[(K, V)])
Parameters:
# saveAsNewAPIHadoopDataset
confs = {"outputFormatClass":"org.apache.hadoop.mapreduce.lib.output.TextOutputFormat",
"keyClass": "org.apache.hadoop.io.LongWritable",
"valueClass": "org.apache.hadoop.io.Text",
"mapreduce.output.fileoutputformat.outputdir": "hdfs://centos03:9000/datas/rdd"
}
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsNewAPIHadoopDataset(conf=confs)
rdd1 = sc.newAPIHadoopFile(path="hdfs://centos03:9000/datas/rdd", inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat", keyClass="org.apache.hadoop.io.LongWritable", valueClass="org.apache.hadoop.io.Text")
print(rdd1.collect()) #1
rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd")
print(rdd2.collect()) #2
#1
[(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]
#2
[‘good\t1’, ‘spark\t4’, ‘beats\t3’]
Save this RDD as a SequenceFile of serialized objects. The serializer used is pyspark.serializers.PickleSerializer
, default batch size is 10.
# saveAsPickleFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsPickleFile("hdfs://centos03:9000/datas/rdd")
rdd1 = sc.pickleFile("hdfs://centos03:9000/datas/rdd")
print(rdd1.collect()) #1
#1
[(‘good’, 1), (‘spark’, 4), (‘beats’, 3)]
Output a Python RDD of key-value pairs (of form RDD[(K, V)])
中间做了两次转换:1. pickled python RDD -> java RDD; 2. java RDD -> writables; 3. written out
Parameters:
# saveAsSequenceFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsSequenceFile("hdfs://centos03:9000/datas/rdd")
rdd1 = sc.sequenceFile("hdfs://centos03:9000/datas/rdd")
print(rdd1.collect()) #1
rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd")
print(rdd2.collect()) #2
#1
[(‘good’, 1), (‘spark’, 4), (‘beats’, 3)]
#2
['SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00�ekpR2\x08�
U��Yn$’, 'SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00�4��E�}βZ;�v\x1f\t\x00\x00\x00\t\x00\x00\x00\x05\x04good\x00\x00\x00\x01', 'SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00\x14��˹\x02oM�g��f�\x02v\x00\x00\x00', '\x00\x00\x00\x06\x05spark\x00\x00\x00\x04', 'SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00F\x0b��\x04lD\x116+\x16n��d�\x00\x00\x00', '\x00\x00\x00\x06\x05beats\x00\x00\x00\x03']
# saveAsTextFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsTextFile("hdfs://centos03:9000/datas/rdd")
rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd")
print(rdd2.collect()) #1
#1
["(‘good’, 1)", “(‘spark’, 4)”, “(‘beats’, 3)”]