pyspark RDD数据的读取与保存

数据读取

hadoopFile

Parameters:

  • path – path to Hadoop file
  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
# hadoopFile:返回键值对,键为为行的偏移量,值为行的内容
# log.txt:
# http://www.baidu.com
# http://www.google.com
# http://www.google.com
# ...	...		...

rdd = sc.hadoopFile("hdfs://centos03:9000/datas/log.txt",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd.collect())  #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect())  #2

#1 [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

#2 [[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]

newAPIHadoopFile

Parameters:

  • path – path to Hadoop file
  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
# newAPIHadoopFile:返回键值对,键为为行的偏移量,值为行的内容
rdd = sc.newAPIHadoopFile("hdfs://centos03:9000/datas/log.txt",
# inputFormatClass与旧的API不同
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text"
)
print(rdd.collect())  #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect())  #2

#1 [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

#2 [[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]

hadoopRDD

Parameters:

  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
confs = {"mapred.input.dir": "hdfs://centos03:9000/datas/log.txt"}
rdd = sc.hadoopRDD(inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
                   keyClass="org.apache.hadoop.io.LongWritable",
                   valueClass="org.apache.hadoop.io.Text",
                   conf=confs)
print(rdd.collect())  #1

#1` [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

newAPIHadoopRDD

Parameters:

  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
confs = {"mapreduce.input.fileinputformat.inputdir":"hdfs://centos03:9000/datas/log.txt"}
rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text", 
conf=confs)
print(rdd.collect())  #1

#1 [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

pickleFile

Parameter:

  • name – 加载数据的地址
  • minPartitions=None

读取由saveAsPickleFile保存的RDD

# pickleFile读取由saveAsPickleFile保存的数据,数据形式与原来保存的数据形式一样
rdd = sc.newAPIHadoopFile("hdfs://centos03:9000/datas/log.txt",
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text"
)
print(rdd.collect())  #1
rdd1 = rdd.map(lambda x: x[1].split(":")).map(lambda x: (x[0], x[1]))
print(rdd1.collect())  #2

rdd1.saveAsPickleFile("hdfs://centos03:9000/datas/logp.txt")
print(sc.pickleFile("hdfs://centos03:9000/datas/logp.txt").collect())  #3

#1[(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

#2 [(‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.sohu.com’), (‘http’, ‘//www.sina.com’), (‘http’, ‘//www.sin2a.com’), (‘http’, ‘//www.sin2desa.com’), (‘http’, ‘//www.sindsafa.com’)]

#3 [(‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//www.google.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//cn.bing.com’), (‘http’, ‘//www.baidu.com’), (‘http’, ‘//www.sohu.com’), (‘http’, ‘//www.sina.com’), (‘http’, ‘//www.sin2a.com’), (‘http’, ‘//www.sin2desa.com’), (‘http’, ‘//www.sindsafa.com’)]

sequenceFile

Parameters:

  • path – path to sequncefile
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter
  • valueConverter
  • minSplits – minimum splits in dataset (default min(2, sc.defaultParallelism))
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
# 读取hadoop序列化的文件,其中keyClass和valueClass可以不用指定
rdd = sc.sequenceFile(path="hdfs://centos03:9000/datas/seqFile",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd.collect())  #1

#1 [(‘Pandas’, 3), (‘Key’, 6), (‘Sanil’, 2)]

textFile

Parameter:

  • name – 文件名称
  • minPartitions=None
  • use_unicode=True
# textFile,如果use_unicode=False, 字符串为str类型,会比unicode更快更小
rdd = sc.textFile(name="hdfs://centos03:9000/datas/log.txt")
print(rdd.collect())  #1

#1 [‘http://www.baidu.com’, ‘http://www.google.com’, ‘http://www.google.com’, ‘http://cn.bing.com’, ‘http://cn.bing.com’, ‘http://www.baidu.com’, ‘http://www.sohu.com’, ‘http://www.sina.com’, ‘http://www.sin2a.com’, ‘http://www.sin2desa.com’, ‘http://www.sindsafa.com’]

wholeTextFiles

从HDFS,本地文件系统或其他hadoop支持的文件系统中读取文件路径,每个文件作为一个record被读取,并返回一个key-value pair, key为每个文件的路径,value为文件的内容

Parameters:

  • path
  • minPartitions=None
  • use_unicode=True
# wholeTextFiles,比较适合小文件多的情况
rdd = sc.wholeTextFiles(path="hdfs://centos03:9000/table")
print(rdd.collect())  #1
rdd1 = rdd.map(lambda x: x[1].split("\t"))
print(rdd1.collect())  #2

#1 [(‘hdfs://centos03:9000/table/order.txt’, ‘1001\t01\t1\r\n1002\t02\t2\r\n1003\t03\t3\r\n1004\t01\t4\r\n1005\t02\t5\r\n1006\t03\t6’), (‘hdfs://centos03:9000/table/pd.txt’, ‘01\t小米\r\n02\t华为\r\n03\t格力\r\n’)]

#2 [[‘1001’, ‘01’, ‘1\r\n1002’, ‘02’, ‘2\r\n1003’, ‘03’, ‘3\r\n1004’, ‘01’, ‘4\r\n1005’, ‘02’, ‘5\r\n1006’, ‘03’, ‘6’], [‘01’, ‘小米\r\n02’, ‘华为\r\n03’, ‘格力\r\n’]]

数据保存

saveAsHadoopFile

Output a Python RDD of key-value pairs(of form RDD[(K, V)])

Parameters:

  • path – path to Hadoop file
  • outputFormatClass – fully qualified classname of Hadoop OutputFormat (e.g. “org.apache.hadoop.mapred.SequenceFileOutputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.IntWritable”, None by default)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.Text”, None by default)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – (None by default)
  • compressionCodecClass – (None by default)
# saveAsHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapred.SequenceFileOutputFormat"
)
print(sc.sequenceFile("hdfs://centos03:9000/datas/rdd_seq").collect())  #1

#1 [(‘good’, 1), (“spark”, 4), (“beats”, 3)]

或:

# saveAsHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapred.TextOutputFormat")

rdd1 = sc.hadoopFile(
"hdfs://centos03:9000/datas/rdd_seq",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.IntWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd1.collect())  #1

#1 [(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]

从上面两段代码来看,序列化形式保存数据比较好。

但是当数据为sc.parallelize([{'good': 1}, {'spark': 4}, {'beats': 3}])时会出现org.apache.spark.SparkException: RDD element of type java.util.HashMap cannot be used的错误,即使rdd中的数据使用json.dumps后仍然出错(org.apache.spark.SparkException: RDD element of type java.lang.String cannot be used),在网上找到一句话: To use String and Map objects you will need to use the more extensive native support available in Scala and Java.

其实在官方API文档也解释了输出的是键值对的PythonRDD

saveAsNewAPIHadoopFile

Output a Python RDD of key-value pairs(of form RDD[(K, V)])

Parameters:

  • path – path to Hadoop file
  • outputFormatClass – fully qualified classname of Hadoop OutputFormat (e.g. “org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.IntWritable”, None by default)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.Text”, None by default)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop job configuration, passed in as a dict (None by default)
# saveAsNewAPIHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat")
print(sc.sequenceFile("hdfs://centos03:9000/datas/rdd_seq").collect())  #1

#1 [(‘good’, 1), (‘spark’, 4), (‘beats’, 3)]

# saveAsNewAPIHadoopFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"
)

rdd2 = sc.hadoopFile("hdfs://centos03:9000/datas/rdd_seq", inputFormatClass="org.apache.hadoop.mapred.TextInputFormat", keyClass="org.apache.hadoop.io.IntWritable", valueClass="org.apache.hadoop.io.Text")
print(rdd2.collect())  #1

#1 [(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]

如果改变数据存储形式呢:

rdd = sc.parallelize([(1, {'good': 1}), (2, {'spark': 4}), (3, {'beats': 3})])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat")
print(sc.sequenceFile("hdfs://centos03:9000/datas/rdd_seq").collect())  #1

#1 [(1, {‘good’: 1}), (2, {‘spark’: 4}), (3, {‘beats’: 3})]

rdd = sc.parallelize([(1, {'good': 1}), (2, {'spark': 4}), (3, {'beats': 3})])
print(rdd.collect())
rdd.saveAsNewAPIHadoopFile(
path="hdfs://centos03:9000/datas/rdd_seq",
outputFormatClass="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat")
rdd2 = sc.hadoopFile(
"hdfs://centos03:9000/datas/rdd_seq",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.IntWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd2.collect())  #1

#1 [(0, ‘1\torg.apache.hadoop.io.MapWritable@3e9840’), (0,‘2\torg.apache.hadoop.io.MapWritable@83dcb79’), (0,‘3\torg.apache.hadoop.io.MapWritable@7493c20’)]

从上面代码看出,保存数据时还是使用序列化的形式比较好,能够保存原数据的结构

saveAsHadoopDataset

Output a Python RDD of key-value pairs (of form RDD[(K, V)])

Parameters:

  • conf – Hadoop job configuration, passed in as a dict
  • keyConverter – (None by default)
  • valueConverter – (None by default)
# saveAsHadoopDataset
confs = {"outputFormatClass": "org.apache.hadoop.mapred.TextOutputFormat",
         "keyClass": "org.apache.hadoop.io.LongWritable",
         "valueClass": "org.apache.hadoop.io.Text",
         "mapred.output.dir": "hdfs://centos03:9000/datas/rdd"}
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsHadoopDataset(conf=confs)  # conf中配置job参数

rdd2 = sc.hadoopFile("hdfs://centos03:9000/datas/rdd", inputFormatClass="org.apache.hadoop.mapred.TextInputFormat", keyClass="org.apache.hadoop.io.LongWritable", valueClass="org.apache.hadoop.io.Text")
print(rdd2.collect())  #1

#1 [(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]

# saveAsHadoopDataset
confs = {"outputFormatClass":"org.apache.hadoop.mapred.SequenceFileOutputFormat", "keyClass": "org.apache.hadoop.io.LongWritable", 
             "valueClass": "org.apache.hadoop.io.Text",
             "mapred.output.dir": "hdfs://centos03:9000/datas/rdd"
            }
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsHadoopDataset(conf=confs)

rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd")  # 序列化的文件可以被textFile读取
print(rdd2.collect())  #1 

#1 [‘good\t1’, ‘spark\t4’, ‘beats\t3’]

saveAsNewAPIHadoopDataset

Output a Python RDD of key-value pairs (of form RDD[(K, V)])

Parameters:

  • conf – Hadoop job configuration, passed in as a dict
  • keyConverter – (None by default)
  • valueConverter – (None by default)
# saveAsNewAPIHadoopDataset
confs = {"outputFormatClass":"org.apache.hadoop.mapreduce.lib.output.TextOutputFormat",
         "keyClass": "org.apache.hadoop.io.LongWritable",
         "valueClass": "org.apache.hadoop.io.Text",
         "mapreduce.output.fileoutputformat.outputdir": "hdfs://centos03:9000/datas/rdd"
        }
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsNewAPIHadoopDataset(conf=confs)

rdd1 = sc.newAPIHadoopFile(path="hdfs://centos03:9000/datas/rdd", inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat", keyClass="org.apache.hadoop.io.LongWritable", valueClass="org.apache.hadoop.io.Text")
print(rdd1.collect())  #1

rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd")
print(rdd2.collect())  #2

#1 [(0, ‘good\t1’), (0, ‘spark\t4’), (0, ‘beats\t3’)]

#2 [‘good\t1’, ‘spark\t4’, ‘beats\t3’]

saveAsPickleFile

Save this RDD as a SequenceFile of serialized objects. The serializer used is pyspark.serializers.PickleSerializer, default batch size is 10.

  • path
  • batchSize=10
# saveAsPickleFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsPickleFile("hdfs://centos03:9000/datas/rdd")

rdd1 = sc.pickleFile("hdfs://centos03:9000/datas/rdd")
print(rdd1.collect())  #1

#1 [(‘good’, 1), (‘spark’, 4), (‘beats’, 3)]

saveAsSequenceFile

Output a Python RDD of key-value pairs (of form RDD[(K, V)])

中间做了两次转换:1. pickled python RDD -> java RDD; 2. java RDD -> writables; 3. written out

Parameters:

  • path – path to sequence file
  • compressionCodecClass – (None by default)
# saveAsSequenceFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsSequenceFile("hdfs://centos03:9000/datas/rdd")

rdd1 = sc.sequenceFile("hdfs://centos03:9000/datas/rdd")
print(rdd1.collect())  #1

rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd")
print(rdd2.collect())  #2

#1 [(‘good’, 1), (‘spark’, 4), (‘beats’, 3)]
#2 ['SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00�ekpR2\x08� U��Yn$’, 'SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00�4��E�}βZ;�v\x1f\t\x00\x00\x00\t\x00\x00\x00\x05\x04good\x00\x00\x00\x01', 'SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00\x14��˹\x02oM�g��f�\x02v\x00\x00\x00', '\x00\x00\x00\x06\x05spark\x00\x00\x00\x04', 'SEQ\x06\x19org.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable\x00\x00\x00\x00\x00\x00F\x0b��\x04lD\x116+\x16n��d�\x00\x00\x00', '\x00\x00\x00\x06\x05beats\x00\x00\x00\x03']

saveAsTextFile

# saveAsTextFile
rdd = sc.parallelize([('good', 1), ("spark", 4), ("beats", 3)])
rdd.saveAsTextFile("hdfs://centos03:9000/datas/rdd")
rdd2 = sc.textFile("hdfs://centos03:9000/datas/rdd")
print(rdd2.collect())  #1

#1 ["(‘good’, 1)", “(‘spark’, 4)”, “(‘beats’, 3)”]

你可能感兴趣的:(pyspark)