PySpark学习:WordCount排序

PySpark学习:WordCount排序

环境:

1、配置好Spark集群环境
2、配置好Python环境,在spark解压目录下的python文件夹中 执行python setup.py install即可安装好pyspark
3、代码:
import sys
import os
from operator import add
from pyspark.context import SparkContext
# os.environ['JAVA_HOME'] = "/usr/local/java/jdk1.8.0_231"

if __name__ == "__main__":
    sc = SparkContext(appName="WorkCount")
    lines = sc.textFile("All along the River.txt")
    output = lines.flatMap(lambda line: line.split(' ')) \
                .map(lambda word: (word, 1)) \
                .reduceByKey(lambda a, b: a+b) \
                .sortBy(lambda x: x[1], False) \
                .take(50)
    for (word,count) in output:
        print("%s : %i" % (word,count))
    sc.stop()

注释:

lines.flatMap(lambda line: line.split(' '))    # 根据空格对一行文本进行分割
                .map(lambda word: (word, 1))   # 每个单词计数1
                .reduceByKey(lambda a, b: a+b) # 相同的key的value值相加
                .sortBy(lambda x: x[1], False) # 根据value的值进行排序
                .take(50)                      # 取前50个数据

错误:

  1. Exception:Java gateway process exited before sending its port number

添加上这一句:os.environ['JAVA_HOME'] = "/usr/local/java/jdk1.8.0_231"

你可能感兴趣的:(PySpark学习:WordCount排序)