PySpark RDD 之 filter

1. pyspark 版本

       2.3.0版本

 

2. 官网

filter(f)[source]

Return a new RDD containing only the elements that satisfy a predicate.

中文: 返回仅包含满足条件的元素的新RDD。

>>> rdd = sc.parallelize([1, 2, 3, 4, 5])
>>> rdd.filter(lambda x: x % 2 == 0).collect()
[2, 4]

 

3. 我的代码

案列1

from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("fliter")
sc = SparkContext(conf=conf)
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
new_rdd1 = rdd1.filter(lambda x: x>2)
print('new_rdd1 = ', new_rdd1.collect())

>>> new_rdd1 =  [3, 4, 5]

案列2

rdd2 = sc.parallelize([[1, 'a'], [2, 'b'], [3, 'c']])
new_rdd2 = rdd2.filter(lambda x: x[0] > 1)
print('new_rdd2 = ', new_rdd2.collect())

>>> new_rdd2 =  [[2, 'b'], [3, 'c']]

案列3: 筛选出健中有a的元素

rdd3 = sc.parallelize([{'a': 1}, {'b': 2}, {'c': 3}])
def myfilter(x):
    print('x= ', x, type(x))
    if 'a' in x.keys():
        return x
new_rdd3 = rdd3.filter(lambda x: myfilter(x))
print('new_rdd3 = ', new_rdd3.collect())

>>> new_rdd3 =  [{'a': 1}]

打印在notebook dos窗口显示出来了:

PySpark RDD 之 filter_第1张图片

 

4. notebook

PySpark RDD 之 filter_第2张图片

你可能感兴趣的:(pyspark,pyspark,filter)