spark调用类内方法


在pyspark中调用类方法,报错

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

原因:

spark不允许在action或transformation中访问SparkContext,如果你的action或transformation中引用了self,那么spark会将整个对象进行序列化,并将其发到工作节点上,这其中就保留了SparkContext,即使没有显式的访问它,它也会在闭包内被引用,所以会出错。

解决:
应该将调用的类方法定义为静态方法 @staticmethod

class model(object):
    @staticmethod
    def transformation_function(row):
        row = row.split(',')
        return row[0]+row[1]

    def __init__(self):
        self.data = sc.textFile('some.csv')

    def run_model(self):
        self.data = self.data.map(model.transformation_function)


        
参考:
https://stackoverflow.com/questions/32505426/how-to-process-rdds-using-a-python-class

你可能感兴趣的:(Spark)