Pyspark官方文档

此文为翻译pyspark 2.0.2,主要介绍pyspark相关使用方法。原文链接问 —— [ Apache pyspark ]

pyspark.sql module

导入Spark SQL和DataFrames包:

-pyspark.sql.SparkSession
-pyspark.sql.DataFrame
-pyspark.sql.Column
-pyspark.sql.Row
-pyspark.sql.DataFrameNaFunctions
-pyspark.DataFrameStatFunctions
-pyspark.sql.Functions
-pyspark.sql.types
-pyspark.sql.Window

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
        .config("spark.some.config.option", "some-value")\ .getOrCreate()
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

>>> d = [{'name': 'Alice', 'age': 1}]
>>> spark.createDataFrame(d).collect()
[Row(age=1, name=u'Alice')]

>>> rdd = sc.parallelize(l)
>>> spark.createDataFrame(rdd).collect()
[Row(_1=u'Alice', _2=1)]
>>> df = spark.createDataFrame(rdd, ['name', 'age'])
>>> df.collect()
[Row(name=u'Alice', age=1)]

你可能感兴趣的:(apache,spark,pyspark,spark,cluster)