pyspark学习系列(三)利用SQL查询

对于spark 中存在dataframe,我们可以用 .creatOrReplaceTempView方法创建临时表。

临时表创建之后我们就可以用SQL语句对这个临时表进行查询统计:

from pyspark.sql.types import *

# Generate our own CSV data 
#   This way we don't have to access the file system yet.
stringCSVRDD = sc.parallelize([(123, 'Katie', 19, 'brown'), (234, 'Michael', 22, 'green'), (345, 'Simone', 23, 'blue')])

# The schema is encoded in a string, using StructType we define the schema using various pyspark.sql.types
schemaString = "id name age eyeColor"
schema = StructType([
    StructField("id", LongType(), True),    
    StructField("name", StringType(), True),
    StructField("age", LongType(), True),
    StructField("eyeColor", StringType(), True)
])

# Apply the schema to the RDD and Create DataFrame
swimmers = spark.createDataFrame(stringCSVRDD, schema)

# Creates a temporary view using the DataFrame
swimmers.createOrReplaceTempView("swimmers")
spark.sql("select * from swimmers").show()
swimmers.select("id", "age").filter("age = 22").show()

spark.sql("select name, eyeColor from swimmers where eyeColor like 'b%'").show()


你可能感兴趣的:(spark,python)