Spark实战(2) DataFrame基础之创建DataFrame

之前,RDD语法占主导,但是比较难用难学.
现在,有了DataFrame,更容易操作和使用spark.

文章目录

  • 创建DataFrame
  • 创建DataFrame(指定Schema)

创建DataFrame

from pyspark.sql import SparkSession
# 新建一个session
spark = SparkSession.builder.appName('Basics').getOrCreate()
# 导入数据
df = spark.read.json('people.json')

df.show() # show the data source
df.printSchema() # print the schema of df
df.columns # to get the column names
df.describte().show() # get a statistical summary of df

创建DataFrame(指定Schema)

#********************************************************************#
# 指定frame结构,然后读取,在实际中更有用!
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

# create the data schema
data_schema = [StructField('age', IntegerType(), True),
               StructField('name',StringType(), True)]
# pass the data schema into the Strucutre type
final_struc = StructType(fileds = data_schema)
# create the dataframe with sepecfied data schema
df = spark.read.json('people.json',schema=final_struc)

你可能感兴趣的:(Spark)