数据挖掘工具---Spark SQL使用

Spark SQL 你需要知道的十件事

来源:

Spark SQL 使用场景

  • Ad-hoc querying of data in files
  • ETL capabilities alongside familiar SQL
  • Interaction with external Databases
  • Scalable query performance with larger clusters
  • Live SQL analytics over streaming data

数据加载:云和本地, RDDs 和 DataFrames

You can load data directly into a DataFrame, and begin querying it relatively quickly. Otherwise you’ll need to load data into an RDD and transform it first.

# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile("file:/var/log/system.log")
allSysLogs = sc.textFile("file:/var/log/system.log*")
allLogs = sc.textFile("file:/var/log/*.log"# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254

That’s great, but you can’t query this. You’ll need to convert the data to Rows, add a schema, and convert it to a dataframe.

# import Row, map the rdd, and create dataframe
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile("file:/var/log/system.log*")
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)

Once the data is converted to at least a DataFrame with a schema, now you can talk SQL to the data.

# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView("logs")
>>> spark.sql("SELECT * FROM logs LIMIT 1").show()
+--------------------+
|                 log|
+--------------------+
|Jan  6 16:37:01 (...|
+--------------------+

But, you can also load certain types of data and store it directly as a DataFrame. This allows you to get to SQL quickly.
Both JSON and Parquet formats can be loaded as a DataFrame straightaway because they
contain enough schema information to do so.

# load parquet straight into DF, and write some SQL
logsDF = spark.read.parquet("file:/logs.parquet")
logsDF.createOrReplaceTempView("logs")
>>> spark.sql("SELECT * FROM logs LIMIT 1").show()
+--------------------+
|                 log|
+--------------------+
|Jan  6 16:37:01 (...|
+--------------------+

In fact, now they even have support for querying parquet files directly! Easy peasy!

# load parquet straight into DF, and write some SQL
>>> spark.sql("""  SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1""").show()
+--------------------+
|                 log|
+--------------------+
|Jan  6 16:37:01 (...|
+--------------------+

That’s one aspect of loading data. The other aspect is using the protocols for cloud storage (i.e. s3://). In some cloud ecosystems, support for their storage protocol comes installed already.

# i.e. on AWS EMR, s3:// is installed already.
sc = spark.sparkContext
decemberLogs = sc.textFile("s3://acme-co/logs/2016/12/")
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.

Sometimes you actually need to provide support for those protocols if your VM’s OS doesn’t have it already.

my-linux-shell$
pyspark --packages 
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 demo2.py
>>> rdd = sc.readText("s3a://acme-co/path/to/files")
rdd.count()
# note: "s3a" and not "s3" -- "s3" is specific to AWS EMR.

Now you should have several ways to load data to quickly start writing SQL with Apache Spark.

SQL 和 DataFrame API 比较,它们之间的区别

  • What is a Datalframe?
    You can think of dataframes like RDDS with a schema
    Note:“Data Frame is just a type alias for Dataset of Row—Databricks”

  • Why Dataframe over RDD?
    Catalyst optimization & schemas

  • What kind of data can Datalframes handle?
    Text. JSON XML, Parquet and more

  • What can I do with a Dataframe?
    Use sql-like and actual SQL. Also, you can apply schemas to your data and benefit from the performance enhancements of the Catalyst optimizer

  • Still Catalyst Optimized Both SQL and API Functions in df’s sit atop Catalyst

  • Dataframe Functions Provides a bridge between to features of
    Spark APIs

  • SQL With Dataframes Allows you a familiar way to interact with the data

  • sql-like Functions in Dataframe API For many of the expected features of SQL. there are similar functions in the DF Apithat do practically the same thing,allowing for .functional().chaining()

模式: 隐式和显示模式解释,数据类型

Schemas can be inferred, i.e. guessed, by spark. With inferred schemas, you usually end up with a bunch of strings and ints. If you have more specific needs, supply your own schema.

# sample data - "people.txt"
1|Kristian|Algebraix Data|San Diego|CA
2|Pat|Algebraix Data|San Diego|CA
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile("file:/people.txt")
def mapper(line):
  s = line.split("|")
  return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)
# we didn't actually pass anything into that 2nd param.
# yet, behind the scenes, there's still a schema.
>>> peopleDF.printSchema()
Root
 |-- company: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

Spark SQL can certainly handle queries where id is a string, but what if we don’t want it to be? it should be an int.

# load as RDD and map it to a row with multiple fields
rdd = sc.textFile("file:/people.txt")
def mapper(line):
  s = line.split("|")
return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
>>> peopleDF.printSchema()
Root
 |-- company: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

You can actually provide a schema, too, which will be more authoritative.

# load as RDD and map it to a row with multiple fields
import pyspark.sql.types as types
rdd = sc.textFile("file:/people.txt")
def mapper(line):
  s = line.split("|")
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
schema = types.StructType([
   types.StructField('id',types.IntegerType(), False)
  ,types.StructField('name',types.StringType())
  ,types.StructField('company',types.StringType())
  ,types.StructField('state',types.StringType())
])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)
>>> peopleDF.printSchema()
Root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- company: string (nullable = true)
 |-- state: string (nullable = true)

And what are the available types?

# http://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/types.html
__all__ = ["DataType", "NullType", "StringType", 
"BinaryType", "BooleanType", "DateType", "TimestampType", 
"DecimalType", "DoubleType", "FloatType", "ByteType", 
"IntegerType", "LongType", "ShortType", "ArrayType", 
"MapType", "StructField", "StructType"]

Gotcha alert
Spark doesn’t seem to care when you leave dates as strings.

# Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql("""  SELECT * FROM NewHires n WHERE n.start_date > "2016-01-01" """).show()

Now you know about inferred and explicit schemas, and the available types you can use.

数据加载以及结果保存等

Loading and Saving is fairly straight forward.
Save your dataframes in your desired format.

# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",  format="parquet") 
# format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text

When you read, some types preserve schema. Parquet keeps the full schema, JSON has inferrable schema, and JDBC pulls in schema.

# read from stored parquet
peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”)
# read from stored JSON
peopleDF = spark.read.json(“s3://acme-co/people.json”)

Spark SQL数据源

来源

  • SparkSession
  • parquet
  • csv
  • json
  • jdbc
  • table
    • 准备table
    • 读取
    • 写入
    • 连接一个已存在的Hive
  • text
    • 格式提前确定
    • 格式在运行时确定

pyspark读写dataframe

来源
创建dataframe

  • 从变量创建
  • 从变量创建
  • 读取json
    df = spark.read.json(file)
  • 读取csv
    monthlySales = spark.read.csv(file, header=True, inferSchema=True)
  • 读取MySQL
df = spark.read.format('jdbc').options(
    url='jdbc:mysql://127.0.0.1',
    dbtable='mysql.db',
    user='root',
    password='123456' 
    ).load()
sql="(select * from mysql.db where db='wp230') t"
df = spark.read.format('jdbc').options(
    url='jdbc:mysql://127.0.0.1',
    dbtable=sql,
    user='root',
    password='123456' 
    ).load()
  • 从pandas.dataframe创建
  • 从列式存储的parquet读取
    df=spark.read.parquet(file)
  • 从hive读取
spark = SparkSession \
        .builder \
        .enableHiveSupport() \      
        .master("172.31.100.170:7077") \
        .appName("my_first_app_name") \
        .getOrCreate()
 
df=spark.sql("select * from hive_tb_name")

保存数据

  • 写到csv
    spark_df.write.csv(path=file, header=True, sep=",", mode='overwrite')
  • 保存到parquet
    spark_df.write.parquet(path=file,mode='overwrite')
  • 写到hive
# 打开动态分区
spark.sql("set hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
 
# 使用普通的hive-sql写入分区表
spark.sql("""
    insert overwrite table ai.da_aipurchase_dailysale_hive 
    partition (saledate) 
    select productid, propertyid, processcenterid, saleplatform, sku, poa, salecount, saledate 
    from szy_aipurchase_tmp_szy_dailysale distribute by saledate
    """)
 
# 或者使用每次重建分区表的方式
jdbcDF.write.mode("overwrite").partitionBy("saledate").insertInto("ai.da_aipurchase_dailysale_hive")
jdbcDF.write.saveAsTable("ai.da_aipurchase_dailysale_hive", None, "append", partitionBy='saledate')
 
# 不写分区表,只是简单的导入到hive表
jdbcDF.write.saveAsTable("ai.da_aipurchase_dailysale_for_ema_predict", None, "overwrite", None)
  • 写到hdfs
    jdbcDF.write.mode("overwrite").options(header="true").csv("/home/ai/da/da_aipurchase_dailysale_for_ema_predict.csv")
  • 写到mysql
# 会自动对齐字段,也就是说,spark_df 的列不一定要全部包含MySQL的表的全部列才行
 
# overwrite 清空表再导入
spark_df.write.mode("overwrite").format("jdbc").options(
    url='jdbc:mysql://127.0.0.1',
    user='root',
    password='123456',
    dbtable="test.test",
    batchsize="1000",
).save()
 
# append 追加方式
spark_df.write.mode("append").format("jdbc").options(
    url='jdbc:mysql://127.0.0.1',
    user='root',
    password='123456',
    dbtable="test.test",
    batchsize="1000",
).save()

SQL 使用场景,什么时候不适合使用 SQL

Spark 1.6

  • Limited support for subqueries and various other noticeable SQL functionalities
  • Runs roughly half of the 99 TPC-DS benchmark queries
  • More SQL support in HiveContext

Spark 2.0

In DataBricks’ Words

  • SQL2003 support
  • Runs all 99 of TPC-DS benchmark queries
  • A native SQL parser that supports both ANSI-SQL as well as Hive QL
  • Native DDL command implementations
  • Subquery support, including
    • Uncorrelated Scalar Subqueries
    • Correlated Scalar Subqueries
    • NOT IN predicate Subqueries (in WHERE/HAVING clauses)
    • IN predicate subqueries (in WHERE/HAVING clauses)
    • (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
  • View canonicalization support
  • In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.

使用 SQL 进行 ETL

These things are some of the things we learned to start doing after working with Spark a while.

  • Tip 1: In production, break your applications into smaller apps as steps. I.e. “Pipeline pattern”
  • Tip 2: When tinkering locally, save a small version of the dataset via spark and test against
    that.
  • Tip 3: If using EMR, create a cluster with your desired steps, prove it works, then export a CLI
    command to reproduce it, and run it in Data Pipeline to start recurring pipelines / jobs.

操作 JSON 数据

JSON data is most easily read-in as line delimied json objects*

{
     "n":"sarah","age":29}
{
     "n”:"steve","age":45}

Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
Access arrays with inline array syntax

SELECT  col[1], col[3] FROM json

If you want to flatten your JSON data, use the explode method(works in both DF API
and SQL)

# json explode example
>>> spark.read.json("file:/json.explode.json").createOrReplaceTempView("json")
>>> spark.sql("SELECT * FROM json").show()
+----+----------------+
|   x|               y|
+----+----------------+
|row1| [1, 2, 3, 4, 5]|
|row2|[6, 7, 8, 9, 10]|
+----+----------------+
>>> spark.sql("SELECT x, explode(y) FROM json").show()
+----+---+
|   x|col|
+----+---+
|row1|  1|
|row1|  2|
|row1|  3|
|row1|  4|
|row1|  5|
|row2|  6|
|row2|  7|
|row2|  8|
|row2|  9|
|row2| 10|
+----+---+

Access nested-objects with dot syntax For multi-line JSON files,you’ve got to do much more:

SELECT    field.subfield FROM json
# a list of data from files.
files = sc.wholeTextFiles("data.json")
# each tuple is (path, jsonData)
rawJSON = files.map(lambda x: x[1])
# sanitize the data
cleanJSON = rawJSON.map(\
  lambda x: re.sub(r"\s+", "",x,flags=re.UNICODE)\
)
# finally, you can then read that in as “JSON”
spark.read.json( scrubbedJSON )
# PS -- the same goes for XML.

从外部数据库读取和写入

To read from an external database, you’ve got to have your JDBC connectors (jar) handy. In order to pass a jar package into spark, you’d use the --jarsflag when starting pyspark or spark-submit.

# loading data into an RDD in Spark 2.0
my-linux-shell$ pyspark \
  --jars /path/to/mysql-jdbc.jar\
  --packages
# note: you can also add the path to your jar in the  spark.defaults config file to these settings:
  spark.driver.extraClassPath
  spark.executor.extraClassPath

Once you’ve got your connector jars successfully imported, now you can read an existing database into your spark application or spark shell as a dataframe.

# line broken for readibility
sqlURL = "jdbc:mysql://<db-host>:<port>
  ?user=<user>
  &password=<pass>
  &rewriteBatchedStatements=true
  &continueBatchOnError=true"
df = spark.read.jdbc(url=sqlURL, table=".")
df.createOrReplaceTempView("myDB")
spark.sql("SELECT * FROM myDB").show()

If you’ve done some work and built created or manipulated a dataframe, you can write it to a database by using the spark.read.jdbc method. Be prepared, it can a while.

Also, be warned, save modes in spark can be a bit destructive. “Overwrite” doesn’t just overwrite your data, it overwrites your schemas too.Say goodbye to those precious indices.

在真实环境下测试你的 SQL

If testing locally, do not load data from S3 or other similar types of cloud storage.
Construct your applications as much as you can in advance. Cloudy clusters are expensive.
In the cloud,you can test a lot of your code reliably with a 1-node cluster.
Get really comfortable using .parallelize() to create dummy data.
If you’re using big data, and many nodes, don’t use .collect() unless you intend to

pyspark.sql 各接口的使用方法

官网地址

class pyspark.sql.SparkSession(sparkContext, jsparkSession=None)[source]

SparkSession是spark操作数据集和数据框的入口。就跟RDD的入口是SparkContext一样。直接使用即可,不需要经过SparkContext。样式如下,根据自己应用的实际情况修改

spark = SparkSession.builder \
...     .master("local") \
...     .appName("Word Count") \
...     .config("spark.some.config.option", "some-value") \
...     .getOrCreate()

class pyspark.sql.DataFrameReader(spark)

http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html

  • csv(path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLine=None)

Loads a CSV file and returns the result as a DataFrame.

原始数据,.csv格式
instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0,3,13,16
2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0,8,32,40
3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0,5,27,32
4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0,3,10,13
5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0,0,1,1
#-*-coding:utf-8-*-

from pyspark import SparkContext
from pyspark.sql import SparkSession

if __name__=="__main__":
    sc=SparkContext(appName='myApp')
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()
    """
    df = spark.read.csv('/tmp/test/hour2.csv')
    df.show()
    
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+----+
    |    _c0|       _c1|   _c2|_c3| _c4|_c5|    _c6|    _c7|       _c8|       _c9|_c10|  _c11|_c12|     _c13|  _c14|      _c15|_c16|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+----+
    |instant|    dteday|season| yr|mnth| hr|holiday|weekday|workingday|weathersit|temp| atemp| hum|windspeed|casual|registered| cnt|
    |      1|2011-01-01|     1|  0|   1|  0|      0|      6|         0|         1|0.24|0.2879|0.81|        0|     3|        13|  16|
    |      2|2011-01-01|     1|  0|   1|  1|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     8|        32|  40|
    |      3|2011-01-01|     1|  0|   1|  2|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     5|        27|  32|
    |      4|2011-01-01|     1|  0|   1|  3|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     3|        10|  13|
    |      5|2011-01-01|     1|  0|   1|  4|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     0|         1|   1|
    |      6|2011-01-01|     1|  0|   1|  5|      0|      6|         0|         2|0.24|0.2576|0.75|   0.0896|     0|         1|   1|
    |      7|2011-01-01|     1|  0|   1|  6|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     2|         0|   2|
    |      8|2011-01-01|     1|  0|   1|  7|      0|      6|         0|         1| 0.2|0.2576|0.86|        0|     1|         2|   3|
    |      9|2011-01-01|     1|  0|   1|  8|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     1|         7|   8|
    |     10|2011-01-01|     1|  0|   1|  9|      0|      6|         0|         1|0.32|0.3485|0.76|        0|     8|         6|  14|
    |     11|2011-01-01|     1|  0|   1| 10|      0|      6|         0|         1|0.38|0.3939|0.76|   0.2537|    12|        24|  36|
    |     12|2011-01-01|     1|  0|   1| 11|      0|      6|         0|         1|0.36|0.3333|0.81|   0.2836|    26|        30|  56|
    |     13|2011-01-01|     1|  0|   1| 12|      0|      6|         0|         1|0.42|0.4242|0.77|   0.2836|    29|        55|  84|
    |     14|2011-01-01|     1|  0|   1| 13|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2985|    47|        47|  94|
    |     15|2011-01-01|     1|  0|   1| 14|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2836|    35|        71| 106|
    |     16|2011-01-01|     1|  0|   1| 15|      0|      6|         0|         2|0.44|0.4394|0.77|   0.2985|    40|        70| 110|
    |     17|2011-01-01|     1|  0|   1| 16|      0|      6|         0|         2|0.42|0.4242|0.82|   0.2985|    41|        52|  93|
    |     18|2011-01-01|     1|  0|   1| 17|      0|      6|         0|         2|0.44|0.4394|0.82|   0.2836|    15|        52|  67|
    |     19|2011-01-01|     1|  0|   1| 18|      0|      6|         0|         3|0.42|0.4242|0.88|   0.2537|     9|        26|  35|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+----+
    可以看到,如果read.csv不加其他参数,那么,会把.csv文件的头部也当成一条数据读进来,另外会自动生成新的列名。
    """
    """
    df = spark.read.csv(path='/tmp/test/hour2.csv',header=True,sep=',')
    df.show()
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
    |instant|    dteday|season| yr|mnth| hr|holiday|weekday|workingday|weathersit|temp| atemp| hum|windspeed|casual|registered|cnt|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
    |      1|2011-01-01|     1|  0|   1|  0|      0|      6|         0|         1|0.24|0.2879|0.81|        0|     3|        13| 16|
    |      2|2011-01-01|     1|  0|   1|  1|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     8|        32| 40|
    |      3|2011-01-01|     1|  0|   1|  2|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     5|        27| 32|
    |      4|2011-01-01|     1|  0|   1|  3|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     3|        10| 13|
    |      5|2011-01-01|     1|  0|   1|  4|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     0|         1|  1|
    |      6|2011-01-01|     1|  0|   1|  5|      0|      6|         0|         2|0.24|0.2576|0.75|   0.0896|     0|         1|  1|
    |      7|2011-01-01|     1|  0|   1|  6|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     2|         0|  2|
    |      8|2011-01-01|     1|  0|   1|  7|      0|      6|         0|         1| 0.2|0.2576|0.86|        0|     1|         2|  3|
    |      9|2011-01-01|     1|  0|   1|  8|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     1|         7|  8|
    |     10|2011-01-01|     1|  0|   1|  9|      0|      6|         0|         1|0.32|0.3485|0.76|        0|     8|         6| 14|
    |     11|2011-01-01|     1|  0|   1| 10|      0|      6|         0|         1|0.38|0.3939|0.76|   0.2537|    12|        24| 36|
    |     12|2011-01-01|     1|  0|   1| 11|      0|      6|         0|         1|0.36|0.3333|0.81|   0.2836|    26|        30| 56|
    |     13|2011-01-01|     1|  0|   1| 12|      0|      6|         0|         1|0.42|0.4242|0.77|   0.2836|    29|        55| 84|
    |     14|2011-01-01|     1|  0|   1| 13|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2985|    47|        47| 94|
    |     15|2011-01-01|     1|  0|   1| 14|      0|      6|         0|         2|0.46|0.4545|0.72|   0.2836|    35|        71|106|
    |     16|2011-01-01|     1|  0|   1| 15|      0|      6|         0|         2|0.44|0.4394|0.77|   0.2985|    40|        70|110|
    |     17|2011-01-01|     1|  0|   1| 16|      0|      6|         0|         2|0.42|0.4242|0.82|   0.2985|    41|        52| 93|
    |     18|2011-01-01|     1|  0|   1| 17|      0|      6|         0|         2|0.44|0.4394|0.82|   0.2836|    15|        52| 67|
    |     19|2011-01-01|     1|  0|   1| 18|      0|      6|         0|         3|0.42|0.4242|0.88|   0.2537|     9|        26| 35|
    |     20|2011-01-01|     1|  0|   1| 19|      0|      6|         0|         3|0.42|0.4242|0.88|   0.2537|     6|        31| 37|
    +-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

    """

spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)[source]

从RDD,列表,或pandas.dataframe数据源创建RDD形式的数据框
参数data就是输入的数据,如果是列表,那么列表中的每个元素对应一行
schema可以是 a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None.

from pyspark.sql import SparkSession
import pandas
spark=SparkSession.builder.getOrCreate()
#由列表创建,列表的元素为元组
a=[('Alice',1)]
a_dataframe=spark.createDataFrame(a)
print(a_dataframe.collect())
#指定数据框的列名
a_dataframe=spark.createDataFrame(a,['name','age'])
print(a_dataframe.collect())
#由列表创建,列表的元素为字典,因而直接有列名
d = [{
     'name': 'Alice', 'age': 1}]
d_dataframe=spark.createDataFrame(d)
print(d_dataframe.collect)
#由rdd创建
rdd=sc.parallelize(a)
df=spark.createDataFrame(rdd)
print(df.collect())
#指定数据框各列的数据类型,这时的类型是pyspark.sql里的数据类型
from pyspark.sql.types import *
schema=StructType([StructField('name',StringType,True),StructField('age',IntegerType,True)])
df3=spark.createDataFrame(rdd,schema)
#直接指定列名和列的数据类型,这时是通过python中的数据类型来指定
print(spark.createDataFrame(rdd,'name:string,age:int').collect())
print(df3.collect())
#由pandas.dataframe创建,其中.toPandas()将数据转换为rdd下的pnadasdataframe类型
print(spark.createDataFrame(df.toPandas()).collect())
print(spark.createDataFrame(pandas.DataFrame([['age',1]])).collect())

结果如下:

[Row(_1='Alice', _2=1)]
[Row(name='Alice', age=1)]

从json文件创建dataframe

# spark is an existing SparkSession
df = spark.read.json("/home/qjzh/miniconda/envs/water_meter2/projects/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+

可以通过属性df.age或索引df[‘age’]获取dataframe的列,前者用起来比较方便,但更鼓励使用后者。

pyspark.sql.DataFrame的使用练习

df.toDF(*cols)

返回一新的数据框, 指定新的列名。有的时候通过读取或转换等操作产生的dataframe没有列名,这个时候通过toDF加上列名可以使得后续对df的操作更方便;(好像利用toDF不可以用来将RDD转换为dataframe)

df.toDF('f1', 'f2').collect()
[Row(f1=2, f2='Alice'), Row(f1=5, f2='Bob')]

df.toJSON(use_unicode=True)

toJSON() 将数据框转换为字符串 RDD。每一行都被转换为 JSON 格式的字符串, 作为所返回的 RDD 中的一个元素。

df.toJSON().first()
'{"age":2,"name":"Alice"}'

df.toLocalIterator()

返回一个包含该数据框所有行的迭代器。这个迭代器会在该数据框的最大分区中消耗尽可能多的内存。

list(df.toLocalIterator())
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]

df.groupBy 之后可以使用的方法

(以下的方法可以跟在groupby后达到分组计算的效果,但也可以不跟groupby后,这些方法本身就是dataframe的方法。)
groupBy会产生一个专门的数据类型pyspark.sql.GroupedData,因而后面可以接什么操作,直接看pyspark.sql.GroupedData类的方法和属性就可以了。

  • agg(*exprs)

聚合计算并将结果返回为 DataFrame。
可用的聚合函数有avg, max, min, sum, 'count'
如果 expr 是从字符串到字符串的单个 dict 映射, 那么其键就是要执行聚合的列, 其值就是该聚合函数。
可选地, expr 还可以是一组聚合列 表达式。
参数: exprs - 一个从列名(字符串)到聚合函数(字符串)的字典映射, 或者是一个 Column 列表。
未指定groupby的列,则所有的数据行为一组;如果未指定agg的列,则对所有数字类的列进行聚合,如下

>>> df.groupBy().avg().collect()
[Row(avg(age)=3.5)]
>>> sorted(df.groupBy(df.name).avg().collect())
[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]

还可以按groupBy的列名进排序:

>>> sorted(df.groupBy('name').agg({
     'age': 'mean'}).collect())
[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]

可以按多个列进行分组:

>>> sorted(df.groupBy(['name', df.age]).count().collect())
[Row(name='Alice', age=2, count=1), Row(name='Bob', age=5, count=1)]

详细的函数列表:

'lit': 'Creates a :class:`Column` of literal value.',
'col': 'Returns a :class:`Column` based on the given column name.',
'column': 'Returns a :class:`Column` based on the given column name.',
'asc': 'Returns a sort expression based on the ascending order of the given column name.',
'desc': 'Returns a sort expression based on the descending order of the given column name.',

'upper': 'Converts a string expression to upper case.',
'lower': 'Converts a string expression to upper case.',
'sqrt': 'Computes the square root of the specified float value.',
'abs': 'Computes the absolute value.',

'max': 'Aggregate function: returns the maximum value of the expression in a group.',
'min': 'Aggregate function: returns the minimum value of the expression in a group.',
'first': 'Aggregate function: returns the first value in a group.',
'last': 'Aggregate function: returns the last value in a group.',
'count': 'Aggregate function: returns the number of items in a group.',
'sum': 'Aggregate function: returns the sum of all values in the expression.',
'avg': 'Aggregate function: returns the average of the values in a group.',
'mean': 'Aggregate function: returns the average of the values in a group.',
'sumDistinct': 'Aggregate function: returns the sum of distinct values in the expression.'

'acos': 'Computes the cosine inverse of the given value; the returned angle is in the range' +
'0.0 through pi.',
'asin': 'Computes the sine inverse of the given value; the returned angle is in the range' +
'-pi/2 through pi/2.',
'atan': 'Computes the tangent inverse of the given value.',
'cbrt': 'Computes the cube-root of the given value.',
'ceil': 'Computes the ceiling of the given value.',
'cos': 'Computes the cosine of the given value.',
'cosh': 'Computes the hyperbolic cosine of the given value.',
'exp': 'Computes the exponential of the given value.',
'expm1': 'Computes the exponential of the given value minus one.',
'floor': 'Computes the floor of the given value.',
'log': 'Computes the natural logarithm of the given value.',
'log10': 'Computes the logarithm of the given value in Base 10.',
'log1p': 'Computes the natural logarithm of the given value plus one.',
'rint': 'Returns the double value that is closest in value to the argument and' +
' is equal to a mathematical integer.',
'signum': 'Computes the signum of the given value.',
'sin': 'Computes the sine of the given value.',
'sinh': 'Computes the hyperbolic sine of the given value.',
'tan': 'Computes the tangent of the given value.',
'tanh': 'Computes the hyperbolic tangent of the given value.',
'toDegrees': 'Converts an angle measured in radians to an approximately equivalent angle ' +
'measured in degrees.',
'toRadians': 'Converts an angle measured in degrees to an approximately equivalent angle ' +
'measured in radians.',

'bitwiseNOT': 'Computes bitwise not.'


'stddev': 'Aggregate function: returns the unbiased sample standard deviation of' +
' the expression in a group.',
'stddev_samp': 'Aggregate function: returns the unbiased sample standard deviation of' +
' the expression in a group.',
'stddev_pop': 'Aggregate function: returns population standard deviation of' +
' the expression in a group.',
'variance': 'Aggregate function: returns the population variance of the values in a group.',
'var_samp': 'Aggregate function: returns the unbiased variance of the values in a group.',
'var_pop':  'Aggregate function: returns the population variance of the values in a group.',
'skewness': 'Aggregate function: returns the skewness of the values in a group.',
'kurtosis': 'Aggregate function: returns the kurtosis of the values in a group.',
'collect_list': 'Aggregate function: returns a list of objects with duplicates.',
'collect_set': 'Aggregate function: returns a set of objects with duplicate elements' +
' eliminated.'

# math functions that take two arguments as input
_binary_mathfunctions = {
     
    'atan2': 'Returns the angle theta from the conversion of rectangular coordinates (x, y) to' +
             'polar coordinates (r, theta).',
    'hypot': 'Computes `sqrt(a^2^ + b^2^)` without intermediate overflow or underflow.',
    'pow': 'Returns the value of the first argument raised to the power of the second argument.',
}

_window_functions = {
     
    'rowNumber':
        """.. note:: Deprecated in 1.6, use row_number instead.""",
    'row_number':
        """returns a sequential number starting at 1 within a window partition.""",
    'denseRank':
        """.. note:: Deprecated in 1.6, use dense_rank instead.""",
    'dense_rank':
        """returns the rank of rows within a window partition, without any gaps.

        The difference between rank and denseRank is that denseRank leaves no gaps in ranking
        sequence when there are ties. That is, if you were ranking a competition using denseRank
        and had three people tie for second place, you would say that all three were in second
        place and that the next person came in third.""",
    'rank':
        """returns the rank of rows within a window partition.

        The difference between rank and denseRank is that denseRank leaves no gaps in ranking
        sequence when there are ties. That is, if you were ranking a competition using denseRank
        and had three people tie for second place, you would say that all three were in second
        place and that the next person came in third.

        This is equivalent to the RANK function in SQL.""",
    'cumeDist':
        """.. note:: Deprecated in 1.6, use cume_dist instead.""",
    'cume_dist':
        """returns the cumulative distribution of values within a window partition,
        i.e. the fraction of rows that are below the current row.""",
    'percentRank':
        """.. note:: Deprecated in 1.6, use percent_rank instead.""",
    'percent_rank':
        """returns the relative rank (i.e. percentile) of rows within a window partition.""",
}

------

  • avg(*cols)
    为每一组的每个数值列计算平均值。
    mean 是 avg 的别名。
    参数: cols - 一组列的名字(字符串)。非数值列被忽略。
>>> df.groupBy().avg('age').collect()
[Row(avg(age)=3.5)]
>>> df3.groupBy().avg('age', 'height').collect()
[Row(avg(age)=3.5, avg(height)=82.5)]
  • count()
    计算每组的记录数。
>>> sorted(df.groupBy(df.age).count().collect())
[Row(age=2, count=1), Row(age=5, count=1)]
  • max(*cols)
    计算每组每个数值列的最大值。
>>> df.groupBy().max('age').collect()
[Row(max(age)=5)]
>>> df3.groupBy().max('age', 'height').collect()
[Row(max(age)=5, max(height)=85)]
  • mean() 求平均数, 同 avg

  • min() 参考 max

  • pivot(pivot_col, values=None)
    透视当前[[DataFrame]]的列并执行指定的聚合。有两个版本的pivot函数:一个需要调用者指定不同值的列表以便进行透视,而另一个不指定。后者更简洁但效率较低,因为Spark需要首先在内部计算不同值的列表。
    参数: pivot_col- 要透视的列名; ```values`` - 将转换为输出 DataFrame 中的列的值列表。

  • *sum(cols)
    为每组计算每个数值列的总和。
    参数: cols - 一组列的名字(字符串)。非数值列被忽略。

>>> df.groupBy().sum('age').collect()
[Row(sum(age)=7)]
>>> df3.groupBy().sum('age', 'height').collect()
[Row(sum(age)=7, sum(height)=165)]

还可以参考后面的pyspark.sql的自定义函数部分的内容
关于groupby后的数据类型可参考:Spark DataFrame 的 groupBy vs groupByKey

df.describe(*cols)

count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
df.describe().show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|               3.5|
| stddev|2.1213203435596424|
|    min|                 2|
|    max|                 5|
+-------+------------------+
df.describe(['age', 'name']).show()
+-------+------------------+-----+
|summary|               age| name|
+-------+------------------+-----+
|  count|                 2|    2|
|   mean|               3.5| null|
| stddev|2.1213203435596424| null|
|    min|                 2|Alice|
|    max|                 5|  Bob|
+-------+------------------+-----+

df.fillna(value, subset=None)

Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
Parameters:

  • value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
  • subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
>>> df4.na.fill(50).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|    80|Alice|
|  5|    50|  Bob|
| 50|    50|  Tom|
| 50|    50| null|
+---+------+-----+
>>> df4.na.fill({
     'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height|   name|
+---+------+-------+
| 10|    80|  Alice|
|  5|  null|    Bob|
| 50|  null|    Tom|
| 50|  null|unknown|
+---+------+-------+

pyspark.sql.DataFrame的连接操作

crossJoin(other)

返回两个df的笛卡尔积

	df=spark.createDataFrame([Row(age=2, name='Alice'), Row(age=5, name='Bob')])
    df.show()
    """
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    |  5|  Bob|
    +---+-----+
    """
    df2=spark.createDataFrame([Row(name='Tom', height=80), Row(name='Bob', height=85)])
    df2.show()
    """
    +------+----+
    |height|name|
    +------+----+
    |    80| Tom|
    |    85| Bob|
    +------+----+
    """
    df3 = spark.createDataFrame([Row(age=2, name='Alice',grade='A'), Row(age=4, name='Bob',grade='B'), Row(age=5, name='Tom',grade='C')])
    df3.show()
    """
    +---+-----+-----+
    |age|grade| name|
    +---+-----+-----+
    |  2|    A|Alice|
    |  4|    B|  Bob|
    |  5|    C|  Tom|
    +---+-----+-----+
    """
    df.crossJoin(df2).show()
    """
    +---+-----+------+----+
    |age| name|height|name|
    +---+-----+------+----+
    |  2|Alice|    80| Tom|
    |  2|Alice|    85| Bob|
    |  5|  Bob|    80| Tom|
    |  5|  Bob|    85| Bob|
    +---+-----+------+----+
    """

join(other, on=None, how=None)

Parameters:

  • other – Right side of the join
  • on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
  • how – str, default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left_anti.
#df.join(df2,on=df.name==df2.name,how='inner').show()
    """
    +---+----+------+----+
    |age|name|height|name|
    +---+----+------+----+
    |  5| Bob|    85| Bob|
    +---+----+------+----+
    """
    #df.join(df2, on=df.name == df2.name, how='cross').show()  # 这样看起来和inner join就没什么区别
    """
    +---+----+------+----+
    |age|name|height|name|
    +---+----+------+----+
    |  5| Bob|    85| Bob|
    +---+----+------+----+
    """
    #df.join(df2,on=df.name==df2.name,how='left').show()
    """
    +---+-----+------+----+
    |age| name|height|name|
    +---+-----+------+----+
    |  5|  Bob|    85| Bob|
    |  2|Alice|  null|null|
    +---+-----+------+----+
    """
    # df.join(df2, on=df.name == df2.name, how='left_outer').show()#与left join似乎没什么区别
    """
    +---+-----+------+----+
    |age| name|height|name|
    +---+-----+------+----+
    |  5|  Bob|    85| Bob|
    |  2|Alice|  null|null|
    +---+-----+------+----+
    """
    # df.join(df2, on=df.name == df2.name, how='left_semi').show()#与left john相比,它只取了左表这一部分,而且只取有连接值的一部分
    """
    +---+----+
    |age|name|
    +---+----+
    |  5| Bob|
    +---+----+
    """
    # df.join(df2, on=df.name == df2.name, how='left_anti').show()#与left john相比,它只取了左表这一部分,而且只取没连接值的一部分
    """
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    +---+-----+
    """
    #cond = [df.name == df3.name, df.age == df3.age]#多个字段连接时,要所有条件同时成立,否则就会null
    #df.join(df3, on=cond, how='left').show()
    """
    +---+-----+----+-----+-----+
    |age| name| age|grade| name|
    +---+-----+----+-----+-----+
    |  2|Alice|   2|    A|Alice|
    |  5|  Bob|null| null| null|
    +---+-----+----+-----+-----+
    """
    #df.join(df2,on=df.name==df2.name,how='right').show()
    """
    +----+----+------+----+
    | age|name|height|name|
    +----+----+------+----+
    |null|null|    80| Tom|
    |   5| Bob|    85| Bob|
    +----+----+------+----+
    """
    # df.join(df2, on=df.name == df2.name, how='right_outer').show()#似乎与right join没什么区别
    """
    +----+----+------+----+
    | age|name|height|name|
    +----+----+------+----+
    |null|null|    80| Tom|
    |   5| Bob|    85| Bob|
    +----+----+------+----+
    """
    #df.join(df2,on=df.name==df2.name,how='outer').show()
    """
    +----+-----+------+----+
    | age| name|height|name|
    +----+-----+------+----+
    |null| null|    80| Tom|
    |   5|  Bob|    85| Bob|
    |   2|Alice|  null|null|
    +----+-----+------+----+
    """

    #df.join(df2, on=df.name == df2.name, how='full').show()#似乎与outer join没什么区别
    """
    +----+-----+------+----+
    | age| name|height|name|
    +----+-----+------+----+
    |null| null|    80| Tom|
    |   5|  Bob|    85| Bob|
    |   2|Alice|  null|null|
    +----+-----+------+----+
    """
    #df.join(df2, on=df.name == df2.name, how='full_outer').show()#似乎与outer join没什么区别
    """
    +----+-----+------+----+
    | age| name|height|name|
    +----+-----+------+----+
    |null| null|    80| Tom|
    |   5|  Bob|    85| Bob|
    |   2|Alice|  null|null|
    +----+-----+------+----+
    """
    #df.join(df2, 'name', how='left').show()#如果不用on,而是指明连接字段,结果是去掉了冗余字段,有点像mysql里left join时不用on而用using
    """
    +-----+---+------+
    | name|age|height|
    +-----+---+------+
    |  Bob|  5|    85|
    |Alice|  2|  null|
    +-----+---+------+
    """
    #df.join(df3, ['name','age'], how='left').show()#与on相比,去掉了冗余项
    """
    +-----+---+-----+
    | name|age|grade|
    +-----+---+-----+
    |Alice|  2|    A|
    |  Bob|  5| null|
    +-----+---+-----+
    """

union(other)或者unionAll(other)

Return a new DataFrame containing union of rows in this and another frame.

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name).

增删改查

PySpark的DataFrame处理方法:增删改差

df.drop(how=‘any’, thresh=None, subset=None)

Functionality for working with missing data in DataFrame
Parameters:

  • how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
  • thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
  • subset – optional list of column names to consider.

df.fill(value, subset=None)

df.replace(to_replace, value, subset=None)

df.withColumn(colName, col)

通过为原数据框添加一个新列或替换已存在的同名列而返回一个新数据框。colName 是一个字符串, 为新列的名字。
col 为这个新列的 Column 表达式。表达式的操作对象必须是原数据框,如果试图把其他数据框的信息加到新列中可能会出错,报错 AssertionError: col should be Column;但是col 表达式可以组合使用多个列:。

>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]

python dataframe添加一列 - 如何在Spark DataFrame中添加常量列
如:

from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))
df.withColumn('new_column', lit(''))

df.withColumnRenamed(existing, new)

重命名已存在的列并返回一个新数据框。existing 为已存在的要重命名的列, col 为新列的名字。这个用toDF也能实现。

>>> df.withColumnRenamed('age', 'age2').collect()
[Row(age2=2, name='Alice'), Row(age2=5, name='Bob')]

class pyspark.sql.Column(jc)

数据框中的一列。 Column 实例可以通过如下的代码创建:

#1. 选择数据框中的一列
df.colName
df["colName"]

#2. 从表达式创建
df.colName + 1
1 / df.colName 

pyspark.sql的自定义函数

pyspark.sql.functions.udf(f=None, returnType=StringType)[source]

pyspark.sql.functions.pandas_udf(f=None, returnType=None, functionType=None)

pyspark中的自定义函数
PySpark Pandas UDF
如果自定义函数是用在groupby后,传入函数的df,就是包含了key本身的所有数据,是一个pandas的dataframe,而返回值也必须是pandas的dataframe。但dataframe的结构可以自定义。
groupby的响应函数里,最终返回的dataframe值、类型和顺序必须和schema中定义的一模一样;由于在响应函数里会进行各种各样的处理,最简单的控制方法就是在返回前按顺序在取一边dataframe,如 df=df[["A","B","C"]] retrun df
注意:
hive中如果数据类型是int型,但实际为null时,用pyspark处理可能会有问题,在groupby 后,此时schema的类型定义不要用interType(),可以临时改为stringType()
返回的dataframe包含原来hive表中没有的字段,按说有两种方式,一种是先对pyspark.sql的dataframe利用withColumns()增加列,也就是groupby前增加列,然后再在groupby的响应函数中进行各种处理;一种是先groupby,再在响应函数里对pandas.DataFrame添加列进行其他处理,但不知道为什么第二种方式出错了,有时间再试一试
下面是一个例子

schema = StructType([
    StructField("AAA", StringType()),
	StructField("result", StringType())
])

# 算法
@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def virtual_height(gdf):
	AAA=gdf.iloc[0,0]
	gdf=gdf.sort_values(by="DATATIME")
	cur_input=list(gdf["BBB"])
	vol_max=list(gdf["CCC"])
	vol_min=list(gdf["DDD"])
	soc_v_time=list(gdf["DATATIME"])
	
	----
				
	return pd.DataFrame({
     "AAA":[AAA, AAA, AAA, AAA, AAA],
				"result":['warning', str(EEE), str(FFF), str(HHH), GGG.strftime("%Y-%m-%d %H:%M:%S")]})


DF.groupby("AAA").apply(virtual_height).show()

pyspark.sql.functions.window(timeColumn, windowDuration, slideDuration=None, startTime=None)

Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.

The time column must be of pyspark.sql.types.TimestampType.

Durations are provided as strings, e.g. ‘1 second’, ‘1 day 12 hours’, ‘2 minutes’. Valid interval strings are ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’. If the slideDuration is not provided, the windows will be tumbling windows.

The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15… provide startTime as 15 minutes.

The output column will be a struct called ‘window’ by default with the nested columns ‘start’ and ‘end’, where ‘start’ and ‘end’ will be of pyspark.sql.types.TimestampType.

>>> df = spark.createDataFrame([("2016-03-11 09:00:07", 1)]).toDF("date", "val")
>>> w = df.groupBy(window("date", "5 seconds")).agg(sum("val").alias("sum"))
>>> w.select(w.window.start.cast("string").alias("start"),
...          w.window.end.cast("string").alias("end"), "sum").collect()
[Row(start='2016-03-11 09:00:05', end='2016-03-11 09:00:10', sum=1)]

spark sql优化

浪尖以案例聊聊spark3的动态分区裁剪

使用 Spark SQL 对结构化数据进行统计分析

来源:王 龙 2015 年 9 月 09 日发布

引言

在很多领域,如电信,金融等,每天都会产生大量的结构化数据,当数据量不断变大,传统的数据存储 (DBMS) 和计算方式 (单机程序) 已经不能满足企业对数据存储,统计分析以及知识挖掘的需要。在过去的数年里,传统的软件开发和维护人员已经积累了大量的基于 DBMS 的操作数据知识和经验,他们已经习惯了通过编写 SQL 语句来对数据记录进行统计分析。于是大数据工程师们开始探索如何使用类 SQL 的方式来操作和分析大数据,通过大量的努力,目前业界已经出现很多 SQL on Hadoop 的方案,如 Hive, Impala 等。Spark SQL 就是其中的一个,实际上 Spark SQL 并不是一开始就存在于 Spark 生态系统里的,它的前身是 Shark。随着 Spark 自身的发展,Spark 团队开始试图放弃 Shark 这个对于 Hive 有太多依赖 (查询优化,语法解析) 的东西,于是才有了 Spark SQL 这个全新的模块,通过几个版本的发展,目前 Spark SQL 已经趋于稳定,功能也逐渐丰富。本文将以 Spark1.4.1 版本为基础,由浅入深地向读者介绍 Spark SQL/DataFrame 的基本概念和原理,并且通过实例向读者展示如何使用 Spark SQL/DataFrame API 开发应用程序。接下来,就让我们开始 Spark SQL 的体验之旅吧。

关于 Spark SQL/DataFrame

Spark SQL 是 Spark 生态系统里用于处理结构化大数据的模块,该模块里最重要的概念就是 DataFrame, 相信熟悉 R 语言的工程师对此并不陌生。Spark 的 DataFrame 是基于早期版本中的 SchemaRDD,所以很自然的使用分布式大数据处理的场景。Spark DataFrame 以 RDD 为基础,但是带有 Schema 信息,它类似于传统数据库中的二维表格。
Spark SQL 模块目前支持将多种外部数据源的数据转化为 DataFrame,并像操作 RDD 或者将其注册为临时表的方式处理和分析这些数据。当前支持的数据源有:

  • Json
    
  • 文本文件
  • RDD
  • 关系数据库
  • Hive
  • Parquet

一旦将 DataFrame 注册成临时表,我们就可以使用类 SQL 的方式操作这些数据,我们将在下文的案例中详细向读者展示如何使用 Spark SQL/DataFrame 提供的 API 完成数据读取,临时表注册,统计分析等步骤。

案例介绍与编程实现

案例一

a.案例描述与分析
本案例中,我们将使用 Spark SQL 分析包含 5 亿条人口信息的结构化数据,数据存储在文本文件上,总大小为 7.81G。文件总共包含三列,第一列是 ID,第二列是性别信息 (F -> 女,M -> 男),第三列是人口的身高信息,单位是 cm。实际上这个文件与我们在本系列文章第一篇中的案例三使用的格式是一致的,读者可以参考相关章节,并使用提供的测试数据生成程序,生成 5 亿条数据,用于本案例中。为了便于读者理解,本案例依然把用于分析的文本文件的内容片段贴出来,具体格式如下。
图 1. 案例一测试数据格式预览

1 F 151
2 E 203
3 M 132
4 E 126
5 M 215
6 M 173
7 F 156
8 M 120
9 M 201
10 F 102
1	Male	163
2	Male	164
3	Male	165
4	Male	168
5	Male	169
6	Male	170
7	Male	170
8	Male	170
9	Male	171
10	Male	172
11	Male	172
12	Male	172
13	Male	173
14	Male	173
15	Male	174
16	Male	175
17	Male	175
18	Male	175
19	Male	175
20	Male	175
21	Male	176
22	Male	178
23	Male	178
24	Male	180
25	Male	180
26	Male	183
27	Female	153
28	Female	156
29	Female	156
30	Female	157
31	Female	158
32	Female	158
33	Female	159
34	Female	160
35	Female	160
36	Female	160
37	Female	160
38	Female	161
39	Female	161
40	Female	162
41	Female	162
42	Female	163
43	Female	163
44	Female	163
45	Female	164
46	Female	164
47	Female	165
48	Female	165
49	Female	165
50	Female	168
51	Female	168
52	Female	170

生成该测试文件后,读者需要把文件上传到 HDFS 上,您可以选择使用 HDFS shell 命令或者 HDSF 的 eclipse 插件。上传到 HDFS 后,我们可以通过访问 HDFS web console(http://namenode:50070),查看文件具体信息。
图 2. 案例一测试数据文件基本信息

数据挖掘工具---Spark SQL使用_第1张图片

本例中,我们的统计任务如下:

  • 用 SQL 语句的方式统计男性中身高超过 180cm 的人数。
  • 用 SQL 语句的方式统计女性中身高超过 170cm 的人数。
  • 对人群按照性别分组并统计男女人数。
  • 用类 RDD 转换的方式对 DataFrame 操作来统计并打印身高大于 210cm 的前 50 名男性。
  • 对所有人按身高进行排序并打印前 50 名的信息。
  • 统计男性的平均身高。
  • 统计女性身高的最大值。

读者可以看到,上述统计任务中有些是相似的,但是我们要用不同的方式实现它,以向读者展示不同的语法。
b.编码实现
清单 1. 案例一示例程序源代码

#-*- coding:utf-8 -*-

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row

if __name__=="__main__":
    sc=SparkContext(appName="myApp")
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()

    rddData=sc.textFile("/tmp/test/heightData.txt")
    rddDataFmt=rddData.map(lambda x:x.split(",")).map(lambda line:Row(id=int(line[0]),gender=line[1],heights=float(line[2])))
    df=spark.createDataFrame(rddDataFmt)
    """
    df.show()
    +------+-------+---+
    |gender|heights| id|
    +------+-------+---+
    |  Male|  163.0|  1|
    |  Male|  164.0|  2|
    |  Male|  165.0|  3|
    |  Male|  168.0|  4|
    |  Male|  169.0|  5|
    |  Male|  170.0|  6|
    |  Male|  170.0|  7|
    |  Male|  170.0|  8|
    |  Male|  171.0|  9|
    |  Male|  172.0| 10|
    |  Male|  172.0| 11|
    |  Male|  172.0| 12|
    |  Male|  173.0| 13|
    |  Male|  173.0| 14|
    |  Male|  174.0| 15|
    |  Male|  175.0| 16|
    |  Male|  175.0| 17|
    |  Male|  175.0| 18|
    |  Male|  175.0| 19|
    |  Male|  175.0| 20|
    +------+-------+---+
    only showing top 20 rows
    """
    df.registerTempTable("people")
    #查看身高大于178的男性人数
    heighterMale178=spark.sql("select * from people where heights>=178 and gender='Male'")
    # print("男性身高大于178cm的人数为:%d"%(heighterMale178.count()))
    # 男性身高大于178cm的人数为:5
    #查看身高大于168的女性人数
    heighterFemal168=spark.sql("select * from people where heights>=168 and gender='Female'")
    # print("女性身高大于168cm的人数为:%d"%(heighterFemal168.count()))
    # 女性身高大于168cm的人数为:3
    #查看不同性别的人数
    """
    df.groupBy("gender").count().show()
    +------+-----+
    |gender|count|
    +------+-----+
    |Female|   26|
    |  Male|   26|
    +------+-----+
    """
    #查看身高大于170的10个男性情况(未排序)
    """
    df.filter(df.gender=="Male").filter(df.heights>=170).show(10)
    +------+-------+---+
    |gender|heights| id|
    +------+-------+---+
    |  Male|  170.0|  6|
    |  Male|  170.0|  7|
    |  Male|  170.0|  8|
    |  Male|  171.0|  9|
    |  Male|  172.0| 10|
    |  Male|  172.0| 11|
    |  Male|  172.0| 12|
    |  Male|  173.0| 13|
    |  Male|  173.0| 14|
    |  Male|  174.0| 15|
    +------+-------+---+
    only showing top 10 rows
    """
    #将所有人按身高从高到低排列,取出前15名的信息
    """
    df.sort(df.heights.desc()).show(15)
    +------+-------+---+
    |gender|heights| id|
    +------+-------+---+
    |  Male|  183.0| 26|
    |  Male|  180.0| 24|
    |  Male|  180.0| 25|
    |  Male|  178.0| 23|
    |  Male|  178.0| 22|
    |  Male|  176.0| 21|
    |  Male|  175.0| 17|
    |  Male|  175.0| 19|
    |  Male|  175.0| 16|
    |  Male|  175.0| 18|
    |  Male|  175.0| 20|
    |  Male|  174.0| 15|
    |  Male|  173.0| 13|
    |  Male|  173.0| 14|
    |  Male|  172.0| 10|
    +------+-------+---+
    only showing top 15 rows
    print(df.sort(df.heights.desc()).take(20))
    [Row(gender='Male', heights=183.0, id=26), Row(gender='Male', heights=180.0, id=25), 
    Row(gender='Male', heights=180.0, id=24), Row(gender='Male', heights=178.0, id=22), 
    Row(gender='Male', heights=178.0, id=23), Row(gender='Male', heights=176.0, id=21), 
    Row(gender='Male', heights=175.0, id=16), Row(gender='Male', heights=175.0, id=17), 
    Row(gender='Male', heights=175.0, id=19), Row(gender='Male', heights=175.0, id=20), 
    Row(gender='Male', heights=175.0, id=18), Row(gender='Male', heights=174.0, id=15), 
    Row(gender='Male', heights=173.0, id=13), Row(gender='Male', heights=173.0, id=14), 
    Row(gender='Male', heights=172.0, id=10), Row(gender='Male', heights=172.0, id=11), 
    Row(gender='Male', heights=172.0, id=12), Row(gender='Male', heights=171.0, id=9), 
    Row(gender='Male', heights=170.0, id=6), Row(gender='Male', heights=170.0, id=7)]
    """
    #统计男性的平均身高
    """
    df.filter(df.gender=="Male").agg({"heights":"avg"}).show()
    +------------------+
    |      avg(heights)|
    +------------------+
    |172.92307692307693|
    +------------------+
    """
    #统计女性的最大身高
    """
    df.filter(df.gender=="Female").agg({"heights":"max"}).show()
    +------------+
    |max(heights)|
    +------------+
    |       170.0|
    +------------+
    """

c.提交并运行
spark-submit --master yarn --executor-memory 1g --driver-memory 1g --conf spark.yarn.am.memory=1g ./sparkSQLTest.py
d.监控执行过程
在提交后,我们可以在 Spark web console(http://:8080)中监控程序执行过程。下面我们将分别向读者展示如何监控程序产生的 Jobs,Stages,以及 D 可视化的查看 DAG 信息。
e.运行结果
如上

案例二

a.案例描述与分析
在案例一中,我们将存储于 HDFS 中的文件转换成 DataFrame 并进行了一系列的统计,细心的读者会发现,都是一个关联于一个 DataFrame 的简单查询和统计,那么在案例二中,我们将向读者展示如何连接多个 DataFrame 做更复杂的统计分析。

在本案例中,我们将统计分析 1 千万用户和 1 亿条交易数据。对于用户数据,它是一个包含 6 个列 (ID, 性别, 年龄, 注册日期, 角色 (从事行业), 所在区域) 的文本文件,具有以下格式。
图 8. 案例二测试用户数据格式预览

1,F,56,2008-10-4,ROLE003,REG005
2,F,13,2008-7-16,ROLE001,REG004
3,M,28,2009-2-2,ROLE003,REG002
4,M,51,2008-8-24,ROLE001,REG003
5,M,18,2010-1-15,ROLE004,REG004
6,M,52,2014-7-27,ROLE001,REG005
7,M,59,2000-6-14,ROLE001,REG004
8,M,13,2006-3-17,ROLE002,REG001
9,M,46,2000-10-5,ROLE002,REG005
10,M,19,2004-8-24,ROLE002,REG001
11,M,15,2004-4-26,ROLE001,REG003
12,F,30,2004-12-20,ROLE002,REG003
13,M,11,2003-3-22,ROLE004,REG001
14,M,56,2013-11-14,ROLE002,REG001
15,F,31,2005-2-8,ROLE003,REG001
16,F,13,2001-11-7,ROLE001,REG003
17,M,38,2000-1-11,ROLE005,REG004
18,M,14,2004-9-12,ROLE003,REG003
19,F,53,2013-3-13,ROLE002,REG004
20,M,34,2009-2-6,ROLE002,REG005
21,F,50,2011-8-16,ROLE003,REG004
22,F,10,2007-5-27,ROLE004,REG002
23,F,47,2014-4-18,ROLE003,REG005
24,M,43,2012-2-21,ROLE001,REG002
25,F,19,2001-10-4,ROLE005,REG001
26,F,14,2006-11-8,ROLE005,REG003

对于交易数据,它是一个包含 5 个列 (交易单号, 交易日期, 产品种类, 价格, 用户 ID) 的文本文件,具有以下格式。
图 9. 案例二测试交易数据格式预览

1,2007-11-12,7,462,2
2,2008-9-24,10,279,6
3,2013-12-13,9,438,2
4,2001-3-26,4,1871,8
5,2005-7-6,10,1634,1
6,2003-2-8,2,1906,6
7,2007-10-27,3,1321,1
8,2001-10-4,2,1623,7
9,2015-3-25,7,754,3
10,2009-4-19,2,1308,1
11,2011-10-15,7,1266,3
12,2015-4-3,6,1480,9
13,2007-4-26,1,1221,5
14,2013-1-19,10,921,5
15,2005-2-25,10,1347,4
16,2005-2-5,2,1150,5
17,2004-12-9,6,1850,8
18,2005-2-26,6,122,8
19,2009-1-10,7,427,7
20,2004-5-1,4,414,9
21,2014-10-5,6,651,4
22,2003-11-7,5,1082,2
23,2010-11-7,9,624,6
24,2013-10-11,3,1848,4
25,2015-6-11,1,141,1
26,2000-10-2,3,653,3

b.编码实现

#-*-coding:utf-8-*-

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.storagelevel import StorageLevel

if __name__=="__main__":
    sc=SparkContext(appName="myApp")
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()
    #读取users和orders数据
    usersRddSrc = sc.textFile("/tmp/test/users.txt")
    usersRdd=usersRddSrc.map(lambda x:x.split(",")).map(lambda line:Row(userID=line[0],gender=line[1],
                age=line[2],registerDate=line[3],occupation=line[4],region=line[5]))
    usersDF=spark.createDataFrame(usersRdd)
    """
    usersDF.show()
    +---+------+----------+------+------------+------+
    |age|gender|occupation|region|registerDate|userID|
    +---+------+----------+------+------------+------+
    | 56|     F|   ROLE003|REG005|   2008-10-4|     1|
    | 13|     F|   ROLE001|REG004|   2008-7-16|     2|
    | 28|     M|   ROLE003|REG002|    2009-2-2|     3|
    | 51|     M|   ROLE001|REG003|   2008-8-24|     4|
    | 18|     M|   ROLE004|REG004|   2010-1-15|     5|
    | 52|     M|   ROLE001|REG005|   2014-7-27|     6|
    | 59|     M|   ROLE001|REG004|   2000-6-14|     7|
    | 13|     M|   ROLE002|REG001|   2006-3-17|     8|
    | 46|     M|   ROLE002|REG005|   2000-10-5|     9|
    | 19|     M|   ROLE002|REG001|   2004-8-24|    10|
    | 15|     M|   ROLE001|REG003|   2004-4-26|    11|
    | 30|     F|   ROLE002|REG003|  2004-12-20|    12|
    | 11|     M|   ROLE004|REG001|   2003-3-22|    13|
    | 56|     M|   ROLE002|REG001|  2013-11-14|    14|
    | 31|     F|   ROLE003|REG001|    2005-2-8|    15|
    | 13|     F|   ROLE001|REG003|   2001-11-7|    16|
    | 38|     M|   ROLE005|REG004|   2000-1-11|    17|
    | 14|     M|   ROLE003|REG003|   2004-9-12|    18|
    | 53|     F|   ROLE002|REG004|   2013-3-13|    19|
    | 34|     M|   ROLE002|REG005|    2009-2-6|    20|
    +---+------+----------+------+------------+------+
    only showing top 20 rows
    """
    usersDF.registerTempTable("users")
    ordersRddSrc=sc.textFile("/tmp/test/orders.txt")
    ordersRdd=ordersRddSrc.map(lambda x:x.split(",")).map(lambda line:Row(orderID=line[0],orderDate=line[1],
                productID=line[2],price=line[3],userID=line[4]))
    ordersDF=spark.createDataFrame(ordersRdd)
    """
    ordersDF.show()
    +----------+-------+-----+---------+------+
    | orderDate|orderID|price|productID|userID|
    +----------+-------+-----+---------+------+
    |2007-11-12|      1|  462|        7|     2|
    | 2008-9-24|      2|  279|       10|     6|
    |2013-12-13|      3|  438|        9|     2|
    | 2001-3-26|      4| 1871|        4|     8|
    |  2015-7-6|      5| 1634|       10|     1|
    |  2003-2-8|      6| 1906|        2|     6|
    |2007-10-27|      7| 1321|        3|     1|
    | 2001-10-4|      8| 1623|        2|     7|
    | 2015-3-25|      9|  754|        7|     3|
    | 2009-4-19|     10| 1308|        2|     1|
    |2011-10-15|     11| 1266|        7|     3|
    |  2015-4-3|     12| 1480|        6|     9|
    | 2007-4-26|     13| 1221|        1|     5|
    | 2013-1-19|     14|  921|       10|     5|
    | 2005-2-25|     15| 1347|       10|     4|
    |  2005-2-5|     16| 1150|        2|     5|
    | 2004-12-9|     17| 1850|        6|     8|
    | 2005-2-26|     18|  122|        6|     8|
    | 2009-1-10|     19|  427|        7|     7|
    |  2004-5-1|     20|  414|        4|     9|
    +----------+-------+-----+---------+------+
    only showing top 20 rows
    """
    ordersDF.registerTempTable("orders")
    usersDF.persist(storageLevel=StorageLevel.MEMORY_ONLY_SER)
    ordersDF.persist(storageLevel=StorageLevel.MEMORY_ONLY_SER)
    #查看2015年订单及用户个人信息
    df2015=ordersDF.filter(ordersDF.orderDate.contains("2015")).join(usersDF,ordersDF.userID==usersDF.userID,"inner")
    """
    df2015.show()
    +---------+-------+-----+---------+------+---+------+----------+------+------------+------+
    |orderDate|orderID|price|productID|userID|age|gender|occupation|region|registerDate|userID|
    +---------+-------+-----+---------+------+---+------+----------+------+------------+------+
    |2015-3-25|      9|  754|        7|     3| 28|     M|   ROLE003|REG002|    2009-2-2|     3|
    | 2015-4-3|     12| 1480|        6|     9| 46|     M|   ROLE002|REG005|   2000-10-5|     9|
    | 2015-7-6|      5| 1634|       10|     1| 56|     F|   ROLE003|REG005|   2008-10-4|     1|
    |2015-6-11|     25|  141|        1|     1| 56|     F|   ROLE003|REG005|   2008-10-4|     1|
    +---------+-------+-----+---------+------+---+------+----------+------+------------+------+
    从上面结果可以看到,虽然2015年的订单有4条,但这里只显示了3条,这是因为采用inner join 取了两者的交集
    """
    count2015=df2015.count()
    print("2015年的订单数为:%d"%(count2015))
    # 2015年有订单且有个人信息的用户数为:3
    #统计2013年总的订单数
    df2013=spark.sql("select * from orders where orderDate like '2013%'")
    count2013=df2013.count()
    # print("2013年总的订单数为:%d"%(count2013))
    # 2013年总的订单数为:3
    #上面展示了两种查看订单的方法:一个是对dataframe操作,一个是对利用spark.sql对临时表查询,跟操作Hive是一样的
    #统计用户ID=1的用户的整体情况
    df1=spark.sql("select o.orderID,o.productID,o.price,u.userID,u.age from orders o,users u where u.userID=1 and u.userID=o.userID")
    """
    df1.show()
    +-------+---------+-----+------+---+
    |orderID|productID|price|userID|age|
    +-------+---------+-----+------+---+
    |      5|       10| 1634|     1| 56|
    |      7|        3| 1321|     1| 56|
    |     10|        2| 1308|     1| 56|
    |     25|        1|  141|     1| 56|
    +-------+---------+-----+------+---+
    """
    #统计用户ID=4的用户订单的最大价格、最小价格、平均价格
    df4=spark.sql("select max(o.price) as maxPrice,min(o.price),avg(o.price) as avgPrice,u.userID from orders o,users u where u.userID=4 and o.userID=u.userID group by u.userID")
    """
    df4.show()
    +--------+----------+--------+------+                                           
    |maxPrice|min(price)|avgPrice|userID|
    +--------+----------+--------+------+
    |     651|      1347|  1282.0|     4|
    +--------+----------+--------+------+
    """
    #df1和df4两个例子更详细的展示了利用spark.sql像操作数据库一样操作临时表;当然这些功能也可以用spark.sql.dataframe接口来实现。

c.提交并运行

spark-submit --master yarn --executor-memory 2g --driver-memory 2g --conf spark.yarn.am.memory=2g ./sparkSQLTest.py

d.监控执行过程
程序提交后,读者可以用案例一描述的方式在 Spark web console 监控执行过程,这样也能帮助您深入的理解 Spark SQL 程序的执行过程。
e.运行结果

总结

关于 Spark SQL 程序开发,我们通常有以下需要注意的地方。

  • Spark SQL 程序开发过程中,我们有两种方式确定 schema,第一种是反射推断 schema,如本文的案例二,这种方式下,我们需要定义样本类 (case class) 来对应数据的列;第二种方式是通过编程方式来确定 schema,这种方式主要是通过 Spark SQL 提供的 StructType 和 StructField 等 API 来编程实现,这种方式下我们不需要定义样本类,如本文中的案例一。
    import sqlCtx.implicits._
    在程序实现中,我们需要使用以便隐式的把 RDD 转化成 DataFrame 来操作。
  • 本文展示的 DataFrame API 使用的方法只是其中很小的一部分,但是一旦读者掌握了开发的基本流程,就能通过参考 DataFrame API 文档 写出更为复杂的程序。
  • 通常来说,我们有两种方式了解 Spark 程序的执行流程。第一种是通过在控制台观察输出日志,另一种则更直观,就是通过 Spark Web Console 来观察 Driver 程序里各个部分产生的 job 信息以及 job 里包含的 stages 信息。
  • 需要指出的是,熟练的掌握 Spark SQL/DataFrame 的知识对学习最新的 Spark 机器学习库 ML Pipeline 至关重要,因为 ML Pipeline 使用 DataFrame 作为数据集来支持多种的数据类型。
  • 笔者在测试的过程中发现,处理相同的数据集和类似的操作,Spark SQL/DataFrame 比传统的 RDD 转换操作具有更好的性能。这是由于 SQL 模块的 Catalyst 对用户 SQL 做了很好的查询优化。在以后的文章中会向读者详细的介绍该组件。

结束语

本文通过两个案例向读者详细的介绍了使用 Spark SQL/DataFrame 处理结构化数据的过程,限于篇幅,我们并没有在文中向读者详细介绍 Spark SQL/DataFrame 程序的执行流程,以及 Catalyst SQL 解析和查询优化引擎。这个将会在本系列后面的文章中介绍。其实文中提供的测试数据还可以用来做更为复杂的 Spark SQL 测试,读者可以基于本文,进行更多的工作。需要指出的是,由于我们用到的数据是靠程序随机生成的,所以部分数据难免有不符合实际的情况,读者应该关注在使用 Spark SQL/DataFrame 处理这些数据的过程。最后,感谢您耐心的阅读本文,如果您有任何问题或者想法,请在文末留言,我们可以进行深入的讨论。让我们互相学习,共同进步。
相关主题

  • 参考Spark SQL/DataFrame 官网文档,了解 Spark SQL/DataFrame 的基本原理和编程模型。
  • 参考Spark Scala API 文档,了解 Spark SQL/DataFrame 相关 API 的使用。
  • developerWorks 开源技术主题:查找丰富的操作信息、工具和项目更新,帮助您掌握开源技术并将其用于 IBM 产品。

Spark 2.0介绍:Spark SQL中的Time Window使用

来源:2016-07-12 21:07:10

Spark SQL中Window API

Spark SQL中的window API是从1.4版本开始引入的,以便支持更智能的分组功能。这个功能对于那些有SQL背景的人来说非常有用;但是在Spark 1.x中,window API一大缺点就是无法使用时间来创建窗口。时间在诸如金融、电信等领域有着非常重要的角色,基于时间来理解数据变得至关重要。
不过值得高兴的是,在Spark 2.0中,window API内置也支持time windows!Spark SQL中的time windows和Spark Streaming中的time windows非常类似。在这篇文章中,我将介绍如何在Spark SQL中使用time windows。

时间序列数据

在我们介绍如何使用time window之前,我们先来准备一份时间序列数据。本文将使用Apple公司从1980年到2016年期间的股票交易信息。如下(完整的数据点击这里获取):

Date,Open,High,Low,Close,Volume,Adj Close
2016-7-11,96.75,97.650002,96.730003,96.980003,23298900,96.980003
2016-7-8,96.489998,96.889999,96.050003,96.68,28855800,96.68
2016-7-7,95.699997,96.5,95.620003,95.940002,24280900,95.940002
2016-7-6,94.599998,95.660004,94.370003,95.529999,30770700,95.529999
2016-7-5,95.389999,95.400002,94.459999,95.040001,27257000,95.040001
2016-7-1,95.489998,96.470001,95.330002,95.889999,25872300,95.889999
2016-6-30,94.440002,95.769997,94.300003,95.599998,35836400,95.599998
2016-6-29,93.970001,94.550003,93.629997,94.400002,36531000,94.400002
2016-6-28,92.900002,93.660004,92.139999,93.589996,40444900,93.589996
2016-6-27,93,93.050003,91.5,92.040001,45489600,92.040001
2016-6-24,92.910004,94.660004,92.650002,93.400002,75311400,93.400002
2016-6-23,95.940002,96.290001,95.25,96.099998,32240200,96.099998
2016-6-22,96.25,96.889999,95.349998,95.550003,28971100,95.550003
2016-6-21,94.940002,96.349998,94.68,95.910004,35229500,95.910004
2016-6-20,96,96.57,95.029999,95.099998,33942300,95.099998
2016-6-17,96.620003,96.650002,95.300003,95.330002,60595000,95.330002
2016-6-16,96.449997,97.75,96.07,97.550003,31236300,97.550003
2016-6-15,97.82,98.410004,97.029999,97.139999,29445200,97.139999
2016-6-14,97.32,98.480003,96.75,97.459999,31931900,97.459999
2016-6-13,98.690002,99.120003,97.099998,97.339996,38020500,97.339996
2016-6-10,98.529999,99.349998,98.480003,98.830002,31712900,98.830002
2016-6-9,98.5,99.989998,98.459999,99.650002,26601400,99.650002
2016-6-8,99.019997,99.559998,98.68,98.940002,20848100,98.940002
2016-6-7,99.25,99.870003,98.959999,99.029999,22409500,99.029999
2016-6-6,97.989998,101.889999,97.550003,98.629997,23292500,98.629997
2016-6-3,97.790001,98.269997,97.449997,97.919998,28062900,97.919998
2016-6-2,97.599998,97.839996,96.629997,97.720001,40004100,97.720001
2016-6-1,99.019997,99.540001,98.330002,98.459999,29113400,98.459999
2016-5-31,99.599998,100.400002,98.82,99.860001,42084800,99.860001

股票数据一共有六列,但是这里我们仅关心Date和Close两列,它们分别代表股票交易时间和当天收盘的价格。

将时间序列数据导入到DataFrame中

我们有了样本数据之后,需要将它导入到DataFrame中以便下面的计算。所有的time window API需要一个类型为timestamp的列。我们可以使用spark-csv工具包来解析上面的Apple股票数据(csv格式),这个工具可以自动推断时间类型的数据并自动创建好模式。代码如下:

计算2016年Apple股票周平均收盘价格

现在我们已经有了初始化好的数据,所以我们可以进行一些基于时间的窗口分析。在本例中我们将计算2016年Apple公司每周股票的收盘价格平均值。下面将一步一步进行介绍。
步骤一:找出2016年的股票交易数据
因为我们仅仅需要2016年的交易数据,所以我们可以对原始数据进行过滤,代码片段我们使用了内置的year函数来提取出日期中的年。
步骤二:计算平均值
现在我们需要对每个星期创建一个窗口,这种类型的窗口通常被称为tumbling window
步骤三:打印window的值

带有开始时间的Time window

在前面的示例中,我们使用的是tumbling window。为了能够指定开始时间,我们需要使用sliding window(滑动窗口)。到目前为止,没有相关API来创建带有开始时间的tumbling window,但是我们可以通过将窗口时间(window duration)和滑动时间(slide duration)设置成一样来创建带有开始时间的tumbling window。代码如下:

#-*-coding:utf-8-*-

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import year
from pyspark.sql.functions import window

if __name__=="__main__":
    sc=SparkContext(appName="myApp")
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()
    df=spark.read.csv(path="/tmp/test/iteblog_apple.csv",header=True,dateFormat="yyyy/MM/dd")
    """
    df.show()
    +---------+---------+---------+---------+---------+--------+---------+
    |     Date|     Open|     High|      Low|    Close|  Volume|Adj Close|
    +---------+---------+---------+---------+---------+--------+---------+
    |2016-7-11|    96.75|97.650002|96.730003|96.980003|23298900|96.980003|
    | 2016-7-8|96.489998|96.889999|96.050003|    96.68|28855800|    96.68|
    | 2016-7-7|95.699997|     96.5|95.620003|95.940002|24280900|95.940002|
    | 2016-7-6|94.599998|95.660004|94.370003|95.529999|30770700|95.529999|
    | 2016-7-5|95.389999|95.400002|94.459999|95.040001|27257000|95.040001|
    | 2016-7-1|95.489998|96.470001|95.330002|95.889999|25872300|95.889999|
    |2016-6-30|94.440002|95.769997|94.300003|95.599998|35836400|95.599998|
    |2016-6-29|93.970001|94.550003|93.629997|94.400002|36531000|94.400002|
    |2016-6-28|92.900002|93.660004|92.139999|93.589996|40444900|93.589996|
    |2016-6-27|       93|93.050003|     91.5|92.040001|45489600|92.040001|
    |2016-6-24|92.910004|94.660004|92.650002|93.400002|75311400|93.400002|
    |2016-6-23|95.940002|96.290001|    95.25|96.099998|32240200|96.099998|
    |2016-6-22|    96.25|96.889999|95.349998|95.550003|28971100|95.550003|
    |2016-6-21|94.940002|96.349998|    94.68|95.910004|35229500|95.910004|
    |2016-6-20|       96|    96.57|95.029999|95.099998|33942300|95.099998|
    |2016-6-17|96.620003|96.650002|95.300003|95.330002|60595000|95.330002|
    |2016-6-16|96.449997|    97.75|    96.07|97.550003|31236300|97.550003|
    |2016-6-15|    97.82|98.410004|97.029999|97.139999|29445200|97.139999|
    |2016-6-14|    97.32|98.480003|    96.75|97.459999|31931900|97.459999|
    |2016-6-13|98.690002|99.120003|97.099998|97.339996|38020500|97.339996|
    +---------+---------+---------+---------+---------+--------+---------+
    only showing top 20 rows
    print(df.dtypes)
    [('Date', 'string'), ('Open', 'string'), ('High', 'string'), ('Low', 'string'), ('Close', 'string'),
     ('Volume', 'string'), ('Adj Close', 'string')]
    
    print(df.count())
    8971
    """
    #找出2016年的股票交易数据
    stock2016=df.filter(year(df.Date)=='2016')
    """
    stock2016.show()
    +---------+---------+---------+---------+---------+--------+---------+
    |     Date|     Open|     High|      Low|    Close|  Volume|Adj Close|
    +---------+---------+---------+---------+---------+--------+---------+
    |2016-7-11|    96.75|97.650002|96.730003|96.980003|23298900|96.980003|
    | 2016-7-8|96.489998|96.889999|96.050003|    96.68|28855800|    96.68|
    | 2016-7-7|95.699997|     96.5|95.620003|95.940002|24280900|95.940002|
    | 2016-7-6|94.599998|95.660004|94.370003|95.529999|30770700|95.529999|
    | 2016-7-5|95.389999|95.400002|94.459999|95.040001|27257000|95.040001|
    | 2016-7-1|95.489998|96.470001|95.330002|95.889999|25872300|95.889999|
    |2016-6-30|94.440002|95.769997|94.300003|95.599998|35836400|95.599998|
    |2016-6-29|93.970001|94.550003|93.629997|94.400002|36531000|94.400002|
    |2016-6-28|92.900002|93.660004|92.139999|93.589996|40444900|93.589996|
    |2016-6-27|       93|93.050003|     91.5|92.040001|45489600|92.040001|
    |2016-6-24|92.910004|94.660004|92.650002|93.400002|75311400|93.400002|
    |2016-6-23|95.940002|96.290001|    95.25|96.099998|32240200|96.099998|
    |2016-6-22|    96.25|96.889999|95.349998|95.550003|28971100|95.550003|
    |2016-6-21|94.940002|96.349998|    94.68|95.910004|35229500|95.910004|
    |2016-6-20|       96|    96.57|95.029999|95.099998|33942300|95.099998|
    |2016-6-17|96.620003|96.650002|95.300003|95.330002|60595000|95.330002|
    |2016-6-16|96.449997|    97.75|    96.07|97.550003|31236300|97.550003|
    |2016-6-15|    97.82|98.410004|97.029999|97.139999|29445200|97.139999|
    |2016-6-14|    97.32|98.480003|    96.75|97.459999|31931900|97.459999|
    |2016-6-13|98.690002|99.120003|97.099998|97.339996|38020500|97.339996|
    +---------+---------+---------+---------+---------+--------+---------+
    only showing top 20 rows
    print(stock2016.count())
    131
    """
    #计算平均值
    tumblingWindowDS=stock2016.groupby(window('Date','1 week')).agg({
     "Close":"avg"})
    """
    tumblingWindowDS.sort(tumblingWindowDS.window.desc()).show()
    +--------------------+------------------+                                       
    |              window|        avg(Close)|
    +--------------------+------------------+
    |[2016-07-07 08:00...| 96.83000150000001|
    |[2016-06-30 08:00...| 95.60000025000001|
    |[2016-06-23 08:00...|        93.8059998|
    |[2016-06-16 08:00...| 95.59800099999998|
    |[2016-06-09 08:00...| 97.66399979999998|
    |[2016-06-02 08:00...|        98.8339996|
    |[2016-05-26 08:00...|       99.09749975|
    |[2016-05-19 08:00...|         97.916002|
    |[2016-05-12 08:00...|        93.3299974|
    |[2016-05-05 08:00...| 92.35599959999999|
    |[2016-04-28 08:00...|        93.9979994|
    |[2016-04-21 08:00...|       101.5520004|
    |[2016-04-14 08:00...|107.46800060000001|
    |[2016-04-07 08:00...|       110.4520004|
    |[2016-03-31 08:00...|110.08399979999999|
    |[2016-03-24 08:00...|       107.8549995|
    |[2016-03-17 08:00...|       106.0699996|
    |[2016-03-10 08:00...|        104.226001|
    |[2016-03-03 08:00...|101.64000100000001|
    |[2016-02-25 08:00...|         99.276001|
    +--------------------+------------------+
    only showing top 20 rows
    
    tumblingWindowDS.select(tumblingWindowDS.window.start.cast("string").alias("start"),
        tumblingWindowDS.window.end.cast("string").alias("end"), "avg(Close)").show()
    
    +-------------------+-------------------+------------------+
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2016-01-07 08:00:00|2016-01-14 08:00:00| 98.47199859999999|
    |2016-01-21 08:00:00|2016-01-28 08:00:00|        97.6719984|
    |2016-06-23 08:00:00|2016-06-30 08:00:00|        93.8059998|
    |2016-03-31 08:00:00|2016-04-07 08:00:00|110.08399979999999|
    |2016-01-14 08:00:00|2016-01-21 08:00:00| 96.72000125000001|
    |2016-04-21 08:00:00|2016-04-28 08:00:00|       101.5520004|
    |2016-04-07 08:00:00|2016-04-14 08:00:00|       110.4520004|
    |2016-02-04 08:00:00|2016-02-11 08:00:00| 94.39799819999999|
    |2016-06-02 08:00:00|2016-06-09 08:00:00|        98.8339996|
    |2016-04-14 08:00:00|2016-04-21 08:00:00|107.46800060000001|
    |2016-03-03 08:00:00|2016-03-10 08:00:00|101.64000100000001|
    |2016-07-07 08:00:00|2016-07-14 08:00:00| 96.83000150000001|
    |2016-05-05 08:00:00|2016-05-12 08:00:00| 92.35599959999999|
    |2016-04-28 08:00:00|2016-05-05 08:00:00|        93.9979994|
    |2016-06-30 08:00:00|2016-07-07 08:00:00| 95.60000025000001|
    |2016-05-26 08:00:00|2016-06-02 08:00:00|       99.09749975|
    |2016-05-12 08:00:00|2016-05-19 08:00:00|        93.3299974|
    |2016-05-19 08:00:00|2016-05-26 08:00:00|         97.916002|
    |2015-12-31 08:00:00|2016-01-07 08:00:00|101.30249774999999|
    |2016-02-11 08:00:00|2016-02-18 08:00:00|        96.2525005|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    """
    #打印window的值
    tumblingWindowDSTR=tumblingWindowDS.select(tumblingWindowDS.window.start.cast("string").alias("start"),
        tumblingWindowDS.window.end.cast("string").alias("end"), "avg(Close)")
    """
    tumblingWindowDSTR.sort(tumblingWindowDSTR.start).show()
    +-------------------+-------------------+------------------+                    
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2015-12-31 08:00:00|2016-01-07 08:00:00|101.30249774999999|
    |2016-01-07 08:00:00|2016-01-14 08:00:00| 98.47199859999999|
    |2016-01-14 08:00:00|2016-01-21 08:00:00| 96.72000125000001|
    |2016-01-21 08:00:00|2016-01-28 08:00:00|        97.6719984|
    |2016-01-28 08:00:00|2016-02-04 08:00:00|         96.239999|
    |2016-02-04 08:00:00|2016-02-11 08:00:00| 94.39799819999999|
    |2016-02-11 08:00:00|2016-02-18 08:00:00|        96.2525005|
    |2016-02-18 08:00:00|2016-02-25 08:00:00| 96.09400000000001|
    |2016-02-25 08:00:00|2016-03-03 08:00:00|         99.276001|
    |2016-03-03 08:00:00|2016-03-10 08:00:00|101.64000100000001|
    |2016-03-10 08:00:00|2016-03-17 08:00:00|        104.226001|
    |2016-03-17 08:00:00|2016-03-24 08:00:00|       106.0699996|
    |2016-03-24 08:00:00|2016-03-31 08:00:00|       107.8549995|
    |2016-03-31 08:00:00|2016-04-07 08:00:00|110.08399979999999|
    |2016-04-07 08:00:00|2016-04-14 08:00:00|       110.4520004|
    |2016-04-14 08:00:00|2016-04-21 08:00:00|107.46800060000001|
    |2016-04-21 08:00:00|2016-04-28 08:00:00|       101.5520004|
    |2016-04-28 08:00:00|2016-05-05 08:00:00|        93.9979994|
    |2016-05-05 08:00:00|2016-05-12 08:00:00| 92.35599959999999|
    |2016-05-12 08:00:00|2016-05-19 08:00:00|        93.3299974|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    # 上面的输出按照window.start进行了排序,这个字段标记了窗口的开始时间。上面的输出你可能
    #已经看到了第一行的开始时间是2015 - 12 - 31,结束时间是2016 - 01 - 07。
    #但是你从原始数据可以得到:2016年Apple公司的股票交易信息是从2016 - 01 - 04开始的;
    #原因是2016 - 01 - 01是元旦,而2016 - 01 - 02和2016 - 01 - 03正好是周末,期间没有股票交易。
    #
    # 我们可以手动指定窗口的开始时间来解决这个问题。
    """

    #带有开始时间的Time window
    #这里的startTime指的是相对1970-01-01 00:00:00 UTC的偏移量,但实际数据的开始时间是不确定的,
    # 因而这种方式对起始时间的控制并不好,不知道为什么这么设计
    tumblingWindowDS2=stock2016.groupby(window(timeColumn='Date',windowDuration='1 week',
        slideDuration='1 week',startTime='4 days')).agg({
     'Close':'avg'})
    tumblingWindowDSTR2 = tumblingWindowDS2.select(tumblingWindowDS2.window.start.cast("string").alias("start"),
        tumblingWindowDS2.window.end.cast("string").alias("end"), "avg(Close)").sort('start')
    """
    tumblingWindowDSTR2.show()
    +-------------------+-------------------+------------------+                    
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2015-12-28 08:00:00|2016-01-04 08:00:00|        105.349998|
    |2016-01-04 08:00:00|2016-01-11 08:00:00|        99.0699982|
    |2016-01-11 08:00:00|2016-01-18 08:00:00| 98.49999799999999|
    |2016-01-18 08:00:00|2016-01-25 08:00:00|        98.1220016|
    |2016-01-25 08:00:00|2016-02-01 08:00:00|        96.2539976|
    |2016-02-01 08:00:00|2016-02-08 08:00:00| 95.29199960000001|
    |2016-02-08 08:00:00|2016-02-15 08:00:00|        94.2374975|
    |2016-02-15 08:00:00|2016-02-22 08:00:00|        96.7880004|
    |2016-02-22 08:00:00|2016-02-29 08:00:00| 96.23000160000001|
    |2016-02-29 08:00:00|2016-03-07 08:00:00|101.53200079999999|
    |2016-03-07 08:00:00|2016-03-14 08:00:00|       101.6199998|
    |2016-03-14 08:00:00|2016-03-21 08:00:00|105.63600160000001|
    |2016-03-21 08:00:00|2016-03-28 08:00:00|105.92749950000001|
    |2016-03-28 08:00:00|2016-04-04 08:00:00|109.46799940000001|
    |2016-04-04 08:00:00|2016-04-11 08:00:00|109.39799980000001|
    |2016-04-11 08:00:00|2016-04-18 08:00:00|       110.3820004|
    |2016-04-18 08:00:00|2016-04-25 08:00:00|106.15400079999999|
    |2016-04-25 08:00:00|2016-05-02 08:00:00|        96.8759994|
    |2016-05-02 08:00:00|2016-05-09 08:00:00|        93.6240004|
    |2016-05-09 08:00:00|2016-05-16 08:00:00| 92.13399799999999|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    """
    # 从上面的结果可以看出,我们已经有了一个从2016 - 01 - 04
    # 的结果;不过结果中还有2015年的数据。原因是我们的开始时间是4
    # days,2016 - 01 - 04
    # 之前的一周数据也会被显示出,我们可以使用filter来过滤掉那行数据:
    #进一步对时间进行控制
    tumblingWindowDSTRfilter=tumblingWindowDSTR2.filter(year(tumblingWindowDSTR2.start)=='2016')
    """
    tumblingWindowDSTRfilter.show()
    +-------------------+-------------------+------------------+                    
    |              start|                end|        avg(Close)|
    +-------------------+-------------------+------------------+
    |2016-01-04 08:00:00|2016-01-11 08:00:00|        99.0699982|
    |2016-01-11 08:00:00|2016-01-18 08:00:00| 98.49999799999999|
    |2016-01-18 08:00:00|2016-01-25 08:00:00|        98.1220016|
    |2016-01-25 08:00:00|2016-02-01 08:00:00|        96.2539976|
    |2016-02-01 08:00:00|2016-02-08 08:00:00| 95.29199960000001|
    |2016-02-08 08:00:00|2016-02-15 08:00:00|        94.2374975|
    |2016-02-15 08:00:00|2016-02-22 08:00:00|        96.7880004|
    |2016-02-22 08:00:00|2016-02-29 08:00:00| 96.23000160000001|
    |2016-02-29 08:00:00|2016-03-07 08:00:00|101.53200079999999|
    |2016-03-07 08:00:00|2016-03-14 08:00:00|       101.6199998|
    |2016-03-14 08:00:00|2016-03-21 08:00:00|105.63600160000001|
    |2016-03-21 08:00:00|2016-03-28 08:00:00|105.92749950000001|
    |2016-03-28 08:00:00|2016-04-04 08:00:00|109.46799940000001|
    |2016-04-04 08:00:00|2016-04-11 08:00:00|109.39799980000001|
    |2016-04-11 08:00:00|2016-04-18 08:00:00|       110.3820004|
    |2016-04-18 08:00:00|2016-04-25 08:00:00|106.15400079999999|
    |2016-04-25 08:00:00|2016-05-02 08:00:00|        96.8759994|
    |2016-05-02 08:00:00|2016-05-09 08:00:00|        93.6240004|
    |2016-05-09 08:00:00|2016-05-16 08:00:00| 92.13399799999999|
    |2016-05-16 08:00:00|2016-05-23 08:00:00| 94.77999880000002|
    +-------------------+-------------------+------------------+
    only showing top 20 rows
    """

你可能感兴趣的:(数据挖掘工具,spark,数据)