You can load data directly into a DataFrame, and begin querying it relatively quickly. Otherwise you’ll need to load data into an RDD and transform it first.
# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile("file:/var/log/system.log")
allSysLogs = sc.textFile("file:/var/log/system.log*")
allLogs = sc.textFile("file:/var/log/*.log")
# lets count the lines in each RDD>>> oneSysLog.count()8339>>> allSysLogs.count()47916>>> allLogs.count()546254
That’s great, but you can’t query this. You’ll need to convert the data to Rows, add a schema, and convert it to a dataframe.
Once the data is converted to at least a DataFrame with a schema, now you can talk SQL to the data.
# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView("logs")>>> spark.sql("SELECT * FROM logs LIMIT 1").show()+--------------------+| log|+--------------------+|Jan 616:37:01(...|+--------------------+
But, you can also load certain types of data and store it directly as a DataFrame. This allows you to get to SQL quickly. Both JSON and Parquet formats can be loaded as a DataFrame straightaway because they contain enough schema information to do so.
# load parquet straight into DF, and write some SQL
logsDF = spark.read.parquet("file:/logs.parquet")
logsDF.createOrReplaceTempView("logs")>>> spark.sql("SELECT * FROM logs LIMIT 1").show()+--------------------+| log|+--------------------+|Jan 616:37:01(...|+--------------------+
In fact, now they even have support for querying parquet files directly! Easy peasy!
# load parquet straight into DF, and write some SQL>>> spark.sql(""" SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1""").show()+--------------------+| log|+--------------------+|Jan 616:37:01(...|+--------------------+
That’s one aspect of loading data. The other aspect is using the protocols for cloud storage (i.e. s3://). In some cloud ecosystems, support for their storage protocol comes installed already.
# i.e. on AWS EMR, s3:// is installed already.
sc = spark.sparkContext
decemberLogs = sc.textFile("s3://acme-co/logs/2016/12/")# count the lines in all of the december 2016 logs in S3>>> decemberLogs.count()910125081250# wow, such logs. Ur poplar.
Sometimes you actually need to provide support for those protocols if your VM’s OS doesn’t have it already.
my-linux-shell$
pyspark --packages
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 demo2.py
>>> rdd = sc.readText("s3a://acme-co/path/to/files")
rdd.count()# note: "s3a" and not "s3" -- "s3" is specific to AWS EMR.
Now you should have several ways to load data to quickly start writing SQL with Apache Spark.
SQL 和 DataFrame API 比较,它们之间的区别
What is a Datalframe? You can think of dataframes like RDDS with a schema Note:“Data Frame is just a type alias for Dataset of Row—Databricks”
Why Dataframe over RDD? Catalyst optimization & schemas
What kind of data can Datalframes handle? Text. JSON XML, Parquet and more
What can I do with a Dataframe? Use sql-like and actual SQL. Also, you can apply schemas to your data and benefit from the performance enhancements of the Catalyst optimizer
Still Catalyst Optimized Both SQL and API Functions in df’s sit atop Catalyst
Dataframe Functions Provides a bridge between to features of Spark APIs
SQL With Dataframes Allows you a familiar way to interact with the data
sql-like Functions in Dataframe API For many of the expected features of SQL. there are similar functions in the DF Apithat do practically the same thing,allowing for .functional().chaining()
模式: 隐式和显示模式解释,数据类型
Schemas can be inferred, i.e. guessed, by spark. With inferred schemas, you usually end up with a bunch of strings and ints. If you have more specific needs, supply your own schema.
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile("file:/people.txt")defmapper(line):
s = line.split("|")return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)# full syntax: .createDataFrame(peopleRDD, schema)
# we didn't actually pass anything into that 2nd param.# yet, behind the scenes, there's still a schema.>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)|--id: string (nullable = true)|-- name: string (nullable = true)|-- state: string (nullable = true)
Spark SQL can certainly handle queries where id is a string, but what if we don’t want it to be? it should be an int.
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile("file:/people.txt")defmapper(line):
s = line.split("|")return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)|--id:long(nullable = true)|-- name: string (nullable = true)|-- state: string (nullable = true)
You can actually provide a schema, too, which will be more authoritative.
# load as RDD and map it to a row with multiple fieldsimport pyspark.sql.types as types
rdd = sc.textFile("file:/people.txt")defmapper(line):
s = line.split("|")return Row(id=s[0],name=s[1],company=s[2],state=s[4])
schema = types.StructType([
types.StructField('id',types.IntegerType(),False),types.StructField('name',types.StringType()),types.StructField('company',types.StringType()),types.StructField('state',types.StringType())])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)
Gotcha alert Spark doesn’t seem to care when you leave dates as strings.
# Spark SQL handles this just fine as if they were# legit date objects.
spark.sql(""" SELECT * FROM NewHires n WHERE n.start_date > "2016-01-01" """).show()
Now you know about inferred and explicit schemas, and the available types you can use.
数据加载以及结果保存等
Loading and Saving is fairly straight forward. Save your dataframes in your desired format.
# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",format="parquet")# format= defaults to parquet if omitted# formats: json, parquet, jdbc, orc, libsvm, csv, text
When you read, some types preserve schema. Parquet keeps the full schema, JSON has inferrable schema, and JDBC pulls in schema.
# read from stored parquet
peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”)# read from stored JSON
peopleDF = spark.read.json(“s3://acme-co/people.json”)
Limited support for subqueries and various other noticeable SQL functionalities
Runs roughly half of the 99 TPC-DS benchmark queries
More SQL support in HiveContext
Spark 2.0
In DataBricks’ Words
SQL2003 support
Runs all 99 of TPC-DS benchmark queries
A native SQL parser that supports both ANSI-SQL as well as Hive QL
Native DDL command implementations
Subquery support, including
Uncorrelated Scalar Subqueries
Correlated Scalar Subqueries
NOT IN predicate Subqueries (in WHERE/HAVING clauses)
IN predicate subqueries (in WHERE/HAVING clauses)
(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
View canonicalization support
In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.
使用 SQL 进行 ETL
These things are some of the things we learned to start doing after working with Spark a while.
Tip 1: In production, break your applications into smaller apps as steps. I.e. “Pipeline pattern”
Tip 2: When tinkering locally, save a small version of the dataset via spark and test against that.
Tip 3: If using EMR, create a cluster with your desired steps, prove it works, then export a CLI command to reproduce it, and run it in Data Pipeline to start recurring pipelines / jobs.
操作 JSON 数据
JSON data is most easily read-in as line delimied json objects*
{
"n":"sarah","age":29}{
"n”:"steve","age":45}
Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done. Access arrays with inline array syntax
SELECT col[1], col[3] FROM json
If you want to flatten your JSON data, use the explode method(works in both DF API and SQL)
# json explode example>>> spark.read.json("file:/json.explode.json").createOrReplaceTempView("json")>>> spark.sql("SELECT * FROM json").show()+----+----------------+| x| y|+----+----------------+|row1|[1,2,3,4,5]||row2|[6,7,8,9,10]|+----+----------------+>>> spark.sql("SELECT x, explode(y) FROM json").show()+----+---+| x|col|+----+---+|row1|1||row1|2||row1|3||row1|4||row1|5||row2|6||row2|7||row2|8||row2|9||row2|10|+----+---+
Access nested-objects with dot syntax For multi-line JSON files,you’ve got to do much more:
SELECT field.subfield FROM json
# a list of data from files.
files = sc.wholeTextFiles("data.json")# each tuple is (path, jsonData)
rawJSON = files.map(lambda x: x[1])# sanitize the data
cleanJSON = rawJSON.map(\
lambda x: re.sub(r"\s+","",x,flags=re.UNICODE)\
)# finally, you can then read that in as “JSON”
spark.read.json( scrubbedJSON )# PS -- the same goes for XML.
从外部数据库读取和写入
To read from an external database, you’ve got to have your JDBC connectors (jar) handy. In order to pass a jar package into spark, you’d use the --jarsflag when starting pyspark or spark-submit.
# loading data into an RDD in Spark 2.0
my-linux-shell$ pyspark \
--jars /path/to/mysql-jdbc.jar\
--packages
# note: you can also add the path to your jar in the spark.defaults config file to these settings:
spark.driver.extraClassPath
spark.executor.extraClassPath
Once you’ve got your connector jars successfully imported, now you can read an existing database into your spark application or spark shell as a dataframe.
# line broken for readibility
sqlURL = "jdbc:mysql://<db-host>:<port>
?user=<user>&password=<pass>&rewriteBatchedStatements=true
&continueBatchOnError=true"
df = spark.read.jdbc(url=sqlURL, table=".
")
df.createOrReplaceTempView("myDB")
spark.sql("SELECT * FROM myDB").show()
If you’ve done some work and built created or manipulated a dataframe, you can write it to a database by using the spark.read.jdbc method. Be prepared, it can a while.
Also, be warned, save modes in spark can be a bit destructive. “Overwrite” doesn’t just overwrite your data, it overwrites your schemas too.Say goodbye to those precious indices.
在真实环境下测试你的 SQL
If testing locally, do not load data from S3 or other similar types of cloud storage. Construct your applications as much as you can in advance. Cloudy clusters are expensive. In the cloud,you can test a lot of your code reliably with a 1-node cluster. Get really comfortable using .parallelize() to create dummy data. If you’re using big data, and many nodes, don’t use .collect() unless you intend to
pyspark.sql 各接口的使用方法
官网地址
class pyspark.sql.SparkSession(sparkContext, jsparkSession=None)[source]
从RDD,列表,或pandas.dataframe数据源创建RDD形式的数据框 参数data就是输入的数据,如果是列表,那么列表中的每个元素对应一行 schema可以是 a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None.
from pyspark.sql import SparkSession
import pandas
spark=SparkSession.builder.getOrCreate()#由列表创建,列表的元素为元组
a=[('Alice',1)]
a_dataframe=spark.createDataFrame(a)print(a_dataframe.collect())#指定数据框的列名
a_dataframe=spark.createDataFrame(a,['name','age'])print(a_dataframe.collect())#由列表创建,列表的元素为字典,因而直接有列名
d =[{
'name':'Alice','age':1}]
d_dataframe=spark.createDataFrame(d)print(d_dataframe.collect)#由rdd创建
rdd=sc.parallelize(a)
df=spark.createDataFrame(rdd)print(df.collect())#指定数据框各列的数据类型,这时的类型是pyspark.sql里的数据类型from pyspark.sql.types import*
schema=StructType([StructField('name',StringType,True),StructField('age',IntegerType,True)])
df3=spark.createDataFrame(rdd,schema)#直接指定列名和列的数据类型,这时是通过python中的数据类型来指定print(spark.createDataFrame(rdd,'name:string,age:int').collect())print(df3.collect())#由pandas.dataframe创建,其中.toPandas()将数据转换为rdd下的pnadasdataframe类型print(spark.createDataFrame(df.toPandas()).collect())print(spark.createDataFrame(pandas.DataFrame([['age',1]])).collect())
结果如下:
[Row(_1='Alice', _2=1)][Row(name='Alice', age=1)]
从json文件创建dataframe
# spark is an existing SparkSession
df = spark.read.json("/home/qjzh/miniconda/envs/water_meter2/projects/people.json")# Displays the content of the DataFrame to stdout
df.show()# +----+-------+# | age| name|# +----+-------+# |null|Michael|# | 30| Andy|# | 19| Justin|# +----+-------+
'lit':'Creates a :class:`Column` of literal value.','col':'Returns a :class:`Column` based on the given column name.','column':'Returns a :class:`Column` based on the given column name.','asc':'Returns a sort expression based on the ascending order of the given column name.','desc':'Returns a sort expression based on the descending order of the given column name.','upper':'Converts a string expression to upper case.','lower':'Converts a string expression to upper case.','sqrt':'Computes the square root of the specified float value.','abs':'Computes the absolute value.','max':'Aggregate function: returns the maximum value of the expression in a group.','min':'Aggregate function: returns the minimum value of the expression in a group.','first':'Aggregate function: returns the first value in a group.','last':'Aggregate function: returns the last value in a group.','count':'Aggregate function: returns the number of items in a group.','sum':'Aggregate function: returns the sum of all values in the expression.','avg':'Aggregate function: returns the average of the values in a group.','mean':'Aggregate function: returns the average of the values in a group.','sumDistinct':'Aggregate function: returns the sum of distinct values in the expression.''acos':'Computes the cosine inverse of the given value; the returned angle is in the range'+'0.0 through pi.','asin':'Computes the sine inverse of the given value; the returned angle is in the range'+'-pi/2 through pi/2.','atan':'Computes the tangent inverse of the given value.','cbrt':'Computes the cube-root of the given value.','ceil':'Computes the ceiling of the given value.','cos':'Computes the cosine of the given value.','cosh':'Computes the hyperbolic cosine of the given value.','exp':'Computes the exponential of the given value.','expm1':'Computes the exponential of the given value minus one.','floor':'Computes the floor of the given value.','log':'Computes the natural logarithm of the given value.','log10':'Computes the logarithm of the given value in Base 10.','log1p':'Computes the natural logarithm of the given value plus one.','rint':'Returns the double value that is closest in value to the argument and'+' is equal to a mathematical integer.','signum':'Computes the signum of the given value.','sin':'Computes the sine of the given value.','sinh':'Computes the hyperbolic sine of the given value.','tan':'Computes the tangent of the given value.','tanh':'Computes the hyperbolic tangent of the given value.','toDegrees':'Converts an angle measured in radians to an approximately equivalent angle '+'measured in degrees.','toRadians':'Converts an angle measured in degrees to an approximately equivalent angle '+'measured in radians.','bitwiseNOT':'Computes bitwise not.''stddev':'Aggregate function: returns the unbiased sample standard deviation of'+' the expression in a group.','stddev_samp':'Aggregate function: returns the unbiased sample standard deviation of'+' the expression in a group.','stddev_pop':'Aggregate function: returns population standard deviation of'+' the expression in a group.','variance':'Aggregate function: returns the population variance of the values in a group.','var_samp':'Aggregate function: returns the unbiased variance of the values in a group.','var_pop':'Aggregate function: returns the population variance of the values in a group.','skewness':'Aggregate function: returns the skewness of the values in a group.','kurtosis':'Aggregate function: returns the kurtosis of the values in a group.','collect_list':'Aggregate function: returns a list of objects with duplicates.','collect_set':'Aggregate function: returns a set of objects with duplicate elements'+' eliminated.'# math functions that take two arguments as input
_binary_mathfunctions ={
'atan2':'Returns the angle theta from the conversion of rectangular coordinates (x, y) to'+'polar coordinates (r, theta).','hypot':'Computes `sqrt(a^2^ + b^2^)` without intermediate overflow or underflow.','pow':'Returns the value of the first argument raised to the power of the second argument.',}
_window_functions ={
'rowNumber':""".. note:: Deprecated in 1.6, use row_number instead.""",'row_number':"""returns a sequential number starting at 1 within a window partition.""",'denseRank':""".. note:: Deprecated in 1.6, use dense_rank instead.""",'dense_rank':"""returns the rank of rows within a window partition, without any gaps.
The difference between rank and denseRank is that denseRank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using denseRank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third.""",'rank':"""returns the rank of rows within a window partition.
The difference between rank and denseRank is that denseRank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using denseRank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third.
This is equivalent to the RANK function in SQL.""",'cumeDist':""".. note:: Deprecated in 1.6, use cume_dist instead.""",'cume_dist':"""returns the cumulative distribution of values within a window partition,
i.e. the fraction of rows that are below the current row.""",'percentRank':""".. note:: Deprecated in 1.6, use percent_rank instead.""",'percent_rank':"""returns the relative rank (i.e. percentile) of rows within a window partition.""",}
------
avg(*cols) 为每一组的每个数值列计算平均值。 mean 是 avg 的别名。 参数: cols - 一组列的名字(字符串)。非数值列被忽略。
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other. Parameters:
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
how – str, default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left_anti.
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
增删改查
PySpark的DataFrame处理方法:增删改差
df.drop(how=‘any’, thresh=None, subset=None)
Functionality for working with missing data in DataFrame Parameters:
how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
subset – optional list of column names to consider.
df.fill(value, subset=None)
df.replace(to_replace, value, subset=None)
df.withColumn(colName, col)
通过为原数据框添加一个新列或替换已存在的同名列而返回一个新数据框。colName 是一个字符串, 为新列的名字。 col 为这个新列的 Column 表达式。表达式的操作对象必须是原数据框,如果试图把其他数据框的信息加到新列中可能会出错,报错 AssertionError: col should be Column;但是col 表达式可以组合使用多个列:。
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.
The time column must be of pyspark.sql.types.TimestampType.
Durations are provided as strings, e.g. ‘1 second’, ‘1 day 12 hours’, ‘2 minutes’. Valid interval strings are ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’. If the slideDuration is not provided, the windows will be tumbling windows.
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15… provide startTime as 15 minutes.
The output column will be a struct called ‘window’ by default with the nested columns ‘start’ and ‘end’, where ‘start’ and ‘end’ will be of pyspark.sql.types.TimestampType.
#include<iostream>
#include<cassert>
using namespace std;
template<class T, int SIZE = 50>
class Stack{
private:
T list[SIZE];//数组存放栈的元素
int top;//栈顶位置
public:
Stack(
Gson提供了丰富的预定义类型适配器,在对象和JSON串之间进行序列化和反序列化时,指定对象和字符串之间的转换方式,
DateTypeAdapter
public final class DateTypeAdapter extends TypeAdapter<Date> {
public static final TypeAdapterFacto