#coding=utf-8
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext,SQLContext
import string
#HiveContext继承SQLContext, SQLContext支持的HiveContext都支持;
#HiveContext和SQLContext打开后属于两个不同的session,一个session看不到另一个session的临时表
sc = SparkContext(conf="")
hiveSql = HiveContext(sc)
hiveSql.registerFunction("get_family_name", lambda x:string.split(x, " ")[0]) #注册函数
student = hiveSql.table("tmp_dp.student") #读取HIVE数据
student.where(student.sex < 5).registerTempTable("stud") #注册成临时表
score_path = "/data/tmp/score/score.json"
score = hiveSql.jsonFile(score_path) #JSON数据
score.registerTempTable("score") #注册成临时表
sqls = """select get_family_name(st.name), avg(sc.performance.math)
from stud st, score sc
where st.stu_id = sc.stu_id
group by get_family_name(st.name)
"""
df = hiveSql.sql(sqls)
for col in df.collect(): #输出结果
print col[0:len(col)]
sc.stop()
Rules
Using Catalyst in Spark SQL
在physical planning阶段,Catalyst生成多个计划并基于成本对他们进行比较;其他阶段都是基于规则的;
Analysis
逻辑计划步骤
Looking up relations by name from the catalog.
Mapping named attributes, such as col, to the input provided given operator’s children.
Determining which attributes refer to the same value to give them a unique ID (which later allows optimization of expressions such as col = col).
Propagating and coercing types through expressions: for example, we cannot know the return type of 1 + col until we have resolved col and possibly casted its subexpressions to a compatible types.
Logical Optimization
Physical Planning
Code Generation
Extension Points
Data Sources
All data sources must implement a createRelation function thattakes
a set of key-value parameters and returns a BaseRelationobject for
that relation
To let Spark SQL read the data, a BaseRelation can implementone of
several interfaces that let them expose varying degrees
ofsophistication. 如:TableScan、PrunedScan、PrunedFilteredScan
User-Defined Types (UDTs)
{
"text": "This is a tweet about #Spark",
"tags": ["#Spark"],
"loc": {"lat": 45.1, "long": 90}
}
{
"text": "This is another tweet",
"tags": [],
"loc": {"lat": 39, "long": 88.5}
}
{
"text": "A #tweet without #location",
"tags": ["#tweet", "#location"]
}
以上数据可用以下SQL查询
SELECT loc.lat, loc.long FROM tweets WHERE text LIKE ’%Spark%’ AND tags IS NOT NULL
- the algorithm attempts to infer a tree of STRUCT types, each of which
may contain atoms, arrays, or other STRUCTs类型转换成兼容类型
- use the same algorithm for inferring schemas of RDDs of Python
objects
data =
tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")
tf = HashingTF().setInputCol("words").setOutputCol("features")
lr = LogisticRegression().setInputCol("features")
pipeline = Pipeline().setStages([tokenizer , tf, lr])
model = pipeline.fit(data)
Figure 7: A short MLlib pipeline and the Python code to run it.We start with a DataFrame of (text, label) records, tokenize the text into words, run a term frequency featurizer (HashingTF) to get a feature vector, then train logistic regression.
CREATE TEMPORARY TABLE users USING jdbc OPTIONS(driver "mysql" url "jdbc:mysql://userDB/users");
CREATE TEMPORARY TABLE logsUSING json OPTIONS (path "logs.json");
SELECT users.id, users.name , logs.message FROM users, logs
WHERE users.id = logs.userId AND users.registrationDate > "2015-01-01“;
在MySQL上的查询
SELECT users.id, users.name FROM usersWHERE users.registrationDate > "2015-01-01"
- used a cluster of six EC2 i2.xlarge machines (one master, five
workers) each with 4 cores, 30 GB memory and an 800 GB SSD, running
HDFS 2.4, Spark 1.3, Shark 0.9.1 and Impala 2.1.1. The dataset was
110 GB of data after compression using the columnar Parquet format
- The main reason for the difference with Shark is code generation in
Catalyst (Section 4.3.4), which reduces CPU overhead. This feature
makes Spark SQL competitive with the C++ and LLVM based Impala engine
in many of these queries. The largest gap from Impala is in query 3a
where Impala chooses a better join plan because the selectivity of
the queries makes one of the tables very small.
The dataset consists of 1 billion integer pairs, (a, b) with 100,000 distinct values of a, on the same five-worker i2.xlarge cluster as in the previous section.
- map and reduce functions in the Python API for Spark
sum_and_count = \
data.map(lambda x: (x.a, (x.b, 1))) \
.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])) \
.collect()
[(x[0], x[1][0] / x[1][1]) for x in sum_and_count]
- DataFrame API
df.groupBy("a").avg("b")
This is because in the DataFrame API, only the logical plan is constructed in Python, and all physical execution is compiled down into native Spark code as JVM bytecode, resulting in more efficient execution. In fact, the DataFrame version also outperforms a Scala version of the Spark code above by 2⇥. This is mainly due to code generation: the code in the DataFrame version avoids expensive allocation of key-value pairs that occurs in hand-written Scala code.
原论文 https://amplab.cs.berkeley.edu/publication/spark-sql-relational-data-processing-in-spark/
利用In-Database Analytics技术在大规模数据上实现机器学习的SGD算法 http://www.infoq.com/cn/articles/in-database-analytics-sdg-arithmetic/
spark 大型集群上的快速和通用数据处理架构 https://code.csdn.net/CODE_Translation/spark_matei_phd
数据库系统实现 机械工业出版社
数据库系统概念 机械工业出版社