参考网址: join on 和 where的区别
d1 = {'name1':["A","B","C"], 'height':[165,170,160]}
d2 = {'name2':["B","C","D"], 'age':[45,43,50]}
df1 = spark.createDataFrame(pd.DataFrame(d1))
df2 = spark.createDataFrame(pd.DataFrame(d2))
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
df1 df2
+-----+------+ +-----+---+
|name1|height| |name2|age|
+-----+------+ +-----+---+
| A| 165| | B| 45|
| B| 170| | C| 43|
| C| 160| | D| 50|
+-----+------+ +-----+---+
>>>spark.sql(
"""
select * from table1
left join table2
"""
).show()
# 输出
+-----+------+-----+---+
|name1|height|name2|age|
+-----+------+-----+---+
| A| 165| B| 45|
| A| 165| C| 43|
| A| 165| D| 50|
| B| 170| B| 45|
| B| 170| C| 43|
| B| 170| D| 50|
| C| 160| B| 45|
| C| 160| C| 43|
| C| 160| D| 50|
+-----+------+-----+---+
on条件是在生成临时表时使用的条件,他不管on的条件是否为真,都会返回左边表所有的记录
>>>spark.sql(
"""
select * from table1
left join table2
on table1.name1 in ('A','B')
"""
).show()
# 输出
+-----+------+-----+----+
|name1|height|name2| age|
+-----+------+-----+----+
| A| 165| B| 45|
| A| 165| C| 43|
| A| 165| D| 50|
| B| 170| B| 45|
| B| 170| C| 43|
| B| 170| D| 50|
| C| 160| null|null|
+-----+------+-----+----+
>>>spark.sql(
"""
select * from table1
left join table2
on table2.name2 in ('B','C')
"""
).show()
# 输出
+-----+------+-----+---+
|name1|height|name2|age|
+-----+------+-----+---+
| A| 165| B| 45|
| A| 165| C| 43|
| B| 170| B| 45|
| B| 170| C| 43|
| C| 160| B| 45|
| C| 160| C| 43|
+-----+------+-----+---+
>>>spark.sql(
"""
select * from table1
left join table2
on table2.name2 in ('A')
"""
).show()
# 输出
+-----+------+-----+----+
|name1|height|name2| age|
+-----+------+-----+----+
| A| 165| null|null|
| B| 170| null|null|
| C| 160| null|null|
+-----+------+-----+----+
>>>spark.sql(
"""
select * from table1
left join table2
on table1.name1=table2.name2
"""
).show()
# 输出
+-----+------+-----+----+
|name1|height|name2| age|
+-----+------+-----+----+
| A| 165| null|null|
| B| 170| B| 45|
| C| 160| C| 43|
+-----+------+-----+----+
where条件是在临时表生成好后,再对临时表进行过滤的条件。这时已经没有left join的含义了,where后提交不为真的就全部过滤掉.
>>>spark.sql(
"""
select * from table1
left join table2
where table1.name1='A'
"""
).show()
# 输出
+-----+------+-----+---+
|name1|height|name2|age|
+-----+------+-----+---+
| A| 165| B| 45|
| A| 165| C| 43|
| A| 165| D| 50|
+-----+------+-----+---+