本文只讨论spark借助jdbc读写mysql数据库
一,jdbc
想要spark能够从mysql中获取数据,我们首先需要一个连接mysql的jar包,mysql-connector-java-5.1.40-bin.jar
将jar包放入虚拟机中合适的位置,比如我放置在/home/sxw/Documents路径下,并在spark的 spark-env.sh 文件中加入:
export SPARK_CLASSPATH=/home/sxw/Documents/mysql-connector-java-5.1.40-bin.jar
二,读取示例代码
df = spark.read.format('jdbc').options(
url='jdbc:mysql://127.0.0.1',
dbtable='mysql.db',
user='root',
password='123456'
).load()
df.show()
# 也可以传入SQL语句
sql="(select * from mysql.db where db='wp230') t"
df = spark.read.format('jdbc').options(
url='jdbc:mysql://127.0.0.1',
dbtable=sql,
user='root',
password='123456'
).load()
df.show()
---------------------
作者:振裕
来源:CSDN
三,写入示例代码
# 打开动态分区
spark.sql("set hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
# 使用普通的hive-sql写入分区表
spark.sql("""
insert overwrite table ai.da_aipurchase_dailysale_hive
partition (saledate)
select productid, propertyid, processcenterid, saleplatform, sku, poa, salecount, saledate
from szy_aipurchase_tmp_szy_dailysale distribute by saledate
""")
# 或者使用每次重建分区表的方式
jdbcDF.write.mode("overwrite").partitionBy("saledate").insertInto("ai.da_aipurchase_dailysale_hive")
jdbcDF.write.saveAsTable("ai.da_aipurchase_dailysale_hive", None, "append", partitionBy='saledate')
# 不写分区表,只是简单的导入到hive表
jdbcDF.write.saveAsTable("ai.da_aipurchase_dailysale_for_ema_predict", None, "overwrite", None)
---------------------
作者:振裕
来源:CSDN
原文:https://blog.csdn.net/suzyu12345/article/details/79673473
四,其他
import os
os.environ["JAVA_HOME"] = r"D:\Program Files\Java\jdk1.8.0_131"
from pyspark.sql import SparkSession, SQLContext, DataFrame
from pyspark.sql.readwriter import DataFrameReader, DataFrameWriter
appname = "demo"
sparkmaster = "local"
spark = SparkSession.builder.appName(appname).master(sparkmaster).getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
spark中实际是DataFrameReader, DataFrameWriter来实现读写dataframe数据操作。
df = sqlContext.read.format("jdbc").options(url, driver, dbtable).load()
df_reader = DataFrameReadre(sqlContext)
df = df_reader.format("jdbc").options().load()
df = df_reader.jdbc(url, table, porperties)