Pyspark连接数据库

pyspark 1.6.2 API  http://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html?highlight=jdbc#pyspark.sql.DataFrameWriter.jdbc


1. 数据库以Mysql为例

url = “jdbc:mysql://localhost:3306/test”
table = “test”
properties = {"user":"root","password":"111111"}

df = sqlContext.read.jdbc(url,table,properties) #读
df.write.jdbc(url,table,properties) #写 
# 写入时候需要把RDD转为dataframe类型 可用rdd.toDF()


# 如果导入数据有中文
# mysql的表的编码格式设置成utf8 ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
# url="jdbc:mysql://127.0.0.1:3306/goddness?useUnicoding=true&characterEncoding=utf-8"
# 带汉字的字段前加u 如:u'汉字'
 


2. jdbc函数如下,主要传入url table 和properties等三个参数,properties是map类型

def jdbc(self, url, table, mode=None, properties=None):
    """Saves the content of the :class:`DataFrame` to a external database table via JDBC.

    .. note:: Don't create too many partitions in parallel on a large cluster;\
    otherwise Spark might crash your external database systems.

    :param url: a JDBC URL of the form ``jdbc:subprotocol:subname``
    :param table: Name of the table in the external database.
    :param mode: specifies the behavior of the save operation when data already exists.

        * ``append``: Append contents of this :class:`DataFrame` to existing data.
        * ``overwrite``: Overwrite existing data.
        * ``ignore``: Silently ignore this operation if data already exists.
        * ``error`` (default case): Throw an exception if data already exists.
    :param properties: JDBC database connection arguments, a list of
                       arbitrary string tag/value. Normally at least a
                       "user" and "password" property should be included."""


3. 问题:

连接数据库时用的是什么方法, 线程池?还是普通的连接???



你可能感兴趣的:(Spark学习)