spark获取数据解读(部分)

本系列文章是下载的是spark2.2.1版本的源码进行相关分析和学习。

理解dbData=sc.read.schema(mySchema).format("com.xxxx.spark.sql").options(uri=dbUrl, database= myDatabase, collection=myCollection).load()

1.SparkSession类的概括(今天我们主要讲解read,所以SparkSession就略过哈):
官方备注为"The entry point to programming Spark with the Dataset and DataFrame API"
即:使用数据集和DataFrame API编写Spark程序的入口点
使用示例:
>>> spark = SparkSession.builder \\
    ...     .master("local") \\
    ...     .appName("Word Count") \\
    ...     .config("spark.some.config.option", "some-value") \\
    ...     .getOrCreate()

    .. autoattribute:: builder
       :annotation:
    """
2.而今天我们来理解一下sparkSession的read方法
官网给出如下内容:
read
Returns a DataFrameReader that can be used to read data in as a DataFrame.
Returns: DataFrameReader
大概意思就是read方法返回的是一个DataFrameReader ,它可以用来读入数据并得到一个DataFrame.
源码如下:
 
def read(self):
        """
        Returns a :class:`DataFrameReader` that can be used to read data
        in as a :class:`DataFrame`.
		返回`DataFrameReader`类:可以用来读取数据且格式为DataFrame
        :return: :class:`DataFrameReader`
        """
        return DataFrameReader(self._wrapped)
从上面的源码中我们可以看出,SparkSession的read()方法实质上是调用的DataFrameReader()方法。
那下面我们就找到这个方法一探究竟
从导入语句from pyspark.sql.readwriter import DataFrameReader可以知道DataFrameReader的所在位置
DataFrameReader这个类包含了很多种方法,这里我们只分析我们用到的就可以了:
schema指定输入模式
format指定输入数据源格式。
options为基础数据源添加输入选项
源码如下:
class DataFrameReader(OptionUtils):
    """
    Interface used to load a :class:`DataFrame` from external storage systems
    (e.g. file systems, key-value stores, etc). Use :func:`spark.read`
    to access this.
    该接口用于从外部存储系统(例如文件系统、键值存储等等)加载一个`DataFrame`类。使用:函数:“spark.read” 来访问。
    .. versionadded:: 1.4
    """
    def __init__(self, spark):
        self._jreader = spark._ssql_ctx.read()
        self._spark = spark

    def _df(self, jdf):
        from pyspark.sql.dataframe import DataFrame
        return DataFrame(jdf, self._spark)

    @since(1.4)
    def format(self, source):
        """Specifies the input data source format.
	指定输入数据源格式。
        :param source: string, name of the data source, e.g. 'json', 'parquet'.
						字符串类型,数据源名称
        >>> df = spark.read.format('json').load('python/test_support/sql/people.json')
        >>> df.dtypes
        [('age', 'bigint'), ('name', 'string')]

        """
        self._jreader = self._jreader.format(source)
        return self

    @since(1.4)
    def schema(self, schema):
        """Specifies the input schema.
	指定输入模式
        Some data sources (e.g. JSON) can infer the input schema automatically from data.
        By specifying the schema here, the underlying data source can skip the schema
        inference step, and thus speed up data loading.
	一些数据源(例如JSON)可以从数据中自动推断出输入模式。
	在这里通过指定模式,底层数据源可以跳过模式推理步骤,从而加快数据加载速度。
        :param schema: a :class:`pyspark.sql.types.StructType` object
        """
        from pyspark.sql import SparkSession
        if not isinstance(schema, StructType):
            raise TypeError("schema should be StructType")
        spark = SparkSession.builder.getOrCreate()
        jschema = spark._jsparkSession.parseDataType(schema.json())
        self._jreader = self._jreader.schema(jschema)
        return self
	
	@since(1.4)
    def options(self, **options):
        """Adds input options for the underlying data source.
		为基础数据源添加输入选项
        You can set the following option(s) for reading files:
            * ``timeZone``: sets the string that indicates a timezone to be used to parse timestamps
                in the JSON/CSV datasources or partition values.
                If it isn't set, it uses the default value, session local timezone.
        """
        for k in options:
            self._jreader = self._jreader.option(k, to_str(options[k]))
        return self

    @since(1.4)
    def load(self, path=None, format=None, schema=None, **options):
        """Loads data from a data source and returns it as a :class`DataFrame`.
		从数据源加载数据并将其作为一个DataFrame类返回。
        :param path: optional string or a list of string for file-system backed data sources.
					可选的,字符串或用于文件系统支持的数据源的字符串列表。
        :param format: optional string for format of the data source. Default to 'parquet'.
						可选的,用于数据源格式化的字符串。默认为“parquet”。
        :param schema: optional :class:`pyspark.sql.types.StructType` for the input schema.
						可选的, 输入模式
        :param options: all other string options
						其他的所有字符串选项
        >>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,
        ...     opt2=1, opt3='str')
        >>> df.dtypes
        [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]

        >>> df = spark.read.format('json').load(['python/test_support/sql/people.json',
        ...     'python/test_support/sql/people1.json'])
        >>> df.dtypes
        [('age', 'bigint'), ('aka', 'string'), ('name', 'string')]
        """
        if format is not None:
            self.format(format)
        if schema is not None:
            self.schema(schema)
        self.options(**options)
        if isinstance(path, basestring):
            return self._df(self._jreader.load(path))
        elif path is not None:
            if type(path) != list:
                path = [path]
            return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
        else:
            return self._df(self._jreader.load())

    @since(1.4)
    def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPartitions=None,
             predicates=None, properties=None):
        """
        Construct a :class:`DataFrame` representing the database table named ``table``
        accessible via JDBC URL ``url`` and connection ``properties``.
		构建一个`DataFrame`类,它表示一个数据库表名为‘table’可以通过JDBC UR`url`链接和连接属性进行。
        Partitions of the table will be retrieved in parallel if either ``column`` or
        ``predicates`` is specified. ``lowerBound`, ``upperBound`` and ``numPartitions``
        is needed when ``column`` is specified.
		如果有`column`或者`predicates`任何一个被指定,表的分区将被并行检索。那么`lowerBound`、                     		`lowerBound`、`numPartitions`都必须被指定。
        If both ``column`` and ``predicates`` are specified, ``column`` will be used.
		如果`column`和`predicates`同时被指定,那么column`也将被指定。
        .. note:: Don't create too many partitions in parallel on a large cluster; \
        otherwise Spark might crash your external database systems.
		不要在一个大型集群上创建太多的并行分区,否则,Spark可能会破坏您的外部数据库系统
        :param url: a JDBC URL of the form ``jdbc:subprotocol:subname`` 一个JDBC url
        :param table: the name of the table 表名
        :param column: the name of an integer column that will be used for partitioning;
                       if this parameter is specified, then ``numPartitions``, ``lowerBound``
                       (inclusive), and ``upperBound`` (exclusive) will form partition strides
                       for generated WHERE clause expressions used to split the column
                       ``column`` evenly
        :param lowerBound: the minimum value of ``column`` used to decide partition stride
        :param upperBound: the maximum value of ``column`` used to decide partition stride
        :param numPartitions: the number of partitions
        :param predicates: a list of expressions suitable for inclusion in WHERE clauses;
                           each one defines one partition of the :class:`DataFrame`
					 一个适合包含在where字句中的表达式列表,每一个都定义了一个`DataFrame`分区
        :param properties: a dictionary of JDBC database connection arguments. Normally at
                           least properties "user" and "password" with their corresponding values.
                           For example { 'user' : 'SYSTEM', 'password' : 'mypassword' }
					 一个JDBC数据库连接参数的字典。通常情况下至少包含用户和密码两个属性
        :return: a DataFrame
        """
        if properties is None:#先检查连接参数,如果为None则设置它为一个空字典
            properties = dict()
        jprop = JavaClass("java.util.Properties", self._spark._sc._gateway._gateway_client)()
        for k in properties:#通过参数设置相关配置
            jprop.setProperty(k, properties[k])
        if column is not None:#如果指定了列名称,则检查lowerBound、upperBound、numPartitions的值,如果他们中任意一个值为None则引发异常
            assert lowerBound is not None, "lowerBound can not be None when ``column`` is specified"
            assert upperBound is not None, "upperBound can not be None when ``column`` is specified"
            assert numPartitions is not None, \
                "numPartitions can not be None when ``column`` is specified"
            return self._df(self._jreader.jdbc(url, table, column, int(lowerBound), int(upperBound),
                                               int(numPartitions), jprop))
        if predicates is not None:
            gateway = self._spark._sc._gateway
            jpredicates = utils.toJArray(gateway, gateway.jvm.java.lang.String, predicates)
            return self._df(self._jreader.jdbc(url, table, jpredicates, jprop))
        return self._df(self._jreader.jdbc(url, table, jprop))
本文是对SparkSession的部分理解,由于自身能力有限,可能存在偏差,欢迎指正分享。





你可能感兴趣的:(spark)