Pysaprk Notes:pyspark sql model

sparksession:使用数据集或DataFrameAPI进行spark编程的入口点。

sparksession可以用来创建DateFrame,将DataFrame当作数据表数据,可以在这个数据表上执行sql语句,也可以缓存数据表,从本地读取文件等。

用下面的方法来创建一个sparksession

>>> spark = SparkSession.builder \\
    ...     .master("local") \\
    ...     .appName("Word Count") \\
    ...     .config("spark.some.config.option", "some-value") \\
    ...     .getOrCreate()

sparksession里面的内部类:Builder

config 函数:配置函数,通过该函数可以自定义配置项或使用SparkConf中的配置。

对于一个存在的SparkConf,可以使用conf参数进行配置

Pysaprk Notes:pyspark sql model_第1张图片

 

使用键值对进行配置,可以省略参数的名字

对于config函数里面配置的值到底是取SparkConf中的conf,还是用户自定义的键值对。

一个加锁操作内,判断conf参数是否为None,若是,则配置内容应该为用户自定义的内容,否则,则是sparkconf里面的配置。

最后返回一个实例对象

Pysaprk Notes:pyspark sql model_第2张图片

 

关于return self :返回的是这个函数所引用的实例对象。

Returning self from a method simply means that your method returns a reference to the instance object on which it was called. This can sometimes be seen in use with object oriented APIs that are designed as a fluent interface that encourages method cascading.
 


appName函数:

设置应用的名字,若没有设定应用的名字的话,会随机生成一个。 


getOrCreate函数:获取已经 存在的sparksession,如果当前没有sparksession存在,则创建一个。

具体看源码

 def getOrCreate(self):
            """Gets an existing :class:`SparkSession` or, if there is no existing one, creates a
            new one based on the options set in this builder.

            This method first checks whether there is a valid global default SparkSession, and if
            yes, return that one. If no valid global default SparkSession exists, the method
            creates a new SparkSession and assigns the newly created SparkSession as the global
            default.

            >>> s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
            >>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1"
            True

            In case an existing SparkSession is returned, the config options specified
            in this builder will be applied to the existing SparkSession.

            >>> s2 = SparkSession.builder.config("k2", "v2").getOrCreate()
            >>> s1.conf.get("k1") == s2.conf.get("k1")
            True
            >>> s1.conf.get("k2") == s2.conf.get("k2")
            True
            """
            with self._lock:
                from pyspark.context import SparkContext
                from pyspark.conf import SparkConf
                session = SparkSession._instantiatedSession
                if session is None or session._sc._jsc is None:
                    sparkConf = SparkConf()
                    for key, value in self._options.items():
                        sparkConf.set(key, value)
                    sc = SparkContext.getOrCreate(sparkConf)
                    # This SparkContext may be an existing one.
                    for key, value in self._options.items():
                        # we need to propagate the confs
                        # before we create the SparkSession. Otherwise, confs like
                        # warehouse path and metastore url will not be set correctly (
                        # these confs cannot be changed once the SparkSession is created).
                        sc._conf.set(key, value)
                    session = SparkSession(sc)
                for key, value in self._options.items():
                    session._jsparkSession.sessionState().conf().setConfString(key, value)
                for key, value in self._options.items():
                    session.sparkContext._conf.set(key, value)
                return session
class Builder(object):
        """Builder for :class:`SparkSession`.
        """

        _lock = RLock()#加锁处理
        _options = {}

      @since(2.0)
        def config(self, key=None, value=None, conf=None):
            """Sets a config option. Options set using this method are automatically propagated to
            both :class:`SparkConf` and :class:`SparkSession`'s own configuration.

            For an existing SparkConf, use `conf` parameter.

            >>> from pyspark.conf import SparkConf
            >>> SparkSession.builder.config(conf=SparkConf())
            >> SparkSession.builder.config("spark.some.config.option", "some-value")
            >> s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
            >>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1"
            True

            In case an existing SparkSession is returned, the config options specified
            in this builder will be applied to the existing SparkSession.

            >>> s2 = SparkSession.builder.config("k2", "v2").getOrCreate()
            >>> s1.conf.get("k1") == s2.conf.get("k1")
            True
            >>> s1.conf.get("k2") == s2.conf.get("k2")
            True
            """
            with self._lock:
                from pyspark.context import SparkContext
                from pyspark.conf import SparkConf
                session = SparkSession._instantiatedSession
                if session is None or session._sc._jsc is None:
                    sparkConf = SparkConf()
                    for key, value in self._options.items():
                        sparkConf.set(key, value)
                    sc = SparkContext.getOrCreate(sparkConf)
                    # This SparkContext may be an existing one.
                    for key, value in self._options.items():
                        # we need to propagate the confs
                        # before we create the SparkSession. Otherwise, confs like
                        # warehouse path and metastore url will not be set correctly (
                        # these confs cannot be changed once the SparkSession is created).
                        sc._conf.set(key, value)
                    session = SparkSession(sc)
                for key, value in self._options.items():
                    session._jsparkSession.sessionState().conf().setConfString(key, value)
                for key, value in self._options.items():
                    session.sparkContext._conf.set(key, value)
                return session

 

你可能感兴趣的:(spark:pysaprk)