"""
Selects column based on the column name specified as a regex and returns it
as :class:`Column`.
"""
用正则表达式的方式返回我们想要的列。
df.show()
# 这里注意`的使用
df.select(df.colRegex("`(grade)+.+`")).show()
上面的(grade)是一个整体,“."表示任意字符,”+“表示前面出现的任意字符出现一次以上,既有一个以上的任意字符。
"""Returns all the records as a list of :class:`Row`."""
返回DataFrame中的所有数据,注意数据量大了容易造成Driver节点内存溢出!
data = df.collect()
for i in data:
print(i)
"""Returns all column names as a list."""
以列表的形式返回DataFrame的所有列名,这个方法是@property
print(df.columns)
"""Returns the number of rows in this :class:`DataFrame`."""
返回DataFrame中Row的数量
print(df.count())
计算两列的协方差,协方差也是用来度量两个变量相关性的一个指标。
df1 = df.withColumn("grade1", df["grade1"].cast("double"))\
.withColumn("grade2", df["grade2"].cast("double"))
c = df1.cov("grade1", "grade2")
print(c)
print(type(c))
createGlobalTempView(name)
使用DataFrame创建一个全局的临时表,其生命周期和启动的app的周期一致,即启动的spark应用存在则这个临时的表就一直能访问。直到sparkcontext的stop方法的调用退出应用为止。创建的临时表保存在global_temp这个库中。
createOrReplaceGlobalTempView(name)
上面的方法当遇到已经创建了的临时表名的话会报错,而这个方法遇到已经存在的临时表会进行替换,没有则创建。
createOrReplaceTempView(name)
使用DataFrame创建本地的临时视图,其生命周期只限于当前的SparkSession,当调用了SparkSession的stop方法停止SparkSession后, 其生命周期就到此为止了。
createTempView(name)
这个方法在创建临时视图时若遇到已经 创建过的视图的名字,会报错。因此需要指定另外的名字.
#具体的语法都是类似的
df.createOrReplaceTempView("表名")
"""Computes basic statistics for numeric and string columns.
This include count, mean, stddev, min, and max. If no columns are
given, this function computes statistics for all numerical or string columns.
.. note:: This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.
Use summary for expanded statistics and control over which statistics to compute.
"""
统计cols对应的基本的统计信息,包括数量、最大值、最小值、均值及标准差.不指定参数会默认计算所有列的描述统计信息。
df.describe("grade1", "grade2").show()
返回DataFrame中非重复的数据。
"""Returns a new :class:`DataFrame` containing the distinct rows in this :class:`DataFrame`.
>>> df.distinct().count()
2
"""
"""Returns a new :class:`DataFrame` that drops the specified column.
This is a no-op if schema doesn't contain the given column name(s).
:param cols: a string name of the column to drop, or a
:class:`Column` to drop, or a list of string name of the columns to drop.
"""
按照列名删除DataFrame中的列,返回新的DataFrame
df.drop("id").drop("name").show()
删除重复行,subset用于指定在删除重复行的时候考虑那几列。
subset:需要传入列名组成的列表
# 删除前
df.show()
# 删除后
df.dropDuplicates(["id", "name"]).show()
删除DataFrame中的na数据,关键字参数how指定如何删,有“any”和‘all’两种选项,thresh指定行中na数据有多少个时删除整行数据,这个设置将覆盖how关键字参数的设置,subset指定在那几列中删除na数据。
"""Returns a new :class:`DataFrame` omitting rows with null values. :func:`DataFrame.dropna` and :func:`DataFrameNaFunctions.drop` are aliases of each other. :param how: 'any' or 'all'. If 'any', drop a row if it contains any nulls. If 'all', drop a row only if all its values are null. :param thresh: int, default None If specified, drop rows that have less than `thresh` non-null values. This overwrites the `how` parameter. :param subset: optional list of column names to consider."""
df.show()
df.dropna(subset=["age","grade1"]).show()
# 上面的["age","grade1"] 可以用tuple()
@property
Returns all column names and their data types as a list.
返回DataFrame列的名字及对应的数据类型组成的tuple列表
"""Replace null values, alias for ``na.fill()``.
:func:`DataFrame.fillna` and :func:`DataFrameNaFunctions.fill` are aliases of each other.
:param value: int, long, float, string, bool or dict.
Value to replace null values with.
If the value is a dict, then `subset` is ignored and `value` must be a mapping
from column name (string) to replacement value. The replacement value must be
an int, long, float, boolean, or string.
:param subset: optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if `value` is a string, and subset contains a non-string column,
then the non-string column is simply ignored."""
用于DataFrame中空数据的填充。
df.show()
df.fillna(str(1000)).show()
df1 = df.withColumn("grade1", df["grade1"].cast("int"))
df1.fillna(999, ["grade1"]).show()
df.fillna({"grade1": "哈哈哈", "grade2": "嘿嘿嘿"}).show()
由上面例子可得:填充空值的类型需要和空值所在列的类型匹配.(int和double可以相互匹配,但是两者和String无法匹配)