pyspark udf 原理

pyspark pandas 用户自定义函数 转化为 udf(user defined functions)

scalar

scalar pandas UDF 用于向量化scalar 操作;The Python function should take pandas.Series as inputs and return a pandas.Series of the same length.
(输入输出均为 pandas.Series,输出为相同长度的series)

Grouped map UDFs

You use grouped map pandas UDFs with groupBy().apply() to implement the “split-apply-combine” pattern. Split-apply-combine(分割应用合并) consists of three steps:

  • Split the data into groups by using DataFrame.groupBy.
  • Apply a function on each group. The input and output of the function are both * * * pandas.DataFrame. The input data contains all the rows and columns for each group.
  • Combine the results into a new DataFrame.

To use groupBy().apply(), you must define the following:

  • A Python function that defines the computation for each group
  • A StructType object or a string that defines the schema of the output DataFrame
    The column labels of the returned pandas.DataFrame must either match the field names in the defined output schema if specified as strings, or match the field data types by position if not strings, for example, integer indices. See pandas.DataFrame for how to label columns when constructing a pandas.DataFrame.
    All data for a group is loaded into memory before the function is applied. This can lead to out of memory exceptions, especially if the group sizes are skewed. The configuration for maxRecordsPerBatch is not applied on groups and it is up to you to ensure that the grouped data will fit into the available memory.

The following example shows how to use groupby().apply() to subtract the mean from each value in the group.

pandas udf 使用案例
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

Arrow

Arrow有两个特性:

  • 跨语言
    Pandas 是个python数据分析库,其管理列式数据的结构(如图5)与Arrow(如图6)类似,Arrow采用RecordBatch,而Pandas使用BlockManager管理Blocks,因此Arrow DF与Pandas DF之间的转换是很容易做到的
    pyspark udf 原理_第1张图片
  • 列式内存数据结构
    由上述列式结构的好处可知,Arrow是一种高效,对CPU友好的内存数据结构

Arrow 在pyspark 中的作用

在Arrow之前,如果要对不同语言数据进行传输需要使用过序列化和反序列化,耗费了大量的开销,而Arrow 借由其简单高效的内存结构,支持不同语言间的相互转换,实现了不同语言之间的高效数据传输。

Arrow在pyspark中的作用
http://www.zdingke.com/2019/12/31/arrow%E5%9C%A8pyspark%E4%B8%AD%E7%9A%84%E4%BD%9C%E7%94%A8/

你可能感兴趣的:(数据分析,python,python,大数据,udf)