(转)Spark与Pandas中DataFrame对比(详细)

最开始接触的DataFrame是pandas的,用来处理时序数据很方便,一直搞不清两者有哪些区别。

markdown 表格有点不好处理,截图了

(转)Spark与Pandas中DataFrame对比(详细)_第1张图片


(转)Spark与Pandas中DataFrame对比(详细)_第2张图片


(转)Spark与Pandas中DataFrame对比(详细)_第3张图片


(转)Spark与Pandas中DataFrame对比(详细)_第4张图片

转载连接:http://www.lining0806.com/spark与pandas中dataframe对比/

diff()操作举例如下:

1. Invoke ipython console -profile=pyspark:

In [1]: from pyspark import SparkConf, SparkContext, SQLContext

In [2]: import pandas as pd

In [3]: sqlcontext = SQLContext(sc)

2. Computing diff on a column in Pandas:

In [4]: df = sqlCtx.createDataFrame([(1, 4), (1, 5), (2, 6),

(2, 6), (3, 0)], ["A", "B"])

In [5]: pdf = df.toPandas()

In [6]: pdf

Out[6]:

A B

0 1 4

1 1 5

2 2 6

3 2 6

4 3 0

In [7]: pdf['diff'] = pdf.B.diff()

In [8]: pdf

Out[8]:

A B diff

0 1 4 NaN

1 1 5 1

2 2 6 1

3 2 6 0

4 3 0 -6

3. Computing diff on a column given a specific key using the Window operation:

In [9]: from pyspark.sql.window import Window

In [10]: window_over_A = Window.partitionBy("A").orderBy("B")

In [11]: df.withColumn("diff", F.lead("B").over(window_over_A) -

df.B).show()

+---+---+-----+

| A| B|diff |

+---+---+-----+

| 1 | 4 | 1 |

| 1 | 5 | null|

| 2 | 6 | 0 |

| 2 | 6 | null|

| 3 | 0 | null|

+---+---+-----+

转自:https://blog.csdn.net/ljp812184246/article/details/77678591

你可能感兴趣的:((转)Spark与Pandas中DataFrame对比(详细))