京东数据挖掘工程师常用的 10多个 pandas 函数

sql,spark,用熟了,希望收集一下pandas dataframe常用的替代方式。这个是为了查漏补缺,快速回忆的,不适用于完全不懂sql,scala,没有操作过任何数据的新手。

df就是例子dataframe

import pandas as pd

1.把column name转成 list[str]

pandas: list(df)

spark-scala: df.colums.toSeq()

2.复制df

pd: df_new = df.copy()

scala: df_new = df

3. 列加减乘除常数

pd: df['v'] = df['v']+1

scala: df.withColumn("v", $"v" + 1)

4.union

pd: df_new = pd.concat([df1, df2])

scala df_new = df1.union(df2)

5.新建空dataframe,指定column

python - Pandas create empty DataFrame with only column names - Stack Overflow

pd:

df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])

6.获取列最大值

python - Find maximum value of a column and return the corresponding row values using Pandas - Stack Overflow

pd: df['v'].max()

scala: df.select(max("v"))

7.去重

python - How to "select distinct" across multiple data frame columns in pandas? - Stack Overflow

pd: df.drop_duplicates()

scala:df.distinct()

ps: pandas 的unique() 只对单列生效,感兴趣可以搜一下。

8.按条件过滤

python - How do I select rows from a DataFrame based on column values? - Stack Overflow

pd: df_new = df.loc[df['value'] == 1]

scalca: df_new = df.filter($"value" === 1)

9.把某列转成list

python - Get list from pandas dataframe column or row? - Stack Overflow

pd:

df['one'].tolist()

scala

df.select("one").collect.map(t=>t(0)) 

10. 把数据分组后,求组内最大最小等,即groupby agg

python - Pandas DataFrame find the max after Groupby two columns and get counts - Stack Overflow

pd: 

dff = df.groupby(['userId', 'tag'], as_index=False)['pageId'].count()

scala: val dff = df.groupBy("userId", "tag").agg(count("pageId"))

ps: 这里一定要加 as_index=False,并且这个count不能确定是否等价于countDistinct.

11.列重命名

https://datascienceparichay.com/article/pandas-rename-column-names/

pd:df.rename(columns={"OldName":"NewName"})

scala: df.withColumnRenamed("oldName","newName")

你可能感兴趣的:(spark,数据挖掘)