sql,spark,用熟了,希望收集一下pandas dataframe常用的替代方式。这个是为了查漏补缺,快速回忆的,不适用于完全不懂sql,scala,没有操作过任何数据的新手。
df就是例子dataframe
import pandas as pd
1.把column name转成 list[str]
pandas: list(df)
spark-scala: df.colums.toSeq()
2.复制df
pd: df_new = df.copy()
scala: df_new = df
3. 列加减乘除常数
pd: df['v'] = df['v']+1
scala: df.withColumn("v", $"v" + 1)
4.union
pd: df_new = pd.concat([df1, df2])
scala df_new = df1.union(df2)
5.新建空dataframe,指定column
python - Pandas create empty DataFrame with only column names - Stack Overflow
pd:
df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
6.获取列最大值
python - Find maximum value of a column and return the corresponding row values using Pandas - Stack Overflow
pd: df['v'].max()
scala: df.select(max("v"))
7.去重
python - How to "select distinct" across multiple data frame columns in pandas? - Stack Overflow
pd: df.drop_duplicates()
scala:df.distinct()
ps: pandas 的unique() 只对单列生效,感兴趣可以搜一下。
8.按条件过滤
python - How do I select rows from a DataFrame based on column values? - Stack Overflow
pd: df_new = df.loc[df['value'] == 1]
scalca: df_new = df.filter($"value" === 1)
9.把某列转成list
python - Get list from pandas dataframe column or row? - Stack Overflow
pd:
df['one'].tolist()
scala
df.select("one").collect.map(t=>t(0))
10. 把数据分组后,求组内最大最小等,即groupby agg
python - Pandas DataFrame find the max after Groupby two columns and get counts - Stack Overflow
pd:
dff = df.groupby(['userId', 'tag'], as_index=False)['pageId'].count()
scala: val dff = df.groupBy("userId", "tag").agg(count("pageId"))
ps: 这里一定要加 as_index=False,并且这个count不能确定是否等价于countDistinct.
11.列重命名
https://datascienceparichay.com/article/pandas-rename-column-names/
pd:df.rename(columns={"OldName":"NewName"})
scala: df.withColumnRenamed("oldName","newName")