pandas对每个分组应用apply函数

GroupBy.apply(function)
  • function的第一个参数是dataframe
  • function的返回结果,可是dataframe、series、单个值,甚至和输入dataframe完全没关系
怎样对数值列按分组的归一化?

将不同范围的数值列进行归一化,映射到[0,1]区间:

  • 更容易做数据横向对比,比如价格字段是几百到几千,增幅字段是0到100
  • 机器学习模型学的更快性能更好
1、数据准备
import pandas as pd
ratings = pd.read_csv(
    r"D:\node\nd\Pandas_study\pandas_test\ratings.dat",
    sep="::",
    engine='python',
    names="UserID::MovieID::Rating::Timestamp".split("::")
)
print(ratings.head())
image.png
def ratings_norm(df):
    """
    :param df:每个用户分组的dataframe
    :return:
    """
    min_value = df["Rating"].min()
    max_value = df["Rating"].max()
    df["Rating_norm"] = df["Rating"].apply(
        lambda x:(x-min_value)/(max_value-min_value)
    )
    return df

ratings = ratings.groupby("UserID").apply(ratings_norm)
a = ratings[ratings["UserID"] == 1].head()
print(a)
image.png
取每个分组的topn数据
fpath = r"D:\node\nd\Pandas_study\pandas_test\beijing_tianqi_2018.csv"
df = pd.read_csv(fpath)
# 替换掉温度的后缀℃
df.loc[:, "bWendu"] = df["bWendu"].str.replace("℃", "").astype('int32')
df.loc[:, "yWendu"] = df["yWendu"].str.replace("℃", "").astype('int32')

#新增一列为月份
df["month"] = df["ymd"].str[:7]
print(df.head())
image.png
def getWenduTopN(df,topn):
    """
    这里的df,是每个月份分组group的df
    :param df:
    :param topn:
    :return:
    """
    return df.sort_values(by = "bWendu")[["ymd","bWendu"]][-topn:]
print(df.groupby("month").apply(getWenduTopN,topn = 2).head())
image.png

你可能感兴趣的:(pandas对每个分组应用apply函数)