Pandas 基础用法记录

因为时不时要做一些数据分析，和基于量化的数据验证，每次使用pandas的时候都需要百度查询，
所以把一些常用的tips记录下来，备忘

基于条件操作某些行列

例如某个值<0 则设置为0 data['influence']=data['influence'].map(lambda x: x if x >0 else 0)
获取特定条件的列，df[df['ticker']==tk]，ticker==tk 的列

合并或者join

使用 join 的话，比如基于 ticker 来join，新的dataframe 会有两个 ticker_left, ticker_right; 需要自己做命名区分；
使用merge 的话，ticker 只有一个，更加实用；merged = pd.merge(graph_fs,market_pd,on='ticker',how='inner')

删除行列、转置

删除p1.drop(columns=['product','Unnamed: 0'],inplace = True)
以某列的统计值，新增一行或者一列；fs['min']=fs.min(axis=1)
重新设置列名，fs.columns=['p','v','p_v']
行列转置， fs=p1.T； T 为转置

构建dataframe

从csv 文件，read_csv, 可以指定 data_type
从二维数组，pd.DataFrame(np.array(all_res),columns=clos)
取一个dataframe 部分列 df= df[[columns]]

求跨期的同环比

比如，三天后减去当前，通常使用shift；idx_df['closeIndex'].shift(-window) / idx_df['closeIndex'] - 1.0

python 的list 的表达式写法

return [ x for x in cols if x not in filterOut] ，for 循环里面直接写条件

对 dict 按照值排序

s_res = sorted(res.items(), key=lambda x: x[1], reverse=True)

pandas to_dict

将一个pandas 变成与dict 相关的格式，有很多种格式；
reports = reports.to_dict(orient="records")，这个是变成一个list，每个list 是一个map

loc 的用法

随机取1000条，news.iloc[random.sample(range(news.shape[0]), k=1000), :]，其中 pd.loc[[list1],[list2]], list1 可以指定行的idx，list2 可以指定列，这里使用分片[:]，表示所有的列。
选取某列的值不再一个list中的所有数据，blackBoxSamples.loc[~blackBoxSamples.NEWS_ID.isin(remove_news_id), :]
使用bool的类型。

dataframe 排序

按照某一列排序 not_recall.sort_values(by="theme_name" , inplace=True, ascending=True)

dataframe 分组遍历,统计词频topK

    grps = samples.groupby("theme_name")
    res = {}
    for name,group in grps:
        titles = group["NEWS_TITLE"].values
        res[name] = titles
    return res

def cut_word_and_count(res,top_k):
    counter = {}
    for name,titles in res.items():
        lines = " ".join(titles)
        cut_words = " ".join(jieba.cut(lines))
        words = [x for x in cut_words.split(" ") if len(x) >1]
        counter[name] = dict(Counter(words).most_common(top_k))
    for name,words in counter.items():
        words = sorted(words.items(), key=lambda x: x[1], reverse=True)
        print(name+","+str(words))

分组合并成新的dataframe，再重置index

    # grps = merged.groupby("NEWS_TITLE")
    # res = pd.DataFrame(columns=cols)
    # for title,group in grps:
    #     t_names = group["THEME_NAME"].values
    #     group["THEME_NAME"] = ",".join(t_names)
    #     res=res.append(group)
    # res = res.drop_duplicates(subset="NEWS_TITLE")
    # res.reset_index(drop = True,inplace = True)

Pandas 基础用法记录

基于条件操作某些行列

合并或者join

删除行列、转置

构建dataframe

求跨期的同环比

python 的list 的表达式写法

对 dict 按照值排序

pandas to_dict

loc 的用法

dataframe 排序

dataframe 分组遍历,统计词频topK

分组合并成新的dataframe，再重置index

你可能感兴趣的:(Pandas 基础用法记录)