问题描述:
原始表part-00000.csv中列name
有两组值,一组是rmb,另一组是cis,所以想要实现按照name中数据分组的堆积柱状图。
1. 原始数据样式
下载链接:http://download.csdn.net/download/zhousishuo/9902909
2. 数据处理
数据处理我使用了两种方法,一种是pandas,一种是pyspark.sql,使用后会发现两种方法在思想上和code上都很相似,只要会其中一种,另外一种按照类似思想翻译就好了(我就是这样弄得,哈哈)
方法一:使用pandas处理
使用pandas进行处理,当然首先需要有它的包。我在Anaconda3上处理的,它集成了很多python的包,使用很方便。
处理代码如下:
import numpy as np
import pandas as pd
data = pd.read_csv("./part-00000.csv")
#获取rmb数据信息
df_rmb = data[data.name.isin(['rmb'])]
#获取cis数据信息
df_cis = data[data.name.isin(['cis'])]
#修改df_rmb列名
df_rmb=df_rmb.rename(columns={'total_cores':'rmb_cores','total_allocatedMEM':'rmb_mem'})[["logTime","rmb_cores","rmb_mem"]]
#修改df_cis列名
df_cis=df_cis.rename(columns={'total_cores':'cis_cores','total_allocatedMEM':'cis_mem'})[["logTime","cis_cores","cis_mem"]]
#将df_rmb和df_cis进行合并,条件是logTime相等,并将结果按logTime升序排列
result = pd.merge(df_rmb, df_cis, on='logTime', how='inner').sort_index(by=["logTime"], ascending=True)
#reset_index可以重置索引
#result.reset_index(inplace=True,drop=True)
#去掉result本身创建时自带的索引(在上述排序后就打断了,在画图的时候也没用,所以将其去掉),然后导出数据
result.to_csv("./py_trans.csv", index=False)
方法二:使用pyspark.sql处理
使用这种方法首先得有pyspark的环境,也就是需要安装spark
处理代码如下:
#读取数据,数据存在hdfs上
df= spark.read.csv("hdfs://master_ip:8020/user/mart_cis/zhousishuo/part-00000.csv",encoding='UTF-8',header='true')
#获取rmb数据信息
df_rmb = df.filter("name=='rmb'").selectExpr("logTime","total_cores as rmb_cores","total_allocatedMEM as rmb_mem")
#获取cis数据信息
df_cis = df.filter("name=='cis'").selectExpr("logTime","total_cores as cis_cores","total_allocatedMEM as cis_mem")
#将rmb数据和cis数据按照logTime进行join
df = df_rmb.join(df_cis,"logTime").select(df_rmb.logTime,"rmb_cores","cis_cores","rmb_mem","cis_mem") .orderBy("logTime")
#将最终数据写入到csv中
df.write.csv(path="hdfs://master_ip:8020/user/mart_cis/zhousishuo/part-data.csv",mode="overwrite",sep=",",header="true")
3.数据处理后样式
下载链接:http://download.csdn.net/detail/zhousishuo/9902790
4. 使用seaborn包画堆积柱状图
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline
#Read in data & create total column
stacked_bar_data = pd.read_csv("D:/jupyter/matplot/new-part.csv")
stacked_bar_data["total_cores"] = stacked_bar_data.rmb_cores + stacked_bar_data.cis_cores
#Set general plot properties
sns.set_style("white")
sns.set_context({"figure.figsize": (24, 10)})
#Plot 1 - background - "total" (top) series
sns.barplot(x = stacked_bar_data.logTime, y = stacked_bar_data.total_cores, color = "red")
#Plot 2 - overlay - "bottom" series
bottom_plot = sns.barplot(x = stacked_bar_data.logTime, y = stacked_bar_data.cis_cores, color = "#0000A3")
topbar = plt.Rectangle((0,0),1,1,fc="red", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='#0000A3', edgecolor = 'none')
l = plt.legend([bottombar, topbar], ['cis total_cores', 'rmb total_cores'], loc=1, ncol = 2, prop={'size':16})
l.draw_frame(False)
#Optional code - Make plot look nicer
sns.despine(left=True)
bottom_plot.set_ylabel("total_cores")
bottom_plot.set_xlabel("logTime")
bottom_plot.set_xticklabels(stacked_bar_data.logTime, rotation=30, fontsize='small')
plt.show()
# #Set fonts to consistent 16pt size
# for item in ([bottom_plot.xaxis.label, bottom_plot.yaxis.label] +
# bottom_plot.get_xticklabels() + bottom_plot.get_yticklabels()):
# item.set_fontsize(8)
参考链接:http://randyzwitch.com/creating-stacked-bar-chart-seaborn/
5.堆积柱状图