大家好✨,这里是bio。这次为大家带来的是多级分类数据的可视化,阅读本文你将学习到
1.如何筛选出你想要的数据
2.数据可视化图像的优化
3.NA值的处理
所用的语言是Python,全部过程及代码在MyGithub,感兴趣的读者可以取看看!
数据来源于Virion database,想要跟做的读者可以下载试试,如果访问不了,可能需要挂载VPN。打开数据我们能看到一些缺失值在宿主、病毒以及TaxID,所以首先对数据的筛选可以是选择这四项值存在的数据。
这里有一个值得注意的点,解码方式使用ISO-8859-1
能够解决乱码问题。选择这个解码主要还是要看你的数据采用什么方式编码。
import pandas as pd
with open('/home/bio_kang/virus_host_virion/Virion.csv','rb') as f:
lines = f.readlines()
data_procession = []
for line in lines:
if len(line) < 34:
continue
else:
data = line.decode('ISO-8859-1').strip('\n').split('\t')
if data[0] == '' or data[1]=='' or data[2]=='' or data[3] == '' :
continue
else:
data_procession.append(data)
data = pd.DataFrame(data_procession[1:-1],columns=data_procession[0])
Virion数据库的数据来自于别的数据库,对其进行统计有利于了解数据的来源比例(那些数据库的数据更科学可信)。
# library matplotlib
import matplotlib.pyplot as plt
# extract data
data_from = data['Database'].value_counts()
# draw graph
plt.figure(dpi=300,figsize=(24,8)) # set the size and dpi
plt.pie(data_from, explode=(0,0,0,0,0.1,0.2,0.3), autopct='%3.2f%%', pctdistance=1.1) # draw
plt.legend(labels=data_from.index, loc="right", bbox_to_anchor=(1,1)) # legend location
plt.savefig("database_statistic.png", format='png', bbox_inches='tight', dpi=300, transparent=True) # save graph
这些代码都比较容易理解。结果图如下
分级数据的展示方式有很多种,今天介绍的是sunburst
旭日图。在绘图之前,要对数据进行处理成绘图需要的格式。
# extract data and remove duplication data
host_data = data['HostTaxID']
host_data = host_data.drop_duplicates()
# write a file
host_taxid = list(host_data)
with open('/home/ouyangkang/Learning/graduation/virus_host/statistic_virion/host.txt','w') as f:
f.writelines(list(map(lambda x:x+'\n', host_taxid)))
# utilize linux system in jupyter and use "taxonkit" to obtain taxonomic information
import os
os.system(' cat host.txt | /home/ouyangkang/software/anaconda3/envs/bio/bin/taxonkit reformat -I 1 -f "{p};{c};{o};{f};{g};{s}" > taxon_host.txt ')
# convert txt format to csv format
with open("/home/ouyangkang/Learning/graduation/virus_host/statistic_virion/taxon_host.txt", 'r') as f:
taxon = f.readlines()
taxon_data = []
for line in taxon:
line = line.strip('\n').split('\t')[1]
taxon_data.append(line.split(';'))
write_data = pd.DataFrame(taxon_data,columns=["Phylum","Class","Order","Family","Genus","Species"])
write_data.to_csv('taxon_host.csv',index=None)
然后进行绘图,绘制的还能动,大家可以按照代码绘制试试。
import plotly.express as px
import pandas as pd
import plotly
data_host = pd.read_csv('taxon_host.csv')
data_host = data_host.fillna(value='Unclassfied')
fig = px.sunburst(data,path=['Phylum','Class','Order','Family','Genus'])
fig.show()
plotly.offline.plot(fig,filename='Host_taxon.html')