<生信交流与合作请关注公众号@生信探索>
把常用的函数写成了几个包,方便之后使用,
bioquest包括三个子包tl、pl、st分别是常用的工具包括dataframe的处理、画图、字符串处理。
genekit包括提取tcga数据、基因名转换、格式转换、差异分析等的函数。
sckit包括单细胞分析的一些函数
可以在https://jihulab.com/BioQuest找到这些函数。
import bioquest as bq
import genekit as gk
exprs = pd.read_csv("star_count.csv.gz",index_col=0)
# Chr Start Stop Strand Symbol GeneType Length SRR1039516 SRR1039522 SRR1039508 ... SRR1039514 SRR1039521 SRR1039512 SRR1039513 SRR1039519 SRR1039515 SRR1039520 SRR1039509 SRR1039518 SRR1039517
# Ensembl
# ENSG00000130762.15 1 3454664 3481113 + ARHGEF16 protein_coding 5324 2 0 1 ... 1 0 2 0 0 0 1 0 1 1
# ENSG00000117472.10 1 46175072 46185962 + TSPAN1 protein_coding 2370 16 2 1 ... 11 1 11 1 17 11 1 0 8 9
# ENSG00000227857.2 1 46134530 46139081 + ENSG00000227857 lncRNA 339 0 1 2 ... 0 0 2 0 0 1 1 0 0 0
# 选择SRR开头的样本
exprs = bq.tl.select(exprs,pattern=r"^SRR")
# 去除 Ensemble的版本号(小数点和后边的数字)
exprs.index = bq.st.removes(string=exprs.index,pattern=r"\.\d+")
# SRR1039516 SRR1039522 SRR1039508 SRR1039523 SRR1039511 SRR1039510 SRR1039514 SRR1039521 SRR1039512 SRR1039513 SRR1039519 SRR1039515 SRR1039520 SRR1039509 SRR1039518 SRR1039517
# ENSG00000130762 2 0 1 0 1 1 1 0 2 0 0 0 1 0 1 1
# ENSG00000117472 16 2 1 2 2 4 11 1 11 1 17 11 1 0 8 9
# ENSG00000227857 0 1 2 0 0 2 0 0 2 0 0 1 1 0 0 0
用于计算tpm和fpkm的基因长度就是上个推文中计算的非冗余外显子之和。
tpm=gk.countto(frame=exprs, towhat="tpm",geneid='Ensembl', species='Human')
fpkm=gk.countto(frame=exprs, towhat="fpkm",geneid='Ensembl', species='Human')
cpm=gk.countto(frame=exprs, towhat="cpm",geneid='Ensembl', species='Human')
tpm=gk.geneIDconverter(frame=tpm, from_id='Ensembl', to_id='Symbol',species="Human", keep_from=False, gene_type=None)
tpm=gk.unique_exprs(frame=tpm, reductions=np.median)
# SRR1039516 SRR1039522 SRR1039508 SRR1039523 SRR1039511 SRR1039510 SRR1039514 SRR1039521 SRR1039512 SRR1039513 SRR1039519 SRR1039515 SRR1039520 SRR1039509 SRR1039518 SRR1039517
# Symbol
# FTL 62919.175973 39937.508172 53136.477158 30489.590631 36000.373210 54960.270363 30914.212683 25590.800300 43085.227124 29046.009169 51910.446252 39525.865074 40127.113325 34191.141873 35092.623376 34229.676260
# COL1A2 8624.405014 11315.967039 9225.065529 6930.294955 6125.772306 8894.128386 7417.930243 7370.383579 8733.446453 7222.892397 9489.185083 10492.409160 10928.643870 5864.545027 6013.585511 6268.442083
# FN1 4850.727657 7963.160439 5668.192718 7721.276730 5003.937144 5151.457137 9355.601231 8597.989742 9622.225648 8786.347207 4382.418760 9408.084101 8648.503050 5130.991878 5039.494412 6249.772857
保存数据用于下一步的差异分析热图展示和WGCNA分析
tpm.to_csv("star_tpm.csv.gz")
本文由 mdnice 多平台发布