纯python统计基于orthofinder得到的系统发育树的关注点位置的树型结构

对于某一个物种或某类物种在整个系统发育树中的位置存在一定争议的情况,使用直系同源基因构建单基因树,并对该物种或该类物种所在结构进行统计是可以对争议起到一定的解决作用的,在此留下全套流程和大家交流。

主要分为几步:

  1. 使用orthofinder进行直系同源基因的寻找和单基因树的构建。(软件安装请自寻教程)
  2. 树型统计
    2.1 对orthofinder跑出来的单基因树进行合并,并简化
    2.2 树型统计
    2.3 标签还原

1. 使用orthofinder进行直系同源基因的寻找和单基因树的构建

将待测物种的蛋白序列放到一个文件夹中:~/liuwei/01.tree.230513/01.Orthofinder/DataSet
纯python统计基于orthofinder得到的系统发育树的关注点位置的树型结构_第1张图片
在~/liuwei/01.tree.230513/01.Orthofinder目录下跑
脚本为:

orthofinder -f DataSet -t 80 -a 1

跑完结果在
~/liuwei/01.tree.230513/01.Orthofinder/DataSet/OrthoFinder/Results_May15_4
结果文件描述请参考
https://www.jianshu.com/p/3bcd965605f5
https://www.jianshu.com/p/a93ce87ff2d6
OrthoFinder 2.0 原理及所涉及的相关概念
orthofiner对每个cluser内的基因都建了一个树,使用的模型暂时还没去看,了解的可以留言,这个结果放在文件夹Resolved_Gene_Trees中了。
我们需要自己把其中的单基因树抓取出来并放到一个文件中去,orthofiner的结果也给出了单基因的list,文件为Orthogroups/Orthogroups_SingleCopyOrthologues.txt

使用脚本01.cat_files.py将Resolved_Gene_Trees中的单基因树抓出来并放到一个文件中:

import sys
f1 = open("~/liuwei/01.tree.230513/01.Orthofinder/DataSet/OrthoFinder/Results_May15_4/Orthogroups/Orthogroups_SingleCopyOrthologues.txt","r")  # SingleCopyOrthologues clusterID list
f3 = open("all_Single_Copy_Orthologue_Trees.txt","w")
for i in f1:
	i = i.strip()
	f2 = open(f"Resolved_Gene_Trees/{i}_tree.txt","r")
	content = f2.readline()
	f3.write(content+"\n")
	f2.close()
f3.close()
f1.close()

这样得到了all_Single_Copy_Orthologue_Trees.txt,内含所有单基因树。

——————————————————————————————
到这里单基因树的文件已经得到,后续我们要做的是为统计结构精简一下单基因树。

由于这个单基因树内的每个单基因树对应的单基因id不一致,因此我们只需要留下物种id就行,随后写了脚本进行简化基因id和物种id。

为了方便管理,我新建了一个test_py文件夹,后续工作主要在此文件夹下进行,各位请自行注意脚本的文件路径。

因为开始的思路没有那么清晰,所以开始我是先精简了一下基因id,将基因id去掉,只保留了物种id(蛋白文件名)。
脚本如下:
ps:我可能对…/SpeciesIDs.txt进行过修饰,必然把点"." 修改成了下划线"_"

import re
f1 = open("all_Single_Copy_Orthologue_Trees.txt","r")
f3 = open("../SpeciesIDs.txt","r")
SpeciesIDs = []
for i in f3:
        i=i.strip()
        SpeciesIDs.append(i)
new_Trees = []
for i in f1:
        for j in SpeciesIDs:
                i = re.sub(f'{j}_[^:]*:','{}:'.format(j),i)
        new_Trees.append(i)
f2 = open("all_concise_Trees.txt","w")
for i in new_Trees:     
        f2.write(i)

随后因为还是很难看,所以直接把物种名标准化了,这里很重要,因为后续的脚本对标准化的名字其实是有要求的,一定要形如M1或M23,ps:M后面的数字不能达3位,你要能自己看脚本也能自己修改,几位其实很easy,只是我懒得改了。
首先建立远物种名和新物种名一一对应文件SpeciesIDs.txt,形如这种,改成以M开头,从小到大依次命名。

1817_protein    M1
1824_protein    M2
21178_protein   M3
21183_protein   M4
21184_protein   M5
21185_protein   M6
21186_protein   M7
21187_protein   M8
21189_protein   M9
21192_protein   M10
2612_protein    M11
2613_protein    M12
Ascim1_GeneCatalog_proteins_20121221_aa M13
Ascni1_GeneCatalog_proteins_20141120_aa M14
Chove1_GeneCatalog_proteins_20131210_aa M15
D10A_protein    M16
D11A_protein    M17
D12A_protein    M18
D13A_protein    M19
D14A_protein    M20
D15A_protein    M21
D16A_protein    M22
D17A_protein    M23
D18A_protein    M24

随后使用脚本changeSpeciesIDs.py将上一步得到的all_concise_Trees.txt中的物种名基于上述文件标准化。

python changeSpeciesIDs.py all_concise_Trees.txt SpeciesIDs.txt 
#this script follows the concise_Trees.py shortNameTrees.txt
import sys
#f1 = open("all_concise_Trees.txt","r")
#f2 = open("SpeciesIDs.txt","r")
#f3 = open("shortNameTrees.txt","w")
f1 = open(sys.argv[1],"r")
f2 = open(sys.argv[2],"r")
f3 = open(sys.argv[3],"w")
name = {}
for i in f2:
        i = i.strip()
        a = i.split()
        name[a[0]] = a[1]
for i in f1:
        for j in name.keys():
                if j in i:
                        i = i.replace(j+",",name[j]+",")
                        i = i.replace(j+")",name[j]+")")
                        i = i.replace(j+":",name[j]+":")
        f3.write(i)

自此树文件准备完成,后面就是统计树型了。

二、树型结构统计

逻辑上对于一个物种或者group来说,观察其与最近的枝brother和较近的枝uncle之间的拓扑结构是最重要的,剩下的就是作为外群ancestor,如图a,而图b可视为图a的简略拓扑结构图,本脚本的目的则为统计所有单基因树中,以target species/group为核心的所有拓扑结构及其数量,每种中brother、uncle以及ancestor内含的物种数量和id是一致的,但brother、uncle以及ancesto内的物种的细分拓扑结构未被进一步考虑,也无需进一步考虑,需要进一步考虑的话可以以其内部某group为target进一步进行统计。
纯python统计基于orthofinder得到的系统发育树的关注点位置的树型结构_第2张图片

使用脚本Count_Structures_Of_PhylogenyTrees.py统计树型。
注意:目标物种或目标物种群的输入在脚本的121行,我这里使用的是input的模式,你可以自己把121行注释了,改成122或123行的形式。
唯一的输入文件就是之前准备好的树文件,结果使用重定向输出到result.txt中。
脚本运行命令如下:

python Count_Structures_Of_PhylogenyTrees.py shortNameTrees.txt >result.txt

脚本内容如下:

#!/public/home/wangwen_lab/zhangjiexiong/anaconda3/bin/python
import re,sys
class Node:
	def __init__(self,nodeNum,nodeLength,upbranch,downbranch):
		self.nodeNum = nodeNum
		self.upbranch = upbranch
		self.downbranch = downbranch
		self.nodeLength = nodeLength
class family:
	def __init__(self,me,brother,father,uncle="NULL",ancestor="NULL"):
		self.me = me
		self.brother = brother
		self.father = father
		self.uncle = uncle
		self.ancestor = ancestor
#f1 = open("shortNameTrees.txt","r")
def findComma(s): # find the core comma of a tree string
	place,sta = 0,0
	for i in s:
		if i == "(":
			sta += 1
		elif i  == ")":
			sta -= 1
		if i == ",":
			if sta == 1:
				commaplace = place
				break
		place += 1
	return commaplace
def findPairedBracket(s): #find the coord of fist "(" and the last ")"
	place,sta = 0,0
	for i in s:
		if i == "(":
			sta += 1
			if sta == 1:
				firstbracket = place
		elif i  == ")":
			sta -= 1
			if place != 0 and sta == 0:
				pairedbracket = place
		place += 1
#	bracket[firstbracket] = pairedbracket
	return [firstbracket,pairedbracket]

def NodeInfo(s): #store the information of the node to the class--Node which is definded in the first
	treelen = len(s)
	bracket = findPairedBracket(s)
	nodeInfo = s[bracket[1]+1:treelen]
	nodeNum = nodeInfo.split(":")[0]
	nodeLength = nodeInfo.split(":")[1]
	nodeContent = s[0:bracket[1]+1]	
	commaplace = findComma(nodeContent)
	nodelength = len(nodeContent)
	upbranch = nodeContent[1:commaplace]
	downbranch = nodeContent[commaplace+1:nodelength-1]
	node = Node(nodeNum,nodeLength,upbranch,downbranch)
	return node

def judgeNode(s): #judge whether the string is a tree
	if type(s) == type('str'):
		if "(" in s:
			bracket = findPairedBracket(s)
			tail  = s[bracket[1]::]
		#	print(tail)
			if re.match("\)n\d+:",tail):
				return 1
			else:
				return 0
	else:
		return 1
def circle(node): # !!!! the most important function which carry out the iteration of the tree to send each subtrees to a global list named "dicNode"
	if judgeNode(node):
		node = NodeInfo(node)
		dicNode.append(node)
		circle(node.upbranch)
		circle(node.downbranch)
def judgeCertainSpecies(node,species): #judge whether the input branch "species" is the only branch under the node's two subbranch and distinguish whether the train is in upbranch or downbranch
	if species in node.upbranch:
		branch = node.upbranch
		elipbranch = branch.replace(species,"")
		if "(" in elipbranch:
			return 0
		else:
			return "up"
	elif species in node.downbranch:
		branch = node.downbranch
		elipbranch = branch.replace(species,"")
		if "(" in elipbranch:
			return 0
		else:
			return "down"
	else:
		return 0
def findDirectFamiliesOfAimedNode(node): #node is a string type. This function can find the adjioning branch of the target node and return the adjioning nodes
	for i in dicNode:
		if judgeCertainSpecies(i,node) != 0:
			father = "("+i.upbranch+","+i.downbranch+")"+i.nodeNum+":"+i.nodeLength
			if judgeCertainSpecies(i,node) == "up":
				me = i.upbranch
				brother = i.downbranch
			elif judgeCertainSpecies(i,node) =="down":
				brother = i.upbranch
				me = i.downbranch
	families = family(me,brother,father)
	return families
def findAllFamiliesOfAimedNode(tree,node): #This function is used to find out all the four related branches of the target node, which are named as "brother", "father", "uncle" and "ancestor".
	directfolks = findDirectFamiliesOfAimedNode(node)
	upgeneration = findDirectFamiliesOfAimedNode(directfolks.father)
	directfolks.ancestor = tree.replace(upgeneration.father,"Me:0")
	me = directfolks.me
	directfolks.uncle = upgeneration.brother
	return directfolks

def drawOutAllStrainInTree(node): #input a node, draw out all the strain name then sort and merge the name together by ",", then it can be used to compare with other processed strings.
	strains = re.findall("M\d{1,2}",node)
	strains.sort()
	merged_string = ','.join(strains)
	return merged_string

ftree = open(sys.argv[1],"r")
#target = input("Please input your most concerned species or subtree:")
target = "M23"
#target = "(M40:0.139101,M42:0.159342)n5:0.079548"
treenum = 0
treetype = {}
for i in ftree:
	i = i.replace(";",":0")
	treenum += 1
	dicNode = []
	circle(i)
	allfolks = findAllFamiliesOfAimedNode(i,target)
	treetype[treenum] = drawOutAllStrainInTree(allfolks.me)+"\t"+drawOutAllStrainInTree(allfolks.brother)+"\t"+drawOutAllStrainInTree(allfolks.uncle)+"\t"+drawOutAllStrainInTree(allfolks.ancestor)
#	print(allfolks.me)
structure = {}
treeid = {}
for i in range(len(treetype)):
	structure[treetype[i+1]]=structure.get(treetype[i+1],0)+1
	treeid[treetype[i+1]]= treeid.get(treetype[i+1],"")+str(i+1)+","
d_order = sorted(structure.items(), key=lambda x: x[1],reverse = True)
print("aim\taimNum\tbrother\tbrotherNum\tuncle\tuncleNum\tancestor\tancestorNum\ttopoStructuresNum\tcorrespondingTreesID")
for i in d_order:
	a = i[0].split("\t")
	for j in a:
		try:
			b = j.split(",")
			print(j+"\t"+str(len(b)),end = "\t")
		except:
			print(j,end="\t")
	print(str(i[1]),end = "\t")
	print(treeid[i[0]])

输出结果如下:
第一列为目标group的物种编号,随后是目标group内的物种数;
第三列为brother的物种编号,随后是brother的物种数
第五列为uncle
第七列为ancestor
第九列为具有该结构的但基因树的数目
第10列为对应的基因树的编号(按照输入文件的树的从前往后编的,第一个数编为1,第二个编为2,依此类推)

aim	aimNum	brother	brotherNum	uncle	uncleNum	ancestor	ancestorNum	topoStructuresNum	correspondingTreesID
M23	1	M10,M16,M17,M22,M24,M28,M29,M30,M32,M33,M35,M4,M5,M6,M7,M8,M9	1M1,M11,M12,M18,M19,M2,M20,M21,M25,M26,M3,M31,M34,M36,M37,M38	16	M13,M14,M15,M27,M39,M40,M41,M42,M43	9	843	2,6,13,14,18,21,22,23,24,26,28,31,33,34,35,43,45,48,49,50,58,59,61,64,65,71,76,81,82,83,84,86,87,89,91,94,100,106,107,109,112,113,114,118,119,124,126,129,131,132,135,137,142,147,148,151,155,157,159,160,162,167,175,181,184,185,188,190,191,201,202,203,207,209,213,214,217,218,220,221,222,223,226,227,234,239,244,245,246,247,249,255,260,261,262,265,267,268,273,276,281,290,292,296,298,301,302,303,313,315,319,325,339,340,342,343,346,347,351,352,353,354,355,356,371,373,380,381,400,401,404,408,414,415,416,419,421,422,432,433,435,439,440,441,442,448,452,453,456,457,464,465,489,490,493,495,497,500,506,507,509,514,515,516,519,520,525,537,539,540,541,542,543,545,546,548,551,554,559,561,567,574,576,579,581,585,588,590,591,592,595,596,599,601,607,609,610,612,613,614,626,631,632,639,640,642,645,646,647,648,655,659,663,668,670,671,682,683,688,689,691,697,698,699,700,701,703,704,706,708,711,712,713,715,717,720,729,732,739,741,745,746,747,749,750,751,752,755,756,757,763,764,766,767,769,771,774,783,785,791,792,797,798,800,801,804,816,818,821,837,838,846,847,848,850,853,858,861,863,864,865,870,876,877,878,879,880,882,888,890,891,893,913,920,921,923,924,926,927,929,931,932,934,937,940,942,943,948,951,952,954,957,961,962,963,966,968,979,982,983,984,991,999,1001,1004,1007,1010,1011,1012,1013,1014,1015,1017,1022,1023,1024,1026,1028,1030,1034,1040,1041,1043,1046,1047,1048,1050,1053,1061,1067,1068,1073,1075,1076,1079,1081,1085,1086,1094,1095,1096,1097,1099,1102,1103,1104,1105,1107,1108,1109,1112,1113,1114,1115,1116,1117,1119,1122,1124,1128,1130,1139,1140,1141,1143,1159,1160,1162,1166,1170,1175,1179,1185,1189,1196,1197,1198,1199,1200,1201,1202,1203,1207,1211,1212,1213,1215,1217,1220,1223,1224,1226,1228,1231,1236,1241,1248,1250,1258,1259,1261,1263,1278,1280,1281,1284,1285,1287,1288,1291,1293,1296,1300,1304,1305,1309,1310,1319,1323,1324,1326,1328,1333,1334,1336,1340,1341,1345,1346,1350,1355,1357,1359,1360,1364,1366,1369,1372,1373,1374,1376,1377,1379,1380,1381,1390,1391,1395,1399,1401,1402,1404,1406,1407,1408,1412,1416,1417,1418,1421,1423,1424,1428,1432,1438,1439,1445,1448,1449,1450,1452,1453,1454,1455,1457,1458,1459,1460,1461,1464,1465,1466,1467,1469,1471,1472,1473,1475,1476,1477,1478,1479,1488,1493,1494,1497,1500,1507,1510,1513,1515,1518,1522,1523,1529,1530,1537,1538,1540,1542,1545,1548,1550,1552,1560,1561,1567,1568,1571,1572,1577,1578,1579,1581,1588,1590,1592,1598,1603,1605,1611,1619,1620,1621,1624,1626,1636,1639,1640,1642,1643,1650,1651,1655,1657,1658,1661,1662,1666,1668,1671,1674,1677,1681,1682,1689,1690,1696,1701,1702,1703,1710,1711,1723,1724,1725,1727,1729,1730,1733,1738,1739,1741,1742,1748,1750,1754,1759,1761,1764,1765,1769,1777,1782,1784,1785,1791,1793,1795,1800,1801,1802,1807,1818,1819,1820,1821,1822,1824,1826,1830,1834,1835,1836,1838,1844,1848,1852,1855,1857,1859,1862,1863,1865,1870,1874,1879,1880,1881,1883,1886,1888,1889,1890,1893,1896,1897,1898,1901,1902,1905,1906,1910,1913,1918,1920,1922,1926,1930,1931,1932,1934,1935,1936,1937,1939,1941,1944,1945,1946,1950,1954,1959,1963,1976,1980,1991,1992,2001,2002,2003,2004,2008,2010,2013,2018,2024,2025,2029,2030,2032,2034,2037,2040,2042,2044,2046,2048,2049,2051,2052,2054,2065,2082,2084,2087,2090,2094,2098,2099,2103,2106,2110,2111,2116,2119,2122,2125,2127,2129,2131,2133,2134,2136,2137,2139,2140,2142,2143,2148,2154,2159,2160,2162,2164,2167,2169,2174,2178,2184,2189,2191,2197,2200,2201,2205,2206,2207,2208,2212,2213,2216,2220,2222,2223,2234,2237,2238,2246,2252,2253,2256,2260,2261,2262,2264,2268,2270,2271,2275,2277,2278,2287,2288,2295,2296,2299,2305,2308,2309,2311,2314,2318,2322,2324,2325,2327,2331,2334,2344,2349,2350,2351,2353,2355,2357,2358,2359,2363,2370,2372,2374,2375,2379,2380,2381,2382,2387,2389,2390,2394,2395,2396,2405,2411,2416,2417,2418,2419,2423,2424,2425,2426,2429,2430,2434,2436,2440,2442,2445,2448,2461,2462,2466,2469,2471,2473,2474,2476,2478,2482,2483,2487,
M23	1	M1,M10,M11,M12,M16,M17,M18,M19,M2,M20,M21,M22,M24,M25,M26,M28,M29,M3,M30,M31,M32,M33,M34,M35,M36,M37,M38,M4,M5,M6,M7,M8,M9	33	M27	1M13,M14,M15,M39,M40,M41,M42,M43	8	612	1,3,12,15,16,20,25,27,36,38,41,56,62,63,66,69,75,88,90,95,97,98,101,102,110,116,121,122,123,136,138,140,141,144,146,150,156,168,169,170,177,178,183,192,194,210,215,216,233,236,241,250,252,263,264,269,278,280,282,283,286,297,304,305,307,309,312,317,330,334,336,337,348,357,359,363,364,367,377,379,382,383,388,391,393,396,397,398,399,403,405,418,424,425,426,427,428,429,436,445,446,449,454,458,466,467,468,470,473,474,478,479,480,482,488,491,492,494,501,511,517,522,523,526,533,534,538,550,553,556,557,560,564,565,566,569,570,571,572,575,577,586,587,589,598,602,604,605,611,618,621,622,624,628,629,635,637,641,644,649,650,652,658,661,664,667,675,676,679,680,687,694,696,705,707,710,714,724,727,730,733,737,738,740,743,744,762,770,772,773,775,788,789,790,796,802,803,807,808,812,815,823,824,828,831,834,841,843,844,845,851,855,860,862,873,874,881,883,884,885,889,892,894,897,898,899,909,914,918,925,936,947,955,971,974,978,987,989,1006,1016,1018,1020,1021,1025,1033,1036,1042,1044,1045,1049,1051,1052,1054,1056,1060,1062,1063,1083,1087,1088,1089,1091,1092,1093,1111,1118,1126,1129,1131,1133,1134,1136,1137,1142,1144,1145,1148,1151,1152,1157,1158,1165,1168,1176,1184,1188,1190,1191,1192,1195,1206,1209,1210,1216,1230,1232,1233,1237,1238,1244,1247,1251,1255,1264,1266,1269,1276,1298,1301,1307,1311,1313,1317,1335,1337,1339,1347,1349,1352,1353,1358,1361,1362,1368,1371,1387,1388,1389,1393,1394,1398,1403,1409,1413,1414,1415,1419,1420,1434,1435,1442,1462,1470,1481,1483,1486,1487,1490,1492,1495,1496,1498,1499,1502,1505,1509,1511,1514,1517,1521,1527,1532,1536,1539,1556,1558,1563,1587,1589,1594,1595,1596,1600,1601,1606,1608,1609,1610,1614,1616,1622,1630,1633,1635,1638,1653,1659,1660,1669,1670,1685,1687,1688,1691,1692,1693,1694,1695,1698,1704,1708,1715,1716,1718,1719,1721,1726,1728,1740,1744,1747,1749,1752,1753,1760,1763,1766,1771,1772,1774,1779,1781,1787,1798,1799,1803,1804,1810,1811,1814,1815,1828,1829,1841,1846,1849,1854,1858,1860,1864,1866,1878,1882,1900,1903,1908,1909,1912,1914,1916,1917,1925,1927,1928,1940,1947,1949,1953,1955,1957,1958,1965,1966,1967,1968,1969,1972,1974,1975,1981,1982,1983,1993,1994,1998,2006,2007,2009,2012,2015,2023,2026,2033,2039,2041,2045,2055,2056,2057,2064,2066,2068,2075,2078,2080,2085,2086,2091,2101,2102,2105,2108,2109,2112,2113,2114,2118,2121,2123,2124,2126,2128,2132,2138,2141,2144,2146,2147,2149,2150,2157,2158,2161,2165,2166,2172,2173,2177,2183,2186,2188,2190,2192,2195,2199,2209,2210,2214,2215,2218,2224,2227,2228,2229,2230,2231,2245,2249,2250,2251,2254,2258,2265,2267,2276,2282,2289,2290,2291,2304,2307,2313,2315,2316,2321,2328,2329,2330,2332,2333,2338,2339,2340,2341,2342,2345,2354,2361,2362,2366,2373,2378,2383,2384,2393,2397,2398,2399,2402,2403,2409,2410,2413,2414,2422,2427,2431,2433,2435,2438,2441,2443,2449,2455,2459,2460,2463,2464,2467,2479,2485,2486,

Hope to be useful!

你可能感兴趣的:(Python,系统发育树,生物信息学,python,开发语言)