如上图所示,它基本是个‘有环的有向图’。从B开始经过各种可能的有向边(所代表的字母)到达E组成的字符串,就是满足Reber Grammar的字符串。例如 BTSXSE,BPTVVE等。
把满足Reber Grammar的字符串Embed到一定的格式中得到的字符串满足Embedded Reber Grammar,有两种格式。
上图中的上方的路径的格式:
上图中的下方的路径的格式:
例如BTBTSXSETE或BPBTSXSEPE都是满足Embedded Reber Grammar的。
如果要判断字符串满足Embedded Reber Grammar,需要确定第二个字母和倒数第二个字母相同。对于一个学习模型,需要有某种记忆(第2个字母和倒数第2个字母相同)才能正确判断一个字符串是否满足Embedded Reber Grammar。如下图格式则不满足
下面的代码是计算n个满足Embedded Reber Grammar的字符串,基本思路就是用dict来刻画Reber Grammar图中边的前后关系,遇到多个后续的边可以选择的时候,用均匀概率来随机选取。
import random
import pandas as pd
import numpy as np
def ReberGrammar(n):
'''
:param n:
how many reber string to generate
:return:
a list of unembedded reber strings
'''
graph={"B":["T1","P1"],
"T1":["S1","X1"],
"P1":["T1","V1"],
"S1":["S1","X1"],
"T2":["T2","V1"],
"X1":["X2","S2"],
"X2":["T2","V1"],
"V1":["P2","V2"],
"P2":["X2","S2"],
"S2":["E"],
"V2":["E"],
"E":["end"]}
rebers=[]
for i in range(n):
str_i=""
edge_ = "B"
while edge_ != "end":
str_i+=edge_[0]
sub_edge_=graph[edge_]
edge_=random.sample(sub_edge_,1)[0]
rebers.append(str_i)
return(rebers)
def EmbeddedReberGrammar(rebers):
'''
:param rebers:
unembedded reber strings
:return:
embedded reber strings
'''
newRebers=[]
for _,str_ in enumerate(rebers):
type=random.randint(0,1)
if type==0:
newRebers.append("BT"+str_+"TE")
else:
newRebers.append("BP"+str_+"PE")
return(newRebers)
用一个机器学习模型来判断一个字符串是否满足Embedded Reber Grammar时,需要一些不满足该法则的字符串作为负样本。
这个实现方法有很多,我用的方法是:
(1)0.5的概率篡改第2个字母,P改成T或者T改成P;
(2)0.5的概率篡改第3个~倒数第3个字母中的随机的一个,改成不合法的一个。
def illegalReberGrammarString(rebers):
'''
:param rebers:
a list of legal reber strings
:return:
a list of illegal reber strings
'''
graph={"B":["T1","P1"],
"T1":["S1","X1"],
"P1":["T1","V1"],
"S1":["S1","X1"],
"T2":["T2","V1"],
"X1":["X2","S2"],
"X2":["T2","V1"],
"V1":["P2","V2"],
"P2":["X2","S2"],
"S2":["E"],
"V2":["E"],
"E":["end"]}
all_edges=["B","T","P","S","X","V","E"]
illegalRebers=[]
for i in range(len(rebers)):
str_i=rebers[i]
type=random.randint(0,1)
if type==0:
if str_i[1]=="P":
str_i=str_i[0]+"T"+str_i[2:]
else:
str_i=str_i[0]+"P"+str_i[2:]
illegalRebers.append(str_i)
if type==1:
L=len(str_i)
edge_index=random.sample(range(2,L-2),1)[0]
edge_=str_i[edge_index]
if edge_ in ["B","E"]:
sub_edge_=["P","T"]
else:
sub_edge_=graph[edge_+"1"]+graph[edge_+"2"]
wrong_edge_=set(all_edges)-set([_[0] for _ in sub_edge_])
replace=random.sample(wrong_edge_,1)[0]
str_i=str_i[0:edge_index]+replace+str_i[edge_index+1:]
illegalRebers.append(str_i)
return(illegalRebers)
以下代码为生成一个csv文件:
第1列为标签y:1表示满足,0表示不满足;
第2列为是否满足Embed Reber Grammer的字符串x;
num=5000
####生成 num 个 Reber Grammar String
rebers=ReberGrammar(num)
####将Reber Grammar String嵌入格式中
embeddedRebers=EmbeddedReberGrammar(rebers)
####根据篡改规则,生成num个illegal Embed Reber Grammer
illegalRebers=illegalReberGrammarString(embeddedRebers)
####生成dataframe,写入csv
labels=[1]*num+[0]*num
strings=embeddedRebers+illegalRebers
####
index=np.random.permutation(2*num)
df=pd.DataFrame()
df['y']=labels
df['x']=strings
df=df.iloc[index,:]
df.to_csv('chapter14/reberStings.csv',index=0)
一个简洁的说明网页