使用python将KDD-99中的文本替换为数值形式

背景我就不介绍了.浪费大家流量.
直接开讲.

转载记得注明出处.
http://blog.csdn.net/isinstance/article/details/51328794

KDD-99是基于林肯实验室的网络异常流量数据集,想下载的链接在这里KDD-99数据集
源文件的每一行是这样的:


0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.

可以看出第2,3,4还有最后一个都是为非数值类型.
现在我们用python的将其中的值根据其在列表中出现的顺序值替换为数值.
列表在这里


[“tcp”, “udp”, “icmp”]其中tcp位置为0,icmp为2,以下照此理替换

["aol", "auth", "bgp", "courier", "csnet_ns", "ctf", "daytime", "discard", "domain", "domain_u", "echo", "eco_i", "ecr_i", "efs", "exec", "finger", "ftp", "ftp_data", "gopher", "harvest", "hostnames", "http", "http_2784", "http_443", "http_8001", "imap4", "IRC", "iso_tsap", "klogin", "kshell", "ldap", "link", "login", "mtp", "name", "netbios_dgm", "netbios_ns", "netbios_ssn", "netstat", "nnsp", "nntp", "ntp_u", "other", "pm_dump", "pop_2", "pop_3", "printer", "private", "red_i", "remote_job", "rje", "shell", "smtp", "sql_net", "ssh", "sunrpc", "supdup", "systat", "telnet", "tftp_u", "tim_i", "time", "urh_i", "urp_i", "uucp", "uucp_path", "vmnet", "whois", "X11", "Z39_50"]
["OTH", "REJ", "RSTO", "RSTOS0", "RSTR", "S0", "S1", "S2", "S3", "SF", "SH"]

先用split()函数将源文件切割成列表
然后传入一个叫replace_kdd(list)的函数
然后在代码72行有个列表,是用于确定你要替换的在整个源文件中的位置的
99999是用了防止溢出
然后在replace_kdd(list)函数里面调用了一个countingFunction(type_into, name)
这个函数用了计算在上面那个列表中元素的位置,然后返回这个位置,作为值写入文件中
替换后的文件如下


0,1,22,10,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.

思想大概就是这样,最后贴出代码的github位置,欢迎大家提出好的想法,联系方式github和CSDN上都有写.

传送门这这里

  • 2018-1-12 我把我的毕业设计论文上传github了,里面有具体的实现细节,大家可以参考参考,因为已经过去两年了,所以就把论文释出了

你可能感兴趣的:(python,ubuntu)