学习论文:G. Creech and J. Hu, "A Semantic Approach to Host-Based Intrusion Detection Systems Using Contiguousand Discontiguous System Call Patterns," in IEEE Transactions on Computers, vol. 63, no. 4, pp. 807-819, April 2014, doi: 10.1109/TC.2013.13.
论文提取了一种新颖的sementic feature用于系统调用序列进行异常检测
提取方法分为三步
First, the training data must be processed to extract a dictionary containing every contiguous system call trace present in the training samples. This step is equivalent to using multiple window lengths under Forrest’s methodology [20], [25], [26], [46] and [47], where the maximum window length allowed is in fact the length of each trace. Each dictionary entry extracted at this stage forms a conceptual ‘word’, or a ‘phrase’ of length 1.
1、提取单词
长度为n的api连续调用子序列为一个单词( n >= 2 )
训练序列得到的所有单词:组成单词字典
例:一个序列为 1,2,3,4,5
则可以得到单词:[1,2],[2,3],[3,4],[4,5]
[1,2,3],[2,3,4],[3,4,5]
[1,2,3,4],[2,3,4,5]
[1,2,3,4,5]
Second, these words are then used to construct further dictionaries consisting of every possible combination of the words up to a specified phrase length.
2、组合短语
任意n个单词进行组合得到的所有可能:长度为n的短语字典(长度为1的短语词典即为单词词典)
例如:单词字典{[1,2],[2,3],[3,4]} :可以得到的长度为2的短语字典[1,2,2,3],[2,3,1,2],… 可以得到的长度为3的短语字典[1,2,2,3,3,4],[2,3,1,2,3,4],…
extract occurrence counts of these different length phrases.
3、得到语义特征向量
输出语义特征向量[x1,x2,x3,x4,xn]代表:长度为n的短语字典中的短语在预测序列中的出现种类数。
例如:长为1的短语字典中有10种在序列中出现,则该序列特征向量的x1=10