[Data Preprocessing-Oriented] [Essay Reading & Understanding Record] [2023.4] [Author : LWC]
PEAN (Pakect-level End-to-end Attentive Network) : A novel “multimodel deep learning framework” for EFC.
※ Mechanism of PEAN:
Nowadays, people are more concerned about “data security”, resulting for the emergency of encrypted protocols (such as SSL/TSL). Thus traditional payload-based classification method or DPI can no longer works once packets are encrypted.
Using machine learning algorithms is a promising method, but this method is based on “hand-designed flow features” which may thus ignore some important packet details, in which case classification is poor-performed. (In PEAN, there might be some better way to attain flow features.)
Challenges of EFC are as follows :
What is necessary to be accentuated is that, not all cases contain complete “handshake packets”. Thus, there exists three types of “handshake packets” which we need to capture in our terminal system :
This part is actually what we need to do in this project, cause the PEAN is ready-made which is open source on git-hub and all we need to do is “preprocess” the data going into the PEAN.
First, let us analyse what is the input traffic data look like.
R = [ P 1 , P 2 , . . . , P n ] R = [P^1,P^2,...,P^n] R=[P1,P2,...,Pn]
R represents the “network traffic” set. In this set, every P represents one “packet” and can also be expressed by a matrix:
P i = ( X i , B i , T i ) P^i = (X^i,B^i,T^i) Pi=(Xi,Bi,Ti)
B represents “byte content”, T represents “start time”, X is a 5-tuple which contains SRC, DST and Protocol features. They can be unfolded as :
B i = [ b y t e 1 , b y t e 2 , . . . b y t e q ] , 0 x 00 ≤ b y t e ≤ 0 x f f B^i = [byte^1,byte^2,...byte^q], 0x00≤byte≤0xff Bi=[byte1,byte2,...byteq],0x00≤byte≤0xff
T i > 0 T^i > 0 Ti>0
X i = < S r c I P i , D s t I P i , S r c P o r t i , D s t P o r t i , P r o t o c o l i > X^i =
R can also be expressed by “flow”. ‘‘Flow’’ f is the set of the “Packets” P with the same X. One flow contains many packets, so it can be expressed as (l means l-th flow):
f l = [ P l 1 , P l 2 , . . . , P l m ] f_{l} = [{P_{l}^1,P_{l}^2,...,P_{l}^m}] fl=[Pl1,Pl2,...,Plm]
Once all of the “flow” contains all of the “packet”, R thus can be expressed as :
R = [ f 1 , f 2 , . . . , f k ] R = [f_{1},f_{2},...,f_{k}] R=[f1,f2,...,fk]
Preprocessing procedure is listed as follows :
PEAN is open source, but we still need a little bit learning about its architecture.
First we need to learn some terminology of evaluation of neural network. TP, TN, FP and FN means:
Here are a few of evaluation index we need to calculate in order to assess the PEAN.
A c c u r a c y = T P i T P i + F P i Accuracy = \frac{TP_i}{TP_{i} + FP_{i}} Accuracy=TPi+FPiTPi
T P R i = T P i T P i + F N i TPR_{i} = \frac{TP_i}{TP_{i} + FN_{i}} TPRi=TPi+FNiTPi
F P R i = F P i F P i + T N i = R e c a l l i FPR_{i} = \frac{FP_i}{FP_{i} + TN_{i}} = Recall_{i} FPRi=FPi+TNiFPi=Recalli
F1_macro : Average F1 value of all categories.
F 1 m a c r o = 1 N ∑ i = 1 N F 1 i = 1 N ∑ i = 1 N 2 × A c c u r a c y × R e c a l l A c c u r a c y + R e c a l l F1_{macro} = \frac{1}{N} \sum_{i=1}^N F1_i = \frac{1}{N} \sum_{i=1}^N 2 \times \frac{Accuracy \times Recall}{Accuracy + Recall} F1macro=N1i=1∑NF1i=N1i=1∑N2×Accuracy+RecallAccuracy×Recall
FTF : Can be calculated by TPR & FPR.
F T F = ∑ i = 1 N w i T P R i 1 + F P R i FTF = \sum_{i=1}^N w_{i} \frac{TPR_{i}}{1 + FPR_{i}} FTF=i=1∑Nwi1+FPRiTPRi