【Paper Reading Note】Brief Introduction of Encrypted Traffic Classification Using PEAN

Brief Introduction of Encrypted Traffic Classification

[Data Preprocessing-Oriented] [Essay Reading & Understanding Record] [2023.4] [Author : LWC]

Catalogue

文章目录

  • Brief Introduction of Encrypted Traffic Classification
    • Catalogue
    • Core
    • Motivation and Challenges
    • Basic Knowledge
      • SSL/TLS handshake process
      • Preprocessing of Input Traffic Data [Important]
        • Traffic Data Structure
        • Preprocessing procedure
      • Overall Architecture of PEAN [Not Emphasis]
    • Evaluation Index

Core

PEAN (Pakect-level End-to-end Attentive Network) : A novel “multimodel deep learning framework” for EFC.

※ Mechanism of PEAN

  • INPUT —— raw bytes & length sequence.
  • OUTPUT —— traffic classification result.
  • Self Attention mechanism —— do better in learning inter-relationships between network packets.
  • Unsupervised pre-training —— enhance characterize ability.

Motivation and Challenges

Nowadays, people are more concerned about “data security”, resulting for the emergency of encrypted protocols (such as SSL/TSL). Thus traditional payload-based classification method or DPI can no longer works once packets are encrypted.

Using machine learning algorithms is a promising method, but this method is based on “hand-designed flow features” which may thus ignore some important packet details, in which case classification is poor-performed. (In PEAN, there might be some better way to attain flow features.)

Challenges of EFC are as follows :

  • Encrypted packets cannot be classified from their content.
  • No effective ways to intergrate all kinds of information (such as IP, TCP headers).

Basic Knowledge

SSL/TLS handshake process

【Paper Reading Note】Brief Introduction of Encrypted Traffic Classification Using PEAN_第1张图片

What is necessary to be accentuated is that, not all cases contain complete “handshake packets”. Thus, there exists three types of “handshake packets” which we need to capture in our terminal system :

  • Fully Complete Handshake Packet.
  • Partially Complete Handshake Packet.
  • No Handshake Packet.

Preprocessing of Input Traffic Data [Important]

This part is actually what we need to do in this project, cause the PEAN is ready-made which is open source on git-hub and all we need to do is “preprocess” the data going into the PEAN.

Traffic Data Structure

First, let us analyse what is the input traffic data look like.
R = [ P 1 , P 2 , . . . , P n ] R = [P^1,P^2,...,P^n] R=[P1,P2,...,Pn]
R represents the “network traffic” set. In this set, every P represents one “packet” and can also be expressed by a matrix:
P i = ( X i , B i , T i ) P^i = (X^i,B^i,T^i) Pi=(Xi,Bi,Ti)
B represents “byte content”, T represents “start time”, X is a 5-tuple which contains SRC, DST and Protocol features. They can be unfolded as :
B i = [ b y t e 1 , b y t e 2 , . . . b y t e q ] , 0 x 00 ≤ b y t e ≤ 0 x f f B^i = [byte^1,byte^2,...byte^q], 0x00≤byte≤0xff Bi=[byte1,byte2,...byteq],0x00byte0xff

T i > 0 T^i > 0 Ti>0

X i = < S r c I P i , D s t I P i , S r c P o r t i , D s t P o r t i , P r o t o c o l i > X^i = Xi=<SrcIPi,DstIPi,SrcPorti,DstPorti,Protocoli>

R can also be expressed by “flow”. ‘‘Flow’’ f is the set of the “Packets” P with the same X. One flow contains many packets, so it can be expressed as (l means l-th flow):
f l = [ P l 1 , P l 2 , . . . , P l m ] f_{l} = [{P_{l}^1,P_{l}^2,...,P_{l}^m}] fl=[Pl1,Pl2,...,Plm]
Once all of the “flow” contains all of the “packet”, R thus can be expressed as :
R = [ f 1 , f 2 , . . . , f k ] R = [f_{1},f_{2},...,f_{k}] R=[f1,f2,...,fk]

Preprocessing procedure

Preprocessing procedure is listed as follows :

  • Bi-directional Flows Extraction : Divide pakects with the same 5-tuple into the same group. [Tool : SplitCap]
  • TLS Traffic Filtering : Only concentrated on TLS encrypted traffic and filter other types. [Tool : tshark]
  • Traffic Typed Selection : Only use 19 kinds of mainstream traffic.
  • Labeling : Use DNS records & TLS Server Name Indication to label network flow.

Overall Architecture of PEAN [Not Emphasis]

PEAN is open source, but we still need a little bit learning about its architecture.
【Paper Reading Note】Brief Introduction of Encrypted Traffic Classification Using PEAN_第2张图片

Evaluation Index

First we need to learn some terminology of evaluation of neural network. TP, TN, FP and FN means:
【Paper Reading Note】Brief Introduction of Encrypted Traffic Classification Using PEAN_第3张图片

Here are a few of evaluation index we need to calculate in order to assess the PEAN.

  • Accuracy : The proportion of “correctly classified samples” to “all samples”.

A c c u r a c y = T P i T P i + F P i Accuracy = \frac{TP_i}{TP_{i} + FP_{i}} Accuracy=TPi+FPiTPi

  • TPR-avg : TPR is short for “True Positive Rate”.

T P R i = T P i T P i + F N i TPR_{i} = \frac{TP_i}{TP_{i} + FN_{i}} TPRi=TPi+FNiTPi

  • FPR-avg : FPR is short for “False Positive Rate”.

F P R i = F P i F P i + T N i = R e c a l l i FPR_{i} = \frac{FP_i}{FP_{i} + TN_{i}} = Recall_{i} FPRi=FPi+TNiFPi=Recalli

  • F1_macro : Average F1 value of all categories.
    F 1 m a c r o = 1 N ∑ i = 1 N F 1 i = 1 N ∑ i = 1 N 2 × A c c u r a c y × R e c a l l A c c u r a c y + R e c a l l F1_{macro} = \frac{1}{N} \sum_{i=1}^N F1_i = \frac{1}{N} \sum_{i=1}^N 2 \times \frac{Accuracy \times Recall}{Accuracy + Recall} F1macro=N1i=1NF1i=N1i=1N2×Accuracy+RecallAccuracy×Recall

  • FTF : Can be calculated by TPR & FPR.

F T F = ∑ i = 1 N w i T P R i 1 + F P R i FTF = \sum_{i=1}^N w_{i} \frac{TPR_{i}}{1 + FPR_{i}} FTF=i=1Nwi1+FPRiTPRi

你可能感兴趣的:(Paper,Reading,Note,卷积神经网络,神经网络,流量识别)