Coursera-Unsupervised Learning, Recommenders, Reinforcement Learning--异常检测Anomaly Detection

        异常检测算法用于处理 unlabeled dataset of normal events,在此基础上建立模型,检测数据是否异常

一、密度估计 Density Estimation

        给定训练数据集(注意其中的数据都是normal events),建立p(x)代表x出现在数据集的可能性;通过计算p(xtest)并与\varepsilon进行对比,p(x_{test})<\varepsilon 时为异常事件,否则为正常事件

Coursera-Unsupervised Learning, Recommenders, Reinforcement Learning--异常检测Anomaly Detection_第1张图片

 二、正态分布/高斯分布 Gaussian Distribution

       定义概率p(x)和对应图像如下, 其中\mu为对称轴对应的x值

        p(x)=\frac{1}{\sqrt{2\pi }\sigma }e^{\frac{-(x-\mu)^{2}}{2\sigma ^{2}}}                 \mu=\frac{1}{m}\sum x^{(i)}              \sigma ^{2}=\frac{1}{m}\sum (x^{(i)}-\mu )^2

Coursera-Unsupervised Learning, Recommenders, Reinforcement Learning--异常检测Anomaly Detection_第2张图片

三、异常检测算法Anomaly Detection Algorithm

        给定训练集 {\overrightarrow{x^{(1)}},\overrightarrow{x^{(2)}},...\overrightarrow{x^{(m)}}},每个x含有n个features,注意这些数据都属于正常数据,依次计算各个feature对应的 \mu和 \sigma,然后得到整体的p(x)=\prod p(xj;\mu_{j},\sigma_j)

        然后通过cross validation set(同时包含正常和异常数据),得到合适的 \varepsilon值;最后通过test set测试模型性能。

 四、实例——电脑异常检测

        数据包含两个features:吞吐量throughput,以及时延latency;数据集中包含m=307个数据

X_train, X_val, y_val = load_data()
# X_train: training set
# X_val,y_val: cross validation set

Coursera-Unsupervised Learning, Recommenders, Reinforcement Learning--异常检测Anomaly Detection_第3张图片

         首先通过正态分布公式,创建函数estimate_gaussian(X),得到 \mu和 \sigma^2

def estimate_gaussian(X): 
    """
    Calculates mean and variance of all features 
    in the dataset
    
    Args:
        X (ndarray): (m, n) Data matrix
    
    Returns:
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features
    """

    m, n = X.shape
    
    ### START CODE HERE ### 
    # 利用numpy简化代码,axis=0表示列运算
    mu = 1 / m * np.sum(X, axis = 0)
    var = 1 / m * np.sum((X-mu)**2, axis = 0)
    ### END CODE HERE ### 
        
    return mu, var

         然后通过得到的参数建立p(x)得到如下的分布图:

Coursera-Unsupervised Learning, Recommenders, Reinforcement Learning--异常检测Anomaly Detection_第4张图片

         下一步根据cross validation set选择合适的 \varepsilon。利用正确的类型y_val和通过模型预测出的p_val,建立循环:每次计算tp(异常且正确预测)、fp(正常但被预测为异常)、fn(异常但被预测为正常),得到计算精度prec与rec,选择对应F1最大的 \varepsilon

        prec=\frac{tp}{tp+fp}                        rec=\frac{tp}{tp+fn}                        F_{1}=\frac{2prec*rec}{prec+rec}

def select_threshold(y_val, p_val): 
    """
    Finds the best threshold to use for selecting outliers 
    based on the results from a validation set (p_val) 
    and the ground truth (y_val)
    
    Args:
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set
        
    Returns:
        epsilon (float): Threshold chosen 
        F1 (float):      F1 score by choosing epsilon as threshold
    """ 

    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    
    step_size = (max(p_val) - min(p_val)) / 1000
    
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
    
        ### START CODE HERE ### 
        predictions=(p_val best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

你可能感兴趣的:(Learning,python,matplotlib,人工智能)