局部异常因子lof
Today’s article is my 5th in a series of “bite-size” article I am writing on different techniques used for anomaly detection. If you are interested, the following are the previous four articles:
今天的文章是我撰写的系列文章中的第五篇,这是我撰写的有关用于异常检测的不同技术的文章。 如果您有兴趣,以下是前四篇文章:
Z-score for anomaly detection
Z分数用于异常检测
Boxplot for anomaly detection
用于异常检测的箱线图
Statistical techniques for anomaly detection
异常检测的统计技术
Time series anomaly detection with “anomalize” library
使用“异常化”库进行时间序列异常检测
Today I am going beyond statistical techniques and stepping into machine learning algorithms for anomaly detection.
今天,我将超越统计技术,而涉足用于异常检测的机器学习算法。
什么是局部离群因子(LOF)? (What is the Local Outlier Factor (LOF)?)
LOF is an unsupervised (well, semi-supervised) machine learning algorithm that uses the density of data points in the distribution as a key factor to detect outliers.
LOF是一种无监督(很好,半监督)的机器学习算法,它使用分布中数据点的密度作为检测异常值的关键因素。
LOF compares the density of any given data point to the density of its neighbors. Since outliers come from low-density areas, the ratio will be higher for anomalous data points. As a rule of thumb, a normal data point has a LOF between 1 and 1.5 whereas anomalous observations will have much higher LOF. The higher the LOF the more likely it is an outlier. If the LOF of point X is 5, it means the average density of X’s neighbors is 5 times higher than its local density.
LOF将任何给定数据点的密度与其邻居的密度进行比较。 由于离群值来自低密度区域,因此异常数据点的比率将更高。 根据经验,正常数据点的LOF在1到1.5之间,而异常观测的LOF更高。 LOF越高,发生异常的可能性越大。 如果点X的LOF为5,则意味着X邻居的平均密度是其局部密度的5倍。
In mathematical terms,
用数学的话
LOF(X)=[(LRD(1st neighbor) + LRD(2nd neighbor ) + .................+ LRD(kth neighbor))/LRD(X)]/k
where LRD is Local Reachability Distance and is computed as follows.
其中LRD是本地可达距离,其计算公式如下。
LRD(X) = 1/(sum of Reachability Distance (X, n))/k)where n is neighbors upto k
The algorithm has four different components:
该算法具有四个不同的组件:
Hyperparameter k: determines the number of neighbors
超参数k :确定邻居数
Reachability distance: distances measured using 3 methods — Euclid, Minkowski, Manhattan
可达距离:使用3种方法测量的距离-欧几里得,明可夫斯基,曼哈顿
Local reachability distance: (LRD) (X) = 1/(sum of Reachability Distance (X, n))/k), where n is neighbors upto k
本地可达距离: (LRD)(X)= 1 /(可达距离(X,n)的总和)/ k),其中n是直到k的邻居
Local Outlier Factor (LOF)
局部离群因子(LOF)
Enough of theory and mathematics. If you didn’t understand much of it, no hard feelings. As I use to say, to drive a car we don’t need to know about its mechanics, but we do need to know how to drive! So jump right into the next section on the implementation of LOF in Python.
足够的理论和数学知识。 如果您对其中的内容了解不多,则不会有难受的感受。 就像我常说的那样,驾驶汽车我们不需要了解其机械原理,但是我们确实需要知道如何驾驶! 因此,请跳到下一节有关在Python中实现LOF的内容。
Python实现 (Python implementation)
We are going to implement LOF for anomaly detection in Python environment using Scikit-Learn library. Let’s first import the required libraries:
我们将使用Scikit-Learn库在Python环境中实现LOF检测异常。 首先导入所需的库:
# data preparation
import pandas as pd
import numpy as np# data visualzation
import matplotlib.pyplot as plt
import seaborn as sns# outlier/anomaly detection
from sklearn.neighbors import LocalOutlierFactor
Now let’s create a hypothetical dataset containing 5 data points.
现在,让我们创建一个包含5个数据点的假设数据集。
# data
df = pd.DataFrame(np.array([[0,1], [1,1], [1,2], [2,2], [5,6]]), columns = ["x", "y"], index = [0,1,2,3,4])
If you plot the data points, finding out the outlier with visual inspection is not so difficult.
如果绘制数据点,则通过视觉检查找出异常值并不是那么困难。
# plot data points
plt.scatter(df["x"], df["y"], color = "b", s = 65)
plt.grid()
Made up data points 组成数据点
So indeed, we don’t need a machine learning algorithm to find out that the 5th data point is an outlier. But let’s see if the algorithm can detect it.
因此,的确,我们不需要机器学习算法就能发现第5个数据点是一个异常值。 但是,让我们看看算法是否可以检测到它。
# model specification
model1 = LocalOutlierFactor(n_neighbors = 2, metric = "manhattan", contamination = 0.02)# model fitting
y_pred = model1.fit_predict(df)# filter outlier index
outlier_index = where(y_pred == -1) # negative values are outliers and positives inliers# filter outlier values
outlier_values = df.iloc[outlier_index]# plot data
plt.scatter(df["x"], df["y"], color = "b", s = 65)# plot outlier values
plt.scatter(outlier_values["x"], outlier_values["y"], color = "r")
Detection of anomalous data points using LOF 使用LOF检测异常数据点
There you go! The algorithm correctly detected the outlier.
你去! 该算法正确检测到异常值。
总结与结论 (Summary and conclusion)
The purpose of this article was to introduce a density-based anomaly detection technique — Local Outlier Factor. LOF compares the density of a given data point to its neighbors and determines whether that data is normal or anomalous. The implementation of this algorithm is not too difficult thanks to the sklearn
library. The interpretation of the results is also pretty straight forward.
本文的目的是介绍一种基于密度的异常检测技术-局部离群值因子。 LOF比较给定数据点与其相邻点的密度,并确定该数据是正常数据还是异常数据。 sklearn
库,该算法的实现并不是太困难。 结果的解释也很简单。
To focus on just one thing I left out another important use case of LocalOutlierFactor()
algorithm — Novelty Detection. This is the subject of another article but briefly, LOF is a semi-supervised ML algorithm where the algorithm is trained only on normal data. After training the algorithm, new data is shown to identify whether it is novel or not.
为了只关注一件事,我省略了LocalOutlierFactor()
算法的另一个重要用例-新奇检测。 这是另一篇文章的主题,但简要地说,LOF是一种半监督的ML算法,其中仅对正常数据进行训练。 训练算法后,将显示新数据以标识它是否新颖。
Hope you liked this article, feel free to follow me on Medium or Twitter.
希望您喜欢本文,请随时在Medium或Twitter上关注我。
翻译自: https://towardsdatascience.com/anomaly-detection-with-local-outlier-factor-lof-d91e41df10f2
局部异常因子lof