欺诈检测_签名欺诈检测-一种高级分析方法

欺诈检测

In my previous article, I discussed advanced analytics application in the area of fraud in a generic fashion. In this article I will delve into details in a specific area of fraud-signature forgery. No wonder that institutions and businesses recognize signatures as the primary way of authenticating transactions. People sign checks, authorize documents and contracts, validate credit card transactions and verify activities through signatures. As the number of signed documents — and their availability — has increased tremendously, so has the growth of signature fraud.

在 上一篇文章中 ,我以通用方式讨论了欺诈领域的高级分析应用程序。 在本文中,我将详细研究欺诈签名伪造的特定领域。 难怪机构和企业将签名识别为认证交易的主要方式。 人们签署支票,授权文件和合同,验证信用卡交易并通过签名验证活动。 随着签名文档的数量及其可用性的急剧增长,签名欺诈的增长也随之增加。

According to recent studies, only check fraud costs banks about $900M per year with 22% of all fraudulent checks attributed to signature fraud. Clearly, with more than 27.5 billion (according to The 2010 Federal Reserve Payments Study) checks written each year in the United States, visually comparing signatures with manual effort on the hundreds of millions of checks processed daily proves impractical.

根据最近的研究,仅支票欺诈每年使银行损失约9亿美元,而所有欺诈性支票中的22%归因于签名欺诈。 显然,在美国每年签发的支票超过275亿张(根据2010年美联储付款研究),目测比较每天处理的亿万张支票的签名和人工签名是不切实际的。

The advent of big data, on-distributed Hadoop-based platforms like MapR, has made it possible to economically and efficiently store and process large amounts of signature images. This enables enterprises to use comprehensive historical transaction data to discover patterns of fraud signatures by developing algorithms, which can automate the traditional visual comparison.

大数据的出现,基于分布式Hadoop的平台(如MapR)使经济高效地存储和处理大量签名图像成为可能。 这使企业能够使用全面的历史交易数据,通过开发算法来发现欺诈签名的模式,这些算法可以自动进行传统的视觉比较。

The art and science of signatures:

签名的艺术和科学:

Before coming to types of automated signature verification types and detailed method let’s understand some concepts related to signing process and some popular myths, types of signature forgeries, and hence loopholes of conventional visual comparison of static signature images.

在介绍自动签名验证的类型和详细方法之前,让我们了解一些与签名过程有关的概念以及一些流行的神话,签名伪造的类型,以及因此而产生的对静态签名图像进行常规视觉比较的漏洞。

Myth: The authentic signatures of same person will be exactly similar throughout all transactions

误解:在所有交易中,同一个人的真实签名将完全相似

Reality: The physical act of signing a signature requires coordinating the brain, eyes, arms, fingers, muscles and nerves. Considering all factors in play, it’s no wonder that people don’t sign their name exactly the same every time: some elements may be omitted or altered. Personality, emotional state, health, age, conditions under which the individual signs, space available for the signature and many other factors all influence signature-to-signature deviations.

现实:签名的物理动作需要协调大脑,眼睛,手臂,手指,肌肉和神经。 考虑到游戏中的所有因素,难怪人们每次都不会在自己的名字上签名完全相同:有些元素可能会被省略或更改。 人格,情绪状态,健康状况,年龄,个人签名所处的条件,签名可用空间以及许多其他因素都会影响签名之间的偏差。

Types of signature forgeries:

签名伪造的类型:

In real life a signature forgery is an event in which the forger mainly focuses on accuracy rather than fluency.

在现实生活中,签名伪造是其中伪造者主要关注准确性而不是流利性的事件。

The range of signature forgeries falls into the following three categories:

签名伪造的范围分为以下三类:

1. Random/Blind forgery — Typically has little or no similarity to the genuine signatures. This type of forgery is created when the forger has no access to the authentic signature.

1.随机/盲伪造品-通常与真实签名几乎没有相似性。 当伪造者无法访问真实签名时,便会创建这种伪造。

2. Unskilled (Trace-over) Forgery: The signature is traced over, appearing as a faint indentation on the sheet of paper underneath. This indentation can then be used as a guide for a signature.

2.不熟练的(伪造的)伪造品:签名被追查到 ,在下面的纸页上以微弱的压痕出现。 然后可以将该缩进用作签名的指南。

3. Skilled forgery — Produced by a perpetrator that has access to one or more samples of the authentic signature and can imitate it after much practice. Skilled forgery is the most difficult of all forgeries to authenticate.

3.熟练的伪造-由行凶者生产,可以获取一个或多个真实签名的样本,并且经过大量练习后可以模仿。 技术伪造是所有伪造中最难以鉴定的。

An effective signature verification system must have the ability to detect all these types of forgeries by means of reliable, customized algorithms.

一个有效的签名验证系统必须具有通过可靠的自定义算法检测所有这些类型的伪造品的能力。

Manual verification conundrum:

手动验证难题:

Because of subjective decision and varies heavily depending on human factors such as expertise, fatigue, mood, working conditions etc manual verification is more error prone and inconsistent, in the case of Skilled forgery(offline method)leads to following instances:

由于是主观决定,并且在很大程度上取决于人为因素(例如专业知识,疲劳,情绪,工作条件等),在熟练伪造(离线方法)的情况下,手动验证更容易出错且不一致,导致以下情况:

False Rejection: Flagging transactions fraudulent (when they are not) mistakenly declined, creating negative impact on customer satisfaction, often called as type-I error.

错误拒绝:错误地标记欺诈性交易(如果不是的话),对客户满意度产生负面影响,通常称为I型错误。

False Acceptance: Genuine signature and skilled forgery that operator accepted as an authentic signature, leading to financial and reputational loss, often called as type-II error.

错误接受:操作员接受为真实签名的真实签名和熟练伪造,导致财务和声誉损失,通常称为II型错误。

Goal of an accurate verification system to minimize both type of error.

准确验证系统的目标是最小化两种类型的错误。

Signature traits:

签名特征:

Let’s understand signature features for a human document examiner to distinguish frauds from genuine.Following is non exhaustive list of static and dynamic characteristics used for signature verification:

让我们了解一下人类文档检查员的签名功能,以区分欺诈与真实。以下是用于签名验证的静态和动态特征的非详尽列表:

· Shaky handwriting(static)

·摇晃的笔迹(静态)

· Pen lifts(dynamic)

·笔式升降机(动态)

· Signs of retouching(static and dynamic)

·修饰的迹象(静态和动态)

· Letter proportions(static)

·字母比例(静态)

· Signature Shape/dimension( static)

·签名形状/尺寸(静态)

· Slant/angulation(static)

·倾斜/倾斜(静态)

· Very close similarity between two or more signatures(static)

·两个或多个签名之间非常相似(静态)

· Speed(dynamic)

·速度(动态)

· Pen pressure(dynamic)

·笔压(动态)

· Pressure Change Patterns(dynamic)

·压力变化模式(动态)

· Acceleration pattern(dynamic)

·加速方式(动态)

· Smoothness of Curves(Static)

·曲线的平滑度(静态)

Based on the verification environment and sample collection condition, not all the features are available for analysis

根据验证环境和样品收集条件,并非所有功能都可用于分析

Types of automatic signature verification system:

自动签名验证系统的类型:

As discussed depending on the feasible(available) signature characteristics extraction and business/functional requirement, broadly two category of Signature Verification systems exist in market.

如所讨论的,根据可行的(可用的)签名特征提取和业务/功能要求,市场上广泛存在两类签名验证系统。

A) Offline Signature Verification: Deployed where there is no scope for monitoring real time signature activity of a person. In applications that scrutinize signed paper documents, only a static, two dimensional image is available for verification. For obvious reason in this type of verification engine, dynamic characteristics. In order to account for the loss of these important information and produce highly accurate signature comparison results, off-line signature verification systems have to imitate the methodologies and approaches used by human forensic document examiners. This method is heavily dependent on tedious image preprocessing(image scaling, resizing, cropping, rotation, filtering, histogram of oriented gradients thresh holding, hash tagging etc.) and adept machine learning skills. The features mainly used here, are static in nature — image texture (wavelet descriptors), geometry and topology (shape, size aspect ratio etc.), stroke positions, hand writing similarity etc.

A)脱机签名验证:部署在没有范围监视个人实时签名活动的地方。 在检查签名的纸质文档的应用程序中,只有静态的二维图像可用于验证。 由于显而易见的原因,这种类型的验证引擎具有动态特性。 为了解决这些重要信息的丢失并产生高度准确的签名比较结果,脱机签名验证系统必须模仿人类法医检查员使用的方法和方法。 这种方法在很大程度上取决于乏味的图像预处理(图像缩放,调整大小,裁剪,旋转,过滤,定向梯度直方图阈值保持,哈希标记等)和熟练的机器学习技能。 这里主要使用的功能本质上是静态的-图像纹理(小波描述符),几何和拓扑(形状,大小长宽比等),笔触位置,手写相似度等。

Although there are many limitations, in most of the real life check transactions and digital document verification signatures are executed beforehand and no scope of real time signature monitoring to capture the dynamic features.

尽管存在许多限制,但是在大多数现实生活中的检查交易和数字文档验证签名都是预先执行的,没有实时签名监视范围来捕获动态功能。

For offline Signature Verification the machine learning tasks can be further categorized in 1) General learning (person-independent)- The verification task is performed by comparing the questioned signature against each known signature in 1:1 basis and 2) Special learning (which is person-dependent) — To verify whether the questioned signature falls within the range of variation among multiple multiple genuine signatures of same individual.

对于脱机签名验证,机器学习任务可以进一步分类为1)常规学习(独立于人)-通过以1:1为基础将有问题的签名与每个已知签名进行比较来执行验证任务,以及2)特殊学习(即取决于个人)—验证所质疑的签名是否落在同一个人的多个多个真实签名之间的变异范围内。

B) Online Signature Verification: Signing is a reflex action based on a repeated action, rather than deliberately controlling muscles and even accurate forgeries take longer to produce than genuine signatures. As the name suggests in this type of verification system capture of crucial dynamic features, such as speed, acceleration and pressure etc., is feasible. This type of system is more accurate as even for the copy machine or an expert, it is virtually impossible to mimic unique behavior patterns and characteristics of the original signer.

B)在线签名验证:签名是基于重复动作的反射动作,而不是刻意控制肌肉,即使是准确的伪造品也要比真正的签名花费更长的时间。 顾名思义,在这种类型的验证系统中,捕获关键的动态特征(例如速度,加速度和压力等)是可行的。 这种类型的系统更为精确,因为即使对于复印机或专家,实际上也无法模仿原始签名者的独特行为模式和特征。

Experiment Brief:

实验简介:

Let’s discuss a simplistic offline verification solution in a simulated environment.For this research, data was prepared out of 40 individuals, each contributed 25 signatures and thereby having 1000 genuine signatures. Then subjects are randomly chosen to forge another person’s signature, with 15/individual, so having 600(decent over sampling of fraud) forgeries. Now with 25 genuine/person and 12 forged signature/person the data is randomly splitted in train(75%) and validation(25%) data, ensuring at least 15 genuine signatures/person.in train data.The goal is to build an offline algorithmic Signature Verification system with person independent learning method, an engine to determine whether or not a questioned signature from validation belongs to the a particular individual.

让我们讨论一个在模拟环境中的简化的脱机验证解决方案。对于此研究,数据是从40个人中准备的,每个人贡献了25个签名,因此具有1000个真实签名。 然后,随机选择对象以伪造另一个人的签名,每个人有15个,因此有600个伪造品。 现在具有25个真实/人和12个伪造的签名/人的数据将随机分为火车(75%)和验证(25%)数据,确保至少15个真实签名/人在火车数据中。目标是建立一个具有个人独立学习方法的离线算法签名验证系统,该引擎用于确定来自验证的疑问签名是否属于特定个人。

Fig: Genuine Signature sample Fig:Sample for an individual (genuine and forged)

图:真正的签名样本图:个人样本(正版和伪造)

Solution Framework:

解决方案框架:

Person independent supervised learning: The learning problem is converted to a two-class classification problem where the input consists of the difference(dissimilarity) between a pair of signatures and odds of genuine signature occurrence is calculated in terms of likelihood-ratio (LR) referred from a suitable parametric distribution of distance(dissimilarity score of paired signatures) both for good(authentic) and bad(forged) population. Then of a questioned signature of a person from it’s genuine signature is fitted to the distribution to calculate LR score and based on the LR and a pre-specified threshold value(based on maximum accuracy)the classification decision to be taken whether or not a questioned signature(from test data) is genuine w.r.t. a particular person.

人无关的监督学习:学习问题被转换为两类分类问题,其中输入由一对签名之间的差异(不相似)组成,并且真实签名出现的几率是根据所提及的似然比(LR)计算的从良好(真实)和不良(伪造)总体的距离(配对签名的相异度得分)的合适参数分布中得出。 然后将一个人的真实签名中的一个人的真实签名与分布进行拟合,以计算LR得分,并根据LR和一个预先指定的阈值(基于最大准确度)对是否有疑问的人做出分类决定签名(来自测试数据)是某人的真实签名。

Model Equation

模型方程式

Where

哪里

• P(Dg(i)|d) is probability density function value for the Dg(genuine) distribution at the distance d

•P(Dg(i)| d)是距离d处Dg(正版)分布的概率密度函数值

• P(Db(i)|d) is probability density function value for the Db(forged) distribution at the distance d

•P(Db(i)| d)是距离d处Db(伪造)分布的概率密度函数值

  • N is number of known samples from a person for 1:1 comparison

    N是某人进行1:1比较的已知样本数

• Ψ is a pre-specified threshold value >1

•Ψ是预定阈值> 1

Although the modeling task is straight forward, a lot of image preprocessing is required to calculate distance/vector of distance(d) between signature pairs based on extracted static features.Also suitable parametric model selection and tuning with optimal cutoff value.

尽管建模任务很简单,但基于提取的静态特征来计算签名对之间的距离(d)的距离/向量还需要进行大量图像预处理,还需要进行适当的参数模型选择和最佳截止值调整。

Steps involved:

涉及的步骤:

A) Feature Extraction: This is highly technical area and involves complex image processing to extract discriminating elements and the combination of elements for a particular person.

A)特征提取:这是技术含量很高的领域,涉及复杂的图像处理,以提取特定人的区分元素和元素组合。

1) Image preprocessing and grid formation: Each signature was gone through salt pepper noise removal and slant normalization process after gray-scale transformation. Then after suitable resizing, cropping and other augmentation process each image is re-constructed with 4x7 grid

1)图像预处理和网格形成:灰度变换后,每个签名都要经过盐胡椒噪声去除和倾斜归一化处理。 然后,在经过适当的调整大小,裁剪和其他增强处理后,每个图像都将使用4x7网格进行重建

.2) Binary feature vector extraction: Extraction of GSC(gradient, structural and concavity)feature map from pixels image grid and corresponding local histogram cell is quantized into binary feature vector of 1024 bits(summing bits of G,S and C features).

.2)二进制特征向量提取:从像素图像网格和相应的局部直方图单元中提取GSC(梯度,结构和凹度)特征图的量化为1024位的二进制特征向量(G,S和C特征的总和)。

Fig: Image grid and 1024 bit binary feature vector

图:图像网格和1024位二进制特征向量

B) Similarity(distance) measure: Developing Gaussian landmark(exp(−rij2/2σ2))sets for point to point matching of paired images and overall similarity or distance measure is used to compute a score that signifies the strength of match between two signatures. The similarity measure converts the pair wise data from feature space to distance space. Several. Here Hamming Distance method is used.

B)相似度(距离)度量:为配对图像的点对点匹配开发高斯界标(exp(-rij2 /2σ2))集,并使用整体相似度或距离度量来计算表示两个签名之间匹配强度的分数。 相似性度量将成对数据从特征空间转换为距离空间。 一些。 这里使用汉明距离法。

(Apology for not elaborating these topics here because of space constraint and will discuss in a separate post.)

(由于篇幅所限,此处未详细说明这些主题的道歉,将在单独的帖子中进行讨论。)

C) Model training(Distribution fit): These pairwise distances(d) of train data are categorized into two vectors,Dg- vector of distances between all pairs of genuine signatures(samples truly came from same persons) and Db- vector of distances between all pairs of forged signatures(samples came from different persons). These two distance vectors can be modeled using known distributions such as Gaussian or gamma. For this example gamma distribution fits the data well.

C)模型训练(分布拟合):火车数据的这些成对距离(d)分为两个向量,所有真实签名对之间的距离的Dg向量(样本确实来自同一个人)和之间的距离的Db向量所有伪造的签名对(样本来自不同的人)。 这两个距离向量可以使用已知的分布(例如高斯或伽马)进行建模。 对于此示例,伽马分布很好地拟合了数据。

D) Likelihood ratio(LR) and classification decision: For a questioned signature of a particular person from untagged data(here from validation)is then 1:1 matched with the person’s genuine signature after above described preprocessing and distance score(pairwise dissimilarity) point is projected against the fitted density curve to get LR value -P(Dg|d)/ P(Db|d).If likelihood ratio is greater than 1,then the classification decision is that the two samples do belong the same person and if the ratio is less than 1, they belong to different persons.If there are a total of N known samples from a person, then for one questioned sample, N no of 1:1 verification can be performed and the likelihood ratios multiplied. For convenience log likelihood-ratios(LLR) are adopted rather than likelihood ratios.

D)似然比(LR)和分类决策:经过上述预处理和距离得分​​(成对相异)点后,对于未标记数据中某人的可疑签名(此处来自验证),则与该人的真实签名1:1匹配对拟合的密度曲线进行投影以获得LR值-P(Dg | d)/ P(Db | d)。如果似然比大于1,则分类决策是两个样本确实属于同一个人,如果比率小于1,它们属于不同的人。如果一个人总共有N个已知样本,则对于一个有问题的样本,可以执行N no 1:1验证,并且似然比成倍增加。 为了方便起见,采用似然比(LLR)而不是似然比。

Fig: Distribution fit and Classification Decision

图:分布拟合和分类决策

Performance Evaluation: The above distribution, although there is a noticeable overlapping zone, has done its job reasonably well in discriminating two regions(genuine and fraud).Apparently the decision boundary is given by the sign of the LLR and a modified decision boundary can be constructed using a threshold α, such that log P(Dg|d)−log P(Db|d) >α. The model accuracy defined as [1-((False acceptance+False reject)/2)] is maximum at a particular value of α. This involves model tuning and the best setting of α is denoted as operating point, for the specified number of known samples. In ROC curves, generated with varied no of known samples (from 12–15) the operating point is shown as ‘*’. The overall accuracy is around 77%.

绩效评估:以上分布尽管存在明显的重叠区域,但在区分两个区域(正版和欺诈)方面做得相当好。显然,决策边界由LLR的符号给出,可以修改后的决策边界使用阈值α构造,使得log P(Dg | d)-log P(Db | d)>α。 定义为[1-((错误接受+错误拒绝)/ 2)]的模型精度在特定的α值处最大。 对于指定数量的已知样本,这涉及模型调整,α的最佳设置表示为工作点。 在ROC曲线中,生成的已知样本数不等(从12到15),工作点显示为“ *”。 总体准确性约为77%。

Fig: Model Tuning and Performance

图:模型调整和性能

Improvement and road ahead:

改进和前进的道路:

Through this experiment and simplistic solution a moderate accuracy is achieved. However the accuracy can be improved with bigger training data, fitting and ensembling with other models, including non-parametric methods(deep learning, CNN etc.). Also incorporating other distance measures(e.g. Levenshtein distance, Chamfer distance)between image pairs as additional features and/or with taking simple/weighted average of these dissimilarity features would make the dissimilarity measure more robust and reliable add more discriminatory power to the model.

通过该实验和简单的解决方案,可以实现中等精度。 但是,可以使用更大的训练数据,与其他模型进行拟合和集成来提高准确性,包括非参数方法(深度学习,CNN等)。 还将图像对之间的其他距离度量(例如,Levenshtein距离,Chamfer距离)合并为附加特征和/或通过对这些相异特征进行简单/加权平均,将使相异性度量更加稳健和可靠,从而为模型增加了更多辨别力。

Finally cutting-edge signature verification systems need to be adaptive, agile and accurate. This requires deep analysis of ever-growing datasets and continuous updates to production models, so that the efficiencies remain stable with time, unlike results achieved in high-volume situations with human operators.

最后,最先进的签名验证系统需要具有自适应性,敏捷性和准确性。 这需要对不断增长的数据集进行深入分析并不断更新生产模型,以使效率随时间保持稳定,这与人工操作人员在大批量情况下获得的结果不同。

翻译自: https://towardsdatascience.com/signature-fraud-detection-an-advanced-analytics-approach-a795b0e588b2

欺诈检测

你可能感兴趣的:(python)