Cybersecurity spans across a variety of missions: prediction, prevention, detection, response (Gartner PPDR); over a variety of ‘places’: network (intrusion detection), endpoint (anti-malware), application (firewalls), user (behavior analytics), process (anti-fraud); during a variety of times: in real-time (transit), at rest (monitor), during investigation (retrospect).
网络安全涉及多个任务:预测,预防,检测,响应( Gartner PPDR ); 遍及各种“场所”:网络(入侵检测),端点(反恶意软件),应用程序(防火墙),用户(行为分析),流程(反欺诈); 在各种时间:实时(运输),静止(显示器),调查(回顾)。
On the face of it, Machine Learning has a huge potential to add value in Cyber: from its innate ability to detect patterns in big data that elude the human eye, through its high fit for fast-scaling real-time changing data, to its hyper-personalization aptitude. Yet, the use of ML in Cyber does not match its popularity in vision or language. Why?
从表面上看,机器学习具有巨大的潜力,可以在网络中增加价值:从其天生的检测大眼中无法察觉的大数据模式的能力到高度适合快速缩放实时变化数据的能力,到超个性化能力。 但是,在网络中使用ML与在视觉或语言上的流行并不匹配。 为什么?
网络机器学习的挑战 (Challenges of Cyber ML)
The abundance of training data in vision and language coupled with the relative ease of labeling is no match to the scarcity of well-labeled data in cyber. It is one thing to ask Mechanical Turkers to label objects in images and another to detect an attack in log files. Not to mention organizations’ reluctance to share data that exposes their vulnerabilities and shortcomings.
视觉和语言方面的培训数据丰富,加上标签的相对易用性,与网络中标签清晰的数据的稀缺性不符。 要求Mechanical Turker标记图像中的对象是一回事,而要检测日志文件中的攻击则是另一回事。 更不用说组织不愿共享暴露其漏洞和缺点的数据。
Signature matching algorithms that detect known attacks (misuse) are inefficient at detecting novel (zero-day) attacks. Anomaly detection approaches that flag deviations from the norm are susceptible to false positives stemming from the difficulty in differentiating between normal and abnormal behaviors, and the continuously changing malicious behaviors.
检测已知攻击(滥用)的签名匹配算法在检测新型(零时差) 攻击方面效率不高。 由于难以区分正常行为和异常行为以及不断变化的恶意行为 ,标记为偏离规范的异常检测方法容易受到误报的影响 。
While language, vision and even recommendation models are often considered quasi-stationary, cyber models need to be re-trained daily or upon detection of new attack / malicious data. This strain on ML workflow can be a limiting constraint on model training time, and deployment processes, especially across multiple places (edge, endpoints, applications, networks…).
虽然语言,视觉甚至推荐模型通常被认为是准平稳模型,但网络模型需要每天或在检测到新的攻击/恶意数据时进行重新训练。 这种对ML工作流的压力可能会限制模型训练时间和部署过程,尤其是在多个地方(边缘,端点,应用程序,网络等)的部署过程。
These algorithmic challenges, when exacerbated with imbalanced data sets and retraining frequency requirements, lead to overflow of false positives. It is no wonder that when security personnel suffer from alert fatigue, they become a ML skeptic. So new ML product initiatives often need to battle with some bad reputation.
当数据集不平衡和重新训练频率要求加剧时,这些算法挑战会导致误报的泛滥。 难怪当安全人员遭受警报疲劳时,他们就会成为ML怀疑论者。 因此,新的ML产品计划通常需要与一些不良声誉作斗争。
Change is coming…
变革即将来临...
现在正在发生什么变化 (What is changing now)
Data
数据
Data is being accumulated like no other time in history. The realization that even mundane data, that does not seem to have any value today, may be extremely valuable to train models and determine normative behaviors in the future, is starting to sink in.
数据的积累是历史上前所未有的。 甚至是平凡的数据,如今似乎似乎没有任何价值,这种认识对于开始训练模型和确定未来的规范性行为也可能具有极其重要的意义,这种认识正在逐渐陷入。
The shift of many organizations to the cloud coupled with the ever-growing ease of sharing data is starting to show results. Data sets for Cyber ML are accumulating, making it easier for data scientists to test their hypotheses.
许多组织向云的转移以及共享数据的不断增长的便捷性开始显示出结果。 Cyber ML的数据集正在积累 ,使数据科学家更容易测试其假设。
Even companies that are reluctant to share cyber-related data in public (or with academia) are starting to reach critical mass for applying unsupervised ML algorithms on their own data.
即使是不愿在公共(或与学术界)之间共享与网络相关的数据的公司,也开始达到将无监督的ML算法应用于其自身数据的临界质量。
2. Tooling
2. 工具
The ML Rush attracts many builders, students, and adjacent professionals towards data science. This not only manifested in the growth of educational programs (on and offline) but also in the flood of novice data scientists, in recent years. Many ML and cloud vendors saw this coming and built automation and semi-automation tools making it easy for a novice data scientist to get started with ML.
ML Rush吸引了许多构建者,学生和相邻的专业人员从事数据科学。 近年来,这不仅体现在教育计划(在线和离线)的增长上,而且还表现在新手数据科学家的泛滥中。 许多ML和云供应商都预见到了这一点,并构建了自动化和半自动化工具,使新手数据科学家可以轻松地开始使用ML。
These tools range from pre-trained models available out of the box via API, through automation of the modeling process (in chunks or in whole). Open source algos, notebooks and the arxiv ocean, enable the novice data scientist to focus more on finding similar examples and reproducing workflows, than on inventing ML from scratch.
这些工具的范围从通过API开箱即用提供的预训练模型到建模过程的自动化(大块或全部)。 开源算法,笔记本和arxiv海洋使新手数据科学家能够更加专注于查找相似的示例和复制工作流,而不是从头开始发明ML。
Similarly, experienced data scientists take advantage of these fast-evolving tools to shed off the time-consuming ML Ops work and focus on modeling. By automating certain processes (e.g. professional labeling jobs) and giving data scientists more visibility into the model and training process (monitoring and retraining), these tools empower the experienced data scientist to be a lot more effective. And given the scarcity of experienced data scientist, that is worth a lot!
同样,经验丰富的数据科学家利用这些快速发展的工具摆脱了费时的ML Ops工作,并专注于建模。 通过使某些过程自动化(例如,专业标签工作)并为数据科学家提供对模型和培训过程的更多可见性(监视和再培训),这些工具可以使经验丰富的数据科学家变得更加有效。 鉴于经验丰富的数据科学家的稀缺性,这非常有价值!
3. Maturity
3. 成熟度
The ML-can-fix-it-all hype is over. Not just in cyber. Builders and even more so users and product managers have come to the realization that ML is a tool, with great potential but just as significant limitations. Methodologies for evaluating use case fit for ML attest to the understanding that ML can work if applied to the right problem.
ML可以解决所有炒作。 不只是网络。 开发人员,甚至用户和产品经理,甚至更多地意识到ML是一种工具,具有巨大的潜力,但同样具有很大的局限性。 评估适用于ML的用例的方法论证明,如果将ML应用于正确的问题,则ML可以正常工作。
The plethora of meetups on ML PoC and on product manager collaboration with data scientists indicate that maturity has grown beyond use case fit. The desire to fail-fast and apply lean methodology to ML, has led to AutoML, pre-trained models and open source algorithms that can be applied nearly out-of-the-box to get to a PoC in a matter of hours or few days.
关于ML PoC和产品经理与数据科学家的大量聚会表明,成熟度已超出用例适合性。 对快速失败和将精益方法应用于ML的渴望导致了AutoML,预训练的模型和开放源代码算法,几乎可以立即使用它们,以在几小时或几小时之内到达PoC。天。
Even the maturity of the more popular siblings, vision and language, comes into play. Facial recognition tools eliminate most of the heavy lifting of bio-metric validations. NLU engines and word2vec algo’s makes the art of feature engineering a lot easier. Neural nets that excel in learning features from raw images can do wanders in learning from raw binary malware. Leaving cyber professionals to deal with the domain, and not the ML iceberg underneath.
甚至更受欢迎的兄弟姐妹,视觉和语言的成熟也都在发挥作用。 面部识别工具消除了大多数繁琐的生物特征验证工作。 NLU引擎和word2vec算法使功能工程变得更加容易。 擅长从原始图像中学习特征的神经网络可以帮助您从原始二进制恶意软件中学习。 让网络专业人员来处理域名,而不是底层的ML冰山。
Cyber ML的未来 (What’s ahead for Cyber ML)
Anomalies can be detected using unsupervised approach on neural networks to analyze time series data. Individual behaviors can be learned and LSTM RNN networks trained to better detect changes in time series that could indicate abnormalities, whether in a specific point, context or in collection with other points. This can analogous to tokenizing conversations between computers as if they were language with words and sentences that has typical and atypical demeanor.
可以使用非监督方法在神经网络上检测异常,以分析时间序列数据 。 可以学习个体行为,并训练LSTM RNN网络以更好地检测可能指示异常的时间序列变化,无论是在特定点,上下文还是在与其他点的集合中。 这类似于将计算机之间的对话标记化,就好像它们是具有典型和非典型举止的单词和句子的语言一样 。
Advances in feature extractions and feature engineering by analyzing deep interactions between variable. In Intrusion Detection Systems (IDS) advanced features can be extracted from: network headers (e.g. IP Source, Destination, IP Length, Source Port etc), TCP connection (Duration Length, Protocol Type, Number of data bytes etc), 2second time or 100connections windows (Number of connections to the same host, SYN/REJ Error rates, Percentage of connections with same service etc), domain knowledge (number failed login attempts, compromised conditions, root accesses, shell prompts etc.).
通过分析变量之间的深层相互作用进行特征提取和特征工程研究进展。 在入侵检测系统(IDS)中, 可以从以下各项中提取高级功能 :网络头(例如IP源,目标,IP长度,源端口等),TCP连接(持续时间,协议类型,数据字节数等),2秒时间或100个连接窗口(与同一主机的连接数,SYN / REJ错误率,具有相同服务的连接百分比等),域知识(登录尝试失败次数,受到破坏的条件,root用户访问权限,shell提示等)。
Contextual/conditional/semantic patterns can be recognized using fuzzy association rules. Multidimensional rules can find new signatures for inclusion into misuse detection systems. Latent relationships can be unveiled using graphs algorithms and Bayesian networks. By constantly growing and improving relationship identification in graph databases, companies can accelerate response time and contain related attack more accurately.
可以使用模糊关联规则来识别上下文/条件/语义模式。 多维规则可以找到新的签名,以包含在滥用检测系统中。 潜在关系可以使用图算法和贝叶斯网络来揭示。 通过不断发展和改进图形数据库中的关系识别,公司可以加快响应时间并更准确地控制相关攻击。
Interventional scale can be achieved using pre-identified density-based clusters. By knowing which cluster of similar behaviors the attacker belongs to, orchestrations can be implemented, and prevention of latent attacks can be achieved. Clusters can also be combined with decision trees for allowing parallel evaluation of features.
干预规模可以使用预先确定的基于密度的集群来实现。 通过知道攻击者属于哪一组相似行为,可以实施协调,并可以防止潜在攻击。 集群也可以与决策树相结合,用于允许并行评估功能 。
ML can be integrated into the work of digital forensic investigation in order to automate and empower non-experts to help with some of the forensic backlogs. Malware detection and classification can be carried on using static code features as well as dynamic executed code.
可以将ML集成到数字取证调查工作中,以自动化和授权非专家帮助解决某些取证积压问题。 恶意软件检测和分类可以使用静态代码功能以及动态执行的代码进行 。
The number of open data sets, open source algorithms and expert communities collaboration, is growing (even if not optimal yet). Enabling faster time to model and new hybrid versions for integrating open source methods with proprietary ones to achieve better accuracy.
开放数据集 ,开放源代码算法和专家社区协作的数量正在增长 (即使不是最佳选择)。 使建模和新混合版本的时间更快,以便将开源方法与专有方法集成在一起,以实现更高的准确性。
Data overflows and static code obfuscations that inhibit real-time response can be overcome with conjunctive rule extraction, dimensionality reduction using data categorization techniques (based on content, time, source and destination) and an ensemble of recurrent neural networks that predict whether an executable is malicious within the first 5 seconds of its execution.
可以通过合并规则提取 ,使用数据分类技术 (基于内容,时间,源和目标)减少维数以及预测预测是否可执行文件的递归神经网络的集合,来克服抑制实时响应的数据溢出和静态代码混淆。在执行的前5秒钟内具有恶意。
Evolutionary Computation (genetic algorithms) that apply survival of the fittest principles by evolving a set of initial (known) rules to generate new rules using four genetic operators: reproduction, crossover, mutation, and dropping. The fitness function can be the support and confidence of a new genetically created rule.
进化计算 (遗传算法),通过进化一组初始(已知)规则来应用最适度原则的生存时间,从而使用四个遗传运算符(繁殖,交叉,变异和剔除)生成新规则。 适应度函数可以是新的基因创建规则的支持和信心。
摘要 (Summary)
ML adoption in Cyber is lagging behind vision and language. Challenged with the difficulty of obtaining labeled data, the fast-changing creative nature of zero-day attacks/malicious actions, and the need for frequent retraining.
网络中机器学习的采用落后于视觉和语言。 挑战在于获取标记数据的难度,零日攻击/恶意行为的快速变化的创新性质以及需要频繁的重新培训的挑战。
However, accumulated data sets (both normative and abnormal); new tools that accelerate labeling (professional), training (distributed), deployment (hybrid), monitoring (real-time) all the way to out-of-the-box models for the novice data scientists; coupled with maturity of research and organizational understanding of ML — lead to the new era of cyber ML.
但是,累积的数据集(规范数据和异常数据); 新工具可加速标记(专业),培训(分布式),部署(混合),监视(实时),为新手数据科学家提供开箱即用的模型; 加上研究的成熟和组织对ML的理解,开创了网络ML的新时代。
Going forward we see a variety of innovative approaches assisting in the various aspects of Cyberwarfare. From RNNs, time series/sequence anomalies, deep feature relationships, fuzzy rules, clustering, graphs, ensemble learning, all the way to evolutionary computing.
展望未来,我们将看到各种创新方法,可在网络战的各个方面提供帮助。 从RNN,时间序列/序列异常,深层特征关系,模糊规则,聚类,图,整体学习,一直到进化计算。
These are exciting times to be in Cyber ML!
这是在Cyber ML中激动人心的时刻!
— — — —
— — — —
This post is my opinion and does not represent my current or past employers. It was first published on OrenSteinberg.com
此职位是我的观点,不代表我现在或过去的雇主。 它最初在OrenSteinberg.com上发布
翻译自: https://towardsdatascience.com/how-machine-learning-transforms-cyber-fb7aca17a1cc