azure
云计算 , 机器学习 (Cloud Computing, Machine Learning)
A common question often asked in Data Science is:-
数据科学中经常问到的一个常见问题是:
Which machine learning algorithm should I use?
我应该使用哪种机器学习算法?
While there is no Magic-Algorithm that solves all business problems with zero errors, the algorithm you select should depend on two distinct parts of your Data Science scenario…
虽然没有魔术算法能够以零错误解决所有业务问题,但您选择的算法应取决于数据科学场景的两个不同部分……
What do you want to do with your data?: Specifically, what is the business question you want to answer by learning from your past data?
您想对数据做什么?:具体来说,您想从过去的数据中学到什么业务问题?
What are the requirements of your Data Science scenario?: Specifically, what is the accuracy, training time, linearity, number of parameters, and number of features your solution supports?
您的数据科学方案的要求是什么 ?:具体来说,您的解决方案支持的准确性,训练时间,线性,参数数量和功能数量是多少?
The ML Algorithm cheat sheet helps you choose the best machine learning algorithm for your predictive analytics solution. Your decision is driven by both the nature of your data and the goal you want to achieve with your data.
ML算法备忘单可帮助您为预测分析解决方案选择最佳的机器学习算法。 您的决定取决于数据的性质和要通过数据实现的目标。
The Machine Learning Algorithm Cheat-sheet was designed by Microsoft Azure Machine Learning (AML), to specifically answer this question:-
机器学习算法备忘单是由Microsoft Azure机器学习(AML)设计的 ,专门用于回答以下问题:
What do you want to do with your data?
您想如何处理您的数据?
数据科学方法论: (The Data Science Methodology:)
I must state here that we need to have a solid understanding of the iterative system of methods that guide Data Scientists on the ideal approach to solving problems using the Data Science Methodology. Otherwise, we may never fully understand the essence of the ML Algorithm Cheat-sheet.
在此我必须指出,我们需要对方法的迭代系统有深入的了解,这些方法可以指导数据科学家使用数据科学方法论解决问题的理想方法 。 否则,我们可能永远无法完全理解ML算法备忘单的本质。
Azure机器学习算法备忘单: (The Azure Machine Learning Algorithm Cheat-sheet:)
The AML cheat-sheet is designed to serve as a starting point, as we try to choose the right model for predictive or descriptive analysis. It is based on the fact that there is simply no substitute for understanding the principles of each algorithm and the system that generated your data.
AML速查表旨在作为起点,因为我们尝试选择正确的模型进行预测或描述性分析。 它基于这样一个事实,那就是无可替代地理解每种算法和生成数据的系统的原理。
The AML Algorithm cheat-sheet can be downloaded here.
AML算法备忘单可在 此处 下载 。
img_credit img_creditAML算法备忘单概述: (Overview of the AML Algorithm Cheat-sheet:)
The Cheat-sheet covers a broad library of algorithms from classification, recommender systems, clustering, anomaly detection, regression, and text analytics families.
备忘单涵盖了广泛的算法库,这些算法来自分类 , 推荐系统 , 聚类 , 异常检测 , 回归和文本分析系列。
Every machine learning algorithm has its own style or inductive bias. So, for a specific problem, several algorithms may be appropriate, and one algorithm may be a better fit than others.
每种机器学习算法都有自己的风格或归纳偏差。 因此,对于一个特定的问题,几种算法可能是合适的,并且一种算法可能比其他算法更合适。
But it’s not always possible to know beforehand which is the best fit. Therefore, In cases like these, several algorithms are listed together in the Cheat-sheet. An appropriate strategy would be to compare the performance of related algorithms and choose the best-befitting to the requirements of the business problem and data science scenario.
但是,并非总是可能事先知道哪种方法最合适。 因此,在这种情况下,速查表中同时列出了几种算法。 合适的策略是比较相关算法的性能,并选择最适合业务问题和数据科学场景的需求。
Bear in mind that the machine learning process is a highly iterative process.
请记住,机器学习过程是一个高度迭代的过程。
AML算法备忘单应用程序: (AML Algorithm Cheat-sheet Applications:)
1.文字分析: (1. Text Analytics:)
If the solution requires extracting information from text, then text analytics can help derive high-quality information, to answer questions like:-
如果解决方案需要从文本中提取信息,则文本分析可以帮助获取高质量的信息,以回答诸如以下的问题:
What information is in this text?
本文中有什么信息?
Text-based algorithms listed in the AML Cheat-sheet include the following:
AML备忘单中列出的基于文本的算法包括以下内容:
Extract N-Gram Features from Text: This helps to featurize unstructured text data, creating a dictionary of n-grams from a column of free text.
提取的N-gram从文本功能 :这有助于特征化非结构化的文本数据,从自由文本的一列创建的正克的字典。
Feature Hashing: Used to transform a stream of English text into a set of integer features that can be passed to a learning algorithm to train a text analytics model.
功能散列: 用过的 将英语文本流转换为一组整数特征,可以将这些整数特征传递给学习算法以训练文本分析模型。
Preprocess Text: Used to clean and simplify texts. It supports common text processing operations such as stop-words-removal, lemmatization, case-normalization, identification, and removal of emails and URLs.
预处理文本: 用过的 清理和简化文本。 它支持常见的文本处理操作,例如停用词删除,词形还原,大小写规范化,标识以及电子邮件和URL的删除。
Word2Vector: Converts words to values for use in NLP tasks, like recommender, named entity recognition, machine translation.
Word2Vector: 将单词转换为用于NLP任务(例如推荐器,命名实体识别,机器翻译)的值。
2.回归: (2. Regression:)
We may need to make predictions on future continuous values such as the rate-of-infections and so on… These can help us answer questions like:-
我们可能需要对未来的连续值进行预测,例如感染率等。这些可以帮助我们回答以下问题:
How much or how many?
多少个?
Regression algorithms listed in the AML Cheat-sheet include the following:
AML备忘单中列出的回归算法包括以下内容:
Fast Forest Quantile Regression: -> Predicts a distribution.
快速森林分位数回归: ->预测分布。
Poisson Regression: -> Predicts event counts.
泊松回归: ->预测事件计数。
Linear Regression: -> Fast training linear model.
线性回归: ->快速训练线性模型。
Bayesian Linear Regression: -> Linear model, small data sets
贝叶斯线性回归: ->线性模型,小型数据集
Decision Forest Regression: -> Accurate, fast training times
决策森林回归: ->准确,快速的培训时间
Neural Network Regression: -> Accurate, long training times
神经网络回归: ->准确,长训练时间
Boosted Decision Tree Regression: -> Accurate, fast training times, large memory footprint
增强的决策树回归: ->准确,快速的培训时间,大内存占用
3.推荐人: (3. Recommenders:)
Well, just like Netflix and Medium, we can generate recommendations for our users or clients, by using algorithms that perform remarkably well on content and collaborative filtering tasks. These algorithms can help answer questions like:-
好吧,就像Netflix和Medium,我们可以使用在内容和协作过滤任务上表现出色的算法为用户或客户生成推荐。 这些算法可以帮助回答以下问题:
What will they be interested in?
他们会对什么感兴趣?
The Recommender algorithm listed in the AML Cheat-sheet includes:-
AML备忘单中列出的推荐算法包括:-
SVD Recommender: -> The SVD Recommender is based on the Single Value Decomposition (SVD) algorithm. It can generate two different kinds of predictions:
SVD推荐器: -> SVD推荐器基于单值分解 ( SVD )算法。 它可以生成两种不同的预测:
Predict ratings for a given user and item.
预测给定用户和项目的等级 。
Recommend items to a user
向用户推荐商品
The SVD Recommender also has the following features:- Collaborative filtering, better performance with lower cost by reducing the dimensionality.
SVD Recommender还具有以下功能:-协作过滤,通过降低尺寸,以更低的成本获得了更好的性能。
4.聚类: (4. Clustering:)
If we want to seek out the hidden structures in our data and to separate similar data points into intuitive groups, then we can use clustering algorithms to answer questions like:-
如果我们想找出数据中的隐藏结构并将相似的数据点分为直观的组,则可以使用聚类算法来回答以下问题:
How is this organized?
这是如何组织的?
The Clustering algorithm listed in the AML Cheat-sheet include:-
AML备忘单中列出的聚类算法包括:-
K-Means: -> K-means is one of the simplest and the best known unsupervised learning algorithms. You can use the algorithm for a variety of machine learning tasks, such as:
K均值: -> K均值是最简单,最著名的无监督学习算法之一。 您可以将算法用于各种机器学习任务,例如:
Detecting abnormal data
检测异常数据
- Clustering text documents 聚类文本文档
- Analyzing datasets before we use other classification or regression methods. 在使用其他分类或回归方法之前,先分析数据集。
5.异常检测: (5. Anomaly Detection:)
This technique is useful as we try to identify and predict rare or unusual data points. For example in IoT data, we could use anomaly-detection to detect and raise an alarm as we analyze the logs-data of a machine. This could be used to identify strange IP addresses or unusually high attempts to access the system or any other anomaly that could pose a serious threat.
当我们尝试识别和预测稀有或异常数据点时,此技术很有用。 例如,在物联网数据中,当我们分析机器的日志数据时,我们可以使用异常检测来检测并发出警报。 这可用于识别奇怪的IP地址或异常高的访问系统尝试或任何其他可能构成严重威胁的异常情况。
Anomaly detection can be used to answer questions like:-
异常检测可用于回答以下问题:
Is this weird, is this abnormal?
这很奇怪,这异常吗?
The Anomaly detection algorithm listed in the AML Cheat-sheet include:-
AML速查表中列出的异常检测算法包括:-
PCA-Based Anomaly Detection: -> For example, to detect fraudulent transactions, you often don’t have enough examples of fraud to train on. But you might have many examples of good transactions.
基于PCA的异常检测: ->例如,要检测欺诈性交易,您通常没有足够的欺诈性实例来进行培训。 但是您可能有许多交易良好的例子。
The PCA-Based Anomaly Detection module solves the problem by analyzing available features to determine what constitutes a “normal” class. The module then applies distance metrics to identify cases that represent anomalies.
基于PCA的异常检测模块通过分析可用功能以确定什么构成“正常”类来解决该问题。 然后,该模块应用距离度量来识别代表异常的案例。
This approach lets you train a model by using existing imbalanced data. PCA records fast training times.
通过这种方法,您可以使用现有的不平衡数据来训练模型。 PCA记录了快速的培训时间。
Train Anomaly Detection model: -> This takes as input, a set of parameters for an anomaly detection model, and an unlabeled dataset. It returns a trained anomaly detection model, together with a set of labels for the training data.
训练异常检测模型 : ->将异常检测模型的一组参数和未标记的数据集作为输入。 它返回一个经过训练的异常检测模型,以及一组用于训练数据的标签。
6.多类分类: (6. Multi-Class Classification:)
Often, we may need to pick the right answers from complex questions with multiple possible answers. For tasks like these, we need a Multi-class classification algorithm. This can help us answer questions like:-
通常,我们可能需要从具有多个可能答案的复杂问题中选择正确的答案。 对于此类任务,我们需要一种多类分类算法。 这可以帮助我们回答以下问题:
Is this A or B or C or D?
这是A或B还是C或D?
Multi-class algorithms listed in the AML Cheat-sheet include the following:
AML备忘单中列出的多类算法包括以下内容:
Multi-Class Logistic Regression: -> Logistic regression is a well-known method in statistics that is used to predict the probability of an outcome, and is popular for classification tasks. The algorithm predicts the probability of occurrence of an event by fitting data to a logistic function.
多类Logistic回归: -> Logistic回归是统计中众所周知的方法,用于预测结果的概率,并且在分类任务中很流行。 该算法通过将数据拟合到逻辑函数来预测事件发生的概率。
Usually a Binary-Classifier, but in Multi-class logistic regression, the algorithm is used to predict multiple outcomes.
通常是二进制分类器,但在多类逻辑回归中,该算法用于预测多个结果。
Features: Fast training times, linear model.
特点 :快速的训练时间,线性模型。
Multi-class Neural Network: -> A neural network is a set of interconnected layers. The inputs are the first layer and are connected to an output layer by an acyclic graph comprised of weighted edges and nodes.
多类神经网络: ->神经网络是一组相互连接的层。 输入是第一层,并通过包含加权边和节点的非循环图连接到输出层。
Between the input and output layers, you can insert multiple hidden layers. Most predictive tasks can be accomplished easily with only one or a few hidden layers. However, recent research has shown that deep neural networks (DNN) with many layers can be effective in complex tasks such as image or speech recognition, with successive layers used to model increasing levels of semantic depth.
在输入和输出层之间,可以插入多个隐藏层。 仅需一层或几层隐藏层即可轻松完成大多数预测性任务。 但是,最近的研究表明,具有多层结构的深度神经网络(DNN)在诸如图像或语音识别之类的复杂任务中可能是有效的,连续的层用于对语义深度的递增级别进行建模。
Features: Accuracy, long training times.
特点 :准确性高,训练时间长。
Multiclass Decision Forest: -> The decision forest algorithm is an ensemble learning method for classification.
多类决策森林:-> 决策森林算法是用于分类的整体学习方法。
Decision trees, in general, are non-parametric models, meaning they support data with varied distributions. In each tree, a sequence of simple tests is run for each class, increasing the levels of a tree structure until a leaf node (decision) is reached.
通常,决策树是非参数模型,这意味着它们支持具有不同分布的数据。 在每棵树中,为每个类运行一系列简单测试,从而增加树结构的级别,直到达到叶节点(决策)为止。
Features: Accuracy, fast training times.
特点:准确性,快速的训练时间。
One-vs-All Multiclass: -> This algorithm implements the one-versus-all method, in which a binary model is created for each of the multiple output classes. In essence, it creates an ensemble of individual models and then merges the results, to create a single model that predicts all classes.
一对多所有类: ->此算法实现了一对多方法,其中为多个输出类中的每一个创建二进制模型。 本质上,它创建了单个模型的集合,然后合并结果,以创建一个预测所有类的模型。
Any binary classifier can be used as the basis for a one-versus-all model
任何二进制分类器都可以用作“一对多”模型的基础
Features: Depends on the two-class classifier.
特点:取决于两类分类器。
Multiclass Boosted Decision Tree: -> This algorithm creates a machine learning model that is based on the boosted decision trees algorithm.
多类增强决策树: ->此算法创建基于增强决策树算法的机器学习模型。
A boosted decision tree is an ensemble learning method in which the second tree corrects for the errors of the first tree, the third tree corrects for the errors of the first and second trees, and so forth. Predictions are based on the ensemble of trees together.
增强决策树是一种集成学习方法,其中第二棵树纠正第一棵树的错误,第三棵树纠正第一棵树和第二棵树的错误,依此类推。 预测是基于树木的整体。
Features: Non-parametric, fast training times, and scalable.
特点:非参数,快速的培训时间和可扩展性。
7.二进制分类: (7. Binary Classification:)
Binary classification tasks are the most common classification tasks. These often involve a yes or no, true or false, type of response. Binary classification algorithms help us to answer questions like:-
二进制分类任务是最常见的分类任务。 这些通常涉及是或否,正确或错误的响应类型。 二进制分类算法可帮助我们回答以下问题:
Is this A or B?
这是A还是B?
Binary-classifier algorithms listed in the AML Cheat-sheet include the following:
AML备忘单中列出的二进制分类器算法包括以下内容:
Two-Class Support Vector Machine: -> Support vector machines (SVMs) are a well-researched class of supervised learning methods. This particular implementation is suited to the prediction of two possible outcomes, based on either continuous or categorical variables.
两类支持向量机: ->支持向量机(SVM)是一种经过严格研究的监督学习方法。 这种特定的实现方式适合基于连续或分类变量来预测两种可能的结果。
Features: Under 100 features, linear model.
特征:根据100个特征,线性模型。
Two-Class Averaged Perceptron: -> The averaged perceptron method is an early and simple version of a neural network. In this approach, inputs are classified into several possible outputs based on a linear function, and then combined with a set of weights that are derived from the feature vector — hence the name “perceptron.”
两类平均感知器: -> 平均感知器方法是神经网络的早期和简单版本。 在这种方法中,基于线性函数将输入分类为几个可能的输出,然后与从特征向量派生的一组权重结合在一起,因此得名“ perceptron”。
Features: Fast training, linear model.
特点:快速训练,线性模型。
Two-Class Decision Forest: -> Decision forests are fast, supervised ensemble models. This algorithm is a good choice if you want to predict a target with a maximum of two outcomes. Generally, ensemble models provide better coverage and accuracy than single decision trees.
两级决策森林: ->决策森林是快速,受监督的集成模型。 如果要预测最多两个结果的目标,则此算法是一个不错的选择。 通常,集成模型比单个决策树提供更好的覆盖范围和准确性。
Features: Accurate, fast training.
特点:准确,快速的训练。
Two-Class Logistic Regression: -> Logistic regression is a well-known statistical technique that is used for modeling many kinds of problems. This algorithm is a supervised learning method; therefore, you must provide a dataset that already contains the outcomes to train the model. In this method, the classification algorithm is optimized for dichotomous or binary variables only.
两类Logistic回归: -> Logistic回归是一种众所周知的统计技术,用于对多种问题进行建模。 该算法是一种监督学习方法。 因此,您必须提供一个已经包含结果的数据集以训练模型。 在这种方法中,分类算法仅针对二分变量或二进制变量进行了优化。
Features: Fast training, linear model.
特点:快速训练,线性模型。
Two-Class Boosted Decision Tree: -> This method creates a machine learning model that is based on the boosted decision trees algorithm.
两类增强决策树: ->此方法创建基于增强决策树算法的机器学习模型。
Features: Accurate, fast training, large memory footprint.
特点:准确,快速的培训,大内存占用。
Two-Class Neural Network: -> This algorithm is used to create a neural network model that can be used to predict a target that has only two values.
两类神经网络: ->他的算法用于创建神经网络模型,该模型可用于预测只有两个值的目标。
Classification using neural networks is a supervised learning method, and therefore requires a tagged dataset, which includes a label column. For example, you could use this neural network model to predict binary outcomes such as whether or not a patient has a certain disease, or whether a machine is likely to fail within a specified window of time.
使用神经网络进行分类是一种有监督的学习方法,因此需要带有标签的数据集 ,其中包括标签列。 例如,您可以使用该神经网络模型来预测二进制结果,例如患者是否患有某种疾病,或者机器是否有可能在指定的时间范围内发生故障。
Features: Accurate, long training times.
特点:准确,训练时间长。
8.图像分类: (8. Image Classification:)
If the analysis requires extracting information from images, then computer vision algorithms can help us to derive high-quality information, to answer questions like:-
如果分析需要从图像中提取信息,则计算机视觉算法可以帮助我们获得高质量的信息,以回答诸如以下的问题:
What does this image represent?
该图像代表什么?
The computer vision algorithms listed in the AML Cheat-sheet include:-
AML备忘单中列出的计算机视觉算法包括:-
DenseNet and ResNet: -> These classification algorithms are supervised learning methods that require a labeled dataset. You can train the model by providing a labeled image directory as inputs. The trained model can then be used to predict values for new unseen input examples.
DenseNet和ResNet: ->这些分类算法是需要标记数据集的监督学习方法。 您可以通过提供标记的图像目录作为输入来训练模型。 然后,可以将训练后的模型用于预测新的看不见的输入示例的值。
Features: High accuracy, better efficiency.
特点:精度高,效率更高。
摘要: (Summary:)
The Azure Machine Learning Algorithm Cheat Sheet helps you with the first consideration: What do you want to do with your data? On the Machine Learning Algorithm Cheat Sheet, look for a task you want to do, and then find an Azure Machine Learning designer algorithm for the predictive analytics solution.
Azure机器学习算法备忘单可帮助您首先考虑以下事项: 您想对数据做什么? 在机器学习算法备忘单上,查找要执行的任务,然后找到用于预测分析解决方案的Azure机器学习设计器算法。
The Azure Machine Learning experience is quite intuitive and easy to grasp. The Azure Machine Learning designer is a drag-and-drop visual interface that makes it engaging and fun to build ML pipelines, assemble algorithms and run iterative ML jobs, build, train and deploy models all within the Azure portal. Once deployed, your models can be consumed by authorized, external, third-party applications in real-time.
Azure机器学习体验非常直观并且易于掌握。 Azure机器学习设计器是一个拖放式可视化界面,它使构建ML管道,组装算法和运行迭代ML作业,构建,训练和部署模型都变得引人入胜且充满乐趣。 部署后,您的模型可以被授权的外部第三方应用程序实时使用。
下一步: (Next Steps:)
After deciding on the right model to choose for the business problem, using the Azure Machine Learning Cheat-sheet, the next step is to answer the second question:-
在确定了适合业务问题的正确模型之后,使用Azure机器学习备忘单 ,下一步是回答第二个问题:
What are the requirements of your Data Science scenario?: Specifically, what is the accuracy, training time, linearity, number of parameters, and number of features your solution supports?
您的数据科学方案有哪些要求 ?:具体来说,您的解决方案支持的准确性,训练时间,线性,参数数量和功能数量是多少?
To get the best possible outcome for these metrics, kindly go through the details at the Azure Machine Learning site.
为了获得这些指标的最佳结果,请仔细查看Azure机器学习站点上的详细信息。
Cheers!!
干杯!!
关于我: (About Me:)
Lawrence is a Data Specialist at Tech Layer, passionate about fair and explainable AI and Data Science. I hold both the Data Science Professional and Advanced Data Science Professional certifications from IBM. After earning the IBM Data Science Explainability badge, my mission is to promote Fairness and Explainability in AI… I love to code up my functions from scratch as much as possible. I love to learn and experiment…And I have a bunch of Data and AI certifications and I’ve written several highly recommended articles.
Lawrence是Tech Layer的数据专家,对公平和可解释的AI和数据科学充满热情。 我同时拥有 IBM 的 Data Science Professional 和 Advanced Data Science Professional 认证。 获得 IBM数据科学可解释性徽章后 ,我的任务是促进AI的公平性和可解释性。我喜欢尽可能地从头开始编写功能。 我喜欢学习和实验……而且我获得了大量数据和AI认证,并且撰写了几篇强烈推荐的文章。
Feel free to connect with me on:-
随时与我联系:-
Github
Github
领英
推特
翻译自: https://medium.com/towards-artificial-intelligence/the-azure-ml-algorithm-cheat-sheet-451547832cad
azure