The logistic regression technique involves the dependent variable, which can be represented in the binary (0 or 1, true or false, yes or no) values, which means that the outcome could only be in either one form of two. For example, it can be utilized when we need to find the probability of a successful or fail event.
If ‘Z’ goes to infinity, Y(predicted) will become 1, and if ‘Z’ goes to negative infinity, Y(predicted) will become 0.
The output from the hypothesis is the estimated probability. This is used to infer how confident can predicted value be actual value when given an input X.
Linear and Logistic regression are the most basic form of regression which are commonly used. The essential difference between these two is that Logistic regression is used when the dependent variable is binary. In contrast, Linear regression is used when the dependent variable is continuous, and the nature of the regression line is linear.
Linear regression models data using continuous numeric value. As against, logistic regression models the data in the binary values.
Linear regression requires to establish the linear relationship among dependent and independent variables, whereas it is not necessary for logistic regression.
In linear regression, the independent variable can be correlated with each other. On the contrary, in the logistic regression, the variable must not be correlated with each other.
Linear regression is unbounded.
A decision tree is a type of supervised learning algorithm that can be used in classification as well as regressor problems.
Decision tree is a tree-based method that goes from observations about an object (represented in the branches) to conclusions about its target value (represented in the leaves). At its core, decision trees are nest if-else conditions.
In classification trees, the target value is discrete and each leaf represents a class. In regression trees, the target value is continuous and each leaf represents the mean of the target values of all objects that end up with that leaf.
Decision trees are easy to interpret and can be used to visualize decisions. However, they are overfit to the data they are trained on -- small changes to the training set can result in significantly different tree structures, which lead to significantly different outputs.
The input to a decision tree can be both continuous as well as categorical. The decision tree works on an if-then statement. Decision tree tries to solve a problem by using tree representation (Node and Leaf)
Assumptions while creating a decision tree:
1) Initially all the training set is considered as a root
2) Feature values are preferred to be categorical, if continuous then they are discretized
3) Records are distributed recursively on the basis of attribute values
4) Which attributes are considered to be in root node or internal node is done by using a statistical approach.
信息增益 = 1 - ∑ (Sb/S) * 熵(Sb),其中Sb是子集,S是整个数据集。这个过程有助于确定哪个属性在分裂时能最大程度地减少数据集的不纯度,进而提高决策树模型的效率和准确度。通过重复这一过程,我们可以逐步构建出决策树,直到每个叶节点的熵都为零,即每个叶节点都纯粹地包含同一类别的数据。
其中,pi 是选择第i类的概率。
Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class distribution is equal. In this way, entropy will give a measure of impurity in the dataset.
Steps to decide which attribute to split:
1. Computetheentropyforthedataset
2. Foreveryattribute:
2.1 Calculate entropy for all categorical values.
2.2 Take average information entropy for the attribute.
2.3 Calculate gain for the current attribute.
3. Picktheattributewiththehighestinformationgain.
4. Repeat until we get the desired tree.
A leaf node is decided when entropy is zero
Information Gain = 1 - ∑ (Sb/S)*Entropy (Sb) Sb - Subset, S - entire data
CART算法(Classification and Regression Trees,分类和回归树)是一种决策树学习算法,它可以用于分类问题也可以用于回归问题。在CART算法中,我们使用基尼指数(Gini Index)作为评估数据集分裂的指标,基尼指数被用作成本函数来评估数据集的分裂。
To control the leaf size, we can set the parameters:
1. Maximum depth: 最大深度
Maximum tree depth is a limit to stop the further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.
NEVER use maximum depth to limit the further splitting of nodes. In other words: use the largest possible value.
2. Minimum split size: 最小分裂大小
Minimum split size is a limit to stop the further splitting of nodes when the number of observations in the node is lower than the minimum split size.
This is a good way to limit the growth of the tree. When a leaf contains too few observations, further splitting will result in overfitting (modeling of noise in the data).
3. Minimum leaf size 最小叶节点大小
Minimum leaf size is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leaf size.
Pruning is mostly done to reduce the chances of overfitting the tree to the training data and reduce the overall complexity of the tree.
1. Pre-pruning is also known as the early stopping criteria.As the name suggests,the criteria are set as parameter values while building the model. The tree stops growing when it meets any of these pre-pruning criteria, or it discovers the pure classes.
2. In Post-pruning,the idea is to allow the decision tree to growfully and observe the CP value. Next, we prune/cut the tree with the optimal CP(Complexity Parameter) value as the parameter. The CP (complexity parameter) is used to control tree growth. If the cost of adding a variable is higher, then the value of CP, tree growth stops.
在后剪枝(Post-pruning)中,策略是允许决策树完全生长,然后观察复杂度参数(CP,Complexity Parameter)的值。接下来,我们以最优的CP值作为参数来剪枝/裁剪树。CP用于控制树的生长,如果添加一个变量的成本高于CP的值,树的生长就会停止。
Decision trees can handle both categorical and numerical variables at the same time as features. There is not any problem in doing that.
最后,将分类变量转换为连续变量是一种好的做法。这可以通过标签编码(Label Encoding)或独热编码(One-Hot Encoding)来实现。标签编码将每个类别赋予一个唯一的整数,而独热编码则为每个类别创建一个新的二进制列,对应类别的列值为1,其他为0。这两种编码方式能够帮助决策树更好地理解和分割数据,尤其是在处理具有多个类别的分类特征时。
随机森林是一种集成学习算法,遵循装袋技术(bagging)。基于对决策树应用装袋(bagging)方法的算法,但它有一个重要的扩展:除了对记录进行抽样外,该算法还对变量进行抽样。在传统的决策树中,为了决定如何创建分区A的一个子分区,算法通过最小化诸如基尼不纯度(Gini impurity)这样的标准来选择变量和分割点。而在随机森林中,算法在每个阶段限制变量的选择为随机选定的变量子集。
与基本的树算法相比,随机森林算法增加了两个更多的步骤:之前讨论的装袋和在每个分裂点对变量进行自助采样(bootstrap sampling):
每个步骤中应该抽样多少个变量?一个经验法则是选择 P,其中P是预测变量的数量。randomForest
predictors = ['borrower_score', 'payment_inc_ratio']
outcome = 'outcome'
X = loan3000[predictors]
y = loan3000[outcome]
rf = RandomForestClassifier(n_estimators=500, random_state=1, oob_score=True), y)
Some Important Parameters:-
n_estimators:- It defines the number of decision trees to be created in a random forest.
criterion:- "Gini" or "Entropy."
min_samples_split:- Used to define the minimum number of samples required in a leaf
node before a split is attempted
max_features: -It defines the maximum number of features allowed for the split in each
decision tree.
n_jobs:- The number of jobs to run in parallel for both fit and predict. Always keep (-1) to
use all the cores for parallel processing.
For example: Voting Republican - 13 Voting Democratic - 16 Non-Respondent - 21 Total - 50 The probability of voting Republican is 13/(13+16), or 44.8%. We put out our press release that the Democrats are going to win by over 10 points; but, when the election comes around, it turns out they lose by 10 points. That certainly reflects poorly on us.
Where did we go wrong in our model?
Bagging,全称为Bootstrap Aggregation,旨在通过减少决策树的方差来提高模型的稳定性和准确性。Bagging的过程如下:
Bagging is like the basic algorithm for ensembles, except that, instead of fitting the various models to the same data, each new model is fitted to a bootstrap resample. Here is the algorithm presented more formally:
Initialize M, the number of models to be fit, and n, the number of records to choose (n < N). Set the iteration m=1.
Take a bootstrap resample (i.e., with replacement) of n records from the training data to form a subsample Ym and m (the bag).
Train a model using Ym and m to create a set of decision rules f^m().
Increment the model counter m=m+1. If m <= M, go to step 2.
In the case where f^M predicts the probability Y=1, the bagged estimate is given by:
在线性回归模型中,确实常常会检查残差,以判断模型的拟合情况是否可以改进。这种方法旨在识别数据中可能存在的非线性关系,从而优化模型的性能。Boosting方法将这一概念推向了极致,它通过拟合一系列模型来实现,其中每个后续模型都致力于最小化前一个模型的误差。Boosting方法的例子包括AdaBoost(自适应增强)和Gradient Boosting,它们都是通过增加后续模型对前一模型误差的关注来逐步提高模型的准确性。
梯度增强(Gradient Boosting):通过使用损失函数的梯度来指导模型的改进。在每一步,梯度增强会添加一个新的模型,这个模型是在损失函数的梯度方向上对误差进行拟合,从而逐步减少整体误差。
随机梯度增强(Stochastic Gradient Boosting):是梯度增强的一个变体,它通过在每一步随机选择样本和特征来增加随机性,从而提高模型的鲁棒性和减少过拟合的风险。这种方法是最通用和广泛使用的Boosting方法。
If these two methods were cars, bagging could be considered a Honda Accord (reliable and steady), whereas boosting could be considered a Porsche (powerful but requires more care).
SVM or Large margin classifier is a supervised learning algorithm that uses a powerful technique called SVM for classification.
支持向量机(Support Vector Machine,简称SVM)是一种强大的监督学习算法,用于分类和回归任务。在分类问题中,SVM的目标是找到一个超平面(在二维空间中是一条直线,在更高维空间中是一个平面或超平面),这个超平面能够最好地分隔不同类别的数据点。
在现实世界的数据中,往往存在噪声和异常点,这些点可能会违反最大边距准则。为了使SVM能够更好地处理这种情况,引入了软间隔的概念,允许某些数据点违反边距准则。这是通过引入松弛变量(slack variables)和正则化参数来实现的,正则化参数控制了对违反边距的容忍度与保持边距大小之间的权衡。
We have two types of SVM classifiers:
1) Linear SVM: In Linear SVM, the data points are expected to be separated by some apparent gap. Therefore, the SVM algorithm predicts a straight hyperplane dividing the two classes. The hyperplane is also called as maximum margin hyperplane
2) Non-Linear SVM: It is possible that our data points are not linearly separable in a p- dimensional space, but can be linearly separable in a higher dimension. Kernel tricks make it possible to draw nonlinear hyperplanes. Some standard kernels are a) Polynomial Kernel b) RBF kernel(mostly used).
Advantages of SVM classifier:
1) SVMs are effective when the number of features is quite large.
2) It works effectively even if the number of features is greater than the number of samples.
3) Non-Linear data can also be classified using customized hyperplanes built by using kernel trick. 4) It is a robust model to solve prediction problems since it maximizes margin.
Disadvantages of SVM classifier:
1) The biggest limitation of the Support Vector Machine is the choice of the kernel. The wrong choice of the kernel can lead to an increase in error percentage.
2) With a greater number of samples, it starts giving poor performances.
3) SVMs have good generalization performance, but they can be extremely slow in the test phase. 4) SVMs have high algorithmic complexity and extensive memory requirements due to the use of quadratic programming.
Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
P(y|X) = {P(X|y) P(y)}/{P(X)}
where, y is class variable and X is a dependent feature vector (of size n) where:
X = (x_1,x_2,x_3,.....,x_n)
To clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of the dataset)
X = (Rainy, Hot, High, False) y = No So basically, P(X|y)
here means, the probability of “Not playing golf” given that the weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.
We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the humidity, or the outlook being ‘Rainy’ does not affect the winds. Hence, the features are assumed to be independent.
Secondly,eachfeatureisgiventhesameweight(orimportance).Forexample,knowingt he only temperature and humidity alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to the outcome
Continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature values as shown below:
This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.
Mean (x) = 1/n * sum(x)
Where n is the number of instances, and x is the values for an input variable in your training data.
We can calculate the standard deviation using the following equation:
Standard deviation(x) = sqrt (1/n * sum(xi-mean(x)^2 ))
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
This is the key to the confusion matrix.
It gives us insight not only into the errors being made by a classifier but, more importantly, the types of errors that are being made.
Accuracy is defined as the ratio of the sum of True Positive and True Negative by Total(TP+TN+FP+FN).
However, there are problems with accuracy. It assumes equal costs for both kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor, or terrible depending upon the problem.
Misclassification Rate is defined as the ratio of the sum of False Positive and False Negative by Total(TP+TN+FP+FN)
Misclassification Rate is also called Error Rate.
Sensitivity (SN) is calculated as the number of correct positive predictions divided by the total number of positives.
It is also called Recall (REC) or true positive rate (TPR). The best sensitivity is 1.0, whereas the worst is 0.0.
Specificity (SP) is calculated as the number of correct negative predictions divided by the total number of negatives. It is also called a true negative rate (TNR). The best specificity is 1.0, whereas the worst is 0.0.
False positive rate (FPR) is calculated as the number of incorrect positive predictions divided by the total number of negatives. The best false positive rate is 0.0, whereas the worst is 1.0. It can also be calculated as 1 – specificity.
False Negative rate (FPR) is calculated as the number of incorrect positive predictions divided by the total number of positives. The best false negative rate is 0.0, whereas the worst is 1.0.
Recall can be defined as the ratio of the total number of correctly classified positive examples divide to the total number of positive examples.
High Recall indicates the class is correctly recognized (small number of FN).
Low Recall indicates the class is incorrectly recognized (large number of FN).
Recall is given by the relation:
To get the value of precision, we divide the total number of correctly classified positive examples by the total number of predicted positive examples.
High Precision indicates an example labeled as positive is indeed positive (a small number of FP).
Low Precision indicates an example labeled as positive is indeed positive (large number of FP).
The relation gives precision:
High recall, low precision: This means that most of the positive examples are correctly recognized (low FN), but there are a lot of false positives.
Low recall, high precision: This shows that we miss a lot of positive examples (high FN), but those we predict as positive are indeed positive (low FP).
Since we have two measures (Precision and Recall), it helps to have a measurement that represents both of them. We calculate an F-measure, which uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.
Randomized search CV is used to perform a random search on hyperparameters. Randomized search CV uses a fit and score method, predict proba, decision_func, transform, etc..,
The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.
In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.