import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# Load the data
df = pd.read_csv('https://bit.ly/3cManTi', delimiter=",")
# Extract input variables (all rows, all columns but last column)
X = df.values[:, :-1]
# Extract output column (all rows, last column)\
Y = df.values[:, -1]
model = LogisticRegression(solver='liblinear')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.33,
model.fit(X_train, Y_train)
prediction = model.predict(X_test)
The confusion matrix evaluates accuracy within each category.
[[truepositives falsenegatives]
[falsepositives truenegatives]]
The diagonal represents correct predictions,
so we want those to be higher
matrix = confusion_matrix(y_true=Y_test, y_pred=prediction)
准确率是衡量分类模型性能的指标,表示模型正确预测的样本占总样本的比例。计算公式为:(正确预测的正样本数 + 正确预测的负样本数) / 总样本数。
It is the most intuitive performance measure and it simply a ratio of correctly predicted to the total observations. We can say as, if we have high accuracy, then our model is best. Yes, we could say that accuracy is a great measure but only when you have symmetric datasets where false positives and false negatives are almost same.
Accuracy = True Positive + True Negative / (True Positive +False Positive + False Negative + True Negative)
计算公式为:正确预测的正样本数 / (正确预测的正样本数 + 错误预测为正样本的负样本数)。
It is also called as the positive predictive value. Number of correct positives in your model that predicts compared to the total number of positives it predicts.
Precision = True Positives / (True Positives + False Positives) Precision = True Positives / Total predicted positive
It is the number of positive elements predicted properly divided by the total number of positive elements predicted.
We can say Precision is a measure of exactness, quality, or accuracy. High precision Means that more or all of the positive results you predicted are correct.
Recall we can also called as sensitivity or true positive rate.
It is several positives that our model predicts compared to the actual number of positives in our data.
Recall = True Positives / (True Positives + False Positives)
Recall = True Positives / Total Actual Positive
Recall is a measure of completeness. High recall which means that our model classified most or all of the possible positive elements as positive.
计算公式为:2 * (精确率 * 召回率) / (精确率 + 召回率)。
We use Precision and recall together because they complement each other in how they describe the effectiveness of a model. The F1 score that combines these two as the weighted harmonic mean of precision and recall.
Bias means it’s how far are the predict values from the actual values. If the average predicted values are far off from the actual values, then we called as this one have high bias.
When our model has a high bias, then it means that our model is too simple and does not capture the complexity of data, thus underfitting the data.
It occurs when our model performs good on the trained dataset but does not do well on a dataset that it is not trained on, like a test dataset or validation dataset. It tells us that actual value is how much scattered from the predicted value.
Because of High variance it cause overfitting that implies that the algorithm models random noise present in the training data.
When model have high variance, then model becomes very flexible and tune itself to the data points of the training set.
这张图展示了模型复杂度与预测误差之间的关系,并解释了偏差-方差权衡(Bias-Variance Tradeoff)的概念。
Data wrangling is a process by which we convert and map data. This changes data from its raw form to a format that is a lot more valuable.
Data wrangling is the first step for machine learning and deep learning. The end goal is to provide data that is actionable and to provide it as fast as possible.
There are three major things to focus on while talking about data wrangling –
1. Acquiring data
The first and probably the most important step in data science is the acquiring, sorting and cleaning of data. This is an extremely tedious process and requires the most amount of time.
One needs to:
Check if the data is valid and up-to-date.
Check if the data acquired is relevant for the problem at hand.
Sources for data collection Data is publicly available on various websites like kaggle.com, data.gov ,World Bank, Five Thirty Eight Datasets, AWS Datasets, Google Datasets.
2. Data cleaning
Data cleaning is an essential component of data wrangling and requires a lot of patience. To make the job easier it is first essential to format the data make the data readable for humans at first.
The essentials involved are:
Format the data to make it more readable
Find outliers (data points that do not match the rest of the dataset) in data
Find missing values and remove them from the data set (without this, any model being
trained becomes incomplete and useless)
3. Data Computation
At times, your machine not have enough resources to run your algorithm e.g. you might not have a GPU. In these cases, you can use publicly available APIs to run your algorithm. These are standard end points found on the web which allow you to use computing power over the web and process data without having to rely on your own system. An example would be the Google Colab Platform.
Normalization is a process that is required when an algorithm uses something like distance measures. Examples would be clustering data, finding cosine similarities, creating recommender systems. Normalization is not always required and is done to prevent variables that are on higher scale from affecting outcomes that are on lower levels. For example, consider a dataset of employees’ income. This data won’t be on the same scale if you try to cluster it. Hence, we would have to normalize the data to prevent incorrect clustering. A key point to note is that normalization does not distort the differences in the range of values. A problem we might face if we don’t normalize data is that gradients would take a very long time to descend and reach the global maxima/ minima. For numerical data, normalization is generally done between the range of 0 to 1. The general formula is: |
Feature selection and feature extraction are two major ways of fixing the curse of dimensionality
1. Feature selection: Feature selection is used to filter a subset of input variables on which the attention should focus. Every other variable is ignored. This is something which we, as humans, tend to do subconsciously. Many domains have tens of thousands of variables out of which most are irrelevant and redundant. Feature selection limits the training data and reduces the amount of computational resources used. It can significantly improve a learning algorithms performance. In summary, we can say that the goal of feature selection is to find out an optimal feature subset. This might not be entirely accurate, however, methods of understanding the importance of features also exist. Some modules in python such as Xgboost help achieve the same. |
2. Feature extraction Feature extraction involves transformation of features so that we can extract features to improve the process of feature selection. For example, in an unsupervised learning problem, the extraction of bigrams from a text, or the extraction of contours from an image are examples of feature extraction. The general workflow involves applying feature extraction on given data to extract features and then apply feature selection with respect to the target variable to select a subset of data. In effect, this helps improve the accuracy of a model. |
Polarity and subjectivity are terms which are generally used in sentiment analysis.
Polarity is the variation of emotions in a sentence. Since sentiment analysis is widely dependent on emotions and their intensity, polarity turns out to be an extremely important factor. In most cases, opinions and sentiment analysis are evaluations. They fall under the categories of emotional and rational evaluations. |
Rational evaluations, as the name suggests, are based on facts and rationality while emotional evaluations are based on non-tangible responses, which are not always easy to detect. Subjectivity in sentiment analysis, is a matter of personal feelings and beliefs which may or may not be based on any fact. When there is a lot of subjectivity in a text, it must be explained and analysed in context. On the contrary, if there was a lot of polarity in the text, it could be expressed as a positive, negative or neutral emotion. |
ARIMA is a widely used statistical method which stands for Auto Regressive Integrated Moving Average. It is generally used for analyzing time series data and time series forecasting. Let’s take a quick look at the terms involved. Auto Regression is a model that uses the relationship between the observation and some numbers of lagging observations.
Integrated means use of differences in raw observations which help make the time series stationary.
Moving Averages is a model that uses the relationship and dependency between the observation and residual error from the models being applied to the lagging observations.
Note that each of these components are used as parameters. After the construction of the model, a linear regression model is constructed.
Data is prepared by:
Finding out the differences
Removing trends and structures that will negatively affect the model
Finally, making the model stationary.
How would you define machine learning?
Machine Learning is about building systems that can learn from data. Learning means getting better at some task, given some performance measure.Can you name four types of applications where it shines?
Machine Learning is great for complex problems for which we have no algorithmic solution, to replace long lists of hand-tuned rules, to build systems that adapt to fluctuating environments, and finally to help humans learn (e.g., data mining).What is a labeled training set?
A labeled training set is a training set that contains the desired solution (a.k.a. a label) for each instance.What are the two most common supervised tasks?
The two most common supervised tasks are regression and classification.Can you name four common unsupervised tasks?
Common unsupervised tasks include clustering, visualization, dimensionality reduction, and association rule learning.What type of algorithm would you use to allow a robot to walk in various unknown terrains?
Reinforcement Learning is likely to perform best if we want a robot to learn to walk in various unknown terrains.What type of algorithm would you use to segment your customers into multiple groups?
If you don't know how to define the groups, then you can use a clustering algorithm (unsupervised learning) to segment your customers into clusters of similar customers.Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?
Spam detection is a typical supervised learning problem: the algorithm is fed many emails along with their labels (spam or not spam).What is an online learning system?
An online learning system can learn incrementally, as opposed to a batch learning system.What is out-of-core learning?
Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer's main memory.What type of algorithm relies on a similarity measure to make predictions?
An instance-based learning system learns the training data by heart; then, when given a new instance, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.What is the difference between a model parameter and a model hyperparameter?
A model parameter determines what the model will predict given a new instance, while a hyperparameter is a parameter of the learning algorithm itself, not of the model.What do model-based algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
Model-based learning algorithms search for an optimal value for the model parameters such that the model will generalize well to new instances. They usually minimize a cost function and make predictions by feeding new instance's features into the model's prediction function.Can you name four of the main challenges in machine learning?
Some of the main challenges are the lack of data, poor data quality, nonrepresentative data, and excessively complex models that overfit the data.If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
The model is likely overfitting the training data. Possible solutions include getting more data, simplifying the model, or reducing the noise in the training data.