【机器学习的学习笔记】（持续更新）

Notes of studying Machine Learning

内容概览

Data Preprocessing
Supervised Learning
- Regression
- Terminologies
- - Linear Regression
  - - Simple Linear Regression
    - Multiple Linear Regression
    - Polynomial Regression
    - Support Vector Regression
    - Decision Trees Regression
- Classification
- - Binary Classifier
  - Multi-class Classifier
- Learners in Classification Problems
- - Lazy Learners
  - Eager Learners
- Evaluating a Classification model
- - 1. Log Loss or Cross-Entropy Loss
  - 2. Confusion Matrix
  - 3. AUC-ROC curve
  - Linear Model
  - - Logistic Regression
    - - Binomial
      - Multinomial
      - Ordinal
  - Non-linear Model
  - - Lazy Learners: K-Nearest Neighbours
    - Eager Learners: Decision Trees
    - - Terminologies
      - Random Forest [Overfitting]
    - Eager Learners: Naïve Bayes
    - - Bernoulli, Multinomial and Gaussian Naive Bayes
      - Assumption
    - Support Vector Machines (SVM)
    - - Linear SVM
      - Tune Parameter
      - Non-linear SVM (Kernel SVM)
- Dimensionality Reduction
- - - Filters Methods
    - Wrappers Methods
    - Embedded Methods
  - Lasso Regression/ L1 regularization [Reduce complexity]
  - Ridge Regression/ L2 regularization [Reduce complexity]
- Feature Extractions
- - Principal Component Analysis (Unsupervised Learning)
  - Linear Discriminant Analysis
  - Kernel PCA
  - Quadratic Discriminant Analysis
Unsupervised Learning
- Clustering
- - Partitioning Clustering
  - Density-Based Clustering
  - Distribution Model-Based Clustering
  - Hierarchical Clustering/ Agglomerative Hierarchical Clustering
  - Fuzzy Clustering
- Association
- - Apriori Algorithm
  - Eclat Algorithm
  - F-P Growth Algorithm
Hyper Parameter Tuning
- Approach 1: Use train_test_split and manually tune parameters by trial and error
- Approach 2: Use K Fold Cross validation
- Approach 3: Use GridSearchCV
- Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters.
- Different Models with Different Parameters
K-Fold Cross Vaildation
Bagging/ Ensemble Learning
References

Data Preprocessing

机器学习的学习笔记。

Supervised Learning

Supervised learning is the types of machine learning in which machines are trained using well “labelled” training data, and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output.

In the real-world, supervised learning can be used for Risk Assessment, Image Classification, Fraud Detection, Spam Filtering, etc.

Regression

a relationship between the input variable and the output variable
the prediction of continuous variables, such as Weather Forecasting, Market Trends, etc.

Terminologies

Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable.
Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided.
Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting.

Linear Regression

Simple Linear Regression

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values.

A single Independent/Predictor(x) variable is used to model the response variable (Y)

$Y = b_0+b_1x$
Y = dependent variables (target variables),
X = Independent variables (predictor variables),
a and b are the linear coefficients

Steps:
Data Pre-processing Steps
Fitting the MLR model to the training set
Predicting the result of the test set

Multiple Linear Regression

More than one predictor variable to predict the response variable

$Y = b_0+b_1x+ b_2x+ b_3x+.....+ b_nx$

import dataset


import pandas as pd
import numpy as npdf = pd.read_csv('/Users/haleyk/Documents/Python_Libraries_for_ML/Python_Libraries_for_ML/Supervised Learning/Linear Regression/homeprices.csv.xls')

area	bedrooms	age	price
0	2600	3.0	20	550000
1	3000	4.0	15	565000
2	3200	NaN	18	610000
3	3600	3.0	30	595000
4	4000	5.0	8	760000
5	4100	6.0	8	810000

remove NA


import math
median_bedrooms = math.floor(df.bedrooms.median())
median_bedrooms

df.bedrooms = df.bedrooms.fillna(median_bedrooms) # clean your data, Data Preprocessing: Fill NA values with median value of a column
df

	area	bedrooms	age	price
0	2600	3.0	20	550000
1	3000	4.0	15	565000
2	3200	4.0	18	610000
3	3600	3.0	30	595000
4	4000	5.0	8	760000
5	4100	6.0	8	810000

import model to find the relationship between area, bedrooms, age, and the price

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['area','bedrooms','age']],df.price)

"\nfrom sklearn.linear_model import LinearRegression\nreg = LinearRegression()\nreg.fit(df[['area','bedrooms','age']], df.price)\n"

this is a 2D array of shape (n_targets, n_features),
see how different factor change the price

reg.coef_
reg.intercept_
reg.predict([[3000,3,40]]) # 3000	4.0	15	565000 Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old

# return
array([498408.25158031])

reg.predict([[2500,4,5]]) # 2600	3.0	20	550000 Find price of home with 2500 sqr ft area, 4 bedrooms, 5 year old

# return
array([578876.03748933])

Polynomial Regression

Data points are arranged in a non-linear fashion, we need the Polynomial Regression model

$Y = b_0+b_1x+ b_2x^2+ b_3x^3+.....+ b_nx^n$
Y is the predicted/target output, b0, b1,… bn are the regression coefficients. x is our independent/input variable.

Photo Source

Steps:

Data Pre-processing

#importing libraries  
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  
  
#importing datasets  
df = pd.read_csv('/Users/haleyk/Documents/Python_Libraries_for_ML/Python_Libraries_for_ML/Supervised Learning/Linear Regression/Position_Salaries.csv')
df

# return

Position	Level	Salary
0	Business Analyst	1	45000
1	Junior Consultant	2	50000
2	Senior Consultant	3	60000
3	Manager	4	80000
4	Country Manager	5	110000
5	Region Manager	6	150000
6	Partner	7	200000
7	Senior Partner	8	300000
8	C-level	9	500000
9	CEO	10	1000000

#Extracting Independent and dependent Variable  
X = df.iloc[:, 1:2].values  # level
y = df.iloc[:, 2].values  # salary

Build a Linear Regression model and fit it to the dataset

#Fitting the Linear Regression to the dataset  
from sklearn.linear_model import LinearRegression  
lin_regs= LinearRegression()  
lin_regs.fit(X,y)

Build a Polynomial Regression model and fit it to the dataset

#Fitting the Polynomial regression to the dataset  
from sklearn.preprocessing import PolynomialFeatures  
poly_regs= PolynomialFeatures(degree= 2)  # the polynomial degree depends on our choice
x_poly= poly_regs.fit_transform(X)  # converting our feature matrix into polynomial feature matrix
lin_reg_2 =LinearRegression()  
lin_reg_2.fit(x_poly, y)

Visualize the result for Linear Regression and Polynomial Regression model

#Visulaizing the result for Linear Regression model  
plt.scatter(X,y,color="blue")  
plt.plot(X,lin_regs.predict(X), color="red")  
plt.title("Bluff detection model(Linear Regression)")  
plt.xlabel("Position Levels")  
plt.ylabel("Salary")  
plt.show()

#Visulaizing the result for Polynomial Regression  
plt.scatter(X,y,color="blue")  
plt.plot(X, lin_reg_2.predict(poly_regs.fit_transform(X)), color="red")
plt.title("Bluff detection model(Polynomial Regression)")  
plt.xlabel("Position Levels")  
plt.ylabel("Salary")  
plt.show()

#Fitting the Polynomial regression to the dataset by degree=3
from sklearn.preprocessing import PolynomialFeatures  
poly_regs= PolynomialFeatures(degree= 3)  # the polynomial degree depends on our choice
x_poly= poly_regs.fit_transform(X)  # converting our feature matrix into polynomial feature matrix
lin_reg_2 =LinearRegression()  
lin_reg_2.fit(x_poly, y)  

#Visulaizing the result for Polynomial Regression  
plt.scatter(X,y,color="blue")  
plt.plot(X, lin_reg_2.predict(poly_regs.fit_transform(X)), color="red")
plt.title("Bluff detection model(Polynomial Regression)")  
plt.xlabel("Position Levels")  
plt.ylabel("Salary")  
plt.show()

when degree=4, the curve is smoother and more accurate

Predicting the output

#Predicting the final result with the Linear Regression model:
lin_pred = lin_regs.predict([[6.5]])  
print(lin_pred)  

# return
[330378.78787879]

#Predicting the final result with the Polynomial Regression model:
poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))  
print(poly_pred)  

# return
[158862.45265158]

Support Vector Regression

Decision Trees Regression

Classification

the output variable is categorical

Binary Classifier

If the classification problem has only two possible outcomes, then it is called as Binary Classifier.
Examples: Yes-No, Male-Female, True-false, etc.

Multi-class Classifier

If a classification problem has more than two outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems

Lazy Learners

Lazy Learner firstly stores the training dataset and wait until it receives the test dataset. In Lazy learner case, classification is done on the basis of the most related data stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning

Eager Learners

Eager Learners develop a classification model based on a training dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in prediction.
Example: Decision Trees, Naïve Bayes, ANN.

Evaluating a Classification model

1. Log Loss or Cross-Entropy Loss

It is used for evaluating the performance of a classifier, whose output is a probability value between the 0 and 1.

For a good binary Classification model, the value of log loss should be near to 0.

2. Confusion Matrix

True Negative: Model has given prediction No, and the real or actual value was also No.
True Positive: The model has predicted yes, and the actual value was also true.
False Negative: The model has predicted no, but the actual value was Yes, it is also called as Type-II error.
False Positive: The model has predicted Yes, but the actual value was No. It is also called a Type-I error.

y_predicted = model.predict(X_test)
from sklearn.metrics import confusion matrix
cm = confusion_matrix(y_test, y_predicted)
import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot = True)

3. AUC-ROC curve

ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under the Curve.

The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR(False Positive Rate) on X-axis.

Linear Model

Logistic Regression

$f(x)=\frac{1}{1+e^{-x}}$
uses sigmoid function or logistic function which is a complex cost function
f(x)= Output between the 0 and 1 value.
x= input to the function
e= base of natural logarithm.

Logistic Function (Sigmoid Function)
In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.

Assumptions
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.

Binomial

In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc.

Steps:

Data Pre-processing step

#Data Pre-procesing Step  
# importing libraries  
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('/Users/haleyk/Documents/Python_Libraries_for_ML/Python_Libraries_for_ML/Supervised Learning/Classification/Logistics Regression/User_Data.csv')  

#Extracting Independent and dependent Variable  
x= data_set.iloc[:, [2,3]].values  
y= data_set.iloc[:, 4].values  

# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  

#feature Scaling  
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)

# return
User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
...	...	...	...	...	...
395	15691863	Female	46	41000	1
396	15706071	Male	51	23000	1
397	15654296	Female	50	20000	1
398	15755018	Male	36	33000	0
399	15594041	Female	49	36000	1
400 rows × 5 columns

Fitting Logistic Regression to the Training set

#Fitting Logistic Regression to the training set  
from sklearn.linear_model import LogisticRegression  
classifier= LogisticRegression(random_state=0)  
classifier.fit(x_train, y_train)  

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,  
                   intercept_scaling=1, l1_ratio=None, max_iter=100,  
                   multi_class='warn', n_jobs=None, penalty='l2',  
                   random_state=0, solver='warn', tol=0.0001, verbose=0,  
                   warm_start=False)

Predicting the test result

#Predicting the test set result  
y_pred= classifier.predict(x_test)

Test accuracy of the result(Creation of Confusion matrix)

#Creating the Confusion matrix  
from sklearn.metrics import confusion_matrix  
cm= confusion_matrix(y_test, y_pred)

Visualizing the test set result

#Visualizing the training set result  
from matplotlib.colors import ListedColormap  
x_set, y_set = x_train, y_train  
x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step  =0.01),  
np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),  
alpha = 0.75, cmap = ListedColormap(('purple','green' )))  
plt.xlim(x1.min(), x1.max())  
plt.ylim(x2.min(), x2.max())  
for i, j in enumerate(np.unique(y_set)):  
    plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
        c = ListedColormap(('purple', 'green'))(i), label = j)  
plt.title('Logistic Regression (Training set)')  
plt.xlabel('Age')  
plt.ylabel('Estimated Salary')  
plt.legend()  
plt.show()

Multinomial

In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”

from sklearn.datasets import load_digits
%matplotlib inline
import matplotlib.pyplot as plt
digits = load_digits()

dir(digits)

digits.data[0] #number 0

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target, test_size=0.2)

model.fit(X_train, y_train)

model.score(X_test, y_test)

# return
0.9694444444444444

y_predicted = model.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predicted)
cm

import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')

Ordinal

In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as “low”, “Medium”, or “High”.

Non-linear Model

Lazy Learners: K-Nearest Neighbours

K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.

Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Step-6: Our model is ready.

Data Pre-processing step

import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()

iris.feature_names
# return
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
 
iris.target_names
# return
array(['setosa', 'versicolor', 'virginica'], dtype=')

df = pd.DataFrame(iris.data,columns=iris.feature_names)
df.head()
# return
	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

df['target']=iris.target
df.head()
# return

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0

df['flower_name'] =df.target.apply(lambda x: iris.target_names[x])
df.head()
# return
sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target	flower_name
0	5.1	3.5	1.4	0.2	0	setosa
1	4.9	3.0	1.4	0.2	0	setosa
2	4.7	3.2	1.3	0.2	0	setosa
3	4.6	3.1	1.5	0.2	0	setosa
4	5.0	3.6	1.4	0.2	0	setosa

Fitting the K-NN algorithm to the Training set

from sklearn.model_selection import train_test_split

X = df.drop(['target','flower_name'], axis='columns')
y = df.target

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10) # Number of neighbors to use by default for 

knn.fit(X_train, y_train)
# return
KNeighborsClassifier(n_neighbors=10)

knn.fit(X_test, y_test)
# return
KNeighborsClassifier(n_neighbors=10)

Predicting the test result

knn.predict([[4.8,3.0,1.5,0.3]])
# return
/Users/haleyk/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  warnings.warn(
array([0])

Test accuracy of the result(Creation of Confusion matrix)

from sklearn.metrics import confusion_matrix
y_pred = knn.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

# return
array([[11,  0,  0],
       [ 0, 13,  0],
       [ 0,  2,  4]])

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
plt.figure(figsize=(7,5))
sn.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')

# 11+13+2+4 = 30 test set
# 2 is wrong
# there is only 2 wrong, therefore, the score is 0.933333

knn.score(X_test, y_test)
# return
0.9333333333333333

Print classification report for precision, recall and f1-score for each classes


from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

# return
    precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.87      1.00      0.93        13
           2       1.00      0.67      0.80         6

    accuracy                           0.93        30
   macro avg       0.96      0.89      0.91        30
weighted avg       0.94      0.93      0.93        30

Data Pre-processing step

# importing libraries  
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('/Users/haleyk/Documents/Python_Libraries_for_ML/Python_Libraries_for_ML/Supervised Learning/Classification/Logistics Regression/User_Data.csv')  
data_set

User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
...	...	...	...	...	...
395	15691863	Female	46	41000	1
396	15706071	Male	51	23000	1
397	15654296	Female	50	20000	1
398	15755018	Male	36	33000	0
399	15594041	Female	49	36000	1
400 rows × 5 columns

Fitting the K-NN algorithm to the Training set

#Extracting Independent and dependent Variable  
x= data_set.iloc[:, [2,3]].values  
y= data_set.iloc[:, 4].values  
  
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  
  
#feature Scaling  
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)  

#Fitting K-NN classifier to the training set  
from sklearn.neighbors import KNeighborsClassifier  
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )  
classifier.fit(x_train, y_train)

Predicting the test result

#Predicting the test set result  
y_pred= classifier.predict(x_test)

Test accuracy of the result(Creation of Confusion matrix)

#Creating the Confusion matrix  
from sklearn.metrics import confusion_matrix  
cm= confusion_matrix(y_test, y_pred)

Visualizing the test set result

#Visulaizing the training set result  
from matplotlib.colors import ListedColormap  
x_set, y_set = x_train, y_train  
x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step  =0.01),  
np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),  
alpha = 0.75, cmap = ListedColormap(('red','green' )))  
plt.xlim(x1.min(), x1.max())  
plt.ylim(x2.min(), x2.max())  
for i, j in enumerate(np.unique(y_set)):  
    plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
        c = ListedColormap(('red', 'green'))(i), label = j)  
plt.title('K-NN Algorithm (Training set)')  
plt.xlabel('Age')  
plt.ylabel('Estimated Salary')  
plt.legend()  
plt.show()

#Visualizing the test set result  
from matplotlib.colors import ListedColormap  
x_set, y_set = x_test, y_test  
x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step  =0.01),  
np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),  
alpha = 0.75, cmap = ListedColormap(('red','green' )))  
plt.xlim(x1.min(), x1.max())  
plt.ylim(x2.min(), x2.max())  
for i, j in enumerate(np.unique(y_set)):  
    plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
        c = ListedColormap(('red', 'green'))(i), label = j)  
plt.title('K-NN algorithm(Test set)')  
plt.xlabel('Age')  
plt.ylabel('Estimated Salary')  
plt.legend()  
plt.show()

Eager Learners: Decision Trees

solve problems for both categorical and numerical data
builds a tree-like structure in which each internal node represents the “test” for an attribute, each branch represent the result of the test, and each leaf node represents the final decision or result.
is constructed starting from the root node/parent node (dataset), which splits into left and right child nodes (subsets of dataset). These child nodes are further divided into their children node, and themselves become the parent node of those nodes.

Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Information Gain
Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute.

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Gini Index
Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm.

An attribute with the low Gini index should be preferred as compared to the high Gini index.

$Gini Index= 1- ∑ j_Pj_2$

Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.

Data Pre-Processing Step
Fitting a Decision-Tree algorithm to the Training set

#Fitting Decision Tree classifier to the training set  
From sklearn.tree import DecisionTreeClassifier  
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)  
classifier.fit(x_train, y_train)

"criterion=‘entropy’: Criterion is used to measure the quality of split, which is calculated by information gain given by entropy.

Predicting the test result

#Predicting the test set result  
y_pred= classifier.predict(x_test)

Test accuracy of the result (Creation of Confusion matrix)

#Creating the Confusion matrix  
from sklearn.metrics import confusion_matrix  
cm= confusion_matrix(y_test, y_pred)  

# return
array([[62,  6],
       [ 3, 29]])

In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect predictions and 62+29=91 correct predictions. Therefore, we can say that compared to other classification models, the Decision Tree classifier made a good prediction.

Another example:

import pandas as pd
df = pd.read_csv('/Users/haleyk/Desktop/Codebasics/ML/9_decision_tree/Exercise/titanic.csv')
df.head()

df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis=1,inplace=True)
df.head()

df.Sex = df.Sex.map({'female':1, 'male':0})  
df.Sex

df = df.fillna(df.Age.mean())
df

df.isnull().sum()

X = df.drop('Survived',axis='columns')
y = df.Survived

X
# return
	Pclass	Sex	Age	Fare
0	3	0	22.000000	7.2500
1	1	1	38.000000	71.2833
2	3	1	26.000000	7.9250
3	1	1	35.000000	53.1000
4	3	0	35.000000	8.0500
...	...	...	...	...
886	2	0	27.000000	13.0000
887	1	1	19.000000	30.0000
888	3	1	29.699118	23.4500
889	1	0	26.000000	30.0000
890	3	0	32.000000	7.7500
891 rows × 4 columns

y 
# return
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(X_train,y_train)
# return
DecisionTreeClassifier()

model.score(X_test,y_test)
# return
0.8044692737430168

Random Forest [Overfitting]

is one of the most powerful supervised learning algorithms which is capable of performing regression as well as classification tasks.
is an ensemble learning method which combines multiple decision trees and predicts the final output based on the average of each tree output. The combined decision trees are called as base models.
uses Bagging or Bootstrap Aggregation technique of ensemble learning in which aggregated decision tree runs in parallel and do not interact with each other.

$g(x)= f_0(x)+ f_1(x)+ f_2(x)+...$

Photo Source

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.

Photo Source

import pandas as pd
from sklearn.datasets import load_digits
digits = load_digits()

dir(digits)   # list of string

# return
['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']

df[0:12]

X = df.drop('target',axis=1)
y = df.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=20) #The number of trees in the forest.
model.fit(X_train, y_train) 

# return
RandomForestClassifier(n_estimators=20)

model.score(X_test, y_test)
# return
0.9638888888888889

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predicted)
cm

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
plt.figure(figsize=(10,7))
sn.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')

# return
Text(69.0, 0.5, 'Truth')

Eager Learners: Naïve Bayes

Bernoulli, Multinomial and Gaussian Naive Bayes

Multinomial Naïve Bayes consider a feature vector where a given term represents the number of times it appears or very often i.e. frequency.
On the other hand, Bernoulli is a binary algorithm used when the feature is present or not.
At last Gaussian is based on continuous distribution.

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.

It is mainly used in text classification that includes a high-dimensional training dataset.

It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes’ Theorem.

Bayes’ Theorem
P(A|B) = P(B|A)P(A) / P(B)

Assumption

Naive bayes theorm uses bayes theorm for conditional probability with a naive assumption that the features are not correlated to each other and tries to find conditional probability of target variable given the probabilities of features.

import pandas as pd
df = pd.read_csv('https://storage.googleapis.com/kagglesdsdata/competitions/3136/26502/train.csv?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1646132661&Signature=PteVGgxccUCBs%2FsS7tCMNObj%2B6Svdf23ERl7SVnvVns2y4B7yPqW3mYYNQazGsbpEF6P0KUiKysl5ZTjIZEuNI%2FdduFeY77DWHdNVWZGDd5TlizWxOPd1E2rHcOyYpavsCr1OgtjLU8HpTCDsHhTb9kpzGoSK0Dzf2iFgh7HNWCsUYbHjCK9OAn6IBITg%2BZpn2KagdtPMFR3QWDeQV5kHlpsG9j8DaPdQGyJcZBC%2B8zuqz2UM6tknZp20fVDjtWR5ZvklfjzDpWGVtKxZfpFQlAv4f3udqF%2F%2F1j2zXFUcUXiD6tFdDR40rcNTL%2BVDJnUeNFHX4%2F7XbprMuToqMgpqQ%3D%3D&response-content-disposition=attachment%3B+filename%3Dtrain.csv')
df.head()

# return
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

# return
	Survived	Pclass	Sex	Age	Fare
0	0	3	male	22.0	7.2500
1	1	1	female	38.0	71.2833
2	1	3	female	26.0	7.9250
3	1	1	female	35.0	53.1000
4	0	3	male	35.0	8.0500

inputs = df.drop('Survived',axis='columns')
target = df.Survived

#inputs.Sex = inputs.Sex.map({'male': 1, 'female': 2})
dummies = pd.get_dummies(inputs.Sex)
dummies.head()

# return
	female	male
0	0	1
1	1	0
2	1	0
3	1	0
4	0	1

# merged

inputs = pd.concat([inputs,dummies],axis='columns')
inputs.head()

# return

Pclass	Sex	Age	Fare	female	male
0	3	male	22.0	7.2500	0	1
1	1	female	38.0	71.2833	1	0
2	3	female	26.0	7.9250	1	0
3	1	female	35.0	53.1000	1	0
4	3	male	35.0	8.0500	0	1

inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head()

# return

Pclass	Age	Fare	female
0	3	22.0	7.2500	0
1	1	38.0	71.2833	1
2	3	26.0	7.9250	1
3	1	35.0	53.1000	1
4	3	35.0	8.0500	0

inputs.columns[inputs.isna().any()] #detect missing data
# return
Index(['Age'], dtype='object')

inputs.Age[:10]
inputs.Age = inputs.Age.fillna(inputs.Age.mean()) # fill the NaN with mean value
inputs.head()

# return
	Pclass	Age	Fare	female
0	3	22.0	7.2500	0
1	1	38.0	71.2833	1
2	3	26.0	7.9250	1
3	1	35.0	53.1000	1
4	3	35.0	8.0500	0

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

len(X_train)
# 623
len(X_test)
# 268
len(inputs)
# 891

model.fit(X_train,y_train)
model.score(X_test,y_test)
# return
0.7350746268656716

# Calculate the score using cross validation
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)
# return
array([0.784     , 0.808     , 0.784     , 0.76612903, 0.82258065])

Support Vector Machines (SVM)

Support Vector Machine is a supervised learning algorithm which can be used for regression as well as classification problems. So if we use it for regression problems, then it is termed as Support Vector Regression.

Photo Source

Linear SVM

Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.

Photo Source

#Data Pre-processing Step  
# importing libraries  
import numpy as nm  
import matplotlib.pyplot as mtp  
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('/Users/haleyk/Documents/Python_Libraries_for_ML/Python_Libraries_for_ML/Supervised Learning/Classification/Logistics Regression/User_Data.csv')  
data_set

# return

User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
...	...	...	...	...	...
395	15691863	Female	46	41000	1
396	15706071	Male	51	23000	1
397	15654296	Female	50	20000	1
398	15755018	Male	36	33000	0
399	15594041	Female	49	36000	1
400 rows × 5 columns

#Extracting Independent and dependent Variable  
x= data_set.iloc[:, [2,3]].values  
y= data_set.iloc[:, 4].values  
  
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  
#feature Scaling  
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)

from sklearn.svm import SVC # "Support vector classifier"  
classifier = SVC(kernel='linear', random_state=0)  
classifier.fit(x_train, y_train)  

#Predicting the test set result  
y_pred= classifier.predict(x_test)  

#Creating the Confusion matrix  
from sklearn.metrics import confusion_matrix  
cm= confusion_matrix(y_test, y_pred)  
cm

# return
array([[66,  2],
       [ 8, 24]])

import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()


df['flower_name'] =df.target.apply(lambda x: iris.target_names[x])
df.head()

df['target'] = iris.target  #0-49 setosa
df.head(

df[df.target==1].head() #50-99 versicolor

df[df.target==2].head() #100-149 virginica

df['flower_name'] =df.target.apply(lambda x: iris.target_names[x])
df.head()

from sklearn.model_selection import train_test_split
X = df.drop(['target','flower_name'], axis=1)
y = df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.svm import SVC
model = SVC()

model.fit(X_train, y_train)

model.score(X_test, y_test)

# return
model.predict([[4.8,3.0,1.5,0.3]])

Tune Parameter

# Regularization (C)
model_C = SVC(C=1)
model_C.fit(X_train, y_train)
model_C.score(X_test, y_test)


model_C = SVC(C=10)
model_C.fit(X_train, y_train)
model_C.score(X_test, y_test)


# Gamma
model_g = SVC(gamma=10)
model_g.fit(X_train, y_train)
model_g.score(X_test, y_test)

# Kernel 
model_linear_kernal = SVC(kernel='linear')
model_linear_kernal.fit(X_train, y_train)
model_linear_kernal.score(X_test, y_test)

Non-linear SVM (Kernel SVM)

Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Photo Source

To separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z.

$z=x^2 +y^2$

#Data Pre-processing Step  
# importing libraries  
import numpy as nm  
import matplotlib.pyplot as mtp  
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('user_data.csv')  
  
#Extracting Independent and dependent Variable  
x= data_set.iloc[:, [2,3]].values  
y= data_set.iloc[:, 4].values  
  
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  
#feature Scaling  
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)

Dimensionality Reduction

Photo Source

Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common techniques of filters method are:

Correlation
Chi-Square Test
ANOVA
Information Gain, etc.

Wrappers Methods

The performance decides whether to add those features or remove to increase the accuracy of the model. This method is more accurate than the filtering method but complex to work. Some common techniques of wrapper methods are:

Forward Selection
Backward Selection
Bi-directional Elimination

Embedded Methods

Embedded methods check the different training iterations of the machine learning model and evaluate the importance of each feature. Some common techniques of Embedded methods are:

Elastic Net

Lasso Regression/ L1 regularization [Reduce complexity]

penalty term contains only the absolute weights
can only shrink the slope to 0 because of taking absolute values

$\widehat{Y}) + \lambda \sum_1^n |w_i|$

# import libraries
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=50, max_iter=100, tol=0.1)
lasso_reg.fit(train_X, train_y)

lasso_reg.score(test_X, test_y)
# return
0.6636
lasso_reg.score(train_X, train_y)
# return
0.6766

Ridge Regression/ L2 regularization [Reduce complexity]

a general linear or polynomial regression will fail if there is high collinearity between the independent variables
penalty term contains a square of weights
shrink the slope near to 0

$\widehat{Y}) + \lambda \sum_1^n w_i^{2}$

from sklearn.linear_model import Ridge
ridge_reg= Ridge(alpha=50, max_iter=100, tol=0.1)
ridge_reg.fit(train_X, train_y)

ridge_reg.score(test_X, test_y)
# return
0.6670848945194956

ridge_reg.score(train_X, train_y)
# return
0.6622376739684328

Feature Extractions

Principal Component Analysis (Unsupervised Learning)

Principal Component Analysis is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called the Principal Components. It is one of the popular tools that is used for exploratory data analysis and predictive modeling.

from sklearn.decomposition import PCA

pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca.shape

# return
(1797, 29)

pca.explained_variance_ratio_
# return
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415,
       0.0491691 , 0.04315987, 0.03661373, 0.03353248, 0.03078806,
       0.02372341, 0.02272697, 0.01821863, 0.01773855, 0.01467101,
       0.01409716, 0.01318589, 0.01248138, 0.01017718, 0.00905617,
       0.00889538, 0.00797123, 0.00767493, 0.00722904, 0.00695889,
       0.00596081, 0.00575615, 0.00515158, 0.0048954 ])

pca.n_components_
# return
29

X_pca
# return
array([[ -1.25946645,  21.27488348,  -9.46305462, ...,   3.67072108,
         -0.9436689 ,  -1.13250195],
       [  7.9576113 , -20.76869896,   4.43950604, ...,   2.18261819,
         -0.51022719,   2.31354911],
       [  6.99192297,  -9.95598641,   2.95855808, ...,   4.22882114,
          2.1576573 ,   0.8379578 ],
       ...,
       [ 10.8012837 ,  -6.96025223,   5.59955453, ...,  -3.56866194,
          1.82444444,   3.53885886],
       [ -4.87210009,  12.42395362, -10.17086635, ...,   3.25330054,
          0.95484174,  -0.93895602],
       [ -0.34438963,   6.36554919,  10.77370849, ...,  -3.01636722,
          1.29752723,   2.58810313]])

X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=30)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)

# return
0.9694444444444444

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
X_pca.shape

# return
(1797, 2)

X_pca
# return
array([[ -1.25946639,  21.27487891],
       [  7.95760922, -20.76869518],
       [  6.99192341,  -9.95598163],
       ...,
       [ 10.80128435,  -6.96025523],
       [ -4.87210315,  12.42395926],
       [ -0.34438701,   6.36554335]])

pca.explained_variance_ratio_
# return: You can see that both combined retains 0.14+0.13=0.27 or 27% of important feature information
array([0.14890594, 0.13618771])


X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=30)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)

# return
0.6083333333333333

Linear Discriminant Analysis

Kernel PCA

Quadratic Discriminant Analysis

Unsupervised Learning

Unsupervised learning is a machine learning technique in which models are not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things.

The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format.

K-means clustering
KNN (k-nearest neighbors)

Anomaly detection
Neural Networks
Principle Component Analysis <>
Independent Component Analysis
Apriori algorithm
Singular value decomposition

Clustering

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft Clustering (data points can belong to another group also). But there are also other various approaches of Clustering exist.

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Elbow Method
WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster.

$WCSS= ∑P_i in Cluster1 distance(P_i C_1)^2 +∑P_i in Cluster2 distance P_i C_2^2+∑P_i in Cluster3 distance(P_i C_3)^2$

To measure the distance between data points and centroid, we can use any method such as Euclidean distance or Manhattan distance.

# importing libraries    
import numpy as np    
import matplotlib.pyplot as plt    
import pandas as pd    

# Importing the dataset  
dataset = pd.read_csv('/Users/haleyk/Documents/Python_Libraries_for_ML/Python_Libraries_for_ML/Unsupervised Learning/K-mean clustering/Mall_Customers.csv')  
dataset

# return
	CustomerID	Gender	Age	Annual Income (k$)	Spending Score (1-100)
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40
...	...	...	...	...	...
245	246	Male	30	297	69
246	247	Female	56	311	14
247	248	Male	29	313	90
248	249	Female	19	316	32
249	250	Female	31	325	86
250 rows × 5 columns

x = dataset.iloc[:, [3, 4]].values

#finding optimal number of clusters using the elbow method  
from sklearn.cluster import KMeans  
wcss_list= []  #Initializing the list for the values of WCSS  
  
#Using for loop for iterations from 1 to 10.  
for i in range(1, 11):  
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)  
    kmeans.fit(x)  
    wcss_list.append(kmeans.inertia_)  
plt.plot(range(1, 11), wcss_list)  
plt.title('The Elobw Method Graph')  
plt.xlabel('Number of clusters(k)')  
plt.ylabel('wcss_list')  
plt.show()  

#training the K-means model on a dataset  
kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)  
y_predict= kmeans.fit_predict(x)  


#visulaizing the clusters  
plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster  
plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster  
plt.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster  
plt.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster  
plt.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster  
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')   
plt.title('Clusters of customers')  
plt.xlabel('Annual Income (k$)')  
plt.ylabel('Spending Score (1-100)')  
plt.legend()  
plt.show()

Below are the main clustering methods used in Machine learning:

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based method. The most common example of partitioning clustering is the K-Means Clustering algorithm.

Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are divided from each other by sparser areas.

Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the probability of how a dataset belongs to a particular distribution. The grouping is done by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture Models (GMM).

Hierarchical Clustering/ Agglomerative Hierarchical Clustering

In this technique, the dataset is divided into clusters to create a tree-like structure, which is also called a dendrogram. Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA

Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Photo Source

Association

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought together. It can also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori Algorithm. It represents the database in the form of a tree structure that is known as a frequent pattern or tree. The purpose of this frequent tree is to extract the most frequent patterns.

Hyper Parameter Tuning

Approach 1: Use train_test_split and manually tune parameters by trial and error

from sklearn import svm, datasets
iris = datasets.load_iris()

import pandas as pd
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df[47:150]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

model = svm.SVC(kernel='rbf',C=30,gamma='auto')
model.fit(X_train,y_train)
model.score(X_test, y_test)

# return
1

Approach 2: Use K Fold Cross validation

Manually try suppling models with different parameters to cross_val_score function with 5 fold cross validation

from sklearn.model_selection import cross_val_score

cross_val_score(svm.SVC(kernel='linear',C=10,gamma='auto'),iris.data, iris.target, cv=5)

cross_val_score(svm.SVC(kernel='rbf',C=10,gamma='auto'),iris.data, iris.target, cv=5)

cross_val_score(svm.SVC(kernel='linear',C=20,gamma='auto'),iris.data, iris.target, cv=5)
.....

Above approach is tiresome and very manual. We can use for loop as an alternative

import numpy as np

kernels = ['rbf', 'linear']
C = [1,10,20]
avg_scores = {}
for kval in kernels:
    for cval in C:
        cv_scores = cross_val_score(svm.SVC(kernel=kval,C=cval,gamma='auto'),iris.data, iris.target, cv=5)
        avg_scores[kval + '_' + str(cval)] = np.average(cv_scores)

avg_scores

# return
{'rbf_1': 0.9800000000000001,
 'rbf_10': 0.9800000000000001,
 'rbf_20': 0.9666666666666668,
 'linear_1': 0.9800000000000001,
 'linear_10': 0.9733333333333334,
 'linear_20': 0.9666666666666666}

From above results we can say that rbf with C=1 or 10 or linear with C=1 will give best performance

Approach 3: Use GridSearchCV

GridSearchCV does exactly same thing as for loop above but in a single line of code

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [1,10,20],                # parameter 
    'kernel': ['rbf','linear']
}, cv=5, return_train_score=False) 

clf.fit(iris.data, iris.target)
clf.cv_results_

df = pd.DataFrame(clf.cv_results_)
df

df[['param_C','param_kernel','mean_test_score']]

clf.best_params_

clf.best_score_

Another Example:

Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters.

This is useful when you have too many parameters to try and your training time is longer. It helps reduce the cost of computation

from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(svm.SVC(gamma='auto'), {
        'C': [1,10,20],
        'kernel': ['rbf','linear']
    }, 
    cv=5, 
    return_train_score=False, 
    n_iter=2
)
rs.fit(iris.data, iris.target)
pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','mean_test_score']]

# return
	param_C	param_kernel	mean_test_score
0	1	linear	0.98
1	1	rbf	0.98

Different Models with Different Parameters

from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}

scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(iris.data, iris.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

# return
	model	best_score	best_params
0	svm	0.980000	{'C': 1, 'kernel': 'rbf'}
1	random_forest	0.960000	{'n_estimators': 5}
2	logistic_regression	0.966667	{'C': 5}

Another Example:

from sklearn import datasets
digits = datasets.load_digits()


from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    },
    'naive_bayes_gaussian': {
        'model': GaussianNB(),
        'params': {}
    },
    'naive_bayes_multinomial': {
        'model': MultinomialNB(),
        'params': {}
    },
    'decision_tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini','entropy'],
            
        }
    }     
}

from sklearn.model_selection import GridSearchCV
import pandas as pd
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(digits.data, digits.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df


# return
	model	best_score	best_params
0	svm	0.949360	{'C': 1, 'kernel': 'linear'}
1	random_forest	0.899833	{'n_estimators': 10}
2	logistic_regression	0.920979	{'C': 1}
3	naive_bayes_gaussian	0.806344	{}
4	naive_bayes_multinomial	0.871452	{}
5	decision_tree	0.817474	{'criterion': 'entropy'}

K-Fold Cross Vaildation

# option 1 
# use all available data for training and testing on same dataset
# e.g., 100 math questions prepared, and a few questions to be tested
# option 2
# 70/100 for training and 30/100 for testing 

# option 3
# K-fold cross vaildation
# 1000 samples 20/100 for testing and 80/100 for training x 5 sets
# take average


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

# Logistic Regression
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

# SVM
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

from sklearn.model_selection import cross_val_score

# cross_val_score uses stratifield kfold by default

# Logistic regression model performance using cross_val_score
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)

# svm model performance using cross_val_score
cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)

# random forest performance using cross_val_score
cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)

# Here we used cross_val_score to fine tune our random forest classifier and 
# figured that having around 40 trees in random forest gives best result.

n_estimators=[5,20,30,40]
average_scores ={}
for ne in n_estimators:
    cv_scores = cross_val_score(RandomForestClassifier(n_estimators=ne),digits.data, digits.target, cv=10)
    average_scores['the average score of' +'_'+str(ne)+'_'+'is'] = np.average(cv_scores)
average_scores

Bagging/ Ensemble Learning

Ensemble learning is all about using multiple models to combine their prediction power to get better predictions that has low variance. Bagging and boosting are two popular techniques that allows us to tackle high variance issue. In this video we will learn about bagging with simple visual demonstration. We will also right python code in sklearn to use BaggingClassifier.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:3]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=10)


from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

# return
0.7136575842458195

from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(), 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
bag_model.fit(X_train, y_train)
bag_model.oob_score_

bag_model.score(X_test, y_test)


bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(), 
    n_estimators=100, # use 100 datasets, trial and error
    max_samples=0.8, # 80% of my sample dataset
    oob_score=True, # out of bag, since the data is random, then you use sth data that does not appear to test
    random_state=0 # Controls the random resampling of the original dataset (sample wise and feature wise).
)
scores = cross_val_score(bag_model, X, y, cv=5)
scores

scores.mean()
# return
0.7578728461081402

# We can see some improvement in test score with bagging classifier as compared to a standalone classifier

from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(RandomForestClassifier(n_estimators=50), X, y, cv=5)
scores.mean()


# return
0.7669637551990494

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

scores = cross_val_score(SVC(), X, y, cv=5)
scores.mean()

from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier(base_estimator=SVC(), n_estimators=100, max_samples=0.8, random_state=0)
scores = cross_val_score(bag_model, X, y, cv=5)
scores.mean()

References

Javapoint: Machine Learning

Scikit-learn

你可能感兴趣的:(机器学习,学习,人工智能,python)

谁也不是天生温柔。温柔也是学来的王田雨
大概你只有被别人温柔对待过，你才知道怎么去温柔对待别人。和同事一起出车，来不及吃早饭。所走的路线也很偏僻，周边没有什么店子。跑了一圈之后停下休息，可是周边很偏僻，没有店子。原以为休息几分钟，就饿着肚子继续上路了。可是不多时有人敲打我的车窗，车窗摇下递给我一瓶水和一碗卤粉。一询问才知道，我的这个同事知道我没吃早饭，想办法给别人打电话，让别人把早餐送到这里来了。后来跟同事一起去学习。某一日与这个同事，
迷茫，不，是不敢搁浅不潜
我经常在想自己咋这么没用，学习学不好，脾气又不好，一直让父亲劳累、失望，可他还是那最厚实的一堵墙支撑着我前进。我动不动就哭，一个人哭很好，能反思很多。我能由现在自己的不足想到未来长远的规划，我羡慕那些比我优秀，日子又比我过的幸福的人。我多么自私，只能想想那些日子比我过的还苦的人。可我真的苦吗？我不苦，我有个很爱我、总是支持默默付出的爸爸。我不苦，我有个一心一意对我好的奶奶。我最瞧不起的就是我自己，
2021年法考故事系列—15 周七七智趣人生
时间：2021/4/28法考倒计时：136天时间执行情况：1.中午：12：00-13：00，1h。2.尝试：早晨：手边做着不动脑的工作，一边只听刑法2h。3.晚上：20：00-21：00；1h。学习内容：刑法总则，有效累计时间6h。不包含只听课时间。策略调整：1.根据现阶段的实际情况（母亲大人住院，自己工作量增加，同时最近特别累），没有办法做到每天能保证6小时，为了不产生罪恶心里，此阶段，干脆抛开
让红旗在心中升起一介书生l1
早上漫妮又闹情绪了，把她带到教室之后，我在办公室里反省了好久。是不是自己的方法不对？最后我决定严厉点，或许这样会有点效果吧。今天挺开心的，因为我们班得到了文明班级。当从广播里听到三五班那一刻时，孩子们欢呼了起来，我也欢呼了起来。我想这个就是所谓的班集体吧第一次拿到文明班级这是班上同学们的努力换来的，纪律也好，卫生也好，学习也好。三五班得到了学校的认可，还记得今天跟刘老师聊天说到：教务处左老师说三五
Python酷库之旅-第三方库Pandas(115) 神奇夜光杯 python pandas 开发语言人工智能标准库及第三方库 excel 学习与成长
目录一、用法精讲506、pandas.DataFrame.rank方法506-1、语法506-2、参数506-3、功能506-4、返回值506-5、说明506-6、用法506-6-1、数据准备506-6-2、代码示例506-6-3、结果输出507、pandas.DataFrame.round方法507-1、语法507-2、参数507-3、功能507-4、返回值507-5、说明507-6、用法507
苹果的“AI茅”之路只走了一半美股研究社人工智能
今年苹果发布会最大的亮点，也许是和华为“撞档”，又或者是替腾讯“发布”新手游，但肯定不是iPhone16。9月10日，苹果秋季新品发布会与华为见非凡品牌盛典相继举行，iPhone16系列也与HUAWEIMateXT同日发布。不过，两大厂商的新品卖点各不相同，华为更加强调三折叠屏手机的“引领性、创新性、颠覆性”；苹果备受关注的则是苹果人工智能(AppleIntelligence)。首席执行官蒂姆·库
Python 安装 Selenium 报错解决方案：全方位排错指南小柒笔记 python selenium 开发语言
引言在尝试使用pip安装Selenium库时，您可能会遇到中断报错，这通常是由于多种原因造成的，如网络问题、权限问题或依赖项缺失等。本文将指导您如何解决这一常见问题。一、检查网络连接首先，确保您的网络连接稳定。pip安装过程中需要从互联网下载包，因此网络不稳定可能导致安装失败。二、使用管理员权限运行在Windows系统中，尝试使用管理员权限运行命令提示符或PowerShell。右键点击命令提示符或
没什么事儿，还是得多读书耳朵的暖窝
为啥我今天要写的题目内容是“没什么事儿，还是得多读点儿书呢？”得益于我昨天看了一部电影《大学》，我发现啊，那些能主动求学且能学习的学子，他们都很赤诚，可爱，也很善良，也很热血！电影里，有一位大学教师奶奶，她已然是到了退休的年纪了，爱人去世后，她依然是享受知识带来的心灵的滋养，在她喜欢的知识领域，将所学的知识传授给学校里的学子们。所谓“师也，传道授业解惑也”，为人师表，在她身边上表述得恰如其分！《大
使用vllIm部署大语言模型添砖JAVA的小墨机器学习
使用vllm部署大语言模型一般需要以下步骤：一、准备工作1.系统要求-操作系统：常见的Linux发行版（如Ubuntu、CentOS）或Windows（通过WSL）。-GPU支持：NVIDIAGPU并安装了适当的驱动程序。-足够的内存和存储空间。2.安装依赖-Python3.8及以上版本。-CUDA工具包（根据GPU型号选择合适的版本）。二、安装vllm1.创建虚拟环境（推荐）-使用Conda：c
《少有人走的路》才是更容易走向成功的路莉蒂亚
成功的路千万条，更容易走的确是少有人走的那一条。就目前来说《少有人走的路》有它正确的方面，它时时刻刻强调沟通与理解，它跨越时代限制，帮助我们探索爱的本质，引导我们过上崭新、宁静而丰富的生活；它帮助我们学习爱，也学习独立；归根到底，它告诉我们怎样找到真正的自我。同时还为我们提供了不一样的角度来看待心智成熟这个问题，作者结合自己多年心理咨询时的案例，引导我们通过学会自律、爱、信仰、恩典来帮助我们走上心
连通无向图一般中心的算法及其matlab程序详解夏天天天天天天天# 图论算法 matlab 图论
#################本文为学习《图论算法及其MATLAB实现》的学习笔记#################若服务点只允许取在各顶点上,而服务对象却取在各顶点及各边(或弧)上的点,则在所有顶点中选定一个顶点作为图的一般中心其条件是该点离它本身的最远服务对象(包括顶点及各边(或弧)上的点)的距离达到极小值。寻找无向图的一般中心对解决网络最佳服务点确定的问题是十分有效的，使得服务对象的范围
为中原出彩添砖加瓦 fjx付俊香
学习省委十届六次全会暨省委工作会议心得体会河南省委十届六次全会，深刻地阐述了“中原更出彩”和“两个高质量发展”的内涵，把准了河南当前和今后一段时间的脉搏，号召我们紧跟总书记步伐、紧扣党中央节拍、紧贴新时代任务！作为奋战在一个正起步腾飞的新学校的一员，我应该这样做：一、加强学习，提高自身素养注多读书，多学习，切实提高自身理论水平和业务水平，用教育理论武装自己，力求为学生创造最好的教育。二、以身作则，
人教版六年级数学上册教材分析尚未秃头的老师
教学内容：修订后的六年制第十一册教科书的主要内容有：位置，分数乘法，分数除法，圆，百分数，统计，数学广角和数学实践活动等。分数乘法和除法，圆，百分数等是本册教材的重点教学内容。教材分析：在数与代数方面，教材安排了分数乘法、分数除法、百分数三个单元。分数乘法和除法的教学是在前面学习整数、小数有关计算的基础上，培养学生分数四则运算能力以及解决有关分数的实际问题的能力。会解决简单的有关百分数的实际问题，
养育男孩女孩原来是这么的不同轩泽妈妈
2020年2月24日星期一小雨119通过这几天父母课堂的学习，让我对孩子有了更多的了解和认识，加上自己心态的改变和轩宝的改变，我们的相处更加和谐了，通过使用番茄学习法，作业也比原来完成的好多了。今天上午轩宝上完网课，看了会书后，觉得无聊，他想要看会电视。假如是以前我会直接拒绝，然后和他讲一大堆道理，紧接着轩宝乱发脾气，我会不开心。今天我换了种处理方式，首先我同意了他看电视这个要求，他很开心的说：“
烦躁 jessica3827
下午三点多才起床，今天的心情很怪，有种没办法控制的感觉，心脏很焦躁的比往常跳的更快一些，不想吃饭，就干脆一直坐着发呆。昨晚一直学习到凌晨六点多才睡觉，被考研折磨的作息已经变成了日夜颠倒，所以有时候猛的在下午醒过来，就会被怅然若失的低落感瞬间包围。不小心打开了朋友圈，呵，好热闹，都没能敢多看几眼，怎么回事？感觉这个世界上就我被抛弃了？感觉好饿，微信上问我妈要不要出去吃点什么，很久都不回复我，得，自己
学习小组Day4笔记—蓝海松茶蓝海松茶LHSC
一、下载安装R和Rstudio1.下载安装RGoogle搜索https://mirrors.tuna.tsinghua.edu.cn/CRAN/选择匹配自己电脑的安装包下载安装一直点下一步，直到安装完成Tips确认自己电脑的用户名是英文名，若不是，请修改2.下载安装RstudioGoogle搜索https://www.rstudio.com/products/rstudio/download/下载
python--排错--AttributeError: 'str' object has no attribute 'decode'，关于python3的字符串我不是庸医 python 排错记录
AttributeError:'str'objecthasnoattribute'decode'一般是因为str的类型本身不是bytes，所以不能解码两个概念:普通str：可理解的语义字节流str（bytes）（0101010101，可视化显示）两个语法Encode:把普通字符串转为机器可识别的bytesDecode:把bytes转为字符串两个差异Python3的str默认不是bytes，所以不能
Python数据分析之股票信息可视化实现matplotlib Blogfish Python3 大数据 python 可视化数据分析
今天学习爬虫技术数据分析对于股票信息的分析及结果呈现，目标是实现对股票信息的爬取并对数据整理后，生成近期成交量折线图。首先，做这个案例一定要有一个明确的思路。知道要干啥，知道用哪些知识，有些方法我也记不住百度下知识库很强大，肯定有答案。有思路以后准备对数据处理，就是几个方法使用了。接口地址参考：Tushare数据涉及知识库：tushare-一个财经数据开放接口；pandas-实现将数据整理为表格，
Scala学习之旅－对Option友好的flatMap 喝冰咖啡 scala 学习
聊点什么OptionflatMapvs.OptionOption的作用在Java/Scala中,Optional/Option(本文还是以scala代码为例)是用来表示某个对象存在或者不存在，也就是说,Option是某个类型T的Wrapper,如果T!=null,Option(T).isDefined==true如果T==null,Option(T).isEmpty==true有了Option这层
文本生成图像工作简述1--概念介绍和技术梳理尹凯
姓名：尹凯学号：22011210590学院：通信工程学院原文链接：https://blog.csdn.net/air__Heaven/article/details/127302735【嵌牛导读】文本生成图像的概念介绍与技术梳理【嵌牛鼻子】文本生成图像基于深度学习的机器学习方法已经在语音、文本、图像等单一模态领域取得了巨大的成功，而同时涉及到多种输入模态的多模态机器学习研究有巨大的应用前景和广泛的
用 Python 写网络编程（三） TesterHome
本文在2021.02.14首发于TesterHome社区，作者是资深游戏测试开发工程师陈子昂。用Python写网络编程共四篇，今天给大家分享其中第三篇。原文链接：https://testerhome.com/topics/27910前言今天是一个特别的节日，1946年情人节，世界上第一台计算机ENIAC在米国的宾夕法尼亚大学被new了，标志着新的时代到来。计算机陪伴人类已经走过了75个年头，所以今
2018-7-30 grace2039
一、学习与实践1.付出不亚于任何人的努力2.要谦虚，不要骄傲3.要每天反省4.活着，就要感谢5.积善行，思利他6.不要有感性的烦恼二、今日分享因为约好下午一点到外高桥，为稳妥起见，中饭就请小马哥帮忙买个汉堡之类的便当带着路上吃，等坐上车，打开小马哥递过来的汉堡，一看真是好开心，小马哥给买的是“和式汉堡”，就是紫菜包，是用米代理了面包那种，而且加过温，拿在手上不冷也不烫，正好可以吃，还买了瓶水，我吃
AttributeError: ‘str’ object has no attribute ‘get’ 云天徽上 python 开发语言 pandas 机器学习 numpy
【Python】成功解决AttributeError:‘str’objecthasnoattribute‘get’在Python编程中，遇到AttributeError是一个常见的错误，它通常表明你尝试访问的对象不具备你正在调用的属性或方法。当错误信息为“AttributeError:‘str’objecthasnoattribute‘get’”时，这通常意味着你错误地将一个字符串（str）对象当
使用LangChain与Together AI模型交互：深入探讨和实践指南 llzwxh888 langchain 人工智能交互 python
使用LangChain与TogetherAI模型交互：深入探讨和实践指南1.引言在人工智能和自然语言处理领域，TogetherAI已经成为一个强大的平台，提供了对50多个领先开源模型的访问。本文将深入探讨如何使用LangChain与TogetherAI模型进行交互，为开发者提供实用的知识和见解，同时解决可能遇到的常见问题。2.TogetherAI简介TogetherAI是一个强大的API平台，允许
OpenLM: 一个灵活的开源大语言模型接口工具 llzwxh888 语言模型人工智能自然语言处理 python
OpenLM:一个灵活的开源大语言模型接口工具引言在人工智能和自然语言处理快速发展的今天，大语言模型(LLM)已经成为许多应用的核心。然而，不同的LLM提供商往往有着各自的API和使用方式，这给开发者带来了一定的挑战。本文将介绍OpenLM，这是一个零依赖、兼容OpenAIAPI的LLM提供者接口，它可以直接通过HTTP调用不同的推理端点。我们将深入探讨OpenLM的特性、使用方法，以及如何将其与
使用中专API实现AI模型调用与部署 llzwxh888 人工智能 easyui 前端 python
在AI技术领域，如何调用和部署大语言模型（LLM）是一个常见的需求。本文将详细介绍如何通过中专API地址http://api.wlai.vip，实现对OpenAI大模型的调用与部署，并提供一个详细的demo代码示例。引言随着人工智能技术的飞速发展，大语言模型在自然语言处理任务中的表现尤为突出。然而，由于国内访问海外API存在一定限制，本文将使用中专API地址来解决这一问题，并展示如何在本地环境中配
小学数学案例及案例写法会宁248南有亮
《用字母表示数》的起始课例重点都放在用字母表示一个数量和数量关系上。这是学生认知上的难点。对小学生来说，从具体事物的个数抽象出数是认识上的一个飞跃，现在由具体的、确定的数过渡到用字母表示抽象的、可变的数，更是认识上的一个飞跃。只有在充分让学生感受到用字母可以表示什么数，什么情形下用字母表示数，以及用字母能表示运算定律和计算公式的基础上，再学习用含字母的式子表示数量和数量关系，这样由易到难，便于学生
fluentd 简介，日志收集并导入BigQuery nvd11 Cloud spring Etl spring boot
日志收集的工具有很多种例如Splunk，很多大公司都在使用，但是个人使用的话并不合适，主要是需要license的…钱是1个大问题另1个常见开源的解决方案是ELK,但是搭建和学习成本高，如果只是为了日志收集并不值。对于k8s方案，还有1个开源选择，就是fluentd，本文的主题。Fluentd的简介Fluentd是一个开源的数据收集器，旨在实现日志数据的统一收集、处理和转发。它支持多种数据源和数据格
2019.3.9隐喻与智慧 Humanni妮妮
从蒂姆营感受到了故事的神奇作用后开始了传家故事画下来，生活中的小事编故事，讲故事给孩子和学生听的经历……却渐感如果不练习不学习，故事讲解的方法、语气、感觉，会随着离开她们，这些“惯性”会离开，很久才能捡起来！于是，开始故事阅读，《一千年一夜》之前每天或者每两天开始阅读。唱故事Puff,明显是条无所不能的龙，但隐喻的应该是年轻时无所不能的父母；或者曾经无所不能的自己，童年的时候，Jackie长大后离
概率潜在语义分析（Probabilistic Latent Semantic Analysis,PLSA）—无监督学习方法、概率模型、生成模型、共现模型、非线性模型、参数化模型、批量学习剑海风云 Artificial Intelligence 人工智能机器学习概率潜在语义分析 PLSA
定义输入:设单词集合为W={ω1,ω2,⋯ ,ωM}W=\{\omega_1,\omega_2,\cdots,\omega_M\}W={ω1,ω2,⋯,ωM},文本集合为D={d1,d2,⋯ ,dN}D=\{d_1,d_2,\cdots,d_N\}D={d1,d2,⋯,dN},话题集合为Z={z1,z2,⋯ ,zN}Z=\{z_1,z_2,\cdots,z_N\}Z={z1,z2,⋯,zN},共现
项目中枚举与注解的结合使用飞翔的马甲 java enum annotation
前言：版本兼容，一直是迭代开发头疼的事，最近新版本加上了支持新题型，如果新创建一份问卷包含了新题型，那旧版本客户端就不支持，如果新创建的问卷不包含新题型，那么新旧客户端都支持。这里面我们通过给问卷类型枚举增加自定义注解的方式完成。顺便巩固下枚举与注解。一、枚举 1.在创建枚举类的时候，该类已继承java.lang.Enum类，所以自定义枚举类无法继承别的类，但可以实现接口。
【Scala十七】Scala核心十一：下划线_的用法 bit1129 scala
下划线_在Scala中广泛应用，_的基本含义是作为占位符使用。_在使用时是出问题非常多的地方，本文将不断完善_的使用场景以及所表达的含义 1. 在高阶函数中使用 scala> val list = List(-3,8,7,9) list: List[Int] = List(-3, 8, 7, 9) scala> list.filter(_ > 7) r
web缓存基础：术语、http报头和缓存策略 dalan_123 Web
对于很多人来说，去访问某一个站点，若是该站点能够提供智能化的内容缓存来提高用户体验，那么最终该站点的访问者将络绎不绝。缓存或者对之前的请求临时存储，是http协议实现中最核心的内容分发策略之一。分发路径中的组件均可以缓存内容来加速后续的请求，这是受控于对该内容所声明的缓存策略。接下来将讨web内容缓存策略的基本概念，具体包括如如何选择缓存策略以保证互联网范围内的缓存能够正确处理的您的内容，并谈论下
crontab 问题周凡杨 linux crontab unix
一： 0481-079 Reached a symbol that is not expected. 背景： */5 * * * * /usr/IBMIHS/rsync.sh
让tomcat支持2级域名共享session g21121 session
tomcat默认情况下是不支持2级域名共享session的，所有有些情况下登陆后从主域名跳转到子域名会发生链接session不相同的情况，但是只需修改几处配置就可以了。打开tomcat下conf下context.xml文件找到Context标签,修改为如下内容如果你的域名是www.test.com <Context sessionCookiePath="/path&q
web报表工具FineReport常用函数的用法总结（数学和三角函数）老A不折腾 Web finereport 总结
ABS ABS(number):返回指定数字的绝对值。绝对值是指没有正负符号的数值。 Number:需要求出绝对值的任意实数。示例: ABS(-1.5)等于1.5。 ABS(0)等于0。 ABS(2.5)等于2.5。 ACOS ACOS(number):返回指定数值的反余弦值。反余弦值为一个角度，返回角度以弧度形式表示。 Number:需要返回角
linux 启动java进程 sh文件墙头上一根草 linux shell jar
#!/bin/bash #初始化服务器的进程PId变量 user_pid=0; robot_pid=0; loadlort_pid=0; gateway_pid=0; ######### #检查相关服务器是否启动成功 #说明： #使用JDK自带的JPS命令及grep命令组合，准确查找pid #jps 加 l 参数，表示显示java的完整包路径 #使用awk，分割出pid
我的spring学习笔记5-如何使用ApplicationContext替换BeanFactory aijuans Spring 3 系列
如何使用ApplicationContext替换BeanFactory？ package onlyfun.caterpillar.device; import org.springframework.beans.factory.BeanFactory; import org.springframework.beans.factory.xml.XmlBeanFactory; import
Linux 内存使用方法详细解析 annan211 linux 内存 Linux内存解析
来源 http://blog.jobbole.com/45748/ 我是一名程序员，那么我在这里以一个程序员的角度来讲解Linux内存的使用。一提到内存管理，我们头脑中闪出的两个概念，就是虚拟内存，与物理内存。这两个概念主要来自于linux内核的支持。 Linux在内存管理上份为两级，一级是线性区，类似于00c73000-00c88000，对应于虚拟内存，它实际上不占用
数据库的单表查询常用命令及使用方法(-) 百合不是茶 oracle 函数单表查询
创建数据库; --建表 create table bloguser(username varchar2(20),userage number(10),usersex char(2)); 创建bloguser表,里面有三个字段 &nbs
多线程基础知识 bijian1013 java 多线程 thread java多线程
一．进程和线程进程就是一个在内存中独立运行的程序，有自己的地址空间。如正在运行的写字板程序就是一个进程。 “多任务”：指操作系统能同时运行多个进程（程序）。如WINDOWS系统可以同时运行写字板程序、画图程序、WORD、Eclipse等。线程：是进程内部单一的一个顺序控制流。线程和进程 a. 每个进程都有独立的
fastjson简单使用实例 bijian1013 fastjson
一.简介阿里巴巴fastjson是一个Java语言编写的高性能功能完善的JSON库。它采用一种“假定有序快速匹配”的算法，把JSON Parse的性能提升到极致，是目前Java语言中最快的JSON库；包括“序列化”和“反序列化”两部分，它具备如下特征：
【RPC框架Burlap】Spring集成Burlap bit1129 spring
Burlap和Hessian同属于codehaus的RPC调用框架，但是Burlap已经几年不更新，所以Spring在4.0里已经将Burlap的支持置为Deprecated,所以在选择RPC框架时，不应该考虑Burlap了。这篇文章还是记录下Burlap的用法吧，主要是复制粘贴了Hessian与Spring集成一文，【RPC框架Hessian四】Hessian与Spring集成
【Mahout一】基于Mahout 命令参数含义 bit1129 Mahout
1. mahout seqdirectory $ mahout seqdirectory --input (-i) input Path to job input directory(原始文本文件). --output (-o) output The directory pathna
linux使用flock文件锁解决脚本重复执行问题 ronin47 linux lock　重复执行
linux的crontab命令，可以定时执行操作，最小周期是每分钟执行一次。关于crontab实现每秒执行可参考我之前的文章《linux crontab 实现每秒执行》现在有个问题，如果设定了任务每分钟执行一次，但有可能一分钟内任务并没有执行完成，这时系统会再执行任务。导致两个相同的任务在执行。例如： <? // test .php
java-74-数组中有一个数字出现的次数超过了数组长度的一半，找出这个数字 bylijinnan java
public class OcuppyMoreThanHalf { /** * Q74 数组中有一个数字出现的次数超过了数组长度的一半，找出这个数字 * two solutions: * 1.O(n) * see <beauty of coding>--每次删除两个不同的数字，不改变数组的特性 * 2.O(nlogn) * 排序。中间
linux 系统相关命令 candiio linux
系统参数 cat /proc/cpuinfo cpu相关参数 cat /proc/meminfo 内存相关参数 cat /proc/loadavg 负载情况性能参数 1）top M：按内存使用排序 P：按CPU占用排序 1：显示各CPU的使用情况 k：kill进程 o：更多排序规则回车：刷新数据 2）ulimit ulimit -a：显示本用户的系统限制参
[经营与资产]保持独立性和稳定性对于软件开发的重要意义 comsci 软件开发
一个软件的架构从诞生到成熟，中间要经过很多次的修正和改造如果在这个过程中，外界的其它行业的资本不断的介入这种软件架构的升级过程中那么软件开发者原有的设计思想和开发路线
在CentOS5.5上编译OpenJDK6 Cwind linux OpenJDK
几番周折终于在自己的CentOS5.5上编译成功了OpenJDK6，将编译过程和遇到的问题作一简要记录，备查。 0. OpenJDK介绍 OpenJDK是Sun（现Oracle）公司发布的基于GPL许可的Java平台的实现。其优点： 1、它的核心代码与同时期Sun（-> Oracle）的产品版基本上是一样的，血统纯正，不用担心性能问题，也基本上没什么兼容性问题；（代码上最主要的差异是
java乱码问题 dashuaifu java乱码问题 js中文乱码
swfupload上传文件参数值为中文传递到后台接收中文乱码在js中用setPostParams（{"tag" : encodeURI( document.getElementByIdx_x("filetag").value，"utf-8")}）; 然后在servlet中String t
cygwin很多命令显示command not found的解决办法 dcj3sjt126com cygwin
cygwin很多命令显示command not found的解决办法修改cygwin.BAT文件如下 @echo off D: set CYGWIN=tty notitle glob set PATH=%PATH%;d:\cygwin\bin;d:\cygwin\sbin;d:\cygwin\usr\bin;d:\cygwin\usr\sbin;d:\cygwin\us
[介绍]从 Yii 1.1 升级 dcj3sjt126com PHP yii2
2.0 版框架是完全重写的，在 1.1 和 2.0 两个版本之间存在相当多差异。因此从 1.1 版升级并不像小版本间的跨越那么简单，通过本指南你将会了解两个版本间主要的不同之处。如果你之前没有用过 Yii 1.1，可以跳过本章，直接从"入门篇"开始读起。请注意，Yii 2.0 引入了很多本章并没有涉及到的新功能。强烈建议你通读整部权威指南来了解所有新特性。这样有可能会发
Linux SSH免登录配置总结 eksliang ssh-keygen Linux SSH免登录认证 Linux SSH互信
转载请出自出处：http://eksliang.iteye.com/blog/2187265 一、原理我们使用ssh-keygen在ServerA上生成私钥跟公钥，将生成的公钥拷贝到远程机器ServerB上后,就可以使用ssh命令无需密码登录到另外一台机器ServerB上。生成公钥与私钥有两种加密方式，第一种是
手势滑动销毁Activity gundumw100 android
老是效仿ios，做android的真悲催！有需求：需要手势滑动销毁一个Activity 怎么办尼？自己写？不用~，网上先问一下百度。结果： http://blog.csdn.net/xiaanming/article/details/20934541 首先将你需要的Activity继承SwipeBackActivity，它会在你的布局根目录新增一层SwipeBackLay
JavaScript变换表格边框颜色 ini JavaScript html Web html5 css
效果查看：http://hovertree.com/texiao/js/2.htm代码如下，保存到HTML文件也可以查看效果： <html> <head> <meta charset="utf-8"> <title>表格边框变换颜色代码-何问起</title> </head> <body&
Kafka Rest : Confluent kane_xie kafka REST confluent
最近拿到一个kafka rest的需求，但kafka暂时还没有提供rest api（应该是有在开发中，毕竟rest这么火），上网搜了一下，找到一个Confluent Platform，本文简单介绍一下安装。这里插一句，给大家推荐一个九尾搜索，原名叫谷粉SOSO，不想fanqiang谷歌的可以用这个。以前在外企用谷歌用习惯了，出来之后用度娘搜技术问题，那匹配度简直感人。环境声明：Ubu
Calender不是单例 men4661273 单例 Calender
在我们使用Calender的时候，使用过Calendar.getInstance()来获取一个日期类的对象，这种方式跟单例的获取方式一样，那么它到底是不是单例呢，如果是单例的话，一个对象修改内容之后，另外一个线程中的数据不久乱套了吗？从试验以及源码中可以得出，Calendar不是单例。测试： Calendar c1 =
线程内存和主内存之间联系 qifeifei java thread
1， java多线程共享主内存中变量的时候，一共会经过几个阶段， lock:将主内存中的变量锁定，为一个线程所独占。 unclock:将lock加的锁定解除，此时其它的线程可以有机会访问此变量。 read:将主内存中的变量值读到工作内存当中。 load:将read读取的值保存到工作内存中的变量副本中。
schedule和scheduleAtFixedRate tangqi609567707 java timer schedule
原文地址：http://blog.csdn.net/weidan1121/article/details/527307 import java.util.Timer;import java.util.TimerTask;import java.util.Date; /** * @author vincent */public class TimerTest {
erlang 部署 wudixiaotie erlang
1.如果在启动节点的时候报这个错： {"init terminating in do_boot",{'cannot load',elf_format,get_files}} 则需要在reltool.config中加入 {app, hipe, [{incl_cond, exclude}]}, 2.当generate时，遇到： ERROR