LIQING LIN

03_Classification_import name fetch_mldata_cross_val_plot_digits_ML_Project Checklist_confusion matr

handling multiple classes directly: Random Forest classifiers or naive Bayes classifiers

strictly binary classifiers: Support Vector Machine classifiers or Linear classifiers

Appendix B. Machine Learning Project Checklist
This checklist can guide you through your Machine Learning projects. There are eight main steps:
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.
Obviously, you should feel free to adapt this checklist to your needs.

1. cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Just like the cross_val_score() function, cross_val_predict() performs K-fold cross-validation, but instead of returning the evaluation scores, it returns the predictions made on each test fold. This means that you get a clean prediction for each
instance in the training set (“clean” meaning that the prediction is made by a model that never saw the data during training).

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

		Predicted Class
		Class=No_(non_5s)_Negative	Class=Yes_Target(5s)_Positive
Actual Class	Class=No	TN	FP
Actual Class	Class=Yes	FN	TP

Non-5s 5s or Positive

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred) #==3530/(3530+687) #TP/(TP+FP)

recall_score(y_train_5, y_train_pred) #==3530/(3530+1891) #TP/(TP+FN)

Each row in a confusion matrix represents an actual class, while each column represents a predicted class. The first row of this matrix considers non-5 images (the negative class): 53,892 of them were correctly classified as non-5s (they are called true negatives), while the remaining 687 were wrongly classified as 5s (false positives). The second row considers the images of 5s (the positive class): 1,891 were wrongly classified as non-5s (false negatives), while the remaining 3,530 were correctly classified as 5s (true positives).

2. cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

threshold? For this you will first need to get the scores of all instances in the training set using the cross_val_predict() function again, but this time specifying that you want it to return decision scores instead of predictions by setting method="decision_function":

#previously, #y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
y_decision_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

Now with these scores you can compute precision and recall for all possible thresholds using the precision_recall_curve() function:

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_decision_scores)

3. cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

the RandomForestClassifier class does not have a decision_function() method. Instead it has a predict_proba() method.

The predict_proba() method returns an array containing a row per instance and a column per class, each containing the probability that the given instance belongs to the given class (e.g., 70% chance that the image represents a 5):

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

#########################################beginning###############################

MNIST

we will be using the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents.

Scikit-Learn provides many helper functions to download popular datasets. MNIST is one of them. The following code fetches the MNIST dataset:

from sklearn.datasets import fetch_mldata

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
 in 
----> 1 from sklearn.datasets import fetch_mldata

ImportError: cannot import name 'fetch_mldata'

Solution_1: #vesion-issue

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
mnist.keys()

from collections import  Counter
Counter(mnist['target'])

mnist

... ...

... ...

X, y = mnist["data"], mnist["target"]
X.shape

Solution_2:

from six.moves import urllib
from scipy.io import loadmat

mnist_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_datafile = "./mnist-original.mat" #saved filename in current directory

response = urllib.request.urlopen(mnist_url)
with open(mnist_datafile, 'wb') as f:
    content = response.read()
    f.write(content)

mnist_raw = loadmat(mnist_datafile)
mnist_raw

mnist_raw["data"].shape

mnist_raw['label'].shape #from the result, we know each column has label in last row

mnist_raw["data"].T.shape

mnist_raw['label'][0].shape #After data transpose and extraction, after the data columns is a label

mnist = {
    "COL_NAMES": ["label", "data"], #header
    'DESCR': 'mldata.org dataset: mnist-original',
    "data": mnist_raw["data"].T,
    "target": mnist_raw["label"][0],
}
mnist.keys()

Datasets loaded by Scikit-Learn generally have a similar dictionary structure including:
• A DESCR key describing the dataset
• A data key containing an array with one row per instance and one column per feature
• A target key containing an array with the labels

X, y = mnist["data"], mnist["target"]
X.shape

y.shape

mnist['feature_names'][-1]

There are 70,000 images, and each image has 784 features. This is because each image is 28×28 pixels, and each feature simply represents one pixel’s intensity像素, from 0 (white) to 255 (black). Let’s take a peek看一眼 at one digit from the dataset. All you need to do is grab an instance’s feature vector, reshape it to a 28×28 array, and display it using Matplotlib’s imshow() function:

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

some_digit = X[0] #784=28*28

some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")

plt.show()

This looks like a 5, and indeed that’s what the label tells us:

y[0]

#you also can use the following code

import matplotlib as mpl
y=y.astype(np.uint8)
def plot_digit(data):
    image = data.reshape(28,28)
    plt.imshow(image, cmap=mpl.cm.binary, interpolation="nearest")
    plt.axis("off")
    plt.show()
plot_digit(X[0])

Following figure shows a few more images from the MNIST dataset to give you a feel for the complexity of the classification task.

#Extra
import matplotlib as mpl
    
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    n_rows = (len(instances)-1) //images_per_row + 1#or (49-1)//10+1=5 instead 49//10=4 : need +1
                                                    #or (52-1)//10+1=6 instead 52//10=5 # need +1
                                                    #or (50-1)//10+1=5 instead (50)//10+1=6 need 50-1
    images = [eachInstance.reshape(size,size) for eachInstance in instances] #ndarray
    row_images=[]
    #process empty on the last row or the number of last row images != images_per_row
    if n_rows * images_per_row >len(instances) : #fill with 28*28 0's
        n_empty = n_rows * images_per_row -len(instances)
                               #hint:tile horizontal                           #Dimension
        images.append(np.zeros((size,size*n_empty))#wrong:np.zeros((size,size))*n_empty==np.zeros((size,size))
                     )                      #wrong:np.zeros((size,size)*n_empty) #2^n_empty dimensions
#OR
#         images.append(np.tile(np.zeros((size,size)),n_empty)
#                      ) 
    #print(len(images)) #check previous if statement
    #images: [array, array...,array]
    #in each array
    #print(len(images[0][0])) #28 columns
    #print(len(images[0]))    #28 rows

    for row in range(n_rows):        #index[0]  : image[0]...image[9] in row 0
        rimages = images[row*images_per_row: (row+1)*images_per_row] #block or subgroup
        row_images.append(np.concatenate(rimages, axis=1)) #images per row
    images= np.concatenate(row_images, axis=0)
                          
    plt.imshow(images, cmap=mpl.cm.binary, **options)
    plt.axis("off") 


plt.figure(figsize=(9,9))
example_images = X[:100]
plot_digits(example_images, images_per_row=10)
plt.show()

But wait! You should always create a test set and set it aside before inspecting the data closely. The MNIST dataset is actually already split into a training set (the first 60,000 images) and a test set (the last 10,000 images):

Note: we don't need do stratified sampling
(02_End-to-End Machine Learning Project: https://blog.csdn.net/Linli522362242/article/details/103387527)

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Let’s also shuffle the training set; this will guarantee that all cross-validation folds will be similar (you don’t want one fold to be missing some digits). Moreover, some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row.
Shuffling the dataset ensures that this won’t happen:

shuffle_index = np.random.permutation(len(X_train)) #len(X_train) ==60000
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] #for tracking the steps

Training a Binary Classifier

Let’s simplify the problem for now and only try to identify one digit — for example, the number 5. This “5-detector” will be an example of a binary classifier, capable of distinguishing between just two classes, 5 and not-5. Let’s create the target vectors for this classification task:

y_train_5 = (y_train==5)
y_test_5 = (y_test==5)

y_train_5

y_test_5

Okay, now let’s pick a classifier and train it. A good place to start is with a Stochastic Gradient Descent
(SGD) classifier, using Scikit-Learn’s SGDClassifier class. This classifier has the advantage of being capable of handling very large datasets efficiently. This is in part because SGD deals with training instances independently, one at a time (which also makes SGD well suited for online learning), as we will see later. Let’s create an SGDClassifier and train it on the whole training set:

###############
Tip: The SGDClassifier relies on randomness during training (hence the name “stochastic”). If you want reproducible results, you should set the random_state parameter.
###############

1.##############Training Classifier ##############

from sklearn.linear_model import SGDClassifier
                                                         #0.001
sgd_clf = SGDClassifier(random_state=42, max_iter=1000, tol=1e-3)
sgd_clf.fit(X_train, y_train_5)

sgd_clf.predict([some_digit]) #detect images of the number 5

The classifier guesses that this image represents a 5 (True). Looks like it guessed right in this particular case! Now, let’s evaluate this model’s performance.

Performance Measures 2.choose the appropriate metric for your task

Evaluating a classifier回归器 is often significantly trickier复杂的 than evaluating a regressor回归器, so we will spend a largepart on this topic. There are many performance measures available, so grab another coffee and get ready to learn many new concepts and acronyms 首字母缩略词!

Measuring Accuracy Using Cross-Validation

Let’s use the cross_val_score() function to evaluate your SGDClassifier model using K-fold crossvalidation, with three folds. Remember that K-fold cross-validation means splitting the training set into K-folds (in this case, three), then making predictions and evaluating them on each fold using a model trained on the remaining folds.

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')

################################################################

T: True prediction or classify correctly; F: False prediction or classify incorrectly
P: instance is truely belong to target class based on the fact;
N: instance is not belong to target class based on the fact
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Precision = TP/(TP+FP) # True predicted/ (True predicted + False predicted)
Recall = TP/(TP+FN)

https://towardsdatascience.com/accuracy-recall-precision-f-score-specificity-which-to-optimize-on-867d3f11124

A school is running a machine learning primary diabetes scan on all of its students.
The output is either diabetic (+ve) (Target)or healthy (-ve)(Not Target).

There are only 4 cases any student X could end up with.
We’ll be using the following as a reference later, So don’t hesitate to re-read it if you get confused.

True positive (TP): Prediction is +ve and X is diabetic, we want that: true positive is a diabetic person(Target) correctly predicted(You believed) .
True negative (TN): Prediction is -ve and X is healthy, we want that too: true negative is a healthy person(Not Target) correctly predicted(You believed)
False positive (FP): Prediction is +ve and X is healthy, false alarm, bad: false positive is a healthy person(Not target) incorrectly predicted as diabetic(+)
False negative (FN): Prediction is -ve and X is diabetic, the worst.: false negative is a diabetic person(target) incorrectly predicted as healthy(-)

Which performance metric to choose?

Accuracy

It’s the ratio of the correctly labeled subjects to the whole pool of subjects.
Accuracy is the most intuitive最直观的 one.
Accuracy answers the following question: How many students did we correctly label out of all the students?
Accuracy = (TP+TN)/(TP+FP+FN+TN)
numerator: all correctly labeled subject (All trues)
denominator: all subjects

Precision

Precision is the ratio of the correctly +ve labeled by our program to all +ve labeled.
Precision answers the following: How many of those who we (target)labeled as diabetic are actually diabetic?
Precision = TP/(TP+FP) # True predicted/ (True predicted + False predicted)
numerator: +ve labeled diabetic people.
denominator: all +ve (actually diabetic) labeled by our program (whether they’re diabetic or not in reality).

Recall (aka Sensitivity)

Recall is the ratio of the correctly +ve labeled by our program to all who are diabetic in reality.
Recall answers the following question: Of all the people who are diabetic(target), how many of those we correctly predict?
Recall = TP/(TP+FN)
numerator: +ve labeled diabetic people.
denominator: all people who are diabetic (whether detected by our program or not)

F1-score (aka F-Score / F-Measure)

F1 Score considers both precision and recall.
It is the harmonic mean(average) of the precision and recall.
F1 Score is best if there is some sort of balance between precision (p) & recall (r) in the system. Oppositely F1 Score isn’t so high if one measure is improved at the expense of the other.
For example, if P is 1 & R is 0, F1 score is 0.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Specificity or called TNR

Specificity is the correctly -ve labeled by the program to all who are healthy in reality.
Specifity answers the following question: Of all the people who are healthy(Not target), how many of those did we correctly predict?
Specificity = TN/(TN+FP)
numerator: -ve labeled healthy people.
denominator: all people who are healthy in reality (whether +ve or -ve labeled)

General Notes

Yes, accuracy is a great measure but only when you have symmetric datasets (false negatives & false positives counts are close), also, false negatives & false positives have similar costs.
If the cost of false positives and false negatives are different then F1 is your savior. F1 is best if you have an uneven class distribution.

Precision is how sure you are of your true positives whilst recall is how sure you are that you are not missing any positives.

Bottom Line is

— Accuracy value of 90% means that 1 of every 10 labels is incorrect, and 9 is correct.
— Precision value of 80% means that on average, 2 of every 10 diabetic(target) labeled student by our program is healthy, and 8 is diabetic(target).
— Recall value is 70% means that 3 of every 10 diabetic people (target) in reality are missed by our program and 7 labeled as diabetic.
— Specificity value is 60% means that 4 of every 10 healthy people(Not target) in reality are miss-labeled as diabetic and 6 are correctly labeled as healthy.

################################################################
IM PLEM ENTING CROSS-VALIDATION
Occasionally you will need more control over the cross-validation process than what Scikit-Learn provides off-the-shelf. In these cases, you can implement cross-validation yourself; it is actually fairly straightforward. The following code does roughly the same thing as Scikit-Learn’s cross_val_score() function, and prints the same result:

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)#if error, set shuffle = False.

for train_index, test_index in skfolds.split(X_train, y_train_5):
clone_clf = clone(sgd_clf) #sgd_clf = SGDClassifier(random_state=42, max_iter=1000, tol=1e-3)
X_train_folds = X_train[train_index]
y_train_folds = y_train_5[train_index]
X_test_fold = X_train[test_index]
y_test_fold = y_train_5[test_index]

clone_clf.fit(X_train_folds, y_train_folds)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred==y_test_fold)
print(n_correct/len(y_test_fold))

https://blog.csdn.net/Linli522362242/article/details/103387527
The StratifiedKFold class performs stratified sampling to produce folds that contain a representative ratio of each class. At each iteration the code creates a clone of the classifier, trains that clone on the training folds, and makes predictions on the test fold. Then it counts the number of correct predictions and outputs the ratio of correct predictions.

################################################################

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')

Wow! Above 95% accuracy (ratio of correct predictions) on all cross-validation folds? This looks amazing, doesn’t it? Well, before you get too excited, let’s look at a very dumb笨 classifier that just classifies every single image in the “not-5” class:

from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

That’s right, it has over 90% accuracy! This is simply because only about 10% of the images are 5s, so if you always guess that an image is not a 5, you will be right about 90% of the time. Beats Nostradamus諾斯塔達摩斯.
This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others).

3. Confusion Matrix混淆矩阵 ####evaluate your classifiers using cross-validation####

A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the 5th row and 3rd column of the confusion matrix.

To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to the actual targets. You could make predictions on the test set, but let’s keep it untouched for now (remember that you want to use the test set only at the very end of your project, once you have a classifier that you are ready to launch). Instead, you can use the cross_val_predict() function:

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Just like the cross_val_score() function, cross_val_predict() performs K-fold cross-validation, but instead of returning the evaluation scores, it returns the predictions made on each test fold. This means that you get a clean prediction for each
instance in the training set (“clean” meaning that the prediction is made by a model that never saw the data during training).

Now you are ready to get the confusion matrix using the confusion_matrix() function. Just pass it the target classes (y_train_5) and the predicted classes (y_train_pred):

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

		Predicted Class
		Class=No_(non_5s)_Negative	Class=Yes_Target(5s)_Positive
Actual Class	Class=No	TN	FP
Actual Class	Class=Yes	FN	TP

Non-5s 5s or Positive

Each row in a confusion matrix represents an actual class, while each column represents a predicted class. The first row of this matrix considers non-5 images (the negative class): 53,892 of them were correctly classified as non-5s (they are called true negatives), while the remaining 687 were wrongly classified as 5s (false positives). The second row considers the images of 5s (the positive class): 1,891 were wrongly classified as non-5s (false negatives), while the remaining 3,530 were correctly classified as 5s (true positives). A perfect classifier would have only true positives 687and true negatives, so its confusion matrix would have nonzero values only on its main diagonal (top left to bottom right):

y_train_perfect_predictions = y_train_5 #pretend we reached perfection
confusion_matrix(y_train_5, y_train_perfect_predictions)

The confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric. An interesting one to look at is the accuracy of the positive predictions; this is called the precision of the classifier.

TP is the number of true positives, and FP is the number of false positives

A trivial way to have perfect precision is to make one single positive prediction and ensure it is correct (precision = 1/1 = 100%). This would not be very useful since the classifier would ignore all but one positive instance. So precision is typically used along with another metric named recall, also called sensitivity or true positive rate (TPR): this is the ratio of positive instances that are correctly detected by the classifier

FN is of course the number of false negatives.

If you are confused about the confusion matrix, Figure 3-2 may help.

Precision and Recall

Scikit-Learn provides several functions to compute classifier metrics, including precision and recall

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred) #==3530/(3530+687)

3530/(3530+687)#TP/(TP+FP)

recall_score(y_train_5, y_train_pred) #==3530/(3530+1891)

3530/(3530+1891)#TP/(TP+FN)

Now your 5-detector does not look as shiny as it did when you looked at its accuracy. When it claims an
image represents a 5, it is correct only 84%(0.8370879772350012) of the time. Moreover, it only detects 65% of the 5s.

It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic调和的mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high.

To compute the F1 score, simply call the f1_score() function:

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

4. ##############select the precision/recall tradeoff that fits your needs##

The F1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall. For example, if you trained a classifier to detect videos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that has a much higher recall but lets a few really bad videos show up in your product (in such cases, you may even want to add a human pipeline to check the classifier’s
video selection).

: True / False means Is the classification correct； P means whether the class to which the object(instance in each row of dataset) belongs is the target class(videos that are safe for kids); you expect high precision. FP: the rate of missclassifying(F) other kinds of videos as the target class(P); FN: the rate of missclassifying(F) the viedos that are safe for kids as other kinds of videos. you more care about FP.

On the other hand, suppose you train a classifier to detect shoplifters扒手 on surveillance images监控影像: it is probably fine if your classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a few false alerts, but almost all shoplifters will get caught).

: True / False means Is the classification correct； P means whether the class to which the object(instance in each row of dataset) belongs is the target class(shoplifter) or the non-target class(Good customer); you expect high recall rate since you want all most shoplifters will be get caught by the security guards and few shoplifters are wrongly(F) classfied as non-target class(Good customer, N). FP：the rate of Misclassifying(F) good customers as shoplifters(P), FN and FP, you more care about FN.

Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff.

Precision/Recall Tradeoff

To understand this tradeoff, let’s look at how the SGDClassifier makes its classification decisions. For each instance, it computes a score based on a decision function, and if that score is greater than a threshold, it assigns the instance to the positive class(Target), or else it assigns it to the negative class(Non-target). Figure 3-3 shows a few digits positioned
from the lowest score on the left to the highest score on the right. Suppose the decision threshold is positioned at the central arrow (between the two 5s): you will find 4 true positives (actual four 5s) on the right of that threshold, and one false positive (actually a 6). Therefore, with that threshold, the precision is 80% (4 out of 5 =(4/(4+1)=TP/(TP+FP) ). But (total is six 5s)out of 6 actual 5s, the classifier only detects 4, so the recall is 67% (4 out of 6(=4/(4+2)=TP/(TP+FN))).

Now if you raise the threshold (move it to the arrow on the right), the false positive (the 6) becomes a true negative, thereby increasing precision (up to 100%=3/(3+0) in this case), but one true positive becomes a false negative, decreasing recall down to 50%=3/(3+3).

Conversely, lowering the threshold increases recall and reduces precision.

Scikit-Learn does not let you set the threshold directly, but it does give you access to the decision scores that it uses to make predictions. Instead of calling the classifier’s predict() method, you can call its decision_function() method, which returns a score for each instance, and then make predictions based on those scores using any threshold you want:

y_decision_scores = sgd_clf.decision_function([some_digit])
y_decision_scores

threshold=0
y_some_digit_pred = (y_decision_scores > threshold)
y_some_digit_pred

The SGDClassifier uses a threshold equal to 0, so the previous code returns the same result(True) as the predict() method (i.e., True). Let’s raise the threshold to 200000:

threshold=200000
y_some_digit_pred = (y_decision_scores>threshold)
y_some_digit_pred

This confirms that raising the threshold decreases recall. The image actually represents a 5, and the classifier detects it when the threshold is 0, but it misses it when the threshold is increased to 200,000.

So how can you decide which threshold to use? For this you will first need to get the scores of all instances in the training set using the cross_val_predict() function again, but this time specifying that you want it to return decision scores instead of predictions by setting method="decision_function":

4. ##############select the precision/recall tradeoff that fits your needs##############

#previously, #y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
y_decision_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

Now with these scores you can compute precision and recall for all possible thresholds using the precision_recall_curve() function:

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_decision_scores)

Finally, you can plot precision and recall as functions of the threshold value using Matplotlib

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], 'b--', label="Precision")
    plt.plot(thresholds, recalls[:-1], 'g-', label="Recall")
    plt.xlabel("Threshold", fontsize=16)
    #plt.ylim([0,1])
    plt.legend(loc="upper left", fontsize=16)
    plt.grid(True)           #plt.ylim([0,1])
    plt.axis([-50000, 50000, 0,1])
    plt.title('Figure 3-4.Precision and recall versus the decision threshold')

plt.figure(figsize=(8,4))    
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.plot([7813, 7813], [0., 0.93],"r:")
plt.plot([-50000, 7813], [0.93, 0.93], "r:")
plt.plot([7813], [0.93], "ro")
plt.plot([-50000, 7813], [0.3068, 0.3068], "r:")
plt.plot([7813], [0.3068], "ro")

plt.show()

#######################################################

NOTE
You may wonder why the precision curve is bumpier崎岖的 than the recall curve in Figure 3-4. The reason is that precision may
sometimes go down when you raise the threshold (although in general it will go up). To understand why, look back at Figure 3-3
and notice what happens when you start from the central threshold and move it just one digit to the right: precision goes from 4/5 (80%) down to 3/4 (75%). On the other hand, recall can only go down when the threshold is increased, which explains why its curve looks smooth.

(y_train_pred == (y_decision_scores >0)).all()

#not exists missing value
#######################################################

Now you can simply select the threshold value that gives you the best precision/recall tradeoff for your task. Another way to select a good precision/recall tradeoff is to plot precision directly against recall, as shown in Figure

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", lw=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0,1, 0,1])
    plt.grid(True)
    
plt.figure(figsize=(8,6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([0.3068, 0.3068], [0., 0.93], "r:")
plt.plot([0.0, 0.3068], [0.93, 0.93], "r:")
plt.plot([0.3068], [0.93], "ro")
plt.title("Precision versus recall")

plt.show()

4. ##############select the precision/recall tradeoff that fits your needs##

You can see that precision really starts to fall sharply around 80% recall. You will probably want to select a precision/recall tradeoff just before that drop — for example, at around 60% recall. But of course the choice depends on your project.

So let's suppose you decide to aim for 93% precision. You look up the first plot (zooming in a bit) and find that you need to use a threshold of about 70,000. To make predictions (on the training set for now), instead of calling the classifier's predict() method, you can just run this code:

The index of precisions corresponds to the index of thresholds, so we need to get the index of our precision value(0.93), then we get the threshold value through this index, finally we can get the specified precision score and recall score(rate) .

threshold_93_precision=thresholds[np.argmax(precisions>=0.93)]
print(threshold_93_precision)

y_train_pred_93 = (y_decision_scores>=threshold_93_precision)
y_train_pred_93

precision_score(y_train_5, y_train_pred_93)

recall_score(y_train_5, y_train_pred_93)

Great, you have a 93% precision classifier (or close enough)! As you can see, it is fairly easy to create a classifier with virtually any precision you want: just set a high enough threshold, and you’re done. Hmm, not so fast. A high-precision classifier is not very useful if its recall is too low!
####################################################
TIP
If someone says “let’s reach 99% precision,” you should ask, “at what recall?”
####################################################

https://en.wikipedia.org/wiki/Sensitivity_and_specificity
The ROC Curve(TPr / FPr = 1-TNr)

The receiver operating characteristic (ROC) curve 接收器工作特性曲线is another common tool used with binary classifiers. It is very similar to the precision / recall curve, but instead of plotting precision versus recall,
the ROC curve plots the true positive rate (another name for recall) against the false positive rate(TPR vs FPR).
The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate(FPR=1-TNR),
the true negative rate,which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Thus the ROC curve plots sensitivity (recall) versus 1 – specificity(TNR).

To plot the ROC curve, you first need to compute the TPR and FPR for various threshold values, using the roc_curve() function:

5. ##############compare various models using ROC curves ##############

from sklearn.metrics import roc_curve

#y_decision_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train_5, y_decision_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, lw=2, label=label)
    plt.plot([0,1], [0,1], 'k--') #dashed diagnoal
    plt.axis([0,1,0,1])
    plt.xlabel("False Positive Rate (Fall-Out)", fontsize=16)
    plt.ylabel("True Positive Rate (Recall)", fontsize=16)
    plt.grid(True)
        
plt.figure(figsize=(8,6))
plot_roc_curve(fpr, tpr)
plt.plot([4.837e-3, 4.837e-3], [0.0, 0.3068], 'y:')
plt.plot([4.837e-3], [0.3068],'yo')
plt.show()

Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the classifier produces. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. Scikit-Learn provides a function to compute the ROC AUC:

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_decision_scores)

#######################################
TIP
Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder how to decide which one to use. As a
rule of thumb, you should prefer the PR curve whenever the positive class(target) is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise. For example, looking at the previous ROC curve (and the ROC AUC score), you may think that the classifier is really good. But this is mostly because there are few positives (5s) compared to the negatives (non-5s). In contrast, the PR curve makes it clear that the classifier has room for improvement (the curve could be closer to the top-right corner).
#######################################
Let’s train a RandomForestClassifier and compare its ROC curve and ROC AUC score to the SGDClassifier. First, you need to get scores for each instance in the training set. But due to the way it works, the RandomForestClassifier class does not have a decision_function() method. Instead it has a predict_proba() method. Scikit-Learn classifiers generally have one or the
other. The predict_proba() method returns an array containing a row per instance and a column per class, each containing the probability that the given instance belongs to the given class (e.g., 70% chance that the image represents a 5):

#we set n_estimators=100 to be future-proof since this will be the default value in Scikit-Learn 0.22
forest_clf = RandomForestClassifier(random_state=42, n_estimators=100)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

y_probas_forest

But to plot a ROC curve, you need scores, not probabilities. A simple solution is to use the positive class’s probability as the score; Now you are ready to plot the ROC curve. It is useful to plot the first ROC curve as well to see how they
compare

5. ##############compare various models using ROC curves ##############

y_scores_forest = y_probas_forest[:,1] #score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)

plt.figure(figsize=(8,6))
plt.plot(fpr,tpr, "b:", lw=2, label="SGD") #Stochastic Gradient Descent(SGD) classifier
plot_roc_curve(fpr_forest, tpr_forest, "Random_Forest")

plt.grid(True)
plt.legend(loc="lower right", fontsize=16)
plt.title('Comparing ROC curves')
plt.show()

As you can see in Figure, the RandomForestClassifier’s ROC curve looks much better than the SGDClassifier’s: it comes much closer to the top-left corner. As a result, its ROC AUC score is also significantly better:

6. ##############ROC AUC scores ##############

roc_auc_score(y_train_5, y_scores_forest) #y_probas_forest[:,1] #score = proba of positive class

Try measuring the precision and recall scores: you should find 98.5% precision and 82.8% recall. Not
too bad!

y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)

recall_score(y_train_5, y_train_pred_forest)

Hopefully you now know how to 1.train binary classifiers, 2.choose the appropriate metric for your task(accuracy), 3.evaluate your classifiers using cross-validation(confusion matrix), 4.select the precision/recall tradeoff that fits your needs, and 5.compare various models using ROC curves and 6..ROC AUC scores. Now let’s try to detect more than just the 5s.

Multiclass Classification

Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes.

Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of handling multiple classes directly. Others (such as Support Vector Machine classifiers or Linear classifiers) are strictly binary classifiers. However, there are various strategies that you can use to perform multiclass classification using multiple binary classifiers.

For example, one way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called the one-versus-all (OvA) strategy (also called one-versus-the-rest).

Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the one-versus-one (OvO) strategy. If there are N classes, you need to train N × (N – 1) / 2 classifiers. For the MNIST problem, this means training 45 binary classifiers! When you want to classify an image, you have to run the image through all 45 classifiers and see which class wins the most duels竞争. The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.

Some algorithms (such as Support Vector Machine classifiers) scale扩展 poorly with the size of the training set, so for these algorithms OvO is preferred since it is faster to train many classifiers on small training sets than training few classifiers on large training sets. For most binary classification algorithms, however, OvA is preferred.

Scikit-Learn detects when you try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvA (except for SVM classifiers for which it uses OvO).
Let’s try this with the SGDClassifier:

#sgd_clf = SGDClassifier(random_state=42, max_iter=1000, tol=1e-3)
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])

import matplotlib as mpl
y=y.astype(np.uint8)
def plot_digit(data):
    image = data.reshape(28,28)
    plt.imshow(image, cmap=mpl.cm.binary, interpolation="nearest")
    plt.axis("off")
    plt.show()
plot_digit(some_digit)

y_train[0]

sgd_clf.fit(X_train[:1000], y_train[:1000])   #y_train, not y_train_5
sgd_clf.predict([some_digit]) #some_digit=X[0]

y_train[0]

############################
Note: the prediction result base on training the SGDClassifier with all training dataset is different with the prediction result base on using first thousand of training set to train the SGDClassifier since the different decision score.
############################
That was easy! This code trains the SGDClassifier on the training set using the original target classes from 0 to 9 (y_train), instead of the 5-versus-all target classes (y_train_5). Then it makes a prediction (a correct one in this case). Under the hood,
Scikit-Learn actually trained 10 binary classifiers, got their decision scores for the image, and selected the class with the highest score.

To see that this is indeed the case, you can call the decision_function() method. Instead of returning just one score per instance, it now returns 10 scores, one per class:

some_digit_dicision_scores = sgd_clf.decision_function([some_digit])
some_digit_dicision_scores

The highest score is indeed the one corresponding to class 5:

np.argmax(some_digit_dicision_scores) #return the index of the maximum value

sgd_clf.classes_

sgd_clf.classes_[5] #np.argmax(some_digit_dicision_scores) ==5
#output: the result is a class 5

##################################
WARNING
When a classifier is trained, it stores the list of target classes in its classes_ attribute, ordered by value. In this case, the index of each class in the classes_ array conveniently matches the class itself (e.g., the class at index 5 happens to be class 5), but in general you won’t be so lucky.
##################################
If you want to force ScikitLearn to use one-versus-one or one-versus-all, you can use the OneVsOneClassifier or OneVsRestClassifier classes. Simply create an instance of OneVsOneClassifier or OneVsRestClassifier and pass a binary classifier to its constructor.

For example, this code creates a multiclass classifier using the OvO strategy, based on a SGDClassifier:

from sklearn.multiclass import OneVsOneClassifier

ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit]) #some_digit=X[0]

len(ovo_clf.estimators_) # N*(N–1)/2 = 10*(10-1)/2

one-versus-all (OvA)strategy (also called one-versus-the-rest), base on SVC.

from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier( SVC(gamma="auto", random_state=42) )
ovr_clf.fit(X_train[:1000], y_train[:1000])
ovr_clf.predict([some_digit])

len(ovr_clf.estimators_)

Training a RandomForestClassifier is just as easy:

#we set n_estimators=100 to be future-proof since this will be the default value in Scikit-Learn 0.22
#forest_clf = RandomForestClassifier(random_state=42, n_estimators=100)
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

len(forest_clf.estimators_)

This time Scikit-Learn did not have to run OvA or OvO because Random Forest classifiers can directly classify instances into multiple classes. You can call predict_proba() to get the list of probabilities that the classifier assigned to each instance for each class:

forest_clf.predict_proba([some_digit])

You can see that the classifier is fairly confident about its prediction: the 0.9 at the 5th index in the array means that the model estimates an 90% probability that the image represents a 5. It also thinks that the image could instead be a 2 or a 3 (1% chance and 8% chance).

Now of course you want to evaluate these classifiers. As usual, you want to use cross-validation. Let’s evaluate the SGDClassifier’s accuracy using the cross_val_score() function:

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

It gets over 85% on all test folds. If you used a random classifier, you would get 10% accuracy(select 1 / 10 digits), so this is not such a bad score, but you can still do much better. For example, simply scaling the inputs increases accuracy above 89%:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

https://blog.csdn.net/Linli522362242/article/details/103387527{Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. }

Error Analysis

你可能感兴趣的:(03_Classification_import name fetch_mldata_cross_val_plot_digits_ML_Project Checklist_confusion matr)

MMO基础服务器架构（四）：线程安全的对象池晴空～蓝兮 MMO双端游戏架构 c#游戏服务器
更多代码细节，球球各位观众老爷给鄙人的开源项目点个Star，持续更新中~项目开源地址4.线程安全的对象池类(采用线程安全的单例模式)压测过~需要实现对象池的对象都要继承IPool接口namespaceCommon.Summer.core;publicinterfaceIPool{voidReturnPool();//放回对象池，释放持有的引用}usingSystem.Collections.Con
centos 7 安装docker-compose
1.下载docker-compose#官方推荐（太慢）curl-L"https://github.com/docker/compose/releases/download/1.26.2/docker-compose-$(uname-s)-$(uname-m)"-o/usr/local/bin/docker-compose#国内（更快）curl-Lhttps://get.daocloud.io/do
layui 复选框赋值前端layui
functioninitCheckBox(name){//从数据库中取出来的以逗号分隔的复选框的值varids="1,2,3";varworkdaysArr=ids.split(",");for(varj=0;j
docker避免容器中的内容被挂载的空目录覆盖(比如nginx的html目录) dockervolume
我有一个镜像jb:1.0，镜像中/jb下有一些内容需要挂载到宿主机来dockervolumecreatejb_volumedockerrun--namejb-v/home/dcw/data:/data--mountsource=jb,target=/jb-itdjb:1.0如果想修改宿主机中的内容可以通过下面命令找到挂载的内容在宿主机的位置dockerinspectjbimage.png
使用Docker部署MySQL8.0.29 九思x docker
第一步：拉取镜像dockerpullmysql:8.0.29作用：从DockerHub拉取MySQL8.0.29官方镜像。第二步：启动容器dockerrun--nameshare_mysql\--restart=always\-vmysql-data:/var/lib/mysql\-p3306:3306\-eMYSQL_ROOT_PASSWORD=root\-dmysql:8.0.29参数说明：-
文件系统（File System — FS）夏L. linux 运维服务器
概念文件系统是Linux内部用来管理磁盘上文件的一套系统，主要体现在文件的存取、查找功能（本身是一套软件，对磁盘上存放的文件进行管理）。内核（Kernel）内核是操作系统内部最核心的软件。查看内核版本uname-r内核作用对CPU进行调度管理对内存进行分配管理对进程进行管理对文件系统进行管理对其他硬件进行管理内核中XFS文件系统存放地址/usr/lib/modules/3.10.0-1160.el
根据文件名称查询文件所在位置姚不倒 linux 运维数据库
在Linux中，根据文件名称查询文件所在位置主要通过命令行工具实现，以下是几种常用方法：---###**1.使用`find`命令（最灵活）**`find`命令可以递归搜索指定目录下的文件，支持按名称、类型、时间等条件过滤。####**基础语法**```bashfind[搜索路径]-name"文件名"```####**示例**-全局搜索名为`example.txt`的文件：```bashfind/-
MyBatis注解——多对多 xingcsdnboke MyBatis mybatis java spring
1、数据库建表语句CREATETABLE`user`(`id`int(11)DEFAULTNULL,`username`varchar(50)DEFAULTNULL);CREATETABLE`role`(`id`int(11)DEFAULTNULL,`rolename`varchar(50)DEFAULTNULL,`roledesc`varchar(50)DEFAULTNULL);CREATETA
MyBatis注解——一对一 xingcsdnboke MyBatis mybatis java spring
1、订单对应用户：一对一，数据库CREATETABLE`orderinfo`(`id`int(11)DEFAULTNULL,`ordertime`datetimeDEFAULTNULL,`total`decimal(8,2)DEFAULTNULL,`uid`int(11)DEFAULTNULL);CREATETABLE`user`(`id`int(11)DEFAULTNULL,`username`
VSCode python 遇到的问题：vscode can't open file '': [Errno 2] No such file or dire... weixin_33984032 python 开发工具 json
代码很简单，就两行：importpandasaspdimportnetCDF4asncdataset=nc.Dataset('20150101.nc')环境：在VSCode中左下角把原环境的Python3.6.532-bit切换为Anaconda中的Python3.6.564-bit('base':conda)过程中有两种错误：（忘记截图了，都是历史记录中的google网页搜索栏找到的搜索记录）1
C++ 实例(二) 阳光向日葵向阳 c++算法数据结构
交换两个数以下我们使用两种方法来交换两个变量：使用临时变量与不使用临时变量。实例-使用临时变量#includeusingnamespacestd;intmain(){inta=5,b=10,temp;cout#includeusingnamespacestd;intmain(){inta=5,b=10;coutusingnamespacestd;intmain(){intn;cout
MyBatis——基于MyBatis注解的学生管理程序基础较差的cs菜鸟 JavaEE实验 mybatis java mysql
MyBatis——基于MyBatis注解的学生管理程序Resourcedao层pojo层utils层测试层实验要求本实验要求根据学生表在数据库中创建一个s_student表，根据班级表在数据库中创建一个c_class表，班级表c_class和学生表s_student是一对多的关系。实验内容表1学生表（s_student）学生编号（id）学生名称（name）学生年龄（age）所属班级（cid）1
dv-scroll-board 鼠标移入单元格显示单元格所有数据 mengfei-super 计算机外设前端 vue.js
前言：在使用大屏组件库data-v开发大屏驾驶舱系统，dv-scroll-board实现表格数据滚动的效果，但是某一列数据较多，需求提出：鼠标移上去要显示对应的问题，完全展示出来。奈何以前没有搞过这个问题，随即立马找向百度麻麻！实现效果及方法如下：{{dvTextName}}exportdefault{data(){return{dvText:{keyX:"15px",keyY:"0px",},d
echarts map3D区域颜色单独设置浪漫不敌风月 echarts echarts 前端 3d
效果图：实现：用的是map3D，之前试了下geo3d因为版本问题不好控制（地图上字体颜色都没法设置）只需要在series的data中加上你要标色的区域名称和颜色即可。此效果实现的是无图例着色。series:[{type:"map3D",//系列类型name:"map3D",//系列名称map:"yuhang",//地图类型。data:[{name:"鸬鸟镇",itemStyle:{color:"#
C# 调用 VITS，推理模型将文字转wav音频调试 -数字人分支未来之窗软件服务 c#开发语言人工智能数字人
Microsoft.ML.OnnxRuntime.OnnxRuntimeException:[ErrorCode:InvalidArgument]Inputname:'input_name'isnotinthemetadata在Microsoft.ML.OnnxRuntime.InferenceSession.LookupInputMetadata(StringnodeName)位置D:\a\_w
CMake 函数和宏 arong-xu CMake c++CMake cmake教程
CMake函数CMake函数定义语法如下,其中name为函数名,为参数名,为函数体.函数定义后,可以通过name调用函数.函数名允许字母数字下划线,不区分大小写.function(name[...])endfunction()如下的样例定义了一个函数fun,不带任何参数.function(fun)message("Hello,World!")endfunction()#调用函数fun()FUN()
sql2019安装重启计算机失败,SQL SERVER 2019安装失败小蛋子儿哦
Detailedresults:Feature:全文和语义提取搜索Status:失败Reasonforfailure:该功能的某个依赖项出错，导致该功能的安装过程失败。NextStep:使用以下信息解决错误，卸载此功能，然后再次运行安装过程。Componentname:SQLServer数据库引擎服务实例功能Componenterrorcode:0x80004005Errordescription
Linux find 命令完全指南可问可问春风 Linux从新手到入门 linux chrome 运维
find是Linux系统最强大的文件搜索工具，支持嵌套遍历、条件筛选、执行动作。以下通过场景分类解析核心用法，涵盖高效搜索、文件管理及高级技巧：一、基础搜索模式1.按文件名搜索（精确/模糊匹配）find/path-name"*.log"#精确匹配.log后缀（区分大小写）find/home-iname"*.TXT"#模糊匹配.txt后缀（忽略大小写）find.-name"data_[0-9].cs
查询、插入、更新、删除数据的SQL语句(SQLite) C++ 老炮儿的技术栈 sql c++算法笔记学习
以下以SQLite数据库为例，展示在C++中使用SQLite库来执行查询、插入、更新和删除数据的操作示例代码。首先确保你已经安装了SQLite库，并且在C++项目中包含了相关的头文件。#include#include#include//回调函数，用于查询结果处理staticintcallback(void*NotUsed,intargc,char**argv,char**azColName){fo
【Golang】defer与recover的组合使用星星点点洲 Go golang 开发语言后端
在Go语言中，defer和recover是两个关键特性，通常结合使用以处理资源管理和异常恢复。以下是它们的核心应用场景及使用示例：1.defer的应用场景defer用于延迟执行函数调用，确保在函数退出前执行特定操作。主要用途包括：资源释放文件操作：确保文件句柄关闭。funcreadFile(filenamestring)error{file,err:=os.Open(filename)iferr!
MyBatis传入参数的方式二十六画生的博客 Mybatis MySQL SpringMVC MyBatis 传入参数方式
以下是传入两个参数的方式：第一种，使用@Param注解，定义参数别名，即定义映射关系DAO:publicListfindByUsernameAndPwd(@Param("userNameABC")Stringusername,@Param("passWordDEF")Stringpassword);SQL:SELECTFROMt_userandusername=#{userNameABC}andp
JVM 的类加载机制原理冰糖心书房 JVM 2025 Java面试系列 java
JVM的类加载机制是指JVM将.class文件（包含Java字节码）加载到内存，并对其进行校验、解析、初始化，最终转换为JVM可以直接使用的Java类型的过程。类加载过程(5个阶段):加载(Loading):查找并加载类的二进制数据：通过类的全限定名（FullyQualifiedName）查找.class文件。类加载器（ClassLoader）负责查找和加载.class文件。类加载器有多种，包括启
P1706 全排列问题及 P1157 组合的输出 wwjjjww 算法深度优先图论
全排列:题目描述按照字典序输出自然数1到n所有不重复的排列，即n的全排列，要求所产生的任一数字序列中不允许出现重复的数字。输入格式一个整数n。输出格式由1∼n组成的所有不重复的数字序列，每行一个序列。#includeusingnamespacestd;intn;boolv[100];inta[100];voiddfs(intpos){if(pos==n+1){for(inti=1;i>n;dfs(
自制C++小游戏走迷宫 ccw_china c++开发语言
直接上代码，有不足请指正，最新编辑于2025.3.22#include#include#include#includeusingnamespacestd;chara[100][100]={"####################","#O#####","###############","#################","#############","##################
怎样才能把网页数据保存到网络上？ 2301_79698214 html javascript java 前端 html5
要将网页数据存放到网络中，一般可以通过以下几种常见的方式：1.使用后端服务器自建服务器：你可以搭建自己的服务器，例如使用Node.js的Express框架或者Python的Flask、Django框架。以下是一个使用Flask框架存储数据到服务器的简单示例：pythonApplyfromflaskimportFlask,requestapp=Flask(__name__)@app.route('/
Bug:eventlet ImportError cannot import name ‘ALREADY HANDLED uncle_ll Bug合集
问题测试gunicorn不同work下的性能时候，在eventlet方式下报错误Error:classuri'eventlet'invalidornotfound:[Traceback(mostrecentcalllast):File"/app/venv/lib64/python3.6/site-packages/gunicorn/util.py",line99,inload_classmod=i
L2-4 吉利矩阵小竹子14 矩阵深度优先算法
输入样例：73输出样例：666这道题是暴力纯搜，但是很难想，我这个是看的别人的代码#include"bits/stdc++.h"usingnamespacestd;intx[20][20];intl,n;intcnt=0;intsumx[5],sumy[5];voiddfs(intx,inty){if(x==n+1){cnt++;return;}//其实不需要考虑列的和是否满足l,因为如果超出l的
蓝桥大使【算法赛】----贪心算法 wyshh119 算法学习贪心算法
这里比较的难点在于sort排序的根据是什么，为什么是两人的报酬差，我的理解是当两人报酬差越大，那么总报酬的损失就越大，其实是缺少具体的证明的，但是通过就说明确实是这样。也就不深究证明了。#include#includeusingnamespacestd;longlongans=0;constintN=100005;structnode{//结构体inta;intb;};nodea[N];intma
openai-agents 中custom example agent ZHOU_CAMP oi_agents 人工智能
代码pipshowopenai-agentsName:openai-agentsVersion:0.0.4Summary:OpenAIAgentsSDKHome-page:https://github.com/openai/openai-agents-pythonAuthor:Author-email:OpenAILicense-Expression:MITLocation:d:\soft\ana
【纯职业小组——思维】 Kent_J_Truman 蓝桥杯算法
题目思路第十五届蓝桥杯省赛PythonB组H题【纯职业小组】题解（AC）_蓝桥杯纯职业小组-CSDN博客代码#includeusingnamespacestd;usingll=longlong;intmain(){ios::sync_with_stdio(0);cin.tie(0);intt;cin>>t;while(t--){intn;llk;cin>>n>>k;unordered_maph;f
Dom 周华华 JavaScript html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
【Spark九十六】RDD API之combineByKey bit1129 spark
1. combineByKey函数的运行机制 RDD提供了很多针对元素类型为(K,V)的API，这些API封装在PairRDDFunctions类中，通过Scala隐式转换使用。这些API实现上是借助于combineByKey实现的。combineByKey函数本身也是RDD开放给Spark开发人员使用的API之一首先看一下combineByKey的方法说明：
msyql设置密码报错：ERROR 1372 (HY000): 解决方法详解 daizj mysql 设置密码
MySql给用户设置权限同时指定访问密码时，会提示如下错误： ERROR 1372 (HY000): Password hash should be a 41-digit hexadecimal number；问题原因：你输入的密码是明文。不允许这么输入。解决办法：用select password('你想输入的密码');查询出你的密码对应的字符串，然后
路漫漫其修远兮吾将上下而求索周凡杨学习思索
王国维在他的《人间词话》中曾经概括了为学的三种境界古今之成大事业、大学问者，罔不经过三种之境界。“昨夜西风凋碧树。独上高楼，望尽天涯路。”此第一境界也。“衣带渐宽终不悔，为伊消得人憔悴。”此第二境界也。“众里寻他千百度，蓦然回首，那人却在灯火阑珊处。”此第三境界也。学习技术，这也是你必须经历的三种境界。第一层境界是说，学习的路是漫漫的，你必须做好充分的思想准备，如果半途而废还不如不要开始。这里，注
Hadoop(二)对话单的操作朱辉辉33 hadoop
Debug： 1、 A = LOAD '/user/hue/task.txt' USING PigStorage(' ') AS (col1,col2,col3); DUMP A; //输出结果前几行示例： (>ggsnPDPRecord(21),,) (-->recordType(0),,) (-->networkInitiation(1),,)
web报表工具FineReport常用函数的用法总结（日期和时间函数）老A不折腾 finereport 报表工具 web开发
web报表工具FineReport常用函数的用法总结（日期和时间函数）说明：凡函数中以日期作为参数因子的，其中日期的形式都必须是yy/mm/dd。而且必须用英文环境下双引号(" ")引用。 DATE DATE(year,month,day):返回一个表示某一特定日期的系列数。 Year:代表年，可为一到四位数。 Month:代表月份。
c++ 宏定义中的##操作符墙头上一根草 C++
#与##在宏定义中的--宏展开 #include <stdio.h> #define f(a,b) a##b #define g(a) #a #define h(a) g(a) int main() { &nbs
分析Spring源代码之，DI的实现 aijuans spring DI 现源代码
(转) 分析Spring源代码之，DI的实现 2012/1/3 by tony 接着上次的讲，以下这个sample [java] view plain copy print
for循环的进化 alxw4616 JavaScript
// for循环的进化 // 菜鸟 for (var i = 0; i < Things.length ; i++) { // Things[i] } // 老鸟 for (var i = 0, len = Things.length; i < len; i++) { // Things[i] } // 大师 for (var i = Things.le
网络编程Socket和ServerSocket简单的使用百合不是茶网络编程基础 IP地址端口
网络编程;TCP/IP协议网络:实现计算机之间的信息共享,数据资源的交换协议:数据交换需要遵守的一种协议,按照约定的数据格式等写出去端口:用于计算机之间的通信每运行一个程序，系统会分配一个编号给该程序，作为和外界交换数据的唯一标识 0~65535 查看被使用的
JDK1.5 生产消费者 bijian1013 java thread 生产消费者 java多线程
ArrayBlockingQueue：一个由数组支持的有界阻塞队列。此队列按 FIFO（先进先出）原则对元素进行排序。队列的头部是在队列中存在时间最长的元素。队列的尾部是在队列中存在时间最短的元素。新元素插入到队列的尾部，队列检索操作则是从队列头部开始获得元素。 ArrayBlockingQueue的常用方法：
JAVA版身份证获取性别、出生日期及年龄 bijian1013 java 性别出生日期年龄
工作中需要根据身份证获取性别、出生日期及年龄，且要还要支持15位长度的身份证号码，网上搜索了一下，经过测试好像多少存在点问题，干脆自已写一个。 CertificateNo.java package com.bijian.study; import java.util.Calendar; import
【Java范型六】范型与枚举 bit1129 java
首先，枚举类型的定义不能带有类型参数，所以，不能把枚举类型定义为范型枚举类，例如下面的枚举类定义是有编译错的 public enum EnumGenerics<T> { //编译错，提示枚举不能带有范型参数 OK, ERROR; public <T> T get(T type) { return null;
【Nginx五】Nginx常用日志格式含义 bit1129 nginx
1. log_format 1.1 log_format指令用于指定日志的格式，格式： log_format name(格式名称) type(格式样式) 1.2 如下是一个常用的Nginx日志格式： log_format main '[$time_local]|$request_time|$status|$body_bytes
Lua 语言 15 分钟快速入门 ronin47 lua 基础
- - 单行注释 - - [[ [多行注释] - - ]] - - - - - - - - - - - 1. 变量 & 控制流 - - - - - - - - - - num = 23 - - 数字都是双精度 str = 'aspythonstring'
java-35.求一个矩阵中最大的二维矩阵 ( 元素和最大 ) bylijinnan java
the idea is from: http://blog.csdn.net/zhanxinhang/article/details/6731134 public class MaxSubMatrix { /**see http://blog.csdn.net/zhanxinhang/article/details/6731134 * Q35 求一个矩阵中最大的二维
mongoDB文档型数据库特点开窍的石头 mongoDB文档型数据库特点
MongoDD: 文档型数据库存储的是Bson文档-->json的二进制特点：内部是执行引擎是js解释器，把文档转成Bson结构，在查询时转换成js对象。 mongoDB传统型数据库对比传统类型数据库：结构化数据，定好了表结构后每一个内容符合表结构的。也就是说每一行每一列的数据都是一样的文档型数据库：不用定好数据结构，
[毕业季节]欢迎广大毕业生加入JAVA程序员的行列 comsci java
一年一度的毕业季来临了。。。。。。。。正在投简历的学弟学妹们。。。如果觉得学校推荐的单位和公司不适合自己的兴趣和专业，可以考虑来我们软件行业，做一名职业程序员。。。软件行业的开发工具中，对初学者最友好的就是JAVA语言了，网络上不仅仅有大量的
PHP操作Excel – PHPExcel 基本用法详解 cuiyadll PHP Excel
导出excel属性设置//Include classrequire_once('Classes/PHPExcel.php');require_once('Classes/PHPExcel/Writer/Excel2007.php');$objPHPExcel = new PHPExcel();//Set properties 设置文件属性$objPHPExcel->getProperties
IBM Webshpere MQ Client User Issue (MCAUSER) darrenzhu IBM jms user MQ MCAUSER
IBM MQ JMS Client去连接远端MQ Server的时候，需要提供User和Password吗？答案是根据情况而定，取决于所定义的Channel里面的属性Message channel agent user identifier (MCAUSER)的设置。 http://stackoverflow.com/questions/20209429/how-mca-user-i
网线的接法 dcj3sjt126com
一、PC连HUB (直连线)A端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。 B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。二、PC连PC （交叉线）A端：(568A)：白绿，绿，白橙，蓝，白蓝，橙，白棕，棕； B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。三、HUB连HUB&nb
Vimium插件让键盘党像操作Vim一样操作Chrome dcj3sjt126com chrome vim
什么是键盘党？键盘党是指尽可能将所有电脑操作用键盘来完成，而不去动鼠标的人。鼠标应该说是新手们的最爱，很直观，指哪点哪，很听话！不过常常使用电脑的人，如果一直使用鼠标的话，手会发酸，因为操作鼠标的时候，手臂不是在一个自然的状态，臂肌会处于绷紧状态。而使用键盘则双手是放松状态，只有手指在动。而且尽量少的从鼠标移动到键盘来回操作，也省不少事。在chrome里安装 vimium 插件
MongoDB查询（2）——数组查询[六] eksliang mongodb MongoDB查询数组
MongoDB查询数组转载请出自出处：http://eksliang.iteye.com/blog/2177292 一、概述 MongoDB查询数组与查询标量值是一样的，例如，有一个水果列表，如下所示： > db.food.find() { "_id" : "001", "fruits" : [ "苹
cordova读写文件（1） gundumw100 JavaScript Cordova
使用cordova可以很方便的在手机sdcard中读写文件。首先需要安装cordova插件：file 命令为： cordova plugin add org.apache.cordova.file 然后就可以读写文件了，这里我先是写入一个文件，具体的JS代码为： var datas=null;//datas need write var directory=&
HTML5 FormData 进行文件jquery ajax 上传到又拍云 ileson jquery Ajax html5 FormData
html5 新东西：FormData 可以提交二进制数据。页面test.html <!DOCTYPE> <html> <head> <title> formdata file jquery ajax upload</title> </head> <body> <
swift appearanceWhenContainedIn:(version1.2 xcode6.4) 啸笑天 version
swift1.2中没有oc中对应的方法： + (instancetype)appearanceWhenContainedIn:(Class <UIAppearanceContainer>)ContainerClass, ... NS_REQUIRES_NIL_TERMINATION; 解决方法：在swift项目中新建oc类如下： #import &
java实现SMTP邮件服务器 macroli java 编程
电子邮件传递可以由多种协议来实现。目前，在Internet 网上最流行的三种电子邮件协议是SMTP、POP3 和 IMAP，下面分别简单介绍。　　◆ SMTP 协议　　简单邮件传输协议(Simple Mail Transfer Protocol,SMTP)是一个运行在TCP/IP之上的协议，用它发送和接收电子邮件。SMTP 服务器在默认端口25上监听。SMTP客户使用一组简单的、基于文本的
mongodb group by having where 查询sql qiaolevip 每天进步一点点学习永无止境 mongo 纵观千象
SELECT cust_id, SUM(price) as total FROM orders WHERE status = 'A' GROUP BY cust_id HAVING total > 250 db.orders.aggregate( [ { $match: { status: 'A' } }, { $group: {
Struts2 Pojo（六） Luob. POJO strust2
注意：附件中有完整案例 1.采用POJO对象的方法进行赋值和传值 2.web配置 <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee&q
struts2步骤 wuai struts
1、添加jar包 2、在web.xml中配置过滤器 <filter> <filter-name>struts2</filter-name> <filter-class>org.apache.st