LIQING LIN

cp11_Working with Unlabeled Data_Clustering Analysis_Kmeans_hierarchical_dendrogram_heat map_DBSCAN

In the previous chapters, we used supervised learning techniques to build machine learning models using data where the answer was already known—the class labels were already available in our training data. In this chapter, we will switch gears and explore cluster analysis, a category of unsupervised learning techniques that allows us to discover hidden structures in data where we do not know the right answer upfront 预先的. The goal of clustering聚类 is to find a natural grouping in data such that items in the same cluster are more similar to each other than those from different clusters.

Given its exploratory nature, clustering is an exciting topic and, in this chapter, you will learn about the following concepts that can help you to organize data into meaningful structures:

Finding centers of similarity using the popular k-means algorithm
Using a bottom-up approach to build hierarchical cluster trees
Identifying arbitrary shapes of objects using a density-based clustering approach

Grouping objects by similarity using kmeans

In this section, we will discuss one of the most popular clustering algorithms, k-means, which is widely used in academia as well as in industry. Clustering (or cluster analysis) is a technique that allows us to find groups of similar objects, objects that are more related to each other than to objects in other groups. Examples of business-oriented applications of clustering include the grouping of documents, music, and movies by different topics, or finding customers that share similar interests based on common purchase behaviors as a basis for recommendation engines.

As we will see in a moment, the k-means algorithm is extremely easy to implement but is also computationally very efficient compared to other clustering algorithms, which might explain its popularity. The k-means algorithm belongs to the category of prototype-based clustering. We will discuss two other categories of clustering, hierarchical and density-based clustering, later in this chapter. Prototype-based clustering means that each cluster is represented by a prototype, which can either be the centroid质心 (average) of similar points with continuous features, or the medoid (the most representative or most frequently occurring point) in the case of categorical features. While k-means is very good at identifying clusters of spherical shape球形, one of the drawbacks of this clustering algorithm is that we have to specify the number of clusters k a priori先验. An inappropriate choice for k can result in poor clustering performance. Later in this chapter, we will discuss the elbow 肘部 method and silhouette轮廓plots, which are useful techniques to evaluate the quality of a clustering to help us determine the optimal number of clusters k.

Although k-means clustering can be applied to data in higher dimensions, we will walk through the following examples using a simple two-dimensional dataset for the purpose of visualization:

from sklearn.datasets import make_blobs

X,y = make_blobs( n_samples=150,
                  n_features=2,
                  centers=3,
                  cluster_std=0.5, #cluster_std--skip distance--diameter
                  shuffle=True,
                  random_state=0
                ) #制作团点 


import matplotlib.pyplot as plt

plt.scatter(X[:,0], X[:,1], c='black', marker='o', s=50)
plt.grid()

plt.show()

The dataset that we just created consists of 150 randomly generated points that are roughly grouped into three regions with higher density(cluster_std=0.5), which is visualized via a two-dimensional scatterplot:

In real-world applications of clustering, we do not have any ground truth category information about those samples; otherwise, it would fall into the category of supervised learning. Thus, our goal is to group the samples based on their feature similarities, which we can be achieved using the k-means algorithm that can be summarized by the following four steps:

Randomly pick k centroids(averages of similar points) from the sample points as initial cluster centers.
https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-f2ad05ed5203
Assign each sample to the nearest centroid , .
Move the centroids质心 to the center中心 of the samples that were assigned to it.
Repeat steps 2 and 3 until the cluster assignments do not change or a user-defined tolerance or a maximum number of iterations is reached.

Now the next question is how do we measure similarity between objects? We can define similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the squared Euclidean distance between two points x and y in m-dimensional space:

Note that, in the preceding equation, the index j refers to the jth dimension (feature column) of the sample points x and y. In the rest of this section, we will use the superscripts i and j to refer to the sample index and cluster index, respectively. Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within cluster sum of squared errors (SSE)簇内误差平方和, which is sometimes also called cluster inertia聚类惯性:

Here, is the representative point (centroid) for cluster j, and =1 if the sample is in cluster j; = 0 otherwise.

###############################################

import sklearn.metrics as m

d = m.pairwise_distances([(1,1),(0,0), (4,4)], [(2,2),(3,3),(0,0)],metric='euclidean')
print(d)

d_indices = m.pairwise_distances_argmin([(1,1),(0,0), (4,4)], [(2,2),(3,3),(0,0)],metric='euclidean')
d_indices

###############################################

from sklearn.metrics import pairwise_distances_argmin

def find_clusters(X, n_clusters, rseed=2):
    # https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
    # 1. Randomly pick k centroids from the sample points as initial cluster centers.
    rng = np.random.RandomState(rseed)
    k_centroids_indices = rng.permutation( X.shape[0] )[:n_clusters]
    centroids = X[k_centroids_indices]
    
    while True:
        # 2. Assign each sample to the nearest centroid
        # X --> nearest_centroids_indices
        X_labeled_centroids = pairwise_distances_argmin(X, centroids)
        
        #3. Move the centroids to the center of the samples that were assigned to it
        #3.1 Find new centers from means of points        #axis=0: mean( n_rows x 1_column )
        new_centers = np.array([ X[X_labeled_centroids==i].mean(axis=0) for i in range(n_clusters) ])
        
        #4. Repeat steps 2 and 3 until the cluster assignments do not change 
        #   or a user-defined tolerance or a maximum number of iterations is reached.
        #4.1 check for convergence
        if np.all( centroids == new_centers ):
            break
        centroids = new_centers #3.2 assign the new center to the centroids
        
    return centroids, X_labeled_centroids

centroids, y_km = find_clusters(X,3)

#plt.scatter(X[:,0], X[:,1], c=X_labeled_centroids, s=50, cmap="viridis")
plt.figure( figsize=(6,6) )

plt.scatter( X[y_km==0, 0], X[y_km==0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', 
            label='cluster 1' )
plt.scatter( X[y_km==1, 0], X[y_km==1, 1], s=50, c='orange', marker='o', edgecolor='black',
            label='cluster 2' )
plt.scatter( X[y_km==2, 0], X[y_km==2, 1], s=50, c='lightblue', marker='v', edgecolor='black',
            label='cluster 3' )
plt.scatter( centroids[:,0], centroids[:,1], s=250, marker='*', c='blue', 
            edgecolor='black', label='centroids')
plt.legend( scatterpoints=1 )
plt.grid()

plt.show()

Now that you have learned how the simple k-means algorithm works, let's apply it to our sample dataset using the KMeans class from scikit-learn's cluster module:

from sklearn.cluster import KMeans
             # n_clusters=3: set the number of desired clusters to 3
             # set n_init=10 to run the k-means clustering algorithms 10 times
             # independently with different "random" centroids to choose the final model 
             # as the one with the lowest SSE
km = KMeans( n_clusters=3, init="random", n_init=10, max_iter=300, tol=1e-04, random_state=0)
y_km = km.fit_predict(X)

Using the preceding code, we set the number of desired clusters to 3; specifying the number of clusters a priori is one of the limitations(drawbacks) of k-means. We set n_init=10 to run the k-means clustering algorithms 10 times independently with different random centroids to choose the final model as the one with the lowest SSE. Via the max_iter parameter, we specify the maximum number of iterations for each single run (here, 300). Note that the k-means implementation in scikit-learn stops early if it converges before the maximum number of iterations is reached.

However, it is possible that k-means does not reach convergence for a particular run, which can be problematic (computationally expensive) if we choose relatively large values for max_iter. One way to deal with convergence problems is to choose larger values for tol, which is a parameter that controls the tolerance with regard to the changes in the within-cluster sum-squared-error to declare convergence. In the preceding code, we chose a tolerance of 1e-04 (=0.0001).

Another problem with k-means is that one or more clusters can be empty. Note that this problem does not exist for k-medoids or fuzzy C-means, an algorithm that we will discuss in the next subsection. However, this problem is accounted for in the current k-means implementation in scikit-learn. If a cluster is empty, the algorithm will search for the sample that is farthest away from the centroid of the empty cluster. Then it will reassign the centroid to be this farthest point.

#################################################
Note
When we are applying k-means to real-world data using a Euclidean distance metric, we want to make sure that the features are measured on the same scale(尺度) and apply z-score standardization or min-max scaling if necessary.
#################################################

After we predicted the cluster labels y_km and discussed the challenges of the k-means algorithm, let's now visualize the clusters that k-means identified in the dataset together with the cluster centroids. These are stored under the cluster_centers_ attribute of the fitted KMeans object:

km.cluster_centers_

plt.figure( figsize=(6,6) )

plt.scatter( X[y_km==0, 0], X[y_km==0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', 
            label='cluster 1' )
plt.scatter( X[y_km==1, 0], X[y_km==1, 1], s=50, c='orange', marker='o', edgecolor='black',
            label='cluster 2' )
plt.scatter( X[y_km==2, 0], X[y_km==2, 1], s=50, c='lightblue', marker='v', edgecolor='black',
            label='cluster 3' )
plt.scatter( km.cluster_centers_[:,0], km.cluster_centers_[:,1], s=250, marker='*', c='blue', 
            edgecolor='black', label='centroids')
plt.legend( scatterpoints=1 )
plt.grid()

plt.show()

In the following scatterplot, we can see that k-means placed the three centroids at the center of each sphere, which looks like a reasonable grouping given this dataset:

print('WC_SSE: %.2f' % km.inertia_)

Although k-means worked well on this toy dataset, we need to note some of the main challenges of k-means. One of the drawbacks of k-means is that we have to specify the number of clusters k a priori, which may not always be so obvious in real-world applications, especially if we are working with a higher dimensional dataset that cannot be visualized. The other properties of k-means are that clusters do not overlap and are not hierarchical, and we also assume that there is at least one item in each cluster. Later, we will encounter different types of clustering algorithms, hierarchical and density-based clustering. Neither type of algorithm requires us to specify the number of clusters upfront预先的 or assume spherical structures in our dataset两种算法都不需要我们预先指定簇数或在数据集中采用球形结构.

In the next subsection, we will introduce a popular variant变异的 of the classic k-means algorithm called k-means++. While it doesn't address解决 those assumptions and drawbacks of k-means discussed in the previous paragraph, it can greatly improve the clustering results through more clever seeding of the initial cluster centers.

K-means++

So far, we discussed the classic k-means algorithm that uses a random seed to place the initial centroids, which can sometimes result in bad clusterings or slow convergence if the initial centroids are chosen poorly (https://blog.csdn.net/Linli522362242/article/details/105722461). One way to address this issue is to run the k-means algorithm multiple times on a dataset and choose the best performing model in terms of the SSE. Another strategy is to place the initial centroids far away from each other via the k-means++ algorithm, which leads to better and more consistent results than the classic k-means (D. Arthur and S.Vassilvitskii. k-means++: The Advantages of Careful Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007).

The initialization in k-means++ can be summarized as follows:

Initialize an empty set M to store the k centroids being selected.
Randomly choose the first centroid (is the representative point (centroid) for cluster j) from the input samples and assign it to M.
For each sample that is not in M, find the minimum squared distance to any of the centroids in M.
To randomly select the next centroid , use a weighted probability distribution equal to .
Repeat steps 2 and 3 until k centroids are chosen.
Proceed with the classic k-means algorithm.

########################################
To use k-means++ with scikit-learn's KMeans object, we just need to set the init parameter to k-means++ (the default setting) instead of random. In fact, 'k-means++' is the default argument to the init parameter, which is strongly recommended in practice. The only reason why we haven't used it in the previous example was to not introduce too many concepts all at once. The rest of this section on k-means will use k-means++, but readers are encouraged to experiment more with the two different approaches (classic k-means via init='random' versus k-means++ via init='k-means++') for placing the initial cluster centroids.
########################################

wc_sse = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wc_sse.append(kmeans.inertia_)#inertia_float:Sum of squared distances of samples to their closest cluster center

plt.plot(range(1,11), wc_sse)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WC_SSE', rotation=0, labelpad=22) #within cluster sum of squared errors
plt.show()

Next, we’ll categorize the data using the optimum number of clusters (3) we determined in the last step. k-means++ ensures that you get don’t fall into the random initialization trap.

# set n_init=10 to run the k-means clustering algorithms 10 times independently with 
# different random centroids to choose the final model as the one with the lowest SSE
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_km = kmeans.fit_predict(X)

plt.figure( figsize=(6,6) )

plt.scatter( X[y_km==0, 0], X[y_km==0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', 
            label='cluster 1' )
plt.scatter( X[y_km==1, 0], X[y_km==1, 1], s=50, c='orange', marker='o', edgecolor='black',
            label='cluster 2' )
plt.scatter( X[y_km==2, 0], X[y_km==2, 1], s=50, c='lightblue', marker='v', edgecolor='black',
            label='cluster 3' )
plt.scatter( km.cluster_centers_[:,0], km.cluster_centers_[:,1], s=250, marker='*', c='blue', 
            edgecolor='black', label='centroids')
plt.legend( scatterpoints=1 )
plt.grid()

plt.show()

The k-Means++ algorithm (first 5 steps)

from sklearn.metrics.pairwise import pairwise_distances

a=[ [1,3],
    [2,2],
    [3,4]
  ]
b=[[3,3],
    [2,2]
  ]

pairwise_distances(a,b)

# https://www.geeksforgeeks.org/ml-k-means-algorithm/

# function to plot the selected centroids 
def plot(data, centroids): 
    #plt.figure( figsize=(6,6) )
    plt.scatter(data[:, 0], data[:, 1], marker = 'o', s=50, color = 'orange', edgecolor='black',
                label = 'data points') 
    plt.scatter(centroids[:-1, 0], centroids[:-1, 1], marker='*', s=150,
                color = 'black', label = 'previously selected centroids') 
    plt.scatter(centroids[-1, 0], centroids[-1, 1], marker='*', s=150, color = 'blue', 
                label = 'next centroid') 
    plt.title( 'Select % d th centroid'%(centroids.shape[0]) ) 
    plt.grid(True)  
    plt.legend()  
    plt.show()

import sklearn.metrics as metrics
# initialisation algorithm
def initialize(data, k):
    centroids = [] # 1. Initialize an empty set M to store the k centroids being selected.
    
    # 2. Randomly choose the first centroid μ^(j) from the input samples and assign it to M.
    k_centroids_indices =[]
    index = np.random.randint(data.shape[0])
    k_centroids_indices.append(index)        
    centroids.append( data[index, :] ) #first centroid
    plot( data, np.array(centroids) )
    
    dist=np.array([])
    # 5. Repeat steps 2 and 3 until k centroids are chosen.
    # compute remaining k - 1 centroids 
    for c_id in range(k - 1):            
        # 3. For each sample x^(i) that is not in M, find the minimum squared 
        # distance d(x^i,M) to any of the centroids in M.
        unused= [i for i in range(data.shape[0]) if i not in k_centroids_indices]
        d = metrics.pairwise_distances( data[unused,:], centroids, metric='euclidean' )           
        dist = np.min(d,axis=1) # for each row
        
        # 4.To randomly select the next centroid μ^p, use a weighted probability distribution equal to       
        # here, select data point with maximum distance as our next centroid
        index = np.argmax(dist)
        k_centroids_indices.append(unused[index]) ######
        next_centroid = data[unused[index], :]    ######
        centroids.append( next_centroid )
        dist=np.array([])
        plot( data, np.array(centroids) )
    return centroids, k_centroids_indices

# call the initialize function to get the centroids 
centroids, k_centroids_indices = initialize(X, k = 3)

6. Proceed with the classic k-means algorithm.

from sklearn.metrics import pairwise_distances_argmin

def find_clusters(X, k_centroids_indices, n_clusters):
    # https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
    centroids = X[k_centroids_indices]
    
    while True:
        # 2. Assign each sample to the nearest centroid
        # X --> nearest_centroids_indices
        X_labeled_centroids = pairwise_distances_argmin(X, centroids)
        
        #3. Move the centroids to the center of the samples that were assigned to it
        #3.1 Find new centers from means of points        #axis=0: mean( n_rows x 1_column )
        new_centers = np.array([ X[X_labeled_centroids==i].mean(axis=0) for i in range(n_clusters) ])
        
        #4. Repeat steps 2 and 3 until the cluster assignments do not change 
        #   or a user-defined tolerance or a maximum number of iterations is reached.
        #4.1 check for convergence
        if np.all( centroids == new_centers ):
            break
        centroids = new_centers #3.2 assign the new center to the centroids
        
    return centroids, X_labeled_centroids

centroids, y_km = find_clusters(X,k_centroids_indices,3) ##########################

#plt.scatter(X[:,0], X[:,1], c=X_labeled_centroids, s=50, cmap="viridis")
plt.figure( figsize=(6,6) )

plt.scatter( X[y_km==0, 0], X[y_km==0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', 
            label='cluster 1' )
plt.scatter( X[y_km==1, 0], X[y_km==1, 1], s=50, c='orange', marker='o', edgecolor='black',
            label='cluster 2' )
plt.scatter( X[y_km==2, 0], X[y_km==2, 1], s=50, c='lightblue', marker='v', edgecolor='black',
            label='cluster 3' )
plt.scatter( centroids[:,0], centroids[:,1], s=250, marker='*', c='blue', 
            edgecolor='black', label='centroids')
plt.legend( scatterpoints=1 )
plt.grid()

plt.show()

Hard versus soft clustering

Hard clustering describes a family of algorithms where each sample in a dataset is assigned to exactly one cluster, as in the k-means algorithm that we discussed in the previous subsection. In contrast, algorithms for soft clustering (sometimes also called fuzzy clustering模糊聚类) assign a sample to one or more clusters. A popular example of soft clustering is the fuzzy C-means (FCM) algorithm (also called soft k-means or fuzzy k-means). The original idea goes back to the 1970s, when Joseph C. Dunn first proposed an early version of fuzzy clustering to improve k-means (A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters, J. C. Dunn, 1973). Almost a decade later, James C. Bedzek published his work on the improvement of the fuzzy clustering algorithm, which is now known as the FCM algorithm (Pattern Recognition with Fuzzy Objective Function Algorithms, J. C. Bezdek, Springer Science+Business Media, 2013).

The FCM(also called soft k-means or fuzzy k-means) procedure is very similar to k-means. However, we replace the hard cluster assignment with probabilities for each point belonging to each cluster (我们使用每个样本点隶属于各簇的概率来代替硬聚类的划分). In k-means, we could express the cluster membership of a sample x with a sparse vector of binary values:

Here, the index position with value 1 indicates the cluster centroid the sample is assigned to (assuming k = 3, j {1, 2, 3}). In contrast, a membership vector in FCM could be represented as follows:

Here, each value falls in the range [0, 1] and represents a probability of membership to the respective cluster centroid. The sum of the memberships for a given sample is equal to 1. Similarly to the k-means algorithm, we can summarize the FCM algorithm in four key steps:

Specify the number of k centroids and randomly assign the cluster memberships for each point.
Compute the cluster centroids , j{1,…, k}.
Update the cluster memberships for each point.
Repeat steps 2 and 3 until the membership coefficients do not change or a user-defined tolerance or a maximum number of iterations is reached.

The objective function of FCM—we abbreviate it by —looks very similar to the within cluster sum-squared-error that we minimize in k-means:

However, note that the membership indicator is not a binary value as in k-means {0,1} ) but a real value that denotes the cluster membership probability ([0,1] ). You also may have noticed that we added an additional exponent to ; the exponent m, any number greater or equal to 1 (typically m = 2), is the so-called fuzziness coefficient (or simply fuzzifier) that controls the degree of fuzziness. The larger the value of m , the smaller the cluster membership becomes, which leads to fuzzier clusters. The cluster membership probability itself is calculated as follows:

For example, if we chose three cluster centers as in the previous k-means example, we could calculate the membership of the sample belonging to the cluster as: j{1,…, k}.

The center of a cluster itself is calculated as the mean of all samples in the cluster weighted by the membership degree of belonging to its own cluster:

Just by looking at the equation to calculate the cluster memberships, it is intuitive to say that each iteration in FCM is more expensive than an iteration in k-means. However, FCM typically requires fewer iterations overall to reach convergence.
Unfortunately, the FCM algorithm is currently not implemented in scikit-learn. However, it has been found in practice that both k-means and FCM produce very similar clustering outputs, as described in a study by Soumi Ghosh and Sanjay K. Dubey (S. Ghosh and S. K. Dubey. Comparative Analysis of k-means and Fuzzy c-means Algorithms. IJACSA, 4:35–38, 2013).
https://www.geeksforgeeks.org/ml-fuzzy-clustering/

Using the elbow method to find the optimal number of clusters

One of the main challenges in unsupervised learning is that we do not know the definitive确定的 answer. We don't have the ground truth class labels in our dataset that allow us to apply the techniques that we used in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, in order to evaluate the performance of a supervised model. Thus, in order to quantify量化 the quality品质 of clustering, we need to use intrinsic metrics—such as the within-cluster SSE (distortion失真) that we discussed earlier in this chapter—to compare the performance of different k-means clusterings. Conveniently, we don't need to compute the within-cluster SSE explicitly as it is already accessible via the inertia_ attribute after fitting a KMeans model:

print('Distortion-WC_SSE: %.2f' % km.inertia_)

Based on the within-cluster SSE, we can use a graphical tool, the so-called elbow method, to estimate the optimal number of clusters k for a given task. Intuitively, we can say that, if k increases, the distortion(the within-cluster SSE) will decrease. This is because the samples will be closer to the centroids they are assigned to(). The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly, which will become more clear if we plot distortion for different values of k:

distortions_wc_sses = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    distortions_wc_sses.append(kmeans.inertia_)#inertia_float:Sum of squared distances of samples to 
                                               #their closest cluster center

plt.plot(range(1,11), distortions_wc_sses, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Distortion( WC_SSE )', labelpad=22) #within cluster sum of squared errors
plt.show()

As we can see in the following plot, the elbow is located at k = 3, which provides evidence that k = 3 is indeed a good choice for this dataset:

Quantifying量化 the quality品质 of clustering via silhouette轮廓 plots

Another intrinsic metric（such as the within-cluster SSE (distortion失真)） to evaluate the quality of a clustering is silhouette analysis, which can also be applied to clustering algorithms other than k-means that we will discuss later in this chapter. Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the samples in the clusters are. To calculate the silhouette coefficient of a single sample in our dataset, we can apply the following three steps:

Calculate the cluster cohesion内聚度 (is the mean intra cluster distance) as the average distance between a sample and all other points in the same cluster.
Calculate the cluster separation分离度 from the next closest cluster(depicts mean nearest cluster distance) as the average distance between the sample and all samples in the nearest cluster.
Calculate the silhouette as the difference between cluster cohesion and separation divided by the greater of the two, as shown here:
: the silhouette coefficient==

The silhouette coefficient is bounded in the range -1 to 1(###a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster###). Based on the preceding formula, we can see that the silhouette coefficient is 0 if the cluster separation and cohesion are equal ( = ). Furthermore, we get close to an ideal silhouette coefficient of 1 if >> , since quantifies how dissimilar差异性 a sample is to other clusters, and tells us how similar相似性 it is to the other samples in its own cluster, respectively.

The silhouette coefficient is available as silhouette_samples from scikit-learn's metric module, and optionally the silhouette_scores can be imported. This calculates the average silhouette coefficient across all samples, which is equivalent to numpy.mean(silhouette_samples(…)).

By executing the following code, we will now create a plot of the silhouette coefficients for a k-means clustering with k = 3:

km = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0)
y_km = km.fit_predict(X)

import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples

cluster_labels = np.unique(y_km)     # array([0, 1, 2])
n_clusters = cluster_labels.shape[0] # 3
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean') #The silhouette coefficients

y_ax_lower, y_ax_upper = 0, 0
yticks = []

for i,c in enumerate(cluster_labels):
    #Get all silhouette coefficients in the same cluster
    c_silhouette_vals = silhouette_vals[y_km==c] #The silhouette coefficients when cluster ==c
    c_silhouette_vals.sort() #ascending
    y_ax_upper += len(c_silhouette_vals) #==the number of data points in the same cluster
    color = cm.jet( float(i)/n_clusters) 
    #cm.jet(X): For floats, X should be in the interval ``[0.0, 1.0]`` to
    #return the RGBA values ``X*100`` percent along the Colormap line.
    plt.barh( range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', 
              color=color)
    yticks.append( (y_ax_lower+y_ax_upper)/2. ) #get the center of plt.barh as a ytick
    y_ax_lower += len( c_silhouette_vals ) # The starting position of the next y-axis
    
silhouette_avg = np.mean( silhouette_vals )
plt.axvline(silhouette_avg, color="red", linestyle='--')
                                     # replace ticks with [labels]
plt.yticks(yticks, cluster_labels+1) #( [25.0, 75.0, 125.0], array([1, 2, 3]) ) =(ticks, [labels]) 
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.show()

Through a visual inspection of the silhouette plot, we can quickly scrutinize 仔细或彻底检查 the sizes of the different clusters( len(c_silhouette_vals) ) and identify clusters that contain outliers:

As we can see in the preceding silhouette plot, our silhouette coefficients are not even close to 0, which can be an indicator of a good clustering. Furthermore, to summarize the goodness of our clustering(为了总结我们聚类效果的好处), we added the average silhouette coefficient to the plot (dotted line).

To see how a silhouette plot looks for a relatively bad clustering, let's seed the k-means algorithm with two centroids only:

#set n_init=10 to run the k-means clustering algorithms 10 times independently 
#with different random centroids to choose the final model as the one with the lowest SSE
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0)
y_km = km.fit_predict(X)
print(np.unique(y_km))

plt.scatter( X[y_km==0,0], X[y_km==0,1], s=50, c='lightgreen', marker='o', edgecolor='black', 
            label='cluster 1' )
plt.scatter( X[y_km==1,0], X[y_km==1,1], s=50, c='orange', marker='s', edgecolor='black',
            label='cluster 2')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], s=250, marker='*', c='red', 
            label='centroids')
plt.legend()
plt.grid()
plt.show()

As we can see in the following scatterplot, one of the centroids falls between two of the three spherical groupings of the sample points. Although the clustering does not look completely terrible, it is suboptimal 不是最满意的.

Next we create the silhouette plot to evaluate the results. Please keep in mind that we typically do not have the luxury of visualizing datasets我们不奢望可视化数据集 in two-dimensional scatterplots in real-world problems, since we typically work with data in higher dimensions:

cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
#from sklearn.metrics import silhouette_samples
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0,0
yticks = []

for i,c in enumerate(cluster_labels):
    #extract all silhouette coefficients in the same cluster
    c_silhouette_vals = silhouette_vals[y_km==c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals) #==the number of data points in the same cluster
    #cm.jet(X): For floats, X should be in the interval ``[0.0, 1.0]`` to
    #return the RGBA values ``X*100`` percent along the Colormap line.
    color = cm.jet(i/n_clusters)
    plt.barh( range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', 
             color=color)
    yticks.append( (y_ax_lower + y_ax_upper)/2) #get the center of plt.barh as a ytick
    y_ax_lower += len(c_silhouette_vals) # The starting position of the next y-axis
    
silhouette_avg = np.mean( silhouette_vals )
plt.axvline(silhouette_avg, color='red', linestyle='--')
plt.yticks(yticks, cluster_labels+1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.show()

As we can see in the resulting plot, the silhouettes now have visibly different lengths and width, which yields further evidence for a suboptimal clustering:

Organizing clusters as a hierarchical tree将多个聚类组织为分层树

In this section, we will take a look at an alternative approach to prototype-based clustering: hierarchical clustering. One advantage of hierarchical clustering algorithms is that it allows us to plot dendrograms系统树图 (visualizations of a binary hierarchical clustering), which can help with the interpretation of the results by creating meaningful taxonomies分类系统. Another useful advantage of this hierarchical approach is that we do not need to specify the number of clusters upfront.

The two main approaches to hierarchical clustering are agglomerative and divisive hierarchical clustering成团块层次聚类和分裂层次聚类. In divisive hierarchical clustering, we start with one cluster that encompasses all our samples, and we iteratively split the cluster into smaller clusters until each cluster only contains one sample. In this section, we will focus on agglomerative clustering成团块聚类, which takes the opposite approach. We start with each sample as an individual cluster and merge the closest pairs of clusters until only one cluster remains.

Grouping clusters in bottom-up fashion

The two standard algorithms for agglomerative hierarchical clustering are single linkage and complete linkage.

Using single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest.

The complete linkage approach is similar to single linkage but, instead of comparing the most similar members in each pair of clusters, we compare the most dissimilar members to perform the merge. This is shown in the following diagram:

#####################################
Note
Other commonly used algorithms for agglomerative hierarchical clustering include average linkage and Ward's linkage. In average linkage, we merge the cluster pairs based on the minimum average distances between all group members in the two clusters. In Ward's linkage, the two clusters that lead to the minimum increase of the total within-cluster SSE are merged.
#####################################
In this section, we will focus on agglomerative clustering using the complete linkage approach. This is an iterative procedure that can be summarized by the following steps:

Compute the distance matrix of all samples.
pd.DataFrame( squareform( pdist(df, metric="euclidean") ), columns=labels, index=labels )
Represent each data point as a singleton单独 cluster.
Merge the two closest clusters based on the distance of the most dissimilar (distant远离的) members.
Update the distance matrix.
Repeat steps 2 to 4 until one single cluster remains.

Now we will discuss how to compute the distance matrix (step 1). But first, let's generate some random sample data to work with. The rows represent different observations (IDs 0 to 4), and the columns are the different features (X, Y, Z) of those samples:

import pandas as pd
import numpy as np

np.random.seed(123)
variables = ['X','Y','Z'] #features
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4'] #indices
X = np.random.random_sample([5,3]) * 10
df = pd.DataFrame(X, columns=variables, index=labels)
df

After executing the preceding code, we should now see the following data frame containing the randomly generated samples:

Performing hierarchical clustering on a distance matrix

To calculate the distance matrix as input for the hierarchical clustering algorithm, we will use the pdist function from SciPy's spatial.distance submodule:

np.sqrt((6.964692-5.513148)**2+(2.861393-7.194690)**2+(2.268515-4.231065)**2)

pdist(df, metric="euclidean")

A condensed distance matrix is a flat array containing the upper triangular of the distance matrix.

from scipy.spatial.distance import pdist, squareform
# pdist: calculated the Euclidean distance between each pair of sample points
#        in our dataset based on the features X, Y, and Z as input to the squareform
#squareform: create a symmetrical matrix of the pair-wise distances
row_dist = pd.DataFrame( squareform(pdist(df, metric="euclidean")), columns=labels, index=labels)
row_dist

Using the preceding code, we calculated the Euclidean distance between each pair of sample points in our dataset based on the features X, Y, and Z. We provided the condensed浓缩 distance matrix—returned by pdist—as input to the squareform function to create a symmetrical matrix of the pair-wise distances, as shown here:

Next we will apply the complete linkage agglomeration to our clusters using the linkage function from SciPy's cluster.hierarchy submodule, which returns a so-called linkage matrix(关联矩阵).

However, before we call the linkage function, let's take a careful look at the function documentation:

from scipy.cluster.hierarchy import linkage

help(linkage)

linkage(y, method='single', metric='euclidean')

[...]

Parameters
    ----------
    y : ndarray
        A condensed distance matrix. A condensed distance matrix
        is a flat array containing the upper triangular of the distance matrix.
        This is the form that ``pdist`` returns. Alternatively, a collection of
        :math:`m` observation vectors in :math:`n` dimensions may be passed as an
        :math:`m` by :math:`n` array. All elements of the condensed distance matrix
        must be finite, i.e. no NaNs or infs.
    method : str, optional
        The linkage algorithm to use. See the ``Linkage Methods`` section below
        for full descriptions.
    metric : str or function, optional
        The distance metric to use in the case that y is a collection of
        observation vectors; ignored otherwise. See the ``pdist``
        function for a list of valid distance metrics. A custom distance
        function can also be used.
    
    Returns
    -------
    Z : ndarray
        The hierarchical clustering encoded as a linkage matrix.
[...]

     Based on the function description, we conclude that we can use a condensed distance matrix (upper triangular) from the pdist function as an input attribute. Alternatively, we could also provide the initial data array and use the euclidean metric as a function argument in linkage. However, we should not use the squareform distance matrix
that we defined earlier, since it would yield different distance values from those expected. To sum it up, the three possible scenarios are listed here:

Incorrect approach: In this approach, we use the squareform distance matrix. The code is as follows:
```
row_clusters = linkage(row_dist,method='complete',metric='euclidean')
```

Correct approach: In this approach, we use the condensed distance matrix. The code is as follows:
```
row_clusters = linkage( pdist(df, metric='euclidean'), method='complete', metric='euclidean')
```
Correct approach: In this approach, we use the input sample matrix. The code is as follows:
```
row_clusters = linkage(df.values, method='complete', metric='euclidean')
```

row_clusters = linkage( pdist(df, metric='euclidean'), method='complete', metric='euclidean' )
row_clusters

To take a closer look at the clustering results, we can turn them to a pandas DataFrame (best viewed in IPython Notebook) as follows:

pd.DataFrame( row_clusters, columns=['row label 1', 'row label 2', 
                                     'distance', 'no. of items in clust.'],
              index=[ 'cluster %d' % (i+1) for i in range( row_clusters.shape[0] ) ]
            )

As shown in the following table, the linkage matrix consists of several rows where each row represents one merge(as an new cluster). The first and second columns denote the most dissimilar(distant远离的) members in each cluster, and the third row reports the distance between those members. The last column returns the count of the members in each cluster.

explain: complete linkage compares the most dissimilar members for each pair of clusters.

Why cluster 3==ID_3 + ID_5{ID_0, ID_4 } not the ID_3 + ID_6{ ID_1, ID_2 }

since ID_5 [the most dissimilar member with ID_3 is ID_0==5.899885] < ID_6[the most dissimilar member with ID_3 is ID_2==7.244262] ==>single linkage==>[ ID_3, ID_0, ID_4 ] ==ID_3 + ID_5

cluster 1	0	4	3.835396	2.0=Count of [ ID_0, ID_4 ] single linkage
cluster 2	1	2	4.347073	2.0=Count of [ ID_1, ID_2 ] single linkage
cluster 3	3	5= [ ID_0, ID_4 ]	5.899885 since complete linkage	3.0=Count of [ ID_3, ID_0, ID_4 ] single linkage
cluster 4	6 = [ ID_1, ID_2 ]	7=[ ID_3, ID_0, ID_4 ]	8.316594 since complete linkage	5.0=Count of [ ID_1, ID_2, ID_3, ID_0, ID_4 ] single linkage

Now that we have computed the linkage matrix, we can visualize the results in the form of a dendrogram(系统树图):

from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import set_link_color_palette
set_link_color_palette(['black'])
#labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']
row_dendrogram = dendrogram( row_clusters, labels=labels, color_threshold=5)
plt.tight_layout()
plt.ylabel('Euclidean distance', fontsize=14)

plt.show()

Such a dendrogram summarizes the different clusters that were formed during the agglomerative hierarchical clustering成团块层次聚类; for example, we can see that the samples ID_0 and ID_4, followed by ID_1 and ID_2, are the most similar ones based on the Euclidean distance metric.
<== the most similar<==single linkage

single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest.

Attaching dendrograms to a heat map

In practical applications, hierarchical clustering dendrograms are often used in combination with a heat map, which allows us to represent the individual values in the sample matrix with a color code. In this section, we will discuss how to attach a dendrogram to a heat map plot and order the rows in the heat map correspondingly.

However, attaching a dendrogram to a heat map can be a little bit tricky, so let's go through this procedure step by step:

We create a new figure object and define the x axis position, y axis position, width, and height of the dendrogram via the add_axes attribute. Furthermore, we rotate the dendrogram 90 degrees counter-clockwise.
The code is as follows:
```
fig = plt.figure(figsize=(8,8), facecolor='white')
axd = fig.add_axes([0.09, 0.1, 0.2, 0.6])#([left, bottom, width, height])
row_dendr = dendrogram(row_clusters, orientation='left')
```
Next we reorder the data in our initial DataFrame according to the clustering labels that can be accessed from the dendrogram object, which is essentially a Python dictionary, via the leaves key. The code is as follows:
```
row_dendr
```
```
df_rowclust = df.iloc[row_dendr['leaves'][::-1]]
df_rowclust
```

Now we construct the heat map from the reordered DataFrame and position it right next to the dendrogram:

axm = fig.add_axes([0.23,0.1, 0.6,0.6]) #([left, bottom, width, height])
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')

Finally we will modify the aesthetics审美 of the heat map by removing the axis ticks and hiding the axis spines. Also, we will add a color bar and assign the feature and sample names to the x and y axis tick labels, respectively. The code is as follows:

axd.spines.values() #the border objects of dendrogram

fig = plt.figure(figsize=(8,8), facecolor='white')

axd = fig.add_axes([0.09, 0.1, 0.2, 0.6])#([left, bottom, width, height])
row_dendr = dendrogram(row_clusters, orientation='left')

df_rowclust = df.iloc[row_dendr['leaves'][::-1]]
axm = fig.add_axes([0.23,0.1, 0.6,0.6]) #([left, bottom, width, height])
cax = axm.matshow( df_rowclust, interpolation='nearest', cmap='hot_r' )

axd.set_xticks([]) #removing the axis ticks
axd.set_yticks([])

# remove the border of dendrogram
for i in axd.spines.values():
    i.set_visible(False)

fig.colorbar( cax ) #add a color bar
axm.set_xticklabels( [''] + list(df_rowclust.columns) ) #[''] -->removing original ticklabels or axis ticks
axm.set_yticklabels( [''] + list(df_rowclust.index))  #assign the feature and sample names to the x and y axis tick labels, respectively
plt.show()

After following the previous steps, the heat map should be displayed with the dendrogram attached:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html

As we can see, the row order in the heat map reflects the clustering of the samples in the dendrogram. In addition to a simple dendrogram, the color-coded values of each sample and feature in the heat map provide us with a nice summary of the dataset. (This example just for demo, I'm not suggest to use heatmap for different feature with different scales)

Applying agglomerative clustering via scikit-learn

In this section, we saw how to perform agglomerative hierarchical clustering using SciPy. However, there is also an Agglomerative Clustering implementation in scikit-learn, which allows us to choose the number of clusters that we want to return. This is useful if we want to prune the hierarchical cluster tree. By setting the n_cluster parameter to 2, we will now cluster the samples into two groups using the same complete linkage approach based on the Euclidean distance metric as before:

from sklearn.cluster import AgglomerativeClustering
                             ############
ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)

########################

from sklearn.cluster import AgglomerativeClustering
                             ############
ac = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)

########################

df # <==X

Looking at the predicted cluster labels, we can see that the first, fourth, and fifth sample (ID_0, ID_3, and ID_4) were assigned to one cluster (0), and the samples ID_1 and ID_2 were assigned to a second cluster (1), which is consistent with the
results that we can observe in the dendrogram.

Locating regions of high density via DBSCAN

Although we can't cover the vast number of different clustering algorithms in this chapter, let's at least introduce one more approach to clustering: Density-based Spatial [ˈspeɪʃl]空间的 Clustering of Applications with Noise (DBSCAN). The notion of density in DBSCAN is defined as the number of data points within a specified radius (epsilon).

In DBSCAN, a special label is assigned to each sample (point) using the following criteria:

A point is considered as core point if at least a specified number (MinPts) of neighboring points fall within the specified radius .
A border point is a point that has fewer neighbors than MinPts within ,but lies within the radius of a core point
All other points that are neither core nor border points are considered as noise points

After labeling the points as core, border, or noise points, the DBSCAN algorithm can be summarized in two simple steps:

Form a separate cluster for each core point or connected group of core points (core points are connected if they are no farther away than ).
Assign each border point to the cluster of its corresponding core point.

To get a better understanding of what the result of DBSCAN can look like before jumping to the implementation, let's summarize what you have learned about core points, border points, and noise points in the following figure:

One of the main advantages of using DBSCAN is that it does not assume that the clusters have a spherical shape as in k-means. Furthermore, DBSCAN is different from k-means and hierarchical clustering in that it doesn't necessarily assign each point to a cluster but is capable of removing noise points.

For a more illustrative example, let's create a new dataset of half-moon-shaped structures to compare k-means clustering, hierarchical clustering, and DBSCAN:

from sklearn.datasets import make_moons

X,y = make_moons( n_samples=200, noise=0.05, random_state=0 )

plt.scatter( X[:,0], X[:,1] )
plt.show()

As we can see in the resulting plot, there are two visible, half-moon-shaped groups consisting of 200 sample points each:

We will start by using the k-means algorithm and complete linkage clustering to see if one of those previously discussed clustering algorithms can successfully identify the half-moon shapes as separate clusters. The code is as follows:

f, (ax1, ax2) = plt.subplots(1,2, figsize=(10,3))

km = KMeans( n_clusters=2, random_state=0 )
y_km = km.fit_predict(X)

ax1.scatter( X[y_km==0, 0], X[y_km==0, 1], c='lightblue', edgecolor='black', marker='o', s=40, 
            label='cluster 1' )
ax1.scatter( X[y_km==1, 0], X[y_km==1, 1], c='red', edgecolor='black', marker='s', s=40,
            label='cluster 2' )
ax1.set_title( 'K-means clustering' )


ac = AgglomerativeClustering( n_clusters=2, affinity='euclidean', linkage='complete' )
y_ac = ac.fit_predict(X)

ax2.scatter( X[y_ac==0, 0], X[y_ac==0, 1], c='lightblue', edgecolor='black', marker='o', s=40,
            label='cluster 1' )
ax2.scatter( X[y_ac==1, 0], X[y_ac==1, 1], c='red', edgecolor='black', marker='s', s=40,
            label='cluster 2' )
ax2.set_title('Agglomerative clustering')

plt.legend()
plt.show()

Based on the visualized clustering results, we can see that the k-means algorithm is unable to separate the two cluster, and also the hierarchical clustering algorithm was challenged by those complex shapes:

Finally, let us try the DBSCAN algorithm on this dataset to see if it can find the two half-moon-shaped clusters using a density-based approach:

from sklearn.cluster import DBSCAN

db = DBSCAN( eps=0.2, min_samples=5, metric='euclidean' )
y_db = db.fit_predict(X)

plt.scatter( X[y_db==0,0], X[y_db==0,1], c='lightblue', marker='o', s=40, label='cluster 1')
plt.scatter( X[y_db==1,0], X[y_db==1,1], c='red', marker='s', s=40, label='cluster 2')

plt.legend()
plt.show()

The DBSCAN algorithm can successfully detect the half-moon shapes, which highlights one of the strength of DBSCAN: clustering data of arbitrary shapes:

However, we shall also note some of the disadvantages of DBSCAN. With an increasing number of features in our dataset—assuming a fixed number of training examples—the negative effect of the curse of dimensionality increases. This is especially a problem if we are using the Euclidean distance metric. However, the problem of the curse of dimensionality is not unique to DBSCAN; it also affects other clustering algorithms that use the Euclidean distance metric, for example, kmeans and hierarchical clustering algorithms. In addition, we have two hyperparameters in DBSCAN (MinPts and ) that need to be optimized to yield good clustering results. Finding a good combination of MinPts and can be problematic if the density differences in the dataset are relatively large.
#######################################
So far, we saw three of the most fundamental categories of clustering algorithms: prototype-based clustering with k-means, agglomerative hierarchical clustering, and density-based clustering via DBSCAN. However, I also want to mention a fourth class of more advanced clustering algorithms that we have not covered in this chapter: graph-based clustering. Probably the most prominent members of the graph-based clustering family are spectral clustering algorithms谱聚类算法 . Although there are many different implementations of spectral clustering, they all have in common that they use the eigenvectors of a similarity matrix to derive the cluster relationships. Since spectral clustering is beyond the scope of this book, you can read the excellent tutorial by Ulrike von Luxburg to learn more about this topic (U. Von Luxburg. A Tutorial on Spectral Clustering. Statistics and computing, 17(4):395–416, 2007). It is freely available from arXiv at http://arxiv.org/pdf/0711.0189v1.pdf.
#######################################

Note that, in practice, it is not always obvious which clustering algorithm will perform best on a given dataset, especially if the data comes in multiple dimensions that make it hard or impossible to visualize. Furthermore, it is important to emphasize that a successful clustering not only depends on the algorithm and its hyperparameters. Rather, the choice of an appropriate distance metric and the use of domain knowledge that can help guide the experimental setup can be even more important.

In the context of the curse of dimensionality, it is thus common practice to apply dimensionality reduction techniques prior to performing clustering. Such dimensionality reduction techniques for unsupervised datasets include principal
component analysis and RBF kernel principal component analysis, which we
covered in Chapter 5(https://blog.csdn.net/Linli522362242/article/details/105196037, https://blog.csdn.net/Linli522362242/article/details/105139547, https://blog.csdn.net/Linli522362242/article/details/105722461), Compressing Data via Dimensionality Reduction. Also, it is particularly common to compress datasets down to two-dimensional subspaces, which allows us to visualize the clusters and assigned labels using two-dimensional scatterplots, which are particularly helpful for evaluating the results.

Summary

In this chapter, you learned about three different clustering algorithms that can help us with the discovery of hidden structures or information in data. We started this chapter with a prototype-based approach, k-means, which clusters samples into spherical shapes based on a specified number of cluster centroids. Since clustering is an unsupervised method, we do not enjoy the luxury of ground truth labels to evaluate the performance of a model. Thus, we used intrinsic performance metrics such as the elbow method or silhouette analysis as an attempt to quantify the quality of clustering.

We then looked at a different approach to clustering: agglomerative hierarchical clustering. Hierarchical clustering does not require specifying the number of clusters up front, and the result can be visualized in a dendrogram representation, which can help with the interpretation of the results. The last clustering algorithm that we saw in this chapter was DBSCAN, an algorithm that groups points based on local densities and is capable of handling outliers and identifying non-globular shapes.

After this excursion into the field of unsupervised learning, it is now about time to introduce some of the most exciting machine learning algorithms for supervised learning: multilayer artificial neural networks. After their recent resurgence 复苏, neural networks are once again the hottest topic in machine learning research. Thanks to recently developed deep learning algorithms, neural networks are considered state-of-the-art最先进的 for many complex tasks such as image classification and speech recognition. In Chapter 12, Implementing a Multilayer Artificial Neural Network from Scratch, we will construct our own multilayer neural network from scratch. In Chapter 13, Parallelizing Neural Network Training with TensorFlow, we will introduce powerful libraries that can help us to train complex network architectures most efficiently.

你可能感兴趣的:(cp11_Working with Unlabeled Data_Clustering Analysis_Kmeans_hierarchical_dendrogram_heat map_DBSCAN)

[数据集][目标检测]垃圾检测数据集VOC+YOLO格式6004张18类别垃圾 FL1623863129 数据集目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：6004标注数量(xml文件个数)：6004标注数量(txt文件个数)：6004标注类别数：18标注类别名称:["bottle_cap","bottle","cup","unlabeled_litter","straw"
【单目深度估计】Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data 十小大单目深度估计计算机视觉图像处理深度学习图像分割论文阅读论文笔记
前言论文题目：DepthAnything:UnleashingthePowerofLarge-ScaleUnlabeledData——任何深度：释放大规模无标记数据的力量论文地址：DepthAnything:UnleashingthePowerofLarge-ScaleUnlabeledData源码地址：https://github.com/LiheYoung/Depth-Anything2024
机器学习---半监督学习（生成式方法）三月七꧁ ꧂ 机器学习机器学习学习人工智能
1.主动学习形式化地看，我们有训练样本集，这l个样本的类别标记（即是否好瓜）已知，称为“有标记”(labeled)样本；此外，还有，这u个样本的类别标记未知（即不知是否好瓜)，称为“未标记”(unlabeled)样本。若直接使用传统监督学习技术，则仅有Dl能用于构建模型，Du所包含的信息被浪费了；另一方面，若Dl较小，则由于训练样本不足，学得模型的泛化能力往往不佳。那么，能否在构建模型的过程中将D
Analysis of Learning from Positive and Unlabeled Data zealscott
PUlearning论文阅读。本文从基本的分类损失出发，推导了PU的分类问题其实就是Cost-sensitiveclassiﬁcation的形式，同时，通过实验证明了如果使用凸函数作为lossfunction，例如hingeloss会导致错误的分类边界（有bias），因此需要使用例如ramploss之类的凹函数。同时，论文还对先验存在偏差的情况进行了讨论，说明了如果样本中大部分都是正样本，那么就算
论文笔记：SelfHAR: Improving Human Activity Recognition through Self-training with Unlabeled Data UQI-LIUWJ 论文笔记论文阅读
Proc.ACMInteract.Mob.WearableUbiquitousTechnol.20211intro1.1背景——人类活动识别（HAR）旨在准确分类人类的物理活动传统方法——依赖于滑动窗口分割和手工特征提取，然后通过各种监督学习技术来识别简单和复杂的活动，如行走、跑步、骑自行车深度学习方法自动提取目标任务的有用特征——>更有效两种方法的局限性受到常规实验室HAR数据集引入的偏见和限制
Leveraging Unlabeled Data for Crowd Counting by Learning to Rank Nightmare004 深度学习人群技术方法
无标签人群技术，作者引入了一种排名。利用的是一个图的人群数量一定小于等于包含这个图的图生成排名数据集作者提出了一种自监督任务，利用的是一个图的人群数量一定小于等于包含这个图的图流程：1.以图像中心为中心，划分一个1/r1/r1/r图像大小的矩形（但是这里没写是面积的还是长宽的）在这个矩形中，随机选择一个点当作锚点2.以锚点为中心，找到一个不超过图像边界的正方形3.重复k−1k-1k−1次，每次生成
Convex Formulation for Learning from Positive and Unlabeled Data zealscott
UnbiasedPUlearning.该论文在之前PUlearning中使用非凸函数作为loss的基础上，对正类样本和未标记样本使用不同的凸函数loss，从而将其转为凸优化问题。结果表明，该loss（doublehingeloss）与非凸loss（ramp）精度几乎一致，但大大减少了计算量。IntrodutionBackground论文首先强调了PU问题的重要性，举了几个例子：Automaticf
Learning Classiﬁers from Only Positive and Unlabeled Data zealscott
PUlearning经典论文。本文主要考虑在SCAR假设下，证明了普通的分类器和PU分类器只相差一个常数，因此可以使用普通分类器的方法来估计，进而得到。同时提供了三种方法来估计这个常数，最后，还对先验的估计提供了思路。Learningatraditionalclassifier概念定义表示一个样本，表示其label（0或者1），表示是否被select那么，在PU问题中，当时，一定有一定成立两种采样
半监督学习思路学习记录 yzZ_here 学习深度学习机器学习计算机视觉人工智能
半监督学习思路整理一、半监督学习思路semi-supervisedlearning（SSL）需要明确的知识点：1、首先确定训练集中包含两种数据：labeled和unlabeled。2、我们最终目的是得到一个分类器，即网络模型。3、训练结束的条件可以是将无标签数据作为网络的输入，得到输出的预测标签，在一定置信度之内的数据可以划分到有标签数据中，直到训练集中的数据都有了标签，此时可以认为分类器就是最终
人群密度估计--Leveraging Unlabeled Data for Crowd Counting by Learning to Rank O天涯海阁O 人群分析人群分析
LeveragingUnlabeledDataforCrowdCountingbyLearningtoRankCVPR2018https://github.com/xialeiliu/CrowdCountingCVPR18本文针对人群密度估计训练数据库规模很小的问题提出了使用未标定数据来self-supervised，具体通过LearningtoRank人群密度估计数据库规模很小的主要原因是图像标
Paddle训练COCO-stuff数据集学习记录彭祥. 环境配置学习记录深度学习 paddle 学习
COCO-stuff数据集COCO-Stuff数据集对COCO数据集中全部164K图片做了像素级的标注。80thingclasses,91stuffclassesand1class‘unlabeled’数据集下载wget--directory-prefix=downloadshttp://images.cocodataset.org/zips/train2017.zipwget--director
MURAUER: Mapping Unlabeled Real Data for Label AUstERity总结中了胖毒
文章链接摘要用于学习三维手部姿势估计模型的数据标记是一项巨大的工作。由于合成数据和真实数据存在'domaingap'，直接使用现成的、准确的模拟合成数据效果不好。然而，要成功地利用合成数据，目前最先进的方法仍然需要大量标记的真实数据结合训练。本文通过学习从真实数据的特征映射到合成数据的特征来消除'domaingap'，并使用大量的同手势双视角未标记的真实数据训练网络，改善性能。关键手势预测使用大量
nginx php 开启 SELinux 提示 Forbbiden 403 紅塵忘
查看nginxerror.log查看到index.phppremissiondead查看SELinuxlog/var/log/aduit.log提示denied{open}scontext=system_u:system_r:httpd_t:s0tcontext=system_u:object_r:unlabeled_t:s0tclass=file查看nginx默认文件Securitycontex
语义分割算法RangeNet++语义标签颜色对应关系 RobotsRuning c++
NumLabelIDLabelNameColorMap(RGB)10unlabeled（未标记的）[0,0,0]21outlier（离群值）[255,0,0]310car（汽车）[100,150,245]411bicycle（自行车）[100,230,245]513bus（公交车）[100,80,250]615motorcycle（摩托车）[30,60,150]716on-rails（轨道交通）[
PU learning 算法笔记1-- 论文《Learning Classifiers from Only Positive and Unlabeled Data》中的方法 beingstrong 机器学习算法笔记机器学习 PU learning
PUlearning（Positive-unlabeledlearning）是当样本集中只有部分标注好的正样本和其余未标注的样本时，如何学习一个二分类器。这篇笔记记录一下论文《LearningClassifiersfromOnlyPositiveandUnlabeledData》中提出的一种PUlearning方法。设xxx是一个样本，而y∈{0,1}y\in\{0,1\}y∈{0,1}是二元标签
论文阅读：Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos yangfuweivip paper face recognition domain adaptation
概述：UnsupervisedDomainAdaptationforFaceRecognitioninUnlabeledVideos，ICCV2017的文章，实现的是用domainadaptation技术将没有label的视频数据迁移到图片识别网络中。作者：URL：http://openaccess.thecvf.com/content_ICCV_2017/papers/Sohn_Unsuperv
[半监督学习] Combining Labeled and Unlabeled Data with Co-Training 码侯烧酒论文机器学习
论文地址:CombiningLabeledandUnlabeledDatawithCo-Training会议:COLT1998任务:分类AFORMALFRAMEWORK定义一个实例空间X=X1×X2X=X_1\timesX_2X=X1×X2,其中X1X_1X1,X2X_2X2对应于同一实例的两个不同"视图".实例里的每个xxx都以成对的形式(x1,x2)(x_1,x_2)(x1,x2)给出.假定每
随机森林 Word2Vec 文本分类 Track48 自然语言处理 Python
数据集是来自kagglesemanticclassification任务的1、加载文件importpandasaspdtrain=pd.read_csv(r"labeledTrainData\labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)unlabeled=pd.read_csv(r"unlabeledTrainData\unlab
From Synthetic to Real: Image Dehazing Collaborating with Unlabeled Real Data从合成到真实：图像去杂化使用未标记的真实数迁凉计算机视觉深度学习机器学习
问题：合成训练数据和真实世界测试图像之间的域转移通常会导致现有方法的退化方法：我们提出了一种与未标记的真实数据协作的新型图像去雾框架贡献：•我们提出了一个图像去雾框架，以利用解开的特征表示和未标记的真实世界模糊图像来增强单图像去雾。•我们设计了一个解纠缠图像去雾网络（DID-Net），通过从粗到细的策略来预测传输图、潜在无雾图像和大气光照图。•使用解纠缠一致性均值教师网络(DMTNet)来协作标记
开放集领域自适应OSDA（十三）：Positive-unlabeled learning for open set domain adaptation CtrlZ1 开放集领域自适应人工智能深度学习机器学习领域自适应
文章目录前言摘要1Introduction2.Relatedwork3.Preliminarybackground4.Autoencoder-basedclassificationlossandnnPUrisk5.Open-setdomainadaptationasaPUproblem6.Experiments6.1.Datasets6.2.Opensetmetrics6.3.Results结论前
Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data, NeurIPS 2021 RitaRao 跨域小样本学习基于度量的元学习小样本学习深度学习机器学习计算机视觉
motivationSTARTUP(ICLR2021)中提出基于self-training的思想用targetdomain的去标记数据联合训练模型。但STARTUP中使用在baseclasses上预先训练得到的网络，为未标记的目标样本创建软标签。域间差异较大时，使用固定的预训练模型将目标图像投影到基数据集的类域中可能是次优的。本文的问题设置和STARTUP中是一致的：带标签的源域样本+去标签的目标
COCO-Stuff dataset Cooder
介绍COCO-Stuff数据集对COCO数据集中全部164K图片做了像素级的标注。80thingclasses,91stuffclassesand1class'unlabeled'下载FilenameDescriptionSizetrain2017.zipCOCO2017trainimages(118Kimages)18GBval2017.zipCOCO2017valimages(5Kimages
Android 首页底部导航栏（BottomNavigationView实现） HeUuln android android kotlin android studio
文章目录资源文件BottomNavigationViewfragment逻辑代码（kotlin）角标和动态显隐资源文件BottomNavigationViewlabelVisibilityMode：文字显示模式labeled：所有图标下的文字可见unlabeled：所有图标下的文字都不可见selected：只显示选定图标的对应文字auto：3个及以下图标，效果相当于labeled，多于3个，则为s
[半监督学习] Tri-Training: Exploiting Unlabeled Data Using Three Classifiers 码侯烧酒论文机器学习
标准的协同训练算法需要两个充分且冗余的视图,每组都足以用于学习,并且在给定类标签的情况下条件独立.不过这并不容易实现,在论文中,提出了一种新的协同训练风格算法,称为Tri-Training.其不需要冗余的视图,也不需要使用不同的监督学习算法.与最开始使用两个分类器的算法相比,Tri-Training使用三个分类器.论文地址:Tri-Training:ExploitingUnlabeledDataU
论文笔记：Few-Shot Segmentation Without Meta-Learning : A Good Transductive Inference Is All You Need 咖喱波特论文笔记小样本学习算法人工智能
好久好久没有记录笔记了~感觉还是分享出来的印象更深刻一些~让自己多思考一点点摘要：inference的方式对fsl的分割任务性能有着巨大的影响，而这一点往往被其他论文忽视，转而偏向元学习的范式。文章提出了transductive的inference方式，即：对每个queryimage，统计其unlabeled像素的信息，通过优化包含三个互补项的新损失1.intro目前常见的FSL论文都是基于元学习
机器学习14 -- Transfer Learning 迁移学习谢杨易深度学习机器学习深度学习迁移学习
1总览迁移学习的目标，是利用一些不相关的数据，来提升目标任务。不相关主要包括task不相关。比如一个为猫狗分类器，一个为老虎狮子分类器data不相关。比如都为猫狗分类器，但一个来自真实的猫和狗照片，另一个为卡通的猫和狗迁移学习中包括两部分数据sourcedata。和目标任务不直接相关，labeled或unlabeled数据一般比较容易获取，数据量很大。可以利用一些公开数据集，比如ImageNet。
Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in vitro 马小李23
通过对抗生成网络产生无标签样本在此之外提高行人重识别的基准摘要：这篇论文的主要贡献是简单的半监督管道，这个管道仅仅使用了原始的训练数据，并不需要额外的收集数据。这个挑战在于1）如何仅从训练集中获得更多的训练数据；2）如何使用新生成的数据。在本作中，对抗生成网络被用于生成无标签的样本。我们提出了离群值的标签平滑正则化（这个离群值应该指的就是生成的图片——译者注）。这个方法为无标签的图片分配了一个均匀
对话系统论文集（12）-状态跟踪 zixufang 对话系统&强化学习
问题：含有state的数据太少，正确得到state（对花槽）很难。背景：全监督是：所有数据都是标记的无监督：所有数据都unlabeled，目标是将数据进行分类。半监督：部分标记，部分未标记创新点：没有使用RL未标记数据：1）通过encoder-decoder产生explicittextspan（St）（对话历史）：输入是Ut和Rt-1（向量链接），本次提问和上一次回答。经过GRU。decoder每
python 从外部引入变量并运行该程序 wuguangbin1230 tensorflow python
1.python程序部分importargparseFLAGS=tf.app.flags.FLAGSoffice31_flags.train()parser=argparse.ArgumentParser()parser.add_argument('--unlabeled_data_path',type=str,default=None)parser.add_argument('--labeled
降噪自编码器/稀疏自编码器/栈式自编码器 weixin_30853329 机器学习深度学习
漫谈autoencoder：降噪自编码器/稀疏自编码器/栈式自编码器（含tensorflow实现）2018年08月11日20:45:14wblgers1234阅读数13196更多分类专栏：机器学习深度学习0.前言在非监督学习中，最典型的一类神经网络莫过于autoencoder(自编码器)，它的目的是基于输入的unlabeled数据X={x(1),x(2),x(3),...}X={x(1),x(
java线程的无限循环和退出 3213213333332132 java
最近想写一个游戏，然后碰到有关线程的问题，网上查了好多资料都没满足。突然想起了前段时间看的有关线程的视频，于是信手拈来写了一个线程的代码片段。希望帮助刚学java线程的童鞋 package thread; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date
tomcat 容器 BlueSkator tomcat Web servlet
Tomcat的组成部分 1、server A Server element represents the entire Catalina servlet container. (Singleton) 2、service service包括多个connector以及一个engine，其职责为处理由connector获得的客户请求。 3、connector 一个connector
php递归,静态变量,匿名函数使用 dcj3sjt126com PHP 递归函数匿名函数静态变量引用传参
<!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <title>Current To-Do List</title> </head> <body>
属性颜色字体变化周华华 JavaScript
function changSize(className){ var diva=byId("fot") diva.className=className; } </script> <style type="text/css"> .max{ background: #900; color:#039;
将properties内容放置到map中 g21121 properties
代码比较简单： private static Map<Object, Object> map; private static Properties p; static { //读取properties文件 InputStream is = XXX.class.getClassLoader().getResourceAsStream("xxx.properti
[简单]拼接字符串 53873039oycg 字符串
工作中遇到需要从Map里面取值拼接字符串的情况，自己写了个，不是很好，欢迎提出更优雅的写法，代码如下： import java.util.HashMap; import java.uti
Struts2学习云端月影
最近开始关注struts2的新特性，从这个版本开始，Struts开始使用convention-plugin代替codebehind-plugin来实现struts的零配置。配置文件精简了，的确是简便了开发过程，但是，我们熟悉的配置突然disappear了，真是一下很不适应。跟着潮流走吧，看看该怎样来搞定convention-plugin。使用Convention插件，你需要将其JAR文件放
Java新手入门的30个基本概念二 aijuans java 新手 java 入门
基本概念:　　1.OOP中唯一关系的是对象的接口是什么,就像计算机的销售商她不管电源内部结构是怎样的,他只关系能否给你提供电就行了,也就是只要知道can or not而不是how and why.所有的程序是由一定的属性和行为对象组成的,不同的对象的访问通过函数调用来完成,对象间所有的交流都是通过方法调用,通过对封装对象数据,很大限度上提高复用率。　　2.OOP中最重要的思想是类,类是模板是蓝图,
jedis 简单使用 antlove java redis cache command jedis
jedis.RedisOperationCollection.java package jedis; import org.apache.log4j.Logger; import redis.clients.jedis.Jedis; import java.util.List; import java.util.Map; import java.util.Set; pub
PL/SQL的函数和包体的基础百合不是茶 PL/SQL编程函数包体显示包的具体数据包
由于明天举要上课,所以刚刚将代码敲了一遍PL/SQL的函数和包体的实现(单例模式过几天好好的总结下再发出来);以便明天能更好的学习PL/SQL的循环,今天太累了,所以早点睡觉,明天继续PL/SQL总有一天我会将你永远的记载在心里,,, 函数; 函数:PL/SQL中的函数相当于java中的方法;函数有返回值定义函数的 --输入姓名找到该姓名的年薪 create or re
Mockito(二)--实例篇 bijian1013 持续集成 mockito 单元测试
学习了基本知识后，就可以实战了，Mockito的实际使用还是比较麻烦的。因为在实际使用中，最常遇到的就是需要模拟第三方类库的行为。比如现在有一个类FTPFileTransfer，实现了向FTP传输文件的功能。这个类中使用了a
精通Oracle10编程SQL(7)编写控制结构 bijian1013 oracle 数据库 plsql
/* *编写控制结构 */ --条件分支语句 --简单条件判断 DECLARE v_sal NUMBER(6,2); BEGIN select sal into v_sal from emp where lower(ename)=lower('&name'); if v_sal<2000 then update emp set
【Log4j二】Log4j属性文件配置详解 bit1129 log4j
如下是一个log4j.properties的配置 log4j.rootCategory=INFO, stdout , R log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appe
java集合排序笔记白糖_ java
public class CollectionDemo implements Serializable,Comparable<CollectionDemo>{ private static final long serialVersionUID = -2958090810811192128L; private int id; private String nam
java导致linux负载过高的定位方法 ronin47
定位java进程ID 可以使用top或ps -ef |grep java ![图片描述][1] 根据进程ID找到最消耗资源的java pid 比如第一步找到的进程ID为5431 执行 top -p 5431 -H ![图片描述][2] 打印java栈信息 $ jstack -l 5431 > 5431.log 在栈信息中定位具体问题将消耗资源的Java PID转
给定能随机生成整数1到5的函数，写出能随机生成整数1到7的函数 bylijinnan 函数
import java.util.ArrayList; import java.util.List; import java.util.Random; public class RandNFromRand5 { /** 题目：给定能随机生成整数1到5的函数，写出能随机生成整数1到7的函数。解法1： f(k) = (x0-1)*5^0+(x1-
PL/SQL Developer保存布局 Kai_Ge
近日由于项目需要，数据库从DB2迁移到ORCAL，因此数据库连接客户端选择了PL/SQL Developer。由于软件运用不熟悉，造成了很多麻烦，最主要的就是进入后，左边列表有很多选项，自己删除了一些选项卡，布局很满意了，下次进入后又恢复了以前的布局，很是苦恼。在众多PL/SQL Developer使用技巧中找到如下这段： &n
[未来战士计划]超能查派[剧透,慎入] comsci 计划
非常好看,超能查派,这部电影......为我们这些热爱人工智能的工程技术人员提供一些参考意见和思想........ 虽然电影里面的人物形象不是非常的可爱....但是非常的贴近现实生活.... &nbs
Google Map API V2 dai_lm google map
以后如果要开发包含google map的程序就更麻烦咯 http://www.cnblogs.com/mengdd/archive/2013/01/01/2841390.html 找到篇不错的文章，大家可以参考一下 http://blog.sina.com.cn/s/blog_c2839d410101jahv.html 1. 创建Android工程由于v2的key需要G
java数据计算层的几种解决方法2 datamachine java sql 集算器
2、SQL SQL/SP/JDBC在这里属于一类，这是老牌的数据计算层，性能和灵活性是它的优势。但随着新情况的不断出现，单纯用SQL已经难以满足需求，比如： JAVA开发规模的扩大，数据量的剧增，复杂计算问题的涌现。虽然SQL得高分的指标不多，但都是权重最高的。成熟度：5星。最成熟的。
Linux下Telnet的安装与运行 dcj3sjt126com linux telnet
Linux下Telnet的安装与运行 linux默认是使用SSH服务的而不安装telnet服务如果要使用telnet 就必须先安装相应的软件包即使安装了软件包默认的设置telnet 服务也是不运行的需要手工进行设置如果是redhat9，则在第三张光盘中找到 telnet-server-0.17-25.i386.rpm
PHP中钩子函数的实现与认识 dcj3sjt126com PHP
假如有这么一段程序： function fun(){ fun1(); fun2(); } 首先程序执行完fun1()之后执行fun2()然后fun()结束。但是，假如我们想对函数做一些变化。比如说，fun是一个解析函数，我们希望后期可以提供丰富的解析函数，而究竟用哪个函数解析，我们希望在配置文件中配置。这个时候就可以发挥钩子的力量了。我们可以在fu
EOS中的WorkSpace密码修改蕃薯耀修改WorkSpace密码
EOS中BPS的WorkSpace密码修改 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 201
SpringMVC4零配置--SpringSecurity相关配置【SpringSecurityConfig】 hanqunfeng SpringSecurity
SpringSecurity的配置相对来说有些复杂，如果是完整的bean配置，则需要配置大量的bean，所以xml配置时使用了命名空间来简化配置，同样，spring为我们提供了一个抽象类WebSecurityConfigurerAdapter和一个注解@EnableWebMvcSecurity，达到同样减少bean配置的目的，如下： applicationContex
ie 9 kendo ui中ajax跨域的问题 jackyrong AJAX跨域
这两天遇到个问题，kendo ui的datagrid，根据json去读取数据，然后前端通过kendo ui的datagrid去渲染，但很奇怪的是，在ie 10,ie 11,chrome,firefox等浏览器中，同样的程序，浏览起来是没问题的，但把应用放到公网上的一台服务器，却发现如下情况： 1） ie 9下，不能出现任何数据，但用IE 9浏览器浏览本机的应用，却没任何问题
不要让别人笑你不能成为程序员 lampcy 编程程序员
在经历六个月的编程集训之后，我刚刚完成了我的第一次一对一的编码评估。但是事情并没有如我所想的那般顺利。说实话，我感觉我的脑细胞像被轰炸过一样。手慢慢地离开键盘，心里很压抑。不禁默默祈祷：一切都会进展顺利的，对吧？至少有些地方我的回答应该是没有遗漏的，是不是？难道我选择编程真的是一个巨大的错误吗——我真的永远也成不了程序员吗？我需要一点点安慰。在自我怀疑，不安全感和脆弱等等像龙卷风一
马皇后的贤德 nannan408
马皇后不怕朱元璋的坏脾气，并敢理直气壮地吹耳边风。众所周知，朱元璋不喜欢女人干政，他认为“后妃虽母仪天下，然不可使干政事”，因为“宠之太过，则骄恣犯分，上下失序”，因此还特地命人纂述《女诫》，以示警诫。但马皇后是个例外。　　有一次，马皇后问朱元璋道：“如今天下老百姓安居乐业了吗？”朱元璋不高兴地回答：“这不是你应该问的。”马皇后振振有词地回敬道：“陛下是天下之父，
选择某个属性值最大的那条记录（不仅仅包含指定属性，而是想要什么属性都可以） Rainbow702 sql group by 最大值 max 最大的那条记录
好久好久不写SQL了，技能退化严重啊！！！直入主题：比如我有一张表，file_info，它有两个属性（但实际不只，我这里只是作说明用）： file_code, file_version 同一个code可能对应多个version 现在，我想针对每一个code，取得它相关的记录中，version 值最大的那条记录， SQL如下： select *
VBScript脚本语言 tntxia VBScript
VBScript 是基于VB的脚本语言。主要用于Asp和Excel的编程。 VB家族语言简介 Visual Basic 6.0 源于BASIC语言。由微软公司开发的包含协助开发环境的事
java中枚举类型的使用 xiao1zhao2 java enum 枚举 1.5新特性
枚举类型是j2se在1.5引入的新的类型,通过关键字enum来定义,常用来存储一些常量. 1.定义一个简单的枚举类型 public enum Sex { MAN, WOMAN } 枚举类型本质是类,编译此段代码会生成.class文件.通过Sex.MAN来访问Sex中的成员,其返回值是Sex类型. 2.常用方法静态的values()方