kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)

原标题:【Python机器学习】K-Means聚类和主成分分析(附源码)

caa0c57c9a0645e4b60e1a73243b6ae7.jpeg

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第1张图片

从本周开始,推送一个系列关于 Python 机器学习 。为了保证内容的原汁原味。我们采取全英的推送。希望大家有所收获。提高自己的英语阅读能力和研究水平。

K-means clustering

To start out we're going to implement and apply K-means to a simple 2-dimensional data set to gain some intuition about how it works. K-means is an iterative, unsupervised clustering algorithm that groups similar instances together into clusters. The algorithm starts by guessing the initial centroids for each cluster, and then repeatedly assigns instances to the nearest cluster and re-computes the centroid of that cluster. The first piece that we're going to implement is a function that finds the closest centroid for each instance in the data.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第2张图片

7c61cb47c6e24a0a9f88db459c93927a.png

The output matches the expected values in the text (remember our arrays are zero-indexed instead of one-indexed so the values are one lower than in the exercise). Next we need a function to compute the centroid of a cluster. The centroid is simply the mean of all of the examples currently assigned to the cluster.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第3张图片

This output also matches the expected values from the exercise. So far so good. The next part involves actually running the algorithm for some number of iterations and visualizing the result. This step was implmented for us in the exercise, but since it's not that complicated I'll build it here from scratch. In order to run the algorithm we just need to alternate between assigning examples to the nearest cluster and re-computing the cluster centroids.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第4张图片

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第5张图片

One step we skipped over is a process for initializing the centroids. This can affect the convergence of the algorithm. We're tasked with creating a function that selects random examples and uses them as the initial centroids.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第6张图片

Our next task is to apply K-means to image compression. The intuition here is that we can use clustering to find a small number of colors that are most representative of the image, and map the original 24-bit colors to a lower-dimensional color space using the cluster assignments. Here's the image we're going to compress.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第7张图片

811bd24c810e49ba892bdce31f7b73c4.png

ed6d314d10b54370a33ae35f409c8ab1.png

Now we need to apply some pre-processing to the data and feed it into the K-means algorithm.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第8张图片

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第9张图片

Cool! You can see that we created some artifacts in the compression but the main features of the image are still there. That's it for K-means. We'll now move on to principal component analysis.

Principal component analysis

PCA is a linear transformation that finds the "principal components", or directions of greatest variance, in a data set. It can be used for dimension reduction among other things. In this exercise we're first tasked with implementing PCA and applying it to a simple 2-dimensional data set to see how it works. Let's start off by loading and visualizing the data set.

72139368e238423691f07de1944c9c21.png

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第10张图片

The algorithm for PCA is fairly simple. After ensuring that the data is normalized, the output is simply the singular value decomposition of the covariance matrix of the original data.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第11张图片

Now that we have the principal components (matrix U), we can use these to project the original data into a lower-dimensional space. For this task we'll implement a function that computes the projection and selects only the top K components, effectively reducing the number of dimensions.

73dcb8e7e3144e5a8f43054b8f8170c1.png

We can also attempt to recover the original data by reversing the steps we took to project it.

a00271bf59ee476d862e50dc1a4b4d96.png

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第12张图片

Notice that the projection axis for the first principal component was basically a diagonal line through the data set. When we reduced the data to one dimension, we lost the variations around that diagonal line, so in our reproduction everything falls along that diagonal.

Our last task in this exercise is to apply PCA to images of faces. By using the same dimension reduction techniques we can capture the "essence" of the images using much less data than the original images.

8815f8bc31cc4c1abdcb6f0dd7b53b4c.png

The exercise code includes a function that will render the first 100 faces in the data set in a grid. Rather than try to re-produce that here, you can look in the exercise text for an example of what they look like. We can at least render one image fairly easily though.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第13张图片

Yikes, that looks awful. These are only 32 x 32 grayscale images though (it's also rendering sideways, but we can ignore that for now). Anyway's let's proceed. Our next step is to run PCA on the faces data set and take the top 100 principal components.

eca8b3f5cbf04bc790a96c3bbbf6117e.png

Now we can attempt to recover the original structure and render it again.

kmeans python代码_【Python机器学习】K-Means聚类和主成分分析(附源码)_第14张图片

听说,置顶关注我们的人都不一般返回搜狐,查看更多

责任编辑:

你可能感兴趣的:(kmeans,python代码)