在scikit-learin众多可用的聚类技术中,呃喔们采用Affinity Propagation(近邻传播);因为它不强求相同大小的类,并且能从数据中自动确定类的数目。


  • 嵌入到2D空间

为了可视化,我们需要将不同股票展示在一个2D画布中,为此我们采用 Manifold learning(流形学习)技术来实现2D嵌入。

  • 可视化


  1. 簇标签用于定义节点颜色
  2. 稀疏协方差模型用于展示边的强度
  3. 2D嵌入用于定位平面中的节点

这个例子有相当多和可视化相关的代码,因为对于显示图像来说可视化至关重要。挑战之一就是定位标签尽量减少重叠。为此,我们采用基于每个轴向上最近邻方向的启发式方法(heuristic )。


Cluster 1: Apple, Amazon, Yahoo
Cluster 2: Comcast, Cablevision, Time Warner
Cluster 3: ConocoPhillips, Chevron, Total, Valero Energy, Exxon
Cluster 4: Cisco, Dell, HP, IBM, Microsoft, SAP, Texas Instruments
Cluster 5: Boeing, General Dynamics, Northrop Grumman, Raytheon
Cluster 6: AIG, American express, Bank of America, Caterpillar, CVS, DuPont de Nemours, Ford, General Electrics, Goldman Sachs, Home Depot, JPMorgan Chase, Marriott, 3M, Ryder, Wells Fargo, Wal-Mart
Cluster 7: McDonald's
Cluster 8: GlaxoSmithKline, Novartis, Pfizer, Sanofi-Aventis, Unilever
Cluster 9: Kellogg, Coca Cola, Pepsi
Cluster 10: Colgate-Palmolive, Kimberly-Clark, Procter Gamble
Cluster 11: Canon, Honda, Navistar, Sony, Toyota, Xerox


from __future__ import print_function   # Python2.6运行才需要,Python3.x的不需要这一行
import sys
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection   # 见参考Matplotlib.collections.LineCollection结构及用法
import pandas as pd
from sklearn import cluster, covariance, manifold  # 见参考3、参考4、参考5
print(__doc__)   # 打印本文的意义

# #############################################################################
# Retrieve the data from Internet

# The data is from 2003 - 2008. This is reasonably calm: (not too long ago so
# that we get high-tech firms, and before the 2008 crash). This kind of
# historical data can be obtained for from APIs like the quandl.com and
# alphavantage.co ones.

symbol_dict = {
    'TOT': 'Total',
    'XOM': 'Exxon',
    'CVX': 'Chevron',
    'COP': 'ConocoPhillips',
    'VLO': 'Valero Energy',
    'MSFT': 'Microsoft',
    'IBM': 'IBM',
    'TWX': 'Time Warner',
    'CMCSA': 'Comcast',
    'CVC': 'Cablevision',
    'YHOO': 'Yahoo',
    'DELL': 'Dell',
    'HPQ': 'HP',
    'AMZN': 'Amazon',
    'TM': 'Toyota',
    'CAJ': 'Canon',
    'SNE': 'Sony',
    'F': 'Ford',
    'HMC': 'Honda',
    'NAV': 'Navistar',
    'NOC': 'Northrop Grumman',
    'BA': 'Boeing',
    'KO': 'Coca Cola',
    'MMM': '3M',
    'MCD': 'McDonald\'s',
    'PEP': 'Pepsi',
    'K': 'Kellogg',
    'UN': 'Unilever',
    'MAR': 'Marriott',
    'PG': 'Procter Gamble',
    'CL': 'Colgate-Palmolive',
    'GE': 'General Electrics',
    'WFC': 'Wells Fargo',
    'JPM': 'JPMorgan Chase',
    'AIG': 'AIG',
    'AXP': 'American express',
    'BAC': 'Bank of America',
    'GS': 'Goldman Sachs',
    'AAPL': 'Apple',
    'SAP': 'SAP',
    'CSCO': 'Cisco',
    'TXN': 'Texas Instruments',
    'XRX': 'Xerox',
    'WMT': 'Wal-Mart',
    'HD': 'Home Depot',
    'GSK': 'GlaxoSmithKline',
    'PFE': 'Pfizer',
    'SNY': 'Sanofi-Aventis',
    'NVS': 'Novartis',
    'KMB': 'Kimberly-Clark',
    'R': 'Ryder',
    'GD': 'General Dynamics',
    'RTN': 'Raytheon',
    'CVS': 'CVS',
    'CAT': 'Caterpillar',
    'DD': 'DuPont de Nemours'}   # 42项公司名称缩写,dict结构

symbols, names = np.array(sorted(symbol_dict.items())).T   # 将symbol_dict转换维(key,value)形式的列,并排序,然后转为2×56数组。最后进行拆包,返回两个numpy.array

quotes = []   # 实例化list,用于承载“报价”

for symbol in symbols:
    print('Fetching quote history for %r' % symbol, file=sys.stderr) # 参考6
    url = ('https://raw.githubusercontent.com/scikit-learn/examples-data/'
           'master/financial-data/{}.csv')   # 见参考7,raw.githubusercontent.com换成github.com就能看到正常资料
    quotes.append(pd.read_csv(url.format(symbol)))   # 见参考8,str.format()字符串格式化。得到quotes是一个list,每个元素是一个dataframe,承载着symbol的所有历史数据(时间、open、close)

close_prices = np.vstack([q['close'] for q in quotes])   # 见参考9,通过numpy聚合功能,获取一个n行1列的数组,每个记录是一个股票的收盘价的list
open_prices = np.vstack([q['open'] for q in quotes])

# The daily variations of the quotes are what carry most information
variation = close_prices - open_prices   # 收盘价-开盘价,作为信息载体

# #############################################################################
# Learn a graphical structure from the correlations
edge_model = covariance.GraphicalLassoCV(cv=5)   # 实例化一个GraphicalLassoCV对象,关于Lasso见参考10,关于GraphicalLassoCV见参考14

# standardize the time series: using correlations rather than covariance
# is more efficient for structure recovery
X = variation.copy().T
X /= X.std(axis=0)

# #############################################################################
# Cluster using affinity propagation

_, labels = cluster.affinity_propagation(edge_model.covariance_)   # 返回划分好的聚类中心的索引和聚类中心的标签,见参考11,
n_labels = labels.max() # 返回标签中的最大值,标签默认是数字递增形式的

for i in range(n_labels + 1): # 此处是[0,1,2)
    print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i]))) # 列出聚类后分类信息

# #############################################################################
# Find a low-dimension embedding for visualization: find the best position of
# the nodes (the stocks) on a 2D plane

# We use a dense eigen_solver to achieve reproducibility (arpack is
# initiated with random vectors that we don't control). In addition, we
# use a large number of neighbors to capture the large-scale structure.
node_position_model = manifold.LocallyLinearEmbedding(
    n_components=2, eigen_solver='dense', n_neighbors=6)   # 见参考5、参考13。近邻选6个,降维后得到2个

embedding = node_position_model.fit_transform(X.T).T   # 训练模型并执行降维,返回降维后的样本集

# #############################################################################
# Visualization
plt.figure(1, facecolor='w', figsize=(10, 8)) # 用函数方式创建图形,背景设为白色,大小设置为10*8 inchs
plt.clf()   # 清除当前图形
ax = plt.axes([0., 0., 1., 1.])   # 新建一个axes对象
plt.axis('off')   # 见参考12

# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy() #偏相关分析
d = 1 / np.sqrt(np.diag(partial_correlations))  # 见参考15
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]  # 参考16,转为n*1结构的二维数组
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02) # 参考17,取上三角矩阵,判断与0.02大小获取True/False布尔值

# Plot the nodes using the coordinates of our embedding
plt.scatter(embedding[0], embedding[1], s=100 * d ** 2, c=labels,
            cmap=plt.cm.nipy_spectral)   # 参考18,参考19

# Plot the edges
start_idx, end_idx = np.where(non_zero) # 参考20,获取non_zero中True的横纵座标
# a sequence of (*line0*, *line1*, *line2*), where::
#            linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [[embedding[:, start], embedding[:, stop]]
            for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
lc = LineCollection(segments,
                    zorder=0, cmap=plt.cm.hot_r,
                    norm=plt.Normalize(0, .7 * values.max()))   # 参考2
lc.set_linewidths(15 * values)

# 为每个节点添加标签,要避免标签重叠
for index, (name, label, (x, y)) in enumerate(
        zip(names, labels, embedding.T)):

    dx = x - embedding[0]
    dx[index] = 1
    dy = y - embedding[1]
    dy[index] = 1
    this_dx = dx[np.argmin(np.abs(dy))]   # 参考21
    this_dy = dy[np.argmin(np.abs(dx))]
    if this_dx > 0:
        horizontalalignment = 'left'
        x = x + .002
        horizontalalignment = 'right'
        x = x - .002
    if this_dy > 0:
        verticalalignment = 'bottom'
        y = y + .002
        verticalalignment = 'top'
        y = y - .002
    plt.text(x, y, name, size=10,
                       edgecolor=plt.cm.nipy_spectral(label / float(n_labels)),
                       alpha=.6))   # 添加图形文本,参考22

plt.xlim(embedding[0].min() - .15 * embedding[0].ptp(),
         embedding[0].max() + .10 * embedding[0].ptp(),)   # 设定x轴的范围,参考23
plt.ylim(embedding[1].min() - .03 * embedding[1].ptp(),
         embedding[1].max() + .03 * embedding[1].ptp())


下载见Download Python source code: plot_stock_market.py。


  • 沪深300股票结构可视化



