本文将带着你使用Python对标普100数据进行简单的分析,你会学到:
- NumPy数组及其运算
- 布尔索引筛选数据
- 散点图和直方图的绘制
标普100数据
标准普尔100指数用来衡量大公司的股票表现,它由多个行业的100家主要公司构成。2017年标普100在各行业的比例如下图所示。
本文将要分析的数据如下表所示,它由四列数据构成,分别是公司名(Name),行业(Sector),股价(Price)和每股盈余(EPS)。
我们将这四列数据分别存储在四个Python列表中。
names = ['Apple Inc', 'Abbvie Inc', 'Abbott Laboratories', 'Accenture Plc', 'Allergan Plc', 'American International Group', 'Allstate Corp', 'Amgen', 'Amazon.Com Inc.', 'American Express Company', 'Boeing Company', 'Bank of America Corp', 'Biogen Inc', 'Bank of New York Mellon Corp', 'Blackrock', 'Bristol-Myers Squibb Company', 'Berkshire Hath Hld B', 'Citigroup Inc', 'Caterpillar Inc', 'Celgene Corp', 'Charter Communicatio', 'Colgate-Palmolive Company', 'Comcast Corp A', 'Capital One Financial Corp', 'Conocophillips', 'Costco Wholesale', 'Cisco Systems Inc', 'CVS Corp', 'Chevron Corp', 'Danaher Corp', 'Walt Disney Company', 'Duke Energy Corp', 'Dowdupont Inc.', 'Emerson Electric Company', 'Exelon Corp', 'Ford Motor Company', 'Facebook Inc', 'Fedex Corp', '21st Centry Fox Class B', '21st Centry Fox Class A', 'General Dynamics Corp', 'General Electric Company', 'Gilead Sciences Inc', 'General Motors Company', 'Alphabet Class C', 'Alphabet Class A', 'Goldman Sachs Group', 'Halliburton Company', 'Home Depot', 'Honeywell International Inc', 'International Business Machines', 'Intel Corp', 'Johnson & Johnson', 'JP Morgan Chase & Co', 'Kraft Heinz Co', 'Kinder Morgan', 'Coca-Cola Company', 'Eli Lilly and Company', 'Lockheed Martin Corp', "Lowe's Companies", 'Mastercard Inc', "McDonald's Corp", 'Mondelez Intl Cmn A', 'Medtronic Inc', 'Metlife Inc', '3M Company', 'Altria Group', 'Monsanto Company', 'Merck & Company', 'Morgan Stanley', 'Microsoft Corp', 'Nextera Energy', 'Nike Inc', 'Oracle Corp', 'Occidental Petroleum Corp', 'Priceline Group', 'Pepsico Inc', 'Pfizer Inc', 'Procter & Gamble Company', 'Philip Morris International Inc', 'Paypal Holdings', 'Qualcomm Inc', 'Raytheon Company', 'Starbucks Corp', 'Schlumberger N.V.', 'Southern Company', 'Simon Property Group', 'AT&T Inc', 'Target Corp', 'Time Warner Inc', 'Texas Instruments', 'Unitedhealth Group Inc', 'Union Pacific Corp', 'United Parcel Service', 'U.S. Bancorp', 'United Technologies Corp', 'Visa Inc', 'Verizon Communications Inc', 'Walgreens Boots Alliance', 'Wells Fargo & Company', 'Wal-Mart Stores', 'Exxon Mobil Corp']
prices = [170.12, 93.29, 55.28, 145.3, 171.81, 59.5, 100.5, 168.93, 1126.82, 93.92, 265.04, 26.7, 311.92, 52.73, 474.05, 60.48, 181.27, 71.87, 137.37, 102.88, 346.2, 72.16, 36.13, 88.26, 49.89, 171.22, 36.38, 70.18, 114.84, 93.45, 103.02, 88.61, 71.12, 60.14, 41.32, 12.11, 179.14, 217.75, 30.42, 31.14, 198.7, 17.91, 71.63, 44.74, 1018.48, 1034.09, 238.05, 41.57, 170.13, 148.04, 151.4, 44.88, 138.54, 98.58, 80.59, 17.04, 45.6, 82.97, 312.93, 81.43, 149.93, 167.01, 42.49, 79.52, 51.85, 232.49, 66.51, 118.19, 53.74, 49.06, 82.49, 155.7, 59.46, 48.97, 68.17, 1762.23, 115.5, 35.38, 88.33, 103.35, 76.55, 66.83, 184.22, 56.83, 61.53, 51.12, 159.25, 34.59, 57.77, 88.62, 98.59, 209.75, 115.58, 113.2, 51.88, 117.05, 110.27, 45.85, 70.25, 54.02, 96.08, 80.31]
earnings = [9.2, 5.31, 2.41, 5.91, 15.42, 2.51, 6.79, 12.58, 3.94, 5.22, 9.75, 1.75, 21.59, 3.47, 21.55, 2.96, 6.29, 5.19, 5.55, 6.4, 1.61, 2.87, 2.02, 7.58, 0.02, 5.82, 2.17, 5.71, 3.57, 3.89, 5.7, 4.45, 3.66, 2.58, 2.48, 1.68, 5.19, 11.91, 1.92, 1.92, 10.07, 1.24, 9.58, 6.19, 29.87, 29.87, 19.2, 0.73, 6.96, 6.95, 13.66, 3.18, 7.14, 6.94, 3.56, 0.65, 1.89, 4.09, 12.72, 4.34, 4.31, 6.4, 2.05, 4.69, 5.2, 8.95, 3.16, 5.53, 3.89, 3.61, 3.38, 6.67, 2.35, 2.55, 0.35, 74.45, 5.12, 2.5, 3.98, 4.49, 1.4, 3.78, 7.56, 2.07, 1.29, 2.75, 6.05, 2.93, 4.93, 6.06, 4.06, 9.6, 5.66, 5.98, 3.37, 6.62, 3.48, 3.75, 5.1, 4.14, 4.36, 3.56]
sectors = ['Information Technology', 'Health Care', 'Health Care', 'Information Technology', 'Health Care', 'Financials', 'Financials', 'Health Care', 'Consumer Discretionary', 'Financials', 'Industrials', 'Financials', 'Health Care', 'Financials', 'Financials', 'Health Care', 'Financials', 'Financials', 'Industrials', 'Health Care', 'Consumer Discretionary', 'Consumer Staples', 'Consumer Discretionary', 'Financials', 'Energy', 'Consumer Staples', 'Information Technology', 'Consumer Staples', 'Energy', 'Health Care', 'Consumer Discretionary', 'Utilities', 'Materials', 'Industrials', 'Utilities', 'Consumer Discretionary', 'Information Technology', 'Industrials', 'Consumer Discretionary', 'Consumer Discretionary', 'Industrials', 'Industrials', 'Health Care', 'Consumer Discretionary', 'Information Technology', 'Information Technology', 'Financials', 'Energy', 'Consumer Discretionary', 'Industrials', 'Information Technology', 'Information Technology', 'Health Care', 'Financials', 'Consumer Staples', 'Energy', 'Consumer Staples', 'Health Care', 'Industrials', 'Consumer Discretionary', 'Information Technology', 'Consumer Discretionary', 'Consumer Staples', 'Health Care', 'Financials', 'Industrials', 'Consumer Staples', 'Materials', 'Health Care', 'Financials', 'Information Technology', 'Utilities', 'Consumer Discretionary', 'Information Technology', 'Energy', 'Consumer Discretionary', 'Consumer Staples', 'Health Care', 'Consumer Staples', 'Consumer Staples', 'Information Technology', 'Information Technology', 'Industrials', 'Consumer Discretionary', 'Energy', 'Utilities', 'Real Estate', 'Telecommunications', 'Consumer Discretionary', 'Consumer Discretionary', 'Information Technology', 'Health Care', 'Industrials', 'Industrials', 'Financials', 'Industrials', 'Information Technology', 'Telecommunications', 'Consumer Staples', 'Financials', 'Consumer Staples', 'Energy']
先来用切片的方法观察下数据。比如查看前四家公司的名称。
print(names[:4])
['Apple Inc', 'Abbvie Inc', 'Abbott Laboratories', 'Accenture Plc']
或者输出最后一家公司的所有信息。
print("公司名:", names[-1])
print("股价:", prices[-1])
print("每股盈余:", earnings[-1])
print("行业:", sectors[-1])
公司名: Exxon Mobil Corp
股价: 80.31
每股盈余: 3.56
行业: Energy
计算市盈率
市盈率(Price to Earnings ratio),也称股价收益比率,由股价除以每年度每股盈余(EPS)得到,它是用来衡量股价水平是否合理的指标之一。
为了方便计算市盈率,我们首先将数据从Python列表类型转换为NumPy数组。
numpy.array()
函数创建numpy数组。
# 导入科学计算包NumPy
import numpy as np
# 将列表转换成numpy数组
names = np.array(names)
prices = np.array(prices)
earnings = np.array(earnings)
sectors = np.array(sectors)
NumPy数组的优势是它可以直接对数组进行运算,而这一点Python列表是做不到的。比如计算市盈率 pe
,我们可以直接将数组 prices
除以数组 earnings
。
# 计算市盈率(P/E)
pe = prices / earnings
# 输出市盈率的前5个值
print(pe[:5])
[ 18.49130435 17.56873823 22.93775934 24.58544839 11.14202335]
接下来我们就具体行业来进行分析,比如对于IT行业,我们首先需要筛选出哪些公司属于这一行业。
这里需要使用布尔型索引。比如在数组 numbers 中找到大于3的数,首先使用 numbers > 3 来得到一个只含有 True 和 False的布尔数组。
numbers = np.array([1,2,3,4,5])
boolean_array = (numbers > 3)
print(boolean_array)
输出:[False False False True True]
然后利用这一布尔数组,筛选出 True 对应的元素,就可以得到大于3的数了。
large_number = numbers[boolean_array]
print(large_number)
输出:[4 5]
# 创建IT行业的布尔数组
boolean_array = (sectors == 'Information Technology')
# 选取IT行业的子集数据
it_names = names[boolean_array]
it_pe = pe[boolean_array]
# 输出IT行业的公司名和市盈率
print(it_names)
print(it_pe)
['Apple Inc' 'Accenture Plc' 'Cisco Systems Inc' 'Facebook Inc'
'Alphabet Class C' 'Alphabet Class A' 'International Business Machines'
'Intel Corp' 'Mastercard Inc' 'Microsoft Corp' 'Oracle Corp'
'Paypal Holdings' 'Qualcomm Inc' 'Texas Instruments' 'Visa Inc']
[ 18.49130435 24.58544839 16.76497696 34.51637765 34.09708738
34.6196853 11.08345534 14.11320755 34.78654292 24.40532544
19.20392157 54.67857143 17.67989418 24.28325123 31.68678161]
用同样的方法,筛选出必需消费品行业的公司和市盈率。
# 创建必需消费品(CS)行业的布尔数组
boolean_array = (sectors == 'Consumer Staples')
# 选取CS行业的子集数据
cs_names = names[boolean_array]
cs_pe = pe[boolean_array]
# 输出CS行业的公司名和市盈率
print(cs_names)
print(cs_pe)
['Colgate-Palmolive Company' 'Costco Wholesale' 'CVS Corp' 'Kraft Heinz Co'
'Coca-Cola Company' 'Mondelez Intl Cmn A' 'Altria Group' 'Pepsico Inc'
'Procter & Gamble Company' 'Philip Morris International Inc'
'Walgreens Boots Alliance' 'Wal-Mart Stores']
[ 25.14285714 29.41924399 12.29071804 22.63764045 24.12698413
20.72682927 21.04746835 22.55859375 22.19346734 23.01781737
13.7745098 22.03669725]
筛选出IT和必需消费品行业的数据后,我们来计算这两个行业市盈率的均值和标准差。
numpy.mean(array)
函数计算数组array的均值。
numpy.std(array)
函数计算数组array的标准差。
# 计算IT行业市盈率的均值和标准差
it_pe_mean = np.mean(it_pe)
it_pe_std = np.std(it_pe)
print("IT行业市盈率的均值:", it_pe_mean)
print("IT行业市盈率的标准差:", it_pe_std)
IT行业市盈率的均值: 26.3330554204
IT行业市盈率的标准差: 10.8661467927
# 计算必需消费品行业市盈率的均值和标准差
cs_pe_mean = np.mean(cs_pe)
cs_pe_std = np.std(cs_pe)
print("必需消费品行业市盈率的均值:", cs_pe_mean)
print("必需消费品行业市盈率的标准差:", cs_pe_std)
必需消费品行业市盈率的均值: 21.5810689064
必需消费品行业市盈率的标准差: 4.41202165427
绘图
首先用散点图来观察这两个行业中每一家公司的市盈率。这里使用Python中常用的绘图工具包 matplotlib
。
matplotlib.pyplot.scatter()
函数绘制散点图。
# 导入 matplotlib.pyplot 模块
import matplotlib.pyplot as plt
# 设置公司id
it_id = np.arange(len(it_pe))
cs_id = np.arange(len(cs_pe))
# 绘制市盈率的散点图
plt.scatter(it_id, it_pe, color='red', label='IT')
plt.scatter(cs_id, cs_pe, color='green', label='CS')
# 增加图例
plt.legend()
# 增加坐标轴标签
plt.xlabel('Company ID')
plt.ylabel('P/E Ratio')
# 输出图
plt.show()
我们注意到,上图的右上角有一IT公司的市盈率特别高。若某股票的市盈率高于同类股票,往往意味着该股有较高的增长预期。所以让我们进一步来观察IT行业的市盈率分布,在这里直方图可以用来查看数据的分布情况。
matplotlib.pyplot.hist()
函数绘制直方图。
# 绘制IT行业市盈率的直方图,将数值分成8个区间
plt.hist(it_pe, bins=8)
# 增加坐标轴标签
plt.xlabel('P/E ratio')
plt.ylabel('Frequency')
# 输出图
plt.show()
现在可以更直观的看到在直方图的右侧有一离群值,它具有很高的市盈率。我们可以使用布尔索引找到这家市盈率很高的公司。
# 找出市盈率大于50的值
outlier_price = it_pe[it_pe > 50]
# 找出市盈率大于50的公司
outlier_name = it_names[it_pe > 50]
# 输出结果, round()函数用于四舍五入
print(str(outlier_name[0]) + " 公司的市盈率是" + str(round(outlier_price[0],2)))
Paypal Holdings 公司的市盈率是54.68
注:本文是 DataCamp 课程 Intro to Python for Finance 的学习笔记。