Miniconda的可用下载地址:Miniconda — Conda documentation。但Miniconda需自己安装各Python程序包(新手不适)。建议直接使用Anaconda。
%paste和%cpaste在Jupyter Notebook中不可用(%lsmagic魔法函数列表中也无对应项)。报错如下:
UsageError: Line magic function `%paste` not found.
实测在IPython中可用。
此处删不掉对应临时目录(本节内容应是在Anaconda Powershell Prompt下运行ipython):
In [20]: rm -r tmp
Python3.7下安装line-profiler需Visual Studio 2017支持。
In[13]:!head -4 data/president_heights.csv
对应Windows系统下用type指令查看文件内容:
In[13]:!type data\president_heights.csv
In[17]:pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Out[17]:MultiIndex(levels=[['a', 'b'], [1, 2]],
codes=[[0, 0, 1, 1], [0, 1, 0, 1]])
d:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead
现版本’labels’已经被’codes’取代。
现版本axis=’col’需改为axis=’columns’
In[8]: df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
print(df3); print(df4); print(pd.concat([df3, df4], axis='columns'))
通过Seaborn下载行星数据失败:
In[2]: import seaborn as sns
planets = sns.load_dataset('planets')
URLError:
将电脑DNS设置改为114.114.114.114有可能修复
新建一个字符串,将所有行JSON对象连接起来,然后再通过pd.read_json来读取所有数据:
In[20]: # read the entire file into a Python array
with open(' 'data/recipeitems-latest.json', 'r') as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
会报错:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 4058: illegal multibyte sequence
需改为:
In[20]: # 将文件内容读取成Python数组
with open('data/recipeitems-latest.json', 'r', encoding='UTF-8') as f:
# 提取每一行内容
data = (line.strip() for line in f)
# 将所有内容合并成一个列表
data_json = "[{0}]".format(','.join(data))
使用pandas-datareader程序包从谷歌/雅虎财经导入金融数据失败:
In[25]: from pandas_datareader import data
goog = data.DataReader('GOOG', start='2004', end='2016',
data_source='google')
NotImplementedError: data_source='google' is not implemented
若改为data_source='yahoo':
ReadTimeout: HTTPSConnectionPool(host='finance.yahoo.com', port=443): Read timed out. (read timeout=30)
In[36]: data.columns = ['West', 'East']
data['Total'] = data.eval('West + East')
因现在所用数据本身有总数项,此处改为:
In[36]: data.columns = ['Total', 'East', 'West']
Numpy随机数获取失败:
In[1]: import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1E6)
y = rng.rand(1E6)
TypeError: 'float' object cannot be interpreted as an integer
这里需改回为:
x = rng.rand(1000000)
y = rng.rand(1000000)
2.在IPython shell中画图
启动ipython后使用%matplotlib魔法命令报错:
In[1]: %matplotlib
AttributeError: 'NoneType' object has no attribute 'lower'
暂时只在IPython Notebook中使用命令%matplotlib inline或%matplotlib notebook启动图形。
高斯过程回归方法调用失败:
In[1]: from sklearn.gaussian_process import GaussianProcess
ImportError: cannot import name 'GaussianProcess' from 'sklearn.gaussian_process' (d:\Users\Administrator\Anaconda3\lib\site-packages\sklearn\gaussian_process\__init__.py)
改用灰色背景时异常:
In[3]: # use a gray background
ax = plt.axes(axisbg='#E6E6E6')
ax.set_axisbelow(True)
AttributeError: 'AxesSubplot' object has no property 'axisbg'
此处需要改为:
In[3]: ax = plt.axes(facecolor='#E6E6E6')
载入Basemap时故障:
In[1]: from mpl_toolkits.basemap import Basemap
会报错KeyError:'PROJ_LIB',需在本地系统中增加环境变量:
变量名:PROJ_LIB
变量值:D:\Users\Administrator\Anaconda3\Library\share
1.频次直方图、KDE和密度图
频次直方图的绘制时:
In[6]: for col in 'xy':
plt.hist(data[col], normed=True, alpha=0.5)
新版本matplotlib中normed已被density取代,报错为:
AttributeError:'Rectangle' object has no property 'normed'
该调用语句可改为:
plt.hist(data[col], density=True, alpha=0.5)
获得一个二维数据可视化图时:
In[9]: sns.kdeplot(data);
d:\Users\Administrator\Anaconda3\lib\site-packages\seaborn\distributions.py:679: UserWarning: Passing a 2D dataset for a bivariate plot is deprecated in favor of kdeplot(x, y), and it will cause an error in future versions. Please update your code.
warnings.warn(warn_msg, UserWarning)
在更高版本环境中会报错,暂时没找到解决方法:
ValueError: If using all scalar values,you must pass an index
把字符串转换为时间类型:
In[25]: def convert_time(s):
h, m, s = map(int, s.split(':'))
return pd.datetools.timedelta(hours=h, minutes=m, seconds=s)
会报错:
AttributeError:module 'pandas' has no attribute 'datetools'
可不使用自建的这个函数,直接调用pd.to_timedelta()
即将下一段中调用部分改为:
converters={'split':pd.to_timedelta, 'final':pd.to_timedelta}
后续将时间换算成秒时:
In[27]: data['split_sec'] = data['split'].astype(int) / 1E9
data['final_sec'] = data['final'].astype(int) / 1E9
会报错:
TypeError:cannot astype a timedelta from [timedelta64[ns]] to [int32]
此处可改为:
In[27]: data['split_sec'] = data['split'].astype(np.int64) / 1E9
data['final_sec'] = data['final'].astype(np.int64) / 1E9
3.有监督学习示例:鸢尾花数据分类
借助函数分割数据集:
In[15]: from sklearn.cross_validation import train_test_split
已无对应模块,报错为:
ModuleNotFoundError:No module named 'sklearn.cross_validation'
改为从现有模块调用该函数:
In[15]: from sklearn.model_selection import train_test_split
5.无监督学习示例:鸢尾花数据聚类
高斯混合模型的导入:
In[20]: from sklearn.mixture import GMM
会报错:
ImportError: cannot import name 'GMM' from 'sklearn.mixture'
应改为:
In[20]: from sklearn.mixture import GaussianMixture # 1.选择模型类
model = GaussianMixture(n_components=3,
covariance_type='full') # 2.设置超参数,初始化模型
2.无监督学习:降维
In[20]: plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('spectral', 10))
此处报错:
ValueError:Colormap spectral is not recogized.
此处对应方案首字母需大写,应该为:
cmap=plt.cm.get_cmap('Spectral', 10)
3.数字分类
In[32]: test_images = xtest.reshape(-1, 8, 8)
报错为:
NameError:name 'xtest' is not defined
此前定义的是'Xtest',此处应为:
In[32]: test_images = Xtest.reshape(-1, 8, 8)
3.交叉检验
LOO交叉检验的调用:
In[8]: from sklearn.model_selection import LeaveOneOut
scores = cross_val_score(model, X, y, cv=LeaveOneOut(len(X)))
会报错:
TypeError: LeaveOneOut() takes no arguments
改为去掉参数:
In[8]: scores = cross_val_score(model, X, y, cv=LeaveOneOut())
2.Scikit-Learn验证曲线
可视化验证曲线的调用:
In[13]: from sklearn.learning_curve import validation_curve
会报错:
ModuleNotFoundError: No module named 'sklearn.learning_curve'
现改为:
In[13]: from sklearn.model_selection import validation_curve
Scikit-Learn学习曲线
学习曲线的调用问题和前面问题相似:
In[17]: from sklearn.learning_curve import learning_curve
应改为:
In[17]: from sklearn.model_selection import learning_curve
网格搜索元评估器的调用:
In[18]: from sklearn.grid_search import GridSearchCV
报错为:
ModuleNotFoundError: No module named 'sklearn.grid_search'
也改为:
In[18]: from sklearn.model_selection import GridSearchCV
画图显示时:
In[21]: plt.plot(X_test.ravel(), y_test, hold=True);
报错为:
AttributeError: 'Line2D' object has no property 'hold'
此处可去掉hold参数,即:
In[21]: plt.plot(X_test.ravel(), y_test);
每一天的自行车流量计算:
In[15]: daily = counts.resample('d').sum()
daily['Total'] = daily.sum(axis=1)
daily = daily[['Total']] # remove other columns
因目前使用数据有总和项,此处进行对应修改:
In[15]: daily = counts.resample('d').sum()
daily = daily[['Fremont Bridge Total']] # remove other columns
daily.columns = ['Total']
线性回归模型的建立:
In[22]: column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday',
'daylight_hrs', 'PRCP', 'dry day', 'Temp(C)', 'annual']
X = daily[column_names]
y = daily['Total']
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
会报错:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
这是因为两份原始数据对应的时间跨度不一致,会产生缺失值,增加语句删除含缺失值的行即可:
daily.dropna(inplace=True)
另外可用下句检查数据中是否有缺失值:
print(np.isnan(daily).any())
书中此例实际使用的是东西向均值而非总流量。
RandomizedPCA的调用:
In[20]: from sklearn.decomposition import RandomizedPCA
ImportError: cannot import name 'RandomizedPCA' from 'sklearn.decomposition' (d:\Users\Administrator\Anaconda3\lib\site-packages\sklearn\decomposition\__init__.py)
已没有单独的RandomizedPCA,改为直接调用PCA即可:
In[20]: from sklearn.decomposition import PCA as RandomizedPCA
从mldata下载MINIST手写数字数据集被拒:
In[20]: from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。
可能需要尝试把数据自行下到本地。
高斯混合模型(GMM)的使用:
In[10]: for pos, covar, w in zip(gmm.means_, gmm.covars_, gmm.weights_):
draw_ellipse(pos, covar, alpha=w * w_factor)
会报错:
AttributeError: 'GaussianMixture' object has no attribute 'covars_'
需改为:
In[10]: for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_)
用GMM拟合原始数据获得16个成分生成400个新数据点时:
In[16]: Xnew = gmm16.sample(400, random_state=42)
TypeError: sample() got an unexpected keyword argument 'random_state'
此后5.12.4小节有一处有同样问题。
2.使用自定义评估器
自定义评估器的使用:
In[17]: scores = [val.mean_validation_score for val in grid.grid_scores_]
会报错:
AttributeError: 'GridSearchCV' object has no attribute 'grid_scores_'
现改为:
In[17]: scores = grid.cv_results_['mean_test_score']
主要软件版本:
Python 3.7.3
Anaconda Navigator 1.9.7
jupyter Notebook 6.0.0
IPython 7.6.1
NumPy 1.16.4
Pandas 0.24.2
Matploylib 3.1.0
Seaborn 0.9.0
Scikit-Learn 0.21.2