pd.columnss
输出为不包括第一列的表名
pd.merge
类似于数据库表的合并,data1,data2代表要合并的两个数据表,how表示连接的方式,on表示连接的条件
.np.round
对数据进行小数点位数处理
str(yr)
可以直接把数字变成字符
df.boxplot(‘Income’,by=’Regin’,rot=90) rot : label rotation angle
画盒图
X = scipy.stats.norm(loc=diff, scale=1)
正态分布,loc=mean,scale=deviation
plt.legend([“a={}”.format(a)for a in a_values],loc=0)
一般图的标注含有变量的时候就可以使用这个功能。
plt.yscale(‘log’)
merged = pd.groupby(‘Region’, as_index=False).mean()
单单使用groupby没有什么效果,要结合如mean等使用。
population.columns = [‘Country’] + list(list(population.columns)[1:])
对表头列名进行从新组织,在实际使用中,list的使用出现了写编译问题,网上说有时候jupyter需要刷新一下的原因。
http://www.cnblogs.com/txw1958/archive/2011/12/21/2295698.html
python3网络抓取资源的N种方法
source.count(bytes(‘Soup’,’UTF-8’))
X.sf(a)
subplot的基本使用方法
x2=np.arange(35,71,1)
fig, ax = plt.subplots(2,1)
ax[0].vlines(x2/100, 0, binom.pmf(x2, N, thep), colors='b', lw=5, alpha=0.5)
ax[1].vlines(x[1:], 0, y, lw=5, colors=dark2_colors[0])
ax[0].set_xlim(0.35,0.75)
ax[1].set_xlim(0.35,0.75)
plt.show()
plt.xticks(rotation=90)
对图像的x轴标注旋转90度,这种情况适用于x轴是比较长的标注。
if l is not None and l[:4]==’http’
这是用于网络连接筛选的代码,在实际应用中,存在很多数据列为空的情况,所以该功能还是非常强大方便的。
[l for l in link_list if l is not None and l.startswith(‘http’)]for
python的for循环使用非常优美简介,实际掌握还是需要大量的联系。
有时候获取网络资源时候,网站会阻止爬虫,这时候就需要对你的爬虫程序进行伪装
req = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'})
source = urllib.request.urlopen(req).read()
jupyter多版本切换问题解决,两条指令
pip2 install ipython
ipython2 kernelspec install-self
1.
data_to_plot = ranking.overall
plt.bar(data_to_plot.index, data_to_plot)
plt.show()
2.
ranking_categories_weighted.head().plot(kind='bar')
.ax = ranking_categories_weighted.head().plot(kind='bar', legend=False)
# Put a legend to the right of the current axis
ax.legend(loc='center left', blebox_to_anchor=(1, 0.5))
plt.show()
jupyter数学公式书写
http://blog.csdn.net/winnerineast/article/details/52274556
http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Typesetting%20Equations.html
一个网络数据处理的过程
URL = "http://www.pollster.com/08USPresGEMvO-2.html"
html=requests.get(URL).text
dom=web.Element(html)
rows=dom.by_tag('tr')
table=[]
for row in rows:
table_row=[]
data=row.by_tag('td')
for value in data:
table_row.append(web.plaintext(value.content))
table.append(table_row)
http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
df2 = new_df.iloc[keep]
密度图也被称作KDE图,调用plt时加上kind=’kde’即可生成一张密度图。
diamonds['price'].plot(kind='kde', color = 'black')kind='kde'
diamonds.boxplot(‘price’, by = ‘color’)
by是x轴,price是Y轴
产生随机数的各种情况
1. np.random.randint(a, b, N)
2. np.random.rand(n, m)
3. np.random.randn(n, m)
1.z.reshape((8,2))
2..z.flatten()/To flatten an array (convert a higher dimensional array into a vector), use flatten()
zip_folder = requests.get('http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip').content
zip_files = StringIO()
zip_files.write(zip_folder)
csv_files = ZipFile(zip_files)
teams=csv_files.open('Teams.csv')
teams=read_csv(teams)
data=pd.DataFrame({'level':['a','b','c','b','a'],
'num':[3,5,6,8,9]})
grouped = df.groupby("playerID", as_index=False)
#print grouped.head()
rookie_idx = grouped["yearID"].aggregate({'min_index':f})['min_index'].values
#获得每组的第一个出现的数据组
rookie = df.loc[rookie_idx][["playerID", "AB", "H"]]
jupyter markdown效果
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
tab = tab.dropna()
去除表中存在数值为空的行,这个功能比较实用
url_exprs = "https://raw.githubusercontent.com/cs109/2014_data/master/exprs_GSE5859.csv"
exprs = pd.read_csv(url_exprs, index_col=0)
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
selector = SelectKBest(f_regression, k=2).fit(x, y)
best_features = np.where(selector.get_support())[0]
print(best_features)
xt = x[:, best_features]
clf = LinearRegression().fit(xt, y)
错误的CrossValidation:
for train, test in KFold(len(y), 10):
xtrain, xtest, ytrain, ytest = xt[train], xt[test], y[train], y[test]
clf.fit(xtrain, ytrain)
yp = clf.predict(xtest)
plt.plot(yp, ytest, 'o')
plt.plot(ytest, ytest, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
正确的CrossValidation:
for train, test in KFold(len(y), n_folds=5):
xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]
b = SelectKBest(f_regression, k=2)
b.fit(xtrain, ytrain)
xtrain = xtrain[:, b.get_support()]
xtest = xtest[:, b.get_support()]
clf.fit(xtrain, ytrain)
scores.append(clf.score(xtest, ytest))
yp = clf.predict(xtest)
plt.plot(yp, ytest, 'o')
plt.plot(ytest, ytest, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
print("CV Score is ", np.mean(scores))
scipy.stats
scipy.integrate
scipy.signal
scipy.optimize
scipy.special
scipy.linalg
mtcars.ix[‘Maserati Bora’]
获取数据的一行
any和all的使用区别结果的区别
(mtcars.mpg >= 20).any() True
(mtcars > 0).all() true false true true ...
rom pandas.tools.plotting import scatter_matrix
scatter_matrix(mtcars[['mpg', 'hp', 'cyl']], figsize = (10, 6), alpha = 1, diagonal='kde')
soup.head.contents
soup.head.children
oup.head.title
soup.head.title.string
for child in soup.head.descendants:
.stripped_strings
soup.find_all('a')
soup.find_all('a')[1].get('href')
a = {'a': 1, 'b':2} a # a dictionary
s = json.dumps(a) s # s is a string containing a in JSON encoding
a2 = json.loads(s) a2 # reading back the keys are now in unicode
data = pd.DataFrame(wc, columns = ['match_number', 'location', 'datetime', 'home_team', 'away_team', 'winner', 'home_team_events', 'away_team_events'])
data['gameDate']=pd.DatetimeIndex(data.datetime).date
data['gameTime']=pd.DatetimeIndex(data.datetime).time
data = stats.binom.rvs(n = 10, p = 0.3, size = 10000)#贝努力随机
y = stats.poisson.pmf(n, lam) 泊松分布
y = stats.norm.pdf(x, 0, 1) 正态分布
y=stats.beta.pdf(x, a, b) b分布
lam = 0.5 x = np.arange(0, 15, 0.1),y = lam * np.exp(-lam * xstats.expon.rvs(scale = 2, size = 1000) e分布
from sklearn.datasets import load_boston
boston = load_boston()
statsmodels is python module specifically for estimating statistical models (less machine learning compared to sklearn). It can estimate many types of statistical models, but today we will focus on linear regression.
eg:import statsmodels.api as sm
import statsmodels.api as sm
model = sm.OLS(y, X)
results = model.fit()
print results.summary()
results.params.values
X = sm.add_constant(X)
residData.plot(title = 'Residuals from least squares estimates across years', figsize = (15, 8), color=map(lambda x: 'blue' if x=='OAK' else 'gray',df.teamID))
np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)