Python语言: 简要概括一下Python语言在数据分析、挖掘场景中常用特性:
import numpy as np # 一般以np为别名
a = np.array([2, 0, 1, 5])
a.sort() # a被覆盖
b = np.array([[1, 2, 3], [4, 5, 6]])
[2 0 1 5]
[2 0 1]
[0 1 2 5]
[[ 1 4 9]
[16 25 36]]
# 求解方程组
from scipy.optimize import fsolve
def f(x):
x1 = x[0]
x2 = x[1]
return [2 * x1 - x2 ** 2 - 1, x1 ** 2 - x2 - 2]
result = fsolve(f, [1, 1])
# 积分
from scipy import integrate
def g(x): # 定义被积函数
return (1 - x ** 2) ** 0.5
pi_2, err = integrate.quad(g, -1, 1) # 输出积分结果和误差
print(pi_2 * 2, err)
[ 1.91963957 1.68501606]
3.141592653589797 1.0002356720661965e-09
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 10000) # 自变量x,10000为点的个数
y = np.sin(x) + 1 # 因变量y
z = np.cos(x ** 2) + 1 # 因变量z
plt.figure(figsize=(8, 4)) # 设置图像大小
# plt.rcParams['font.sans-serif'] = 'SimHei' # 标签若有中文,则需设置字体
# plt.rcParams['axes.unicode_minus'] = False # 保存图像时若负号显示不正常,则添加该句
# 两条曲线
plt.plot(x, y, label='$\sin (x+1)$', color='red', linewidth=2) # 设置标签,线条颜色,线条大小
plt.plot(x, z, 'b--', label='$\cos x^2+1$')
plt.xlim(0, 10) # x坐标范围
plt.ylim(0, 2.5) # y坐标范围
plt.xlabel("Time(s)") # x轴名称
plt.ylabel("Volt") # y轴名称
plt.title("Matplotlib Sample") # 图的标题
plt.legend() # 显示图例
plt.show() # 显示作图结果
import pandas as pd
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
d = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]], columns=['a', 'b', 'c'])
d2 = pd.DataFrame(s)
print(d.head()) # 预览前5行
# 读取文件(路径最好别带中文)
df=pd.read_csv("G:\\data.csv", encoding="utf-8")
a 1
b 2
c 3
dtype: int64
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
a b c
count 6.000000 6.000000 6.000000
mean 8.500000 9.500000 10.500000
std 5.612486 5.612486 5.612486
min 1.000000 2.000000 3.000000
25% 4.750000 5.750000 6.750000
50% 8.500000 9.500000 10.500000
75% 12.250000 13.250000 14.250000
max 16.000000 17.000000 18.000000
Empty DataFrame
Columns: [1068, 12, 蔬果, 1201, 蔬菜, 120104, 花果, 20150430, 201504, DW-1201040010, 散称, 生鲜, 千克, 0.973, 5.43, 2.58, 否]
Index: []
from sklearn.linear_model import LinearRegression
model= LinearRegression()
model.fit(): 训练模型,监督模型是fit(X,y),无监督模型是fit(X)
model.predict(X_new): 预测新样本 model.predict_proba(X_new): 预测概率,仅对某些模型有用(LR)
model.ransform(): 从数据中学到新的“基空间” model.fit_transform(): 从数据中学到的新的基,并将这个数据按照这组“基”进行转换Scikit-Learn本身自带了一些数据集,如花卉和手写图像数据集等,下面以花卉数据集举个栗子,训练集包含4个维度——萼片长度、宽度,花瓣长度和宽度,以及四个亚属分类结果。 示例:
from sklearn import datasets # 导入数据集
from sklearn import svm
iris = datasets.load_iris() # 加载数据集
clf = svm.LinearSVC() # 建立线性SVM分类器
clf.fit(iris.data, iris.target) # 用数据训练模型
print(clf.predict([[5, 3, 1, 0.2], [5.0, 3.6, 1.3, 0.25]]))
[0 0]
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential() # 模型初始化
model.add(Dense(20, 64)) # 添加输入层(20节点)、第一隐藏层(64节点)的连接
model.add(Activation('tanh')) # 第一隐藏层用tanh作为激活函数
model.add(Dropout(0.5)) # 使用Dropout防止过拟合
model.add(Dense(64, 64)) # 添加第一隐藏层(64节点)、第二隐藏层(64节点)的连接
model.add(Activation('tanh')) # 第二隐藏层用tanh作为激活函数
model.add(Dense(64, 1)) # 添加第二隐藏层(64节点)、输出层(1节点)的连接
model.add(Activation('sigmod')) # 第二隐藏层用sigmod作为激活函数
sgd=SGD(lr=0.1,decay=1e-6,momentum=0.9,nesterov=True) # 定义求解算法
model.compile(loss='mean_squared_error',optimizer=sgd) # 编译生成模型,损失函数为平均误差平方和
model.fit(x_train,y_train,nb_epoch=20,batch_size=16) # 训练模型
score = model.evaluate(X_test,y_test,batch_size=16) # 测试模型
import logging
from gensim import models
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
sentences = [['first', 'sentence'], ['second', 'sentence']] # 将分好词的句子按列表形式输入
model = models.Word2Vec(sentences, min_count=1) # 用以上句子训练词向量模型
print(model['sentence']) # 输出单词sentence的词向量
2017-10-24 19:02:40,785 : INFO : collecting all words and their counts
2017-10-24 19:02:40,785 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-10-24 19:02:40,785 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences
2017-10-24 19:02:40,785 : INFO : Loading a fresh vocabulary
2017-10-24 19:02:40,785 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-10-24 19:02:40,785 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-10-24 19:02:40,786 : INFO : deleting the raw counts dictionary of 3 items
2017-10-24 19:02:40,786 : INFO : sample=0.001 downsamples 3 most-common words
2017-10-24 19:02:40,786 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-10-24 19:02:40,786 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-10-24 19:02:40,786 : INFO : resetting layer weights
2017-10-24 19:02:40,786 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-10-24 19:02:40,788 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-10-24 19:02:40,788 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-10-24 19:02:40,788 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-10-24 19:02:40,789 : INFO : training on 20 raw words (0 effective words) took 0.0s, 0 effective words/s
2017-10-24 19:02:40,789 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
