课程链接:https://aistudio.baidu.com/aistudio/course/introduce/1337
总结
在这篇,文章,主要有一下笔记
1.csv 文件读取
2.xls 文件的读取
3.cv2图片的存取
4.python mysql的基本操作,以及一些sql代码
5.sklearn onehot编码
6.sklearn 数据规范化
7.使用sklearn,mlxtend 进行特征选择,
8.使用,PDA,LDA给数据降维
里面有些代码,为经过验证,可能不能跑通,仅是作为使用示例
1. 1 文件中的数据
1.1.1读取csv文件
#法一
import csv
with open("cities.csv") as f:
data = csv.reader(f)
for line in data:
print(line)
#法二
import pandas as pd
df = pd.read_csv("cities.csv")
print(df)
pd.read_csv("cities.csv", index_col=0)
#头部
df.head()
#shape
df.shape
1.1.2 Excel 文件
import pandas as pd
cities = pd.read_excel("cities.xls") #读取
cities.to_excel("cities2.xls") #保存
1.1.3 图片读取与保存
import cv2
img = cv2.imread("img.jpg") #读取
cv2.imwrite("img2.jpg",img) #保存
1.2 数据库中的数据
1.2.1 pymysql 使用
import pandas as pd
import pymysql
#连接mysql数据库
mydb = pymysql.connect(host="localhost",
user='root',
password='1q2w3e4r5t',
db="books",
)
#将csv文件里的数据,存到mysql数据库
cursor = mydb.cursor()
path = "/Users/qiwsir/Documents/Codes/DataSet"
df = pd.read_csv(path + "/jiangsu/cities.csv")
sql = 'insert into city (name, area, population, longd, latd) \
values ("%s","%s", "%s", "%s", "%s")'
for idx in df.index:
row = df.iloc[idx]
cursor.execute(sql % (row['name'], row['area'], row['population'], row['longd'], row['latd']))
mydb.commit()
#读取数据 全部数据
sql = "select * from mybooks"
cursor.execute(sql)
datas = cursor.fetchall()
for data in datas:
print(data)
#读取 指点列的数据
sql_columns = 'SELECT name, area FROM city'
cursor.execute(sql_columns)
cursor.fetchall()
#以area字段值从大到小查询全部记录;
sql_sort = "SELECT * FROM city ORDER BY area DESC"
cursor.execute(sql_sort)
cursor.fetchall()
#取前三个(未确认,也可能是后三个)
cursor.execute("SELECT * FROM city ORDER BY area DESC LIMIT 3")
cursor.fetchall()
#取name="Soochow"的所有数据
cursor.execute("SELECT * FROM city WHERE name='Soochow'")
cursor.fetchall()
#取某个属性指定范围的记录
cursor.execute("SELECT * FROM city WHERE population BETWEEN 8000000 AND 15000000")
cursor.fetchall()
#取name包涵字符“S”的记录
cursor.execute("SELECT * FROM city WHERE name LIKE '%S%'") # 包含S字符
cursor.fetchall()
#取name以“N”开头的记录
cursor.execute("SELECT * FROM city WHERE name LIKE 'n%'") #以n开头的
cursor.fetchall()
# and
cursor.execute("SELECT * FROM city WHERE population>5000000 AND area<9000")
cursor.fetchall()
# or
cursor.execute("SELECT * FROM city WHERE population>8000000 OR area>9000")
cursor.fetchall()
#直接使用pandas读取
import pandas as pd
import pymysql
mydb = pymysql.connect(host="localhost",
user='root',
password='1q2w3e4r5t',
db="books",)
cities = pd.read_sql_query("Select * FROM city", con=mydb, index_col='id')
print(cities)
1.3 [网页上的数据]be(https://aistudio.baidu.com/aistudio/education/group/info/1337)
里面讲了,requests,beautifulesoup
这里不做笔记
1.4 来自API的数据
直接使用requests
这里不做笔记
数据清理
2.1 转变数据类型
import pandas as pd
df = pd.DataFrame([{'col1':'a', 'col2':'1'},
{'col1':'b', 'col2':'2'}])
#查看个列的数据类型
print(df.dtypes)
#修改指定行数据类型
df['col2'].astype(int)
#其余的几个操作
movies.dtypes
movies['ID'].astype("str")
movies['想看'].apply(lambda x: int(x.replace('人', "")))
bras['creationTime'].str.split().apply(pd.Series, 0)
bras['productColor'].str.findall("[\u4E00-\u9FFF]+").str[0]
bras2 = bras['productSize'].str.upper()
bras2.str.findall("[a-zA-Z]+").str[0]
2.2 处理重复数据
pd.DataFrame.duolicated 可以直接提出同一列中有重复数据的记录
使用直接
help(pd.DataFrame.duolicated )
2.3处理缺失数据
2.4处理离群数据
箱线图,离群值,,,完全看不懂,视频没看
3
3.1 特征数值化
#对词进行编码
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['white', 'green', 'red', 'green', 'white'])
print(le.classes_)
print(le.transform(["green", 'red', 'red', 'white']))
#给句子分词
import re
d1 = "I am Laoqi. I am a programmer."
d2 = "Laoqi is in Soochow. It is a beautiful city."
words = re.findall(r"\w+", d1+d2) # 以正则表达式提炼单词,不是用split(),这样就避免了句点问题
print(words)
# 为每句话中的单词出现次数计数
words = list(set(words)) # 唯一单词保存为列表
[w.lower() for w in words]
words
def count_word(document, unique_words):
count_doc = []
for word in unique_words:
n = document.lower().count(word)
count_doc.append(n)
return count_doc
count1 = count_word(d1, words)
count2 = count_word(d2, words)
print(count1)
print(count2)
#使用sklearn
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
tf1 = count_vect.fit_transform([d1, d2])
count_vect.get_feature_names() # 相对前面方法少了2个,因为I 和 a作为常用词停词了。
tf1.toarray() # 显示记录数值
3.2 特征二值化
将数据极化,只有0或1,或者一张图片的颜色致设置为0或255
3.3OneHot编码
from sklearn.preprocessing import OneHotEncoder
le = OneHotEncoder(sparse=False)
le.fit([[1],[2],[3],[4],[5],[6],[7],[8]])
a=le.transform([[1],[3],[5]])
3.4 数据变换
看的不是很懂,好像就是sklarn线性回归,,多项式回归
3.5 特征离散化
连续性的特征数据,离散化
3.6 数据规范化
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
iris = datasets.load_iris()
iris_std = StandardScaler().fit_transform(iris.data)
iris_mm = MinMaxScaler().fit_transform(iris.data)
robust_scaled = robust.fit_transform(iris.data)
from sklearn.preprocessing import Normalizer
norma = Normalizer() # ③
norma.fit_transform([[3, 4]])
norma1 = Normalizer(norm='l1')
norma1.fit_transform([[3, 4]])
norma_max = Normalizer(norm='max')
norma_max.fit_transform([[3, 4]])
1.Sklearn之数据预处理——StandardScaler
理解 sklearn.preprocessing.MinMaxScaler
Python之 sklearn:sklearn中的RobustScaler 函数的简介及使用方法之详细攻略
4.sklearn 的 Normalizer的L1和 L2
3 特征选择 可参考( https://www.cnblogs.com/cgmcoding/p/13523501.html)
特征选择,可将有效的特征给提取出来,降低特征维度,模型准确率
4-1封装器法
主要思想:包裹式(封装器法)从初始特征集合中不断的选择特征子集,训练学习器,根据学习器的性能来对子集进行评价,直到选择出最佳的子集。包裹式特征选择直接针对给定学习器进行优化
#获取数据
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import wine_data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X, y = wine_data()
print(X.shape)
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify=y,test_size=0.3,random_state=1)
std = StandardScaler()
X_train_std = std.fit_transform(X_train)
#循序特征选择
knn = KNeighborsClassifier(n_neighbors=3) # ①
sfs = SFS(estimator=knn,
k_features=4,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=0)
sfs.fit(X_train_std, y_train)
sfs_X = sfs.transform(X_train_std)
print("root:",X_train_std.shape)
print("select:",sfs_X.shape)
#穷举特征选择
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
exha = ExhaustiveFeatureSelector(RandomForestClassifier(n_jobs=-1),
min_features=2,
max_features=4,
scoring='roc_auc',
print_progress=True,
cv=2)
exha.fit(X_train_std, y_train)
exha_X =exha.transform(X_train_std)
print("root:",X_train_std.shape)
print("select:",exha_X.shape)
# 递归特征消除 一
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
rfe = RFE(RandomForestRegressor(), n_features_to_select=5) # ○12
rfe.fit(X_train_std, y_train)
rfe_X =rfe.transform(X_train_std)
print("root:",X_train_std.shape)
print("select:",rfe_X.shape)
## 递归特征消除 二
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestRegressor
from datetime import datetime
y_train = df_train['loss']
rfr = RandomForestRegressor(n_estimators=100, max_features='sqrt', max_depth=12, n_jobs=-1)
rfecv = RFECV(estimator=rfr, # ○13
step=10,
cv=3,
min_features_to_select=10,
scoring='neg_mean_absolute_error',
verbose=2)
start_time = datetime.now()
rfecv.fit(X_train, y_train)
end_time = datetime.now()
m, s = divmod((end_time - start_time).total_seconds(), 60)
print('Time taken: {0} minutes and {1} seconds.'.format(m, round(s, 2)))
4-2
#筛选出最合适的几个feature
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest # ①
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
skb = SelectKBest(chi2, k=2) # ②
result = skb.fit(X, y) # ③
print("X^2 is: ", result.scores_)
print("P-values is: ", result.pvalues_)
"""
方差过滤
VarianceThreshold
比如一个特征本身的方差很小,就表示样本在这个特征上基本没有差异,可能特征中的大多数值都一样,甚至整个特征的取值都相同,那这个特征对于样本区分没有什么作用。
所以无论接下来的特征工程要做什么,都要优先消除方差为0的特征。VarianceThreshold有重要参数threshold,表示方差的阈值,表示舍弃所有方差小于threshold的特征,不填默认为0,即删除所有的记录都相同的特征。
"""
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
train_features, test_features, train_labels, test_labels = train_test_split(
data.drop(labels=['TARGET'], axis=1),
data['TARGET'],
test_size=0.2,
random_state=41)
qconstant_filter = VarianceThreshold(threshold=0.01) # ⑥
qconstant_filter.fit(train_features)
特征工程总结
当数据量很大的时候,优先使用方差过滤和互信息法调整,再上其他特征选择方法。
使用逻辑回归时,优先使用嵌入法。
使用支持向量机时,优先使用包装法。
参考:
sklearn中的特征工程(过滤法、嵌入法和包装法)
https://blog.csdn.net/xlperpetual/article/details/103402737
第5章 特征抽取
使用的是sklearn 的PCA\LDA
LDA PCA介绍:https://blog.csdn.net/qq_20386411/article/details/83009694
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
#PCA
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
#LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)