机器学习 之 python 数据预处理与数据分析

在python库pandas库下,导入数据,简单查看数据结构,缺失。然后分离数据集为 输入和输出集。最后为一些简单的数据分析,数据相关性,异常值。
仅为个人学习笔记分享,不够全面请见谅。

1. Get Data

# imput dataset

df = pd.read_csv('a.csv')

# shape of dataset
df.shape

# first 5 rows of dataset
df.head(5)

# check for missing values
df.isna().sum()

2. Split Dataset

# value counts in each column (bool)
df ['bool'].value_counts()

# convert boolean value to binary
df ['bool'] = [1 if x == True else 0 for x in df ['bool']]

# split the dataset to input (all the others) and output (bool)
x = df.drop('bool', 1)      # 1 refers to drop 1 or more columns 
y = df.bool

3. Data Exploration

其中包括 Boxplot (箱型图), Outlier (异常值), Scatter (散点图), Barplot (柱状图)

# feature type and size of dataset
df.info()

# Columns data description (mean, std, min, max ....)
df[['a','b','c']].describe()

# Boxplot
df[['a','b','c']].boxplot()

# Outliers (异常值】)
# https://github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb
# This is a function that finds the outliers that are outside of 1.5xIQR. 
# It returns what value the outlier has, and where in the dataset they are
def find_outliers_tukey(x):
    q1 = np.percentile(x, 25)
    q3 = np.percentile(x, 75)
    iqr = q3-q1 
    floor = q1 - 1.5*iqr
    ceiling = q3 + 1.5*iqr
    outlier_indices = list(x.index[(x < floor)|(x > ceiling)])
    outlier_values = list(x[outlier_indices])

    return outlier_indices, outlier_values

# outlier_indices 为异常值位置,outlier_values 为异常值
outlier_indices, outlier_values = find_outliers_tukey(df['a'])


# Correlation heatmap
# https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e
# The heatmap below gives a correlation of each feature against every other feature. 
# For categorical varibales it doesn't make much sense though.
import seaborn as sns
sns.heatmap(df.corr());
sns.set(font_scale=2)
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, cbar = False, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);


# Scatter relation map
df.plot(kind='scatter', x='a', y='b', color = 'blue')

# Barplot comparing data in a column
df['a'].value_counts().plot(kind='bar', title= 'Barplot')

以上。

你可能感兴趣的:(Machine,Learning,python,数据分析)