开启数据挖掘及数据分析学习之旅

数据挖掘之旅

  • 数据挖掘简介及其应用场景
  • 搭建Python数据挖掘环境
  • 亲和性分析示例:根据购买习惯推荐商品
  • 经典分类问题示例:根据测量结果推测植物种类

数据挖掘简介

  • 数据挖掘旨在让计算机根据已有数据做出决策。决策可以是预测明天的天气、拦截垃圾邮件、检测网站的语言或者约会网站上发现新的恋爱对象等。
  • 数据挖掘设计算法、统计学、工程学、最优化理论和计算机科学相关领域的知识。

亲和性分析

  • 向网站用户提供多样化的服务和定制化的服务或投放定向广告
  • 为了向用户推荐电影或商品,而卖给他们呢一些与之相关的小玩意
  • 根据基因寻找有亲缘关系的人

商品推荐

  • 通过分析用户历史交易数据,找到用户购买相同商品的交易数据,通过打折、限购、预售等营销方式推荐给用户相同商品或相似商品

简单的排序规则

  • 规则的优劣有多种衡量方法,常用的是支持度(support)和置信度(confidence)
    • 支持度:指数据集中规则应验的次数,衡量的是给定规则应验的比例
    • 置信度:衡量的则是规则准确率如何 参考网址`

简单示例

import numpy as np
#  加载数据
# 数据title 面包、牛奶、奶酪、苹果、香蕉
# 0 表示未购买该商品,1 表示购买该商品
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
print(X[:5])

# 检测“如果买了苹果,也会购买香蕉”的置信度和支持度
# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]
num_apple_purchases = 0
for  sample in X:
    if  sample[3] == 1:
        num_apple_purchases += 1
print ("{0} people bought  Apples".format(num_apple_purchases))

# defaultdict 若查找键值不存在,则返回默认值
from collections  import defaultdict
#  规则应验
valid_rules = defaultdict(int) 
# 违反规则
invalid_rules = defaultdict(int)
# 相同规则,即若顾客买了苹果,他们也买苹果
num_occurances =  defaultdict(int)

#  顾客购买了某一商品
n_features = 4
for sample  in X:
    for premise  in range(4):
        if sample[premise] == 0:
            continue
        num_occurances[premise] += 1
        for conclusion in range(n_features):
            if premise  == conclusion:
                continue
            if sample[conclusion] == 1:
                valid_rules[(premise,conclusion)] += 1
            else:
                invalid_rules[(premise,conclusion)] += 1
                
# 计算置信度支持度
support  = valid_rules
confidence  =  defaultdict(float)
for premise,conclusion in valid_rules.keys():
    rule = (premise,conclusion)
    confidence[rule] = valid_rules[rule]/num_occurances[premise]
# 格式化输出函数
def print_rule(premise,conclusion,support,confidence,features):
    premise_name =  features[premise]
    conclusion_name = features[conclusion]
    print ("Rule: If a person buys {0} they will also buy {1}".format(premise_name,conclusion_name))
    print (" - Support: {0}".format(support[(premise,conclusion)]))
    print (" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))
premise = 1
conclusion = 3
print_rule(premise,conclusion,support,confidence,features)

# 找出最优规则
from operator import itemgetter
sorted_support = sorted(support.items(),key=itemgetter(1),reverse = True)
for index  in range(5):
    print("Rule # {0}".format(index + 1))
    premise,conclusion = sorted_support[index][0]
    print_rule(premise,conclusion,support,confidence,features)
# 找出最优规则
sorted_confidence = sorted(confidence.items(),key=itemgetter(1),reverse = True)
for index  in range(5):
    print("Rule # {0}".format(index + 1))
    premise,conclusion = sorted_confidence[index][0]
    print_rule(premise,conclusion,support,confidence,features)

数据集位置及解释 提取码:hjyc
%%markdown

简单介绍分类问题

  • 分类应用的目标是,根据已知类别的数据集,经过训练得到一个分类模型,再用模型对类别未知的数据进行分类。
  • 什么是类别?类别值又怎么解释?,可参考如下例子
    • 根据检测数据确定植物的种类。类别的值为“植物属于哪个种类?”
    • 判断图像中有没有狗。类别是“图像里有狗吗?”
    • 根据化验结果,判断病人有没有被感染。类别是“病人被感染了吗?”

实现OneR算法 描述

  • OneR算法的思路很简单,根据已有数据中,具有相同特征值的个体最可能属于那个类别进行分类
  • OneR算法首先变量每个特征的每个取值,对每个特征值,统计它在各个类别中的出现次数,找到它出现次数最多的类别,并统计它在其他类别中出现的次数
  • OneR选取错误最低的特征作为唯一的分类准则

简单示例

#数据加载
import numpy as np
from sklearn.datasets import load_iris
dataset = load_iris()
X = dataset.data
y = dataset.target
n_samples, n_features = X.shape
# 计算每个属性的平均值

attribute_means = X.mean(axis=0)
assert attribute_means.shape == (n_features,)
# 将均值作为阈值,离散化数据
X_d = np.array(X >= attribute_means, dtype='int')
# 拆分数据集,将数据集拆分为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=14)
print("There are {} training samples".format(y_train.shape))
print("There are {} testing samples".format(y_test.shape))
from collections import defaultdict
from operator import itemgetter

def train(X, y_true, feature):
    """使用OneR算法计算给定特征的预测值和误差 
  
    Parameters
    ----------
    X: array [n_samples, n_features]
       保存数据集的二维数组。每一行是一个样本,每一列是一个特征。
    y_true: array [n_samples,]
        保存类值的一维数组。对应于X,这样y_true[i] is the class value for sample X[i].  
    feature: int
       与要测试的变量的索引相对应的整数
        0 <= variable < n_features
    Returns
    -------
    predictors: dictionary of tuples: (value, prediction)
        对于数组中的每个项,如果变量具有给定值,则进行给定的预测。
    error: float
        此规则错误预测的训练数据比率。
    """
    # Check that variable is a valid number
    n_samples, n_features = X.shape
    assert 0 <= feature < n_features
    # 获取此变量具有的所有唯一值
    values = set(X[:,feature])
    # Stores the predictors array that is returned
    predictors = dict()
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    # 计算使用此特征进行分类的总误差 
    total_error = sum(errors)
    return predictors, total_error
    

def train_feature_value(X, y_true, feature, value):
    # 创建一个简单的字典来计算它们给出特定预测的频率
    class_counts = defaultdict(int)
    # 遍历每个样本并计算每个类/值对的频率
    for sample, y in zip(X, y_true):
        if sample[feature] == value:
            class_counts[y] += 1
    # 现在通过排序(最高优先)和选择第一个项目来获得最好的一个
    sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    # 错误是没有归类为最频繁类的样本数
    n_samples = X.shape[1]
    error = sum([class_count for class_value, class_count in class_counts.items()
                 if class_value != most_frequent_class])
    return most_frequent_class, error


# 计算所有预测值
all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])}
errors = {variable: error for variable, (mapping, error) in all_predictors.items()}
# 选择最好的并保存为“模型”              
# 按错误排序
best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0]
print("The best model is based on variable {0} and has error {1:.2f}".format(best_variable, best_error))
model = {'variable': best_variable,
         'predictor': all_predictors[best_variable][0]}
# 定义预测函数
def predict(X_test, model):
    variable = model['variable']
    predictor = model['predictor']
    y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])
    return y_predicted
y_predicted = predict(X_test, model)
# Compute the accuracy by taking the mean of the amounts that y_predicted is equal to y_test
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))

你可能感兴趣的:(开启数据挖掘及数据分析学习之旅)