官网地址:
https://rasbt.github.io/mlxtend/
Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks.
关联规则分析是数据挖掘中最活跃的研究方法之一,目的是在一个数据集中找到各项之间的关联关系,而这种关系并没有在数据中直接体现出来。各种关联规则分析算法从不同方面入手减少可能的搜索空间大小以及减少扫描数据的次数。Apriori算法是最经典的挖掘频繁项集的算法,第一次实现在大数据集上的可行的关联规则提取,其核心思想是通过连接产生候选项及其支持度,然后通过剪枝生成频繁项集。
关联规则(Association Rules)是海量数据挖掘(Mining Massive Datasets,MMDs)非常经典的任务,其主要目标是试图从一系列事务集中挖掘出频繁项以及对应的关联规则。关联规则来自于一个家喻户晓的“啤酒与尿布”的故事。
pip install mlxtend
# or
pip install mlxtend --upgrade --no-deps
# or
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple mlxtend
from mlxtend.data import autompg_data
X, y = autompg_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('\nHeader: %s' % ['cylinders', 'displacement',
'horsepower', 'weight', 'acceleration',
'model year', 'origin', 'car name'])
print('1st row', X[0])
from mlxtend.data import boston_housing_data
X, y = boston_housing_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('1st row', X[0])
from mlxtend.data import iris_data
X, y = iris_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('\nHeader: %s' % ['sepal length', 'sepal width',
'petal length', 'petal width'])
print('1st row', X[0])
import numpy as np
print('Classes: Setosa, Versicolor, Virginica')
print(np.unique(y))
print('Class distribution: %s' % np.bincount(y))
from mlxtend.data import loadlocal_mnist
import platform
if not platform.system() == 'Windows':
X, y = loadlocal_mnist(
images_path='train-images-idx3-ubyte',
labels_path='train-labels-idx1-ubyte')
else:
X, y = loadlocal_mnist(
images_path='train-images.idx3-ubyte',
labels_path='train-labels.idx1-ubyte')
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('\n1st row', X[0])
import numpy as np
print('Digits: 0 1 2 3 4 5 6 7 8 9')
print('labels: %s' % np.unique(y))
print('Class distribution: %s' % np.bincount(y))
import numpy as np
from mlxtend.data import make_multiplexer_dataset
X, y = make_multiplexer_dataset(address_bits=2,
sample_size=10,
positive_class_ratio=0.5,
shuffle=False,
random_seed=123)
print('Features:\n', X)
print('\nClass labels:\n', y)
from mlxtend.data import mnist_data
X, y = mnist_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('1st row', X[0])
from mlxtend.data import three_blobs_data
X, y = three_blobs_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('1st row', X[0])
import numpy as np
print('Suggested cluster labels')
print(np.unique(y))
print('Label distribution: %s' % np.bincount(y))
import matplotlib.pyplot as plt
plt.scatter(X[:,0], X[:,1],
c='white',
marker='o',
s=50)
plt.grid()
plt.show()
from mlxtend.data import wine_data
X, y = wine_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('\nHeader: %s' % ['alcohol', 'malic acid', 'ash', 'ash alcalinity',
'magnesium', 'total phenols', 'flavanoids',
'nonflavanoid phenols', 'proanthocyanins',
'color intensity', 'hue', 'OD280/OD315 of diluted wines',
'proline'])
print('1st row', X[0])
import numpy as np
print('Classes: %s' % np.unique(y))
print('Class distribution: %s' % np.bincount(y))
from mlxtend.frequent_patterns import apriori
apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)
from mlxtend.frequent_patterns import association_rules
# 生成关联规则的数据帧,包括 指标“得分”、“置信度”和“提升”
association_rules(df, metric='confidence', min_threshold=0.8, support_only=False)
from mlxtend.frequent_patterns import apriori
# Get frequent itemsets from a one-hot DataFrame
fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)
from mlxtend.frequent_patterns import apriori
# Get maximal frequent itemsets from a one-hot DataFrame
fpmax(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
# Initializing Classifiers
clf1 = LogisticRegression(random_state=0)
clf2 = RandomForestClassifier(random_state=0)
clf3 = SVC(random_state=0, probability=True)
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],
weights=[2, 1, 1], voting='soft')
# Loading some example data
X, y = iris_data()
X = X[:,[0, 2]]
# Plotting Decision Regions
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10, 8))
labels = ['Logistic Regression',
'Random Forest',
'RBF kernel SVM',
'Ensemble']
for clf, lab, grd in zip([clf1, clf2, clf3, eclf],
labels,
itertools.product([0, 1],
repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y,
clf=clf, legend=2)
plt.title(lab)
plt.show()
关联规则的一般形式
最小支持度和最小置信度
项集
支持度计数
关联分析(Association Analysis):在大规模数据集中寻找有趣的关系。
频繁项集(Frequent Item Sets):经常出现在一块的物品的集合,即包含0个或者多个项的集合称为项集。
支持度(Support):数据集中包含该项集的记录所占的比例,是针对项集来说的。
置信度(Confidence):出现某些物品时,另外一些物品必定出现的概率,针对规则而言。
关联规则(Association Rules):暗示两个物品之间可能存在很强的关系。形如A->B的表达式,规则A->B的度量包括支持度和置信度
项集支持度:一个项集出现的次数与数据集所有事物数的百分比称为项集的支持度
支持度: 首先是个百分比值,指的是某个商品组合出现的次数与总次数之间的比例。支持度越高,代表这个组合(可以是单个商品)出现的频率越大。
置信度:首先是个条件概率。指的是当你购买了商品A,会有多大的概率购买商品B
提升度: 商品A的出现,对商品B的出现概率提升的程度。提升度(A→B)=置信度(A→B)/支持度(B)
什么样的数据才是频繁项集呢?从名字上来看就是出现次数多的集合,没错,但是上面算次数多呢?这里我们给出频繁项集的定义。**频繁项集:**支持度大于等于最小支持度(Min Support)阈值的项集。
1、导入数据,并将数据预处理
2、计算频繁项集
3、根据各个频繁项集,分别计算支持度和置信度
4、根据提供的最小支持度和最小置信度,输出满足要求的关联规则
(1)找出频繁项集(支持度必须大于等于给定的最小支持度阈值)
生成频繁项目集。
一个频繁项集的所有子集必须也是频繁的。
指定最小支持度(min_support),过滤掉非频繁项集,既能减轻计算负荷又能提高预测质量。
(2)找出上步中频繁项集的规则
生成关联规则。
指定最小置信度(metric = “confidence”, min_threshold = 0.01),来过滤掉弱规则。
由频繁项集产生强关联规则。由第一步可知,未超过预定的最小支持阈值的项集已被剔除,如果剩下的这些项集又满足了预定的最小置信度阈值,那么就挖掘出了强关联规则。
(3)Metrics
Apriori算法流程:
1、首先对数据库中进行一次扫描,统计每一个项出现的次数,形成候选1-项集;
2、根据minsupport阈值筛选出频繁1-项集;
3、将频繁1-项集进行组合,形成候选2-项集;
4、对数据库进行第二次扫描,为每个候选2-项集进行计数,并筛选出频繁2-项集;
5、重复上述流程,直到候选项集为空;
6、根据生成的频繁项集,通过计算相应的置信度来生成管理规则。
Frequent itemsets via the Apriori algorithm.
Apriori function to extract frequent itemsets for association rule mining.
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
print(df)
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.6)
print(frequent_itemsets)
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
print(frequent_itemsets)
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets[ (frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >= 0.8) ]
print(frequent_itemsets)
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]
print(frequent_itemsets)
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
print(sparse_df)
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
te = TransactionEncoder()
oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
# print(sparse_df)
frequent_itemsets = apriori(sparse_df, min_support=0.6, use_colnames=True, verbose=1)
print(frequent_itemsets)
Association rules generation from frequent itemsets.
Function to generate association rules from frequent itemsets.
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
### alternatively:
#frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
#frequent_itemsets = fpmax(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print(rules)
如果您对根据不同兴趣指标的规则感兴趣,您可以简单地调整和参数。例如,如果您只对提升分数为 >= 1.2 的规则感兴趣,则可以执行以下操作:
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
print(rules)
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
# print(rules)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
print(rules)
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
# print(rules)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
# print(rules)
rules = rules[ (rules['antecedent_len'] >= 2) &
(rules['confidence'] > 0.75) &
(rules['lift'] > 1.2) ]
print(rules)
同样,使用 Pandas API,我们可以根据“前因”或“后因”列选择条目:
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
# print(rules)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
# print(rules)
rules[rules['antecedents'] == {'Eggs', 'Kidney Beans'}]
print(rules)
计算的大多数指标取决于频繁项集输入数据帧中提供的给定规则的结果和先前支持分数。
import pandas as pd
dict = {'itemsets': [['177', '176'], ['177', '179'],
['176', '178'], ['176', '179'],
['93', '100'], ['177', '178'],
['177', '176', '178']],
'support':[0.253623, 0.253623, 0.217391,
0.217391, 0.181159, 0.108696, 0.108696]}
freq_itemsets = pd.DataFrame(dict)
print(freq_itemsets)
import pandas as pd
dict = {'itemsets': [['177', '176'], ['177', '179'],
['176', '178'], ['176', '179'],
['93', '100'], ['177', '178'],
['177', '176', '178']],
'support':[0.253623, 0.253623, 0.217391,
0.217391, 0.181159, 0.108696, 0.108696]}
freq_itemsets = pd.DataFrame(dict)
print(freq_itemsets)
from mlxtend.frequent_patterns import association_rules
res = association_rules(freq_itemsets, support_only=True, min_threshold=0.1)
print(res)
import pandas as pd
dict = {'itemsets': [['177', '176'], ['177', '179'],
['176', '178'], ['176', '179'],
['93', '100'], ['177', '178'],
['177', '176', '178']],
'support':[0.253623, 0.253623, 0.217391,
0.217391, 0.181159, 0.108696, 0.108696]}
freq_itemsets = pd.DataFrame(dict)
print(freq_itemsets)
from mlxtend.frequent_patterns import association_rules
res = association_rules(freq_itemsets, support_only=True, min_threshold=0.1)
print(res)
res = res[['antecedents', 'consequents', 'support']]
print(res)
没有用于修剪的特定 API。相反,可以在生成的数据帧上使用 pandas API 来删除单个行。
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
from mlxtend.frequent_patterns import association_rules
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
print(rules)
我们想删除规则“(洋葱、芸豆)->(鸡蛋)”。为此,我们可以定义选择掩码并删除此行
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
from mlxtend.frequent_patterns import association_rules
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
print(rules)
antecedent_sele = rules['antecedents'] == frozenset({'Onion', 'Kidney Beans'}) # or frozenset({'Kidney Beans', 'Onion'})
consequent_sele = rules['consequents'] == frozenset({'Eggs'})
final_sele = (antecedent_sele & consequent_sele)
rules = rules.loc[ ~final_sele ]
print(rules)
如果您觉得该方法或代码有一点点用处,可以给作者点个赞,或打赏杯咖啡;
╮( ̄▽ ̄)╭
如果您感觉方法或代码不咋地
//(ㄒoㄒ)//,就在评论处留言,作者继续改进;
o_O???
如果您需要相关功能的代码定制化开发,可以留言私信作者;
(✿◡‿◡)
感谢各位大佬童鞋们的支持!
( ´ ▽´ )ノ ( ´ ▽´)っ!!!