参考资料:http://blog.csdn.net/zone_programming/article/details/42032309
更多数据挖掘代码:https://github.com/linyiqun/DataMiningAlgorithm
介绍
GSP算法是序列模式挖掘算法的一种,他是一种类Apriori的一种,整个过程与Apriori算法比较类似,不过在细节上会略有不同,在下面的描述中,将会有所描述。GSP在原有的频繁模式定义的概念下,增加了3个的概念。
1、加入时间约束min_gap,max_gap,要求原来的连续变为只要满足在规定的min_gap到max_gap之间即可。
2、加入time_windows_size,只要在windows_size内的item,都可以被认为是同一ItemSet。
3、加入分类标准。
以上3点新的中的第一条特征将会在后面的算法中着重展现。
算法原理
1、根据所输入的序列,找出所有的单项集,即1频繁模式,这里会经过最小支持度阈值的判断。
2、根据1频繁模式进行连接运算,产生2频繁模式,这里会有进行最小阈值的判断。
3、根据2频繁模式连接产生3频繁模式,会经过最小支持度判断和剪枝操作,剪枝操作的原理在于判断他的所有子集是否也全是频繁模式。
4、3频繁模式不断的挖掘知道不能够产生出候选集为止。
连接操作的原理
2个序列,全部变为item列表的形式,如果a序列去掉第1个元素后,b序列去掉最后1个序列,2个序列的item完全一致,则代表可以连接,由b的最后一个元素加入到a中,至于是以独立项集的身份加入还是加入到a中最后1个项集中取决于b中的最后一个元素所属项集是否为单项项集。
时间约束计算
这个是用在支持度计数使用的,GSP算法的支持度计算不是那么简单,比如序列判断<2, <3, 4>>是否在序列<(1,5), 2 , <3, 4>, 2>,这就不能仅仅判断序列中是否只包含2,<3, 4>就行了,还要满足时间间隔约束,这就要把2,和<3,4>的所有出现时间都找出来,然后再里面找出一条满足时间约束的路径就算包含。时间的定义是从左往右起1.2,3...继续,以1个项集为单位,所有2的时间有2个分别为t=2和t=4,然后同理,因为<3,4>在序列中只有1次,所以时间为t=3,所以问题就变为了下面一个数组的问题
2 4
3
从时间数组的上往下,通过对多个时间的组合,找出1条满足时间约束的方案,这里的方案只有2-3,4-3,然后判断时间间隔,如果存在这样的方式,则代表此序列支持所给定序列,支持度值加1,这个算法在程序的实现中是比较复杂的。
算法的代码实现
测试数据输入(格式:事务ID item数 item1 item2.....):
- 1 2 1 5
- 1 1 2
- 1 1 3
- 1 1 4
- 2 1 1
- 2 1 3
- 2 1 4
- 2 2 3 5
- 3 1 1
- 3 1 2
- 3 1 3
- 3 1 4
- 3 1 5
- 4 1 1
- 4 1 3
- 4 1 5
- 5 1 4
- 5 1 5
最后组成的序列为:
<(1,5) 2 3 4>
<1 3 4 (3,5)>
<1 2 3 4 5>
<1 3 5>
<4 5>
也就是说同一序列都是同事务的。下面是关键的类
Sequence.java:
ItemSet.java:
- package DataMining_GSP;
-
- import java.util.ArrayList;
-
-
-
-
-
-
-
- public class ItemSet {
-
-
-
- private ArrayList<Integer> items;
-
- public ItemSet(String[] itemStr) {
- items = new ArrayList<>();
- for (String s : itemStr) {
- items.add(Integer.parseInt(s));
- }
- }
-
- public ItemSet(int[] itemNum) {
- items = new ArrayList<>();
- for (int num : itemNum) {
- items.add(num);
- }
- }
-
- public ItemSet(ArrayList<Integer> itemNum) {
- this.items = itemNum;
- }
-
- public ArrayList<Integer> getItems() {
- return items;
- }
-
- public void setItems(ArrayList<Integer> items) {
- this.items = items;
- }
-
-
-
-
-
-
-
-
- public boolean compareIsSame(ItemSet itemSet) {
- boolean result = true;
-
- if (this.items.size() != itemSet.items.size()) {
- return false;
- }
-
- for (int i = 0; i < itemSet.items.size(); i++) {
- if (this.items.get(i) != itemSet.items.get(i)) {
-
- result = false;
- break;
- }
- }
-
- return result;
- }
-
-
-
-
-
-
- public ArrayList<Integer> copyItems() {
- ArrayList<Integer> copyItems = new ArrayList<>();
-
- for (int num : this.items) {
- copyItems.add(num);
- }
-
- return copyItems;
- }
- }
GSPTool.java(算法工具类):
- package DataMining_GSP;
-
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileReader;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.Collections;
- import java.util.HashMap;
- import java.util.Map;
-
-
-
-
-
-
-
- public class GSPTool {
-
- private String filePath;
-
- private int minSupportCount;
-
- private int min_gap;
-
- private int max_gap;
-
- private ArrayList<Sequence> totalSequences;
-
- private ArrayList<Sequence> totalFrequencySeqs;
-
- private ArrayList<ArrayList<HashMap<Integer, Integer>>> itemNum2Time;
-
- public GSPTool(String filePath, int minSupportCount, int min_gap,
- int max_gap) {
- this.filePath = filePath;
- this.minSupportCount = minSupportCount;
- this.min_gap = min_gap;
- this.max_gap = max_gap;
- totalFrequencySeqs = new ArrayList<>();
- readDataFile();
- }
-
-
-
-
- private void readDataFile() {
- File file = new File(filePath);
- ArrayList<String[]> dataArray = new ArrayList<String[]>();
-
- try {
- BufferedReader in = new BufferedReader(new FileReader(file));
- String str;
- String[] tempArray;
- while ((str = in.readLine()) != null) {
- tempArray = str.split(" ");
- dataArray.add(tempArray);
- }
- in.close();
- } catch (IOException e) {
- e.getStackTrace();
- }
-
- HashMap<Integer, Sequence> mapSeq = new HashMap<>();
- Sequence seq;
- ItemSet itemSet;
- int tID;
- String[] itemStr;
- for (String[] str : dataArray) {
- tID = Integer.parseInt(str[0]);
- itemStr = new String[Integer.parseInt(str[1])];
- System.arraycopy(str, 2, itemStr, 0, itemStr.length);
- itemSet = new ItemSet(itemStr);
-
- if (mapSeq.containsKey(tID)) {
- seq = mapSeq.get(tID);
- } else {
- seq = new Sequence(tID);
- }
- seq.getItemSetList().add(itemSet);
- mapSeq.put(tID, seq);
- }
-
-
- totalSequences = new ArrayList<>();
- for (Map.Entry entry : mapSeq.entrySet()) {
- totalSequences.add((Sequence) entry.getValue());
- }
- }
-
-
-
-
-
-
- private ArrayList<Sequence> generateOneFrequencyItem() {
- int count = 0;
- int currentTransanctionID = 0;
- Sequence tempSeq;
- ItemSet tempItemSet;
- HashMap<Integer, Integer> itemNumMap = new HashMap<>();
- ArrayList<Sequence> seqList = new ArrayList<>();
-
- for (Sequence seq : totalSequences) {
- for (ItemSet itemSet : seq.getItemSetList()) {
- for (int num : itemSet.getItems()) {
-
- if (!itemNumMap.containsKey(num)) {
- itemNumMap.put(num, 1);
- }
- }
- }
- }
-
- boolean isContain = false;
- int number = 0;
- for (Map.Entry entry : itemNumMap.entrySet()) {
- count = 0;
- number = (int) entry.getKey();
- for (Sequence seq : totalSequences) {
- isContain = false;
-
- for (ItemSet itemSet : seq.getItemSetList()) {
- for (int num : itemSet.getItems()) {
- if (num == number) {
- isContain = true;
- break;
- }
- }
-
- if(isContain){
- break;
- }
- }
-
- if(isContain){
- count++;
- }
- }
-
- itemNumMap.put(number, count);
- }
-
-
- for (Map.Entry entry : itemNumMap.entrySet()) {
- count = (int) entry.getValue();
- if (count >= minSupportCount) {
- tempSeq = new Sequence();
- tempItemSet = new ItemSet(new int[] { (int) entry.getKey() });
-
- tempSeq.getItemSetList().add(tempItemSet);
- seqList.add(tempSeq);
- }
-
- }
-
- Collections.sort(seqList);
-
- totalFrequencySeqs.addAll(seqList);
-
- return seqList;
- }
-
-
-
-
-
-
-
-
- private ArrayList<Sequence> generateTwoFrequencyItem(
- ArrayList<Sequence> oneSeq) {
- Sequence tempSeq;
- ArrayList<Sequence> resultSeq = new ArrayList<>();
- ItemSet tempItemSet;
- int num1;
- int num2;
-
-
-
- for (int i = 0; i < oneSeq.size(); i++) {
- num1 = oneSeq.get(i).getFirstItemSetNum();
- for (int j = 0; j < oneSeq.size(); j++) {
- num2 = oneSeq.get(j).getFirstItemSetNum();
-
- tempSeq = new Sequence();
- tempItemSet = new ItemSet(new int[] { num1 });
- tempSeq.getItemSetList().add(tempItemSet);
- tempItemSet = new ItemSet(new int[] { num2 });
- tempSeq.getItemSetList().add(tempItemSet);
-
- if (countSupport(tempSeq) >= minSupportCount) {
- resultSeq.add(tempSeq);
- }
- }
- }
-
-
- for (int i = 0; i < oneSeq.size(); i++) {
- num1 = oneSeq.get(i).getFirstItemSetNum();
- for (int j = i; j < oneSeq.size(); j++) {
- num2 = oneSeq.get(j).getFirstItemSetNum();
-
- tempSeq = new Sequence();
- tempItemSet = new ItemSet(new int[] { num1, num2 });
- tempSeq.getItemSetList().add(tempItemSet);
-
- if (countSupport(tempSeq) >= minSupportCount) {
- resultSeq.add(tempSeq);
- }
- }
- }
-
- totalFrequencySeqs.addAll(resultSeq);
-
- return resultSeq;
- }
-
-
-
-
-
-
-
-
- private ArrayList<Sequence> generateCandidateItem(
- ArrayList<Sequence> seqList) {
- Sequence tempSeq;
- ArrayList<Integer> tempNumArray;
- ArrayList<Sequence> resultSeq = new ArrayList<>();
-
- ArrayList<ArrayList<Integer>> seqNums = new ArrayList<>();
-
- for (int i = 0; i < seqList.size(); i++) {
- tempNumArray = new ArrayList<>();
- tempSeq = seqList.get(i);
- for (ItemSet itemSet : tempSeq.getItemSetList()) {
- tempNumArray.addAll(itemSet.copyItems());
- }
- seqNums.add(tempNumArray);
- }
-
- ArrayList<Integer> array1;
- ArrayList<Integer> array2;
-
- Sequence seqi = null;
- Sequence seqj = null;
-
- boolean canConnect = true;
-
- for (int i = 0; i < seqNums.size(); i++) {
- for (int j = 0; j < seqNums.size(); j++) {
- array1 = (ArrayList<Integer>) seqNums.get(i).clone();
- array2 = (ArrayList<Integer>) seqNums.get(j).clone();
-
-
- array1.remove(0);
- array2.remove(array2.size() - 1);
-
- canConnect = true;
- for (int k = 0; k < array1.size(); k++) {
- if (array1.get(k) != array2.get(k)) {
- canConnect = false;
- break;
- }
- }
-
- if (canConnect) {
- seqi = seqList.get(i).copySeqence();
- seqj = seqList.get(j).copySeqence();
-
- int lastItemNum = seqj.getLastItemSetNum();
- if (seqj.isLastItemSetSingleNum()) {
-
- ItemSet itemSet = new ItemSet(new int[] { lastItemNum });
- seqi.getItemSetList().add(itemSet);
- } else {
-
- ItemSet itemSet = seqi.getLastItemSet();
- itemSet.getItems().add(lastItemNum);
- }
-
-
- if (isChildSeqContained(seqi)
- && countSupport(seqi) >= minSupportCount) {
- resultSeq.add(seqi);
- }
- }
- }
- }
-
- totalFrequencySeqs.addAll(resultSeq);
- return resultSeq;
- }
-
-
-
-
-
-
-
-
- private boolean isChildSeqContained(Sequence seq) {
- boolean isContained = false;
- ArrayList<Sequence> childSeqs;
-
- childSeqs = seq.createChildSeqs();
- for (Sequence tempSeq : childSeqs) {
- isContained = false;
-
- for (Sequence frequencySeq : totalFrequencySeqs) {
- if (tempSeq.compareIsSame(frequencySeq)) {
- isContained = true;
- break;
- }
- }
-
- if (!isContained) {
- break;
- }
- }
-
- return isContained;
- }
-
-
-
-
-
-
-
-
- private int countSupport(Sequence seq) {
- int count = 0;
- int matchNum = 0;
- Sequence tempSeq;
- ItemSet tempItemSet;
- HashMap<Integer, Integer> timeMap;
- ArrayList<ItemSet> itemSetList;
- ArrayList<ArrayList<Integer>> numArray = new ArrayList<>();
-
- ArrayList<ArrayList<Integer>> timeArray = new ArrayList<>();
-
- for (ItemSet itemSet : seq.getItemSetList()) {
- numArray.add(itemSet.getItems());
- }
-
- for (int i = 0; i < totalSequences.size(); i++) {
- timeArray = new ArrayList<>();
-
- for (int s = 0; s < numArray.size(); s++) {
- ArrayList<Integer> childNum = numArray.get(s);
- ArrayList<Integer> localTime = new ArrayList<>();
- tempSeq = totalSequences.get(i);
- itemSetList = tempSeq.getItemSetList();
-
- for (int j = 0; j < itemSetList.size(); j++) {
- tempItemSet = itemSetList.get(j);
- matchNum = 0;
- int t = 0;
-
- if (tempItemSet.getItems().size() == childNum.size()) {
- timeMap = itemNum2Time.get(i).get(j);
-
- for (int k = 0; k < childNum.size(); k++) {
- if (timeMap.containsKey(childNum.get(k))) {
- matchNum++;
- t = timeMap.get(childNum.get(k));
- }
- }
-
-
- if (matchNum == childNum.size()) {
- localTime.add(t);
- }
- }
-
- }
-
- if (localTime.size() > 0) {
- timeArray.add(localTime);
- }
- }
-
-
- if (timeArray.size() == numArray.size()
- && judgeTimeInGap(timeArray)) {
- count++;
- }
- }
-
- return count;
- }
-
-
-
-
-
-
-
-
- private boolean judgeTimeInGap(ArrayList<ArrayList<Integer>> timeArray) {
- boolean result = false;
- int preTime = 0;
- ArrayList<Integer> firstTimes = timeArray.get(0);
- timeArray.remove(0);
-
- if (timeArray.size() == 0) {
- return false;
- }
-
- for (int i = 0; i < firstTimes.size(); i++) {
- preTime = firstTimes.get(i);
-
- if (dfsJudgeTime(preTime, timeArray)) {
- result = true;
- break;
- }
- }
-
- return result;
- }
-
-
-
-
-
-
-
-
- private boolean dfsJudgeTime(int preTime,
- ArrayList<ArrayList<Integer>> timeArray) {
- boolean result = false;
- ArrayList<ArrayList<Integer>> timeArrayClone = (ArrayList<ArrayList<Integer>>) timeArray
- .clone();
- ArrayList<Integer> firstItemItem = timeArrayClone.get(0);
-
- for (int i = 0; i < firstItemItem.size(); i++) {
- if (firstItemItem.get(i) - preTime >= min_gap
- && firstItemItem.get(i) - preTime <= max_gap) {
-
- preTime = firstItemItem.get(i);
- timeArrayClone.remove(0);
-
- if (timeArrayClone.size() == 0) {
- return true;
- } else {
- result = dfsJudgeTime(preTime, timeArrayClone);
- if (result) {
- return true;
- }
- }
- }
- }
-
- return result;
- }
-
-
-
-
- private void initItemNumToTimeMap() {
- Sequence seq;
- itemNum2Time = new ArrayList<>();
- HashMap<Integer, Integer> tempMap;
- ArrayList<HashMap<Integer, Integer>> tempMapList;
-
- for (int i = 0; i < totalSequences.size(); i++) {
- seq = totalSequences.get(i);
- tempMapList = new ArrayList<>();
-
- for (int j = 0; j < seq.getItemSetList().size(); j++) {
- ItemSet itemSet = seq.getItemSetList().get(j);
- tempMap = new HashMap<>();
- for (int itemNum : itemSet.getItems()) {
- tempMap.put(itemNum, j + 1);
- }
-
- tempMapList.add(tempMap);
- }
-
- itemNum2Time.add(tempMapList);
- }
- }
-
-
-
-
- public void gspCalculate() {
- ArrayList<Sequence> oneSeq;
- ArrayList<Sequence> twoSeq;
- ArrayList<Sequence> candidateSeq;
-
- initItemNumToTimeMap();
- oneSeq = generateOneFrequencyItem();
- twoSeq = generateTwoFrequencyItem(oneSeq);
- candidateSeq = twoSeq;
-
-
- for (;;) {
- candidateSeq = generateCandidateItem(candidateSeq);
-
- if (candidateSeq.size() == 0) {
- break;
- }
- }
-
- outputSeqence(totalFrequencySeqs);
-
- }
-
-
-
-
-
-
-
- private void outputSeqence(ArrayList<Sequence> outputSeqList) {
- for (Sequence seq : outputSeqList) {
- System.out.print("<");
- for (ItemSet itemSet : seq.getItemSetList()) {
- System.out.print("(");
- for (int num : itemSet.getItems()) {
- System.out.print(num + ",");
- }
- System.out.print("), ");
- }
- System.out.println(">");
- }
- }
-
- }
调用类Client.java:
- package DataMining_GSP;
-
-
-
-
-
-
- public class Client {
- public static void main(String[] args){
- String filePath = "C:\\Users\\lyq\\Desktop\\icon\\testInput.txt";
-
- int minSupportCount = 2;
-
- int min_gap = 1;
-
- int max_gap = 5;
-
- GSPTool tool = new GSPTool(filePath, minSupportCount, min_gap, max_gap);
- tool.gspCalculate();
- }
- }
算法的输出(挖掘出的所有频繁模式):
- <(1,), >
- <(2,), >
- <(3,), >
- <(4,), >
- <(5,), >
- <(1,), (3,), >
- <(1,), (4,), >
- <(1,), (5,), >
- <(2,), (3,), >
- <(2,), (4,), >
- <(3,), (4,), >
- <(3,), (5,), >
- <(4,), (5,), >
- <(1,), (3,), (4,), >
- <(1,), (3,), (5,), >
- <(2,), (3,), (4,), >
算法实现的难点
1、算法花费了几天的时间,难点首先在于对算法原理本身的理解,网上对于此算法的资料特别少,而且不同的人所表达的意思 都有少许的不同,讲的也不是很详细,于是就通过阅读别人的代码理解GSP算法的原理,我的代码实现也是参考了参考资料的C语言的实现。
2、在实现时间约束的支持度计数统计的时候,调试了一段时间,做时间统计容易出错,因为层级实在太多容易搞晕。
3、还有1个是Sequence和ItemSet的拷贝时的引用问题,在产生新的序列时一定要深拷贝1个否则导致同一引用会把原数据给改掉的。
GSP算法和Apriori算法的比较
我是都实现过了GSP算法和Apriori算法的,后者是被称为关联规则挖掘算法,偏向于挖掘关联规则的,2个算法在连接的操作上有不一样的地方,还有在数据的构成方式上,Apriori的数据会简单一点,都是单项单项构成的,而且在做支持度统计的时候只需判断存在与否即可。不需要考虑时间约束。Apriori算法给定K项集,连接到K-1项集算法就停止了,而GSP算法是直到不能够产生候选集为止。