参考资料:http://baike.baidu.com/link?url=vlCBGoGR0_97l9SQ-WNeRv7oWb-3j7c6oUnyMzQAU3PTo0fx0O5MVXxckgqUlP871xR2Le-puGfFcrA4-zIntq
更多挖掘算法:https://github.com/linyiqun/DataMiningAlgorithm
介绍
RoughSets算法是一种比较新颖的算法,粗糙集理论对于数据的挖掘方面提供了一个新的概念和研究方法。本篇文章我不会去介绍令人厌烦的学术概念,就是简单的聊聊RoughSets算法的作用,直观上做一个了解。此算法的应用场景是,面对一个庞大的数据库系统,如何从里面分析出有效的信息,如果一database中有几十个字段,有我们好受的了,但是一般的在某些情况下有些信息在某些情况下是无用的或者说是无效的,这时候我们假设在不影响最终决策分类结果的情况下,对此属性进行约简。这就是RoughSets所干的事情了。
算法原理
算法的原理其实很简单,所有属性分为2种属性1类为条件属性,1类为决策属性,我们姑且把决策属性设置在数据列的最后一列,算法的步骤依次判断条件属性是否能被约简,如果能被约简,此输出约简属性后的规则,规则的形式大体类似于IF---THEN的规则。下面举1个例子,此例子来自于百度百科上的粗糙集理论。
给定8条记录:
元素 颜色 形状 大小 稳定性
x1 红 三角 大 稳定
x2 红 三角 大 稳定
x3 黄 圆 小 不稳定
x4 黄 圆 小 不稳定
x5 蓝 方块 大 稳定
x6 红 圆 中 不稳定
x7 蓝 圆 小 不稳定
x8 蓝 方块 中 不稳定
在这里还是得介绍几个最基本的一些概念,这里的所有的记录的集合叫做论域,那么这个论域能表达出一些什么知识或者信息呢,比如说蓝色的或者中的积木={X5,X7,X8}U{X6,X8}={X5,X6,X7,X8},同理,通过论域集合内的记录进行交并运算能够表达出不同的信息。在这里总共有3个属性,就可以分成3x3=9个小属性分类,如下:
A/R1={X1,X2,X3}={{x1,x2,x6},{x3,x4},{x5,x7,x8}} (颜色分类)
A/R2={Y1,Y2,Y3}={{x1,x2},{x5,x8},{x3,x4,x6,x7}} (形状分类)
A/R3={Z1,Z2,Z3}={{x1,x2,x5},{x6,x8},{x3,x4,x7}} (大小分类)
我们定义一个知识系统A/R=R1∩R2∩R3,就是3x3x3总共27种可能,每行各取1个做计算组后的结果为
A/R={{x1,x2},{x3,x4},{x5},{x6},{x7},{x8}},所以这个知识系统所决定的知识就是A/R中所有的集合以此这些集合的并集。给定一个集合如何用知识系统中的集合进行表示呢,这就用到了又一对概念,上近似和下近似。比如说给定集合X={X2,X5X7},在知识库中就是下近似{X2.X5},上近似{X1,X2,X5,X7},上下近似的完整定义是下近似集是在那些所有的包含于X的知识库中的集合中求交得到的,而上近似则是将那些包含X的知识库中的集合求并得到的。在后面的例子中我也是以一个集合的上下近似集是否是等于他自身来对知识系统是否是允许的做一个判断。(这只是我自己的判断原则,并不是标准的)
下面是属性约简的过程,从颜色开始,这时知识系统变为了那么知识系统变成A/(R-R1)={{x1,x2},{x3,x4,x7},,,}以及这些子集的并集,此时稳定的集合{X1,X2,X5}的集合上下近似集还是他本身,所有没有改变,说明此属性是可以约简的,然后再此基础上在约简,直到上下近似集的改变。依次3种属性进行遍历。最后得到规则,我们以约简颜色属性为例,我们可以得出的规则是大三角的稳定,圆小的不稳定等等。大体原理就是如此,也许从某些方面来说还有欠妥的地方。
算法的代码实现
同样以上面的数据未例子,不过我把他转成了英文的形式,避免中文的编码问题:
- Element Color Shape Size Stability
- x1 Red Triangle Large Stable
- x2 Red Triangle Large Stable
- x3 Yellow Circle Small UnStable
- x4 Yellow Circle Small UnStable
- x5 Blue Rectangle Large Stable
- x6 Red Circle Middle UnStable
- x7 Blue Circle Small UnStable
- x8 Blue Rectangle Middle UnStable
程序写的会有些复杂,里面很多都是集合的交并运算,之所以不采用直接的数组的运算,是为了更加突出集合的概念。
Record.java:
RecordCollection.java:
KnowledgeSystem.java:
RoughSetsTool.java:
- package DataMining_RoughSets;
-
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileReader;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.HashMap;
- import java.util.Map;
-
-
-
-
-
-
-
- public class RoughSetsTool {
-
- public static String DECISION_ATTR_NAME;
-
-
- private String filePath;
-
- private String[] attrNames;
-
- private ArrayList<String[]> totalDatas;
-
- private ArrayList<Record> totalRecords;
-
- private HashMap<String, ArrayList<String>> conditionAttr;
-
- private ArrayList<RecordCollection> collectionList;
-
- public RoughSetsTool(String filePath) {
- this.filePath = filePath;
- readDataFile();
- }
-
-
-
-
- private void readDataFile() {
- File file = new File(filePath);
- ArrayList<String[]> dataArray = new ArrayList<String[]>();
-
- try {
- BufferedReader in = new BufferedReader(new FileReader(file));
- String str;
- String[] tempArray;
- while ((str = in.readLine()) != null) {
- tempArray = str.split(" ");
- dataArray.add(tempArray);
- }
- in.close();
- } catch (IOException e) {
- e.getStackTrace();
- }
-
- String[] array;
- Record tempRecord;
- HashMap<String, String> attrMap;
- ArrayList<String> attrList;
- totalDatas = new ArrayList<>();
- totalRecords = new ArrayList<>();
- conditionAttr = new HashMap<>();
-
- attrNames = dataArray.get(0);
- DECISION_ATTR_NAME = attrNames[attrNames.length - 1];
- for (int j = 0; j < dataArray.size(); j++) {
- array = dataArray.get(j);
- totalDatas.add(array);
- if (j == 0) {
-
- continue;
- }
-
- attrMap = new HashMap<>();
- for (int i = 0; i < attrNames.length; i++) {
- attrMap.put(attrNames[i], array[i]);
-
-
- if (i > 0 && i < attrNames.length - 1) {
- if (conditionAttr.containsKey(attrNames[i])) {
- attrList = conditionAttr.get(attrNames[i]);
- if (!attrList.contains(array[i])) {
- attrList.add(array[i]);
- }
- } else {
- attrList = new ArrayList<>();
- attrList.add(array[i]);
- }
- conditionAttr.put(attrNames[i], attrList);
- }
- }
- tempRecord = new Record(array[0], attrMap);
- totalRecords.add(tempRecord);
- }
- }
-
-
-
-
- private void recordSpiltToCollection() {
- String attrName;
- ArrayList<String> attrList;
- ArrayList<Record> recordList;
- HashMap<String, String> collectionAttrValues;
- RecordCollection collection;
- collectionList = new ArrayList<>();
-
- for (Map.Entry entry : conditionAttr.entrySet()) {
- attrName = (String) entry.getKey();
- attrList = (ArrayList<String>) entry.getValue();
-
- for (String s : attrList) {
- recordList = new ArrayList<>();
-
- for (Record record : totalRecords) {
- if (record.isContainedAttr(s)) {
- recordList.add(record);
- }
- }
- collectionAttrValues = new HashMap<>();
- collectionAttrValues.put(attrName, s);
- collection = new RecordCollection(collectionAttrValues,
- recordList);
-
- collectionList.add(collection);
- }
- }
- }
-
-
-
-
-
-
-
-
- private HashMap<String, ArrayList<RecordCollection>> constructCollectionMap(
- ArrayList<String> reductAttr) {
- String currentAtttrName;
- ArrayList<RecordCollection> cList;
-
- HashMap<String, ArrayList<RecordCollection>> collectionMap = new HashMap<>();
-
-
- for (int i = 1; i < attrNames.length - 1; i++) {
- currentAtttrName = attrNames[i];
-
-
- if (reductAttr != null && reductAttr.contains(currentAtttrName)) {
- continue;
- }
-
- cList = new ArrayList<>();
-
- for (RecordCollection c : collectionList) {
- if (c.isContainedAttrName(currentAtttrName)) {
- cList.add(c);
- }
- }
-
- collectionMap.put(currentAtttrName, cList);
- }
-
- return collectionMap;
- }
-
-
-
-
- private ArrayList<RecordCollection> computeKnowledgeSystem(
- HashMap<String, ArrayList<RecordCollection>> collectionMap) {
- String attrName = null;
- ArrayList<RecordCollection> cList = null;
-
- ArrayList<RecordCollection> ksCollections;
-
- ksCollections = new ArrayList<>();
-
-
- for (Map.Entry entry : collectionMap.entrySet()) {
- attrName = (String) entry.getKey();
- cList = (ArrayList<RecordCollection>) entry.getValue();
- break;
- }
- collectionMap.remove(attrName);
-
- for (RecordCollection rc : cList) {
- recurrenceComputeKS(ksCollections, collectionMap, rc);
- }
-
- return ksCollections;
- }
-
-
-
-
-
-
-
-
-
-
-
- private void recurrenceComputeKS(ArrayList<RecordCollection> ksCollections,
- HashMap<String, ArrayList<RecordCollection>> map,
- RecordCollection preCollection) {
- String attrName = null;
- RecordCollection tempCollection;
- ArrayList<RecordCollection> cList = null;
- HashMap<String, ArrayList<RecordCollection>> mapCopy = new HashMap<>();
-
-
- if(map.size() == 0){
- ksCollections.add(preCollection);
- return;
- }
-
- for (Map.Entry entry : map.entrySet()) {
- cList = (ArrayList<RecordCollection>) entry.getValue();
- mapCopy.put((String) entry.getKey(), cList);
- }
-
-
- for (Map.Entry entry : map.entrySet()) {
- attrName = (String) entry.getKey();
- cList = (ArrayList<RecordCollection>) entry.getValue();
- break;
- }
-
- mapCopy.remove(attrName);
- for (RecordCollection rc : cList) {
-
- tempCollection = preCollection.overlapCalculate(rc);
-
- if (tempCollection == null) {
- continue;
- }
-
-
- if (mapCopy.size() == 0) {
- ksCollections.add(tempCollection);
- } else {
- recurrenceComputeKS(ksCollections, mapCopy, tempCollection);
- }
- }
- }
-
-
-
-
- public void findingReduct() {
- RecordCollection[] sameClassRcs;
- KnowledgeSystem ks;
- ArrayList<RecordCollection> ksCollections;
-
- ArrayList<String> reductAttr = null;
- ArrayList<String> attrNameList;
-
- ArrayList<ArrayList<String>> canReductAttrs;
- HashMap<String, ArrayList<RecordCollection>> collectionMap;
-
- sameClassRcs = selectTheSameClassRC();
-
- recordSpiltToCollection();
-
- collectionMap = constructCollectionMap(reductAttr);
- ksCollections = computeKnowledgeSystem(collectionMap);
- ks = new KnowledgeSystem(ksCollections);
- System.out.println("原始集合分类的上下近似集合");
- ks.getDownSimilarRC(sameClassRcs[0]).printRc();
- ks.getUpSimilarRC(sameClassRcs[0]).printRc();
- ks.getDownSimilarRC(sameClassRcs[1]).printRc();
- ks.getUpSimilarRC(sameClassRcs[1]).printRc();
-
- attrNameList = new ArrayList<>();
- for (int i = 1; i < attrNames.length - 1; i++) {
- attrNameList.add(attrNames[i]);
- }
-
- ArrayList<String> remainAttr;
- canReductAttrs = new ArrayList<>();
- reductAttr = new ArrayList<>();
-
- for (String s : attrNameList) {
- remainAttr = (ArrayList<String>) attrNameList.clone();
- remainAttr.remove(s);
- reductAttr = new ArrayList<>();
- reductAttr.add(s);
- recurrenceFindingReduct(canReductAttrs, reductAttr, remainAttr,
- sameClassRcs);
- }
-
- printRules(canReductAttrs);
- }
-
-
-
-
-
-
-
-
-
-
-
-
-
- private void recurrenceFindingReduct(
- ArrayList<ArrayList<String>> resultAttr,
- ArrayList<String> reductAttr, ArrayList<String> remainAttr,
- RecordCollection[] sameClassRc) {
- KnowledgeSystem ks;
- ArrayList<RecordCollection> ksCollections;
- ArrayList<String> copyRemainAttr;
- ArrayList<String> copyReductAttr;
- HashMap<String, ArrayList<RecordCollection>> collectionMap;
- RecordCollection upRc1;
- RecordCollection downRc1;
- RecordCollection upRc2;
- RecordCollection downRc2;
-
- collectionMap = constructCollectionMap(reductAttr);
- ksCollections = computeKnowledgeSystem(collectionMap);
- ks = new KnowledgeSystem(ksCollections);
-
- downRc1 = ks.getDownSimilarRC(sameClassRc[0]);
- upRc1 = ks.getUpSimilarRC(sameClassRc[0]);
- downRc2 = ks.getDownSimilarRC(sameClassRc[1]);
- upRc2 = ks.getUpSimilarRC(sameClassRc[1]);
-
-
- if (!upRc1.isCollectionSame(sameClassRc[0])
- || !downRc1.isCollectionSame(sameClassRc[0])) {
- return;
- }
-
- if (!upRc2.isCollectionSame(sameClassRc[1])
- || !downRc2.isCollectionSame(sameClassRc[1])) {
- return;
- }
-
-
- resultAttr.add(reductAttr);
-
- if (remainAttr.size() == 1) {
- return;
- }
-
- for (String s : remainAttr) {
- copyRemainAttr = (ArrayList<String>) remainAttr.clone();
- copyReductAttr = (ArrayList<String>) reductAttr.clone();
- copyRemainAttr.remove(s);
- copyReductAttr.add(s);
- recurrenceFindingReduct(resultAttr, copyReductAttr, copyRemainAttr,
- sameClassRc);
- }
- }
-
-
-
-
-
-
- private RecordCollection[] selectTheSameClassRC() {
- RecordCollection[] resultRc = new RecordCollection[2];
- resultRc[0] = new RecordCollection();
- resultRc[1] = new RecordCollection();
- String attrValue;
-
-
- attrValue = totalRecords.get(0).getRecordDecisionClass();
- for (Record r : totalRecords) {
- if (attrValue.equals(r.getRecordDecisionClass())) {
- resultRc[0].getRecord().add(r);
- }else{
- resultRc[1].getRecord().add(r);
- }
- }
-
- return resultRc;
- }
-
-
-
-
-
-
- public void printRules(ArrayList<ArrayList<String>> reductAttrArray){
-
- ArrayList<String> rulesArray;
- String rule;
-
- for(ArrayList<String> ra: reductAttrArray){
- rulesArray = new ArrayList<>();
- System.out.print("约简的属性:");
- for(String s: ra){
- System.out.print(s + ",");
- }
- System.out.println();
-
- for(Record r: totalRecords){
- rule = r.getDecisionRule(ra);
- if(!rulesArray.contains(rule)){
- rulesArray.add(rule);
- System.out.println(rule);
- }
- }
- System.out.println();
- }
- }
-
-
-
-
-
-
-
- public void printRecordCollectionList(ArrayList<RecordCollection> rcList) {
- for (RecordCollection rc : rcList) {
- System.out.print("{");
- for (Record r : rc.getRecord()) {
- System.out.print(r.getName() + ", ");
- }
- System.out.println("}");
- }
- }
- }
调用类Client.java:
- package DataMining_RoughSets;
-
-
-
-
-
-
- public class Client {
- public static void main(String[] args){
- String filePath = "C:\\Users\\lyq\\Desktop\\icon\\input.txt";
-
- RoughSetsTool tool = new RoughSetsTool(filePath);
- tool.findingReduct();
- }
- }
结果输出:
- 原始集合分类的上下近似集合
- {x1, x2, x5, }
- {x1, x2, x5, }
- {x3, x4, x7, x6, x8, }
- {x3, x4, x7, x6, x8, }
- 约简的属性:Color,
- 属性Shape=Triangle,Size=Large,他的分类为Stable
- 属性Shape=Circle,Size=Small,他的分类为UnStable
- 属性Shape=Rectangle,Size=Large,他的分类为Stable
- 属性Shape=Circle,Size=Middle,他的分类为UnStable
- 属性Shape=Rectangle,Size=Middle,他的分类为UnStable
-
- 约简的属性:Color,Shape,
- 属性Size=Large,他的分类为Stable
- 属性Size=Small,他的分类为UnStable
- 属性Size=Middle,他的分类为UnStable
-
- 约简的属性:Shape,
- 属性Size=Large,Color=Red,他的分类为Stable
- 属性Size=Small,Color=Yellow,他的分类为UnStable
- 属性Size=Large,Color=Blue,他的分类为Stable
- 属性Size=Middle,Color=Red,他的分类为UnStable
- 属性Size=Small,Color=Blue,他的分类为UnStable
- 属性Size=Middle,Color=Blue,他的分类为UnStable
-
- 约简的属性:Shape,Color,
- 属性Size=Large,他的分类为Stable
- 属性Size=Small,他的分类为UnStable
- 属性Size=Middle,他的分类为UnStable
算法的小问题
我在算法实现时很大的问题到不是碰到很多,就是对于上下近似集的计算上自己做了一个修改,下近似集就是知识系统中的集合完全包括在目标集合的目标,而上近似则是在下近似集的基础上添加目标集合中还没有被包含进集合的元素的所属集合,跟题目原先设想的还是有一点点的不一样,但是算法整体思想还是呈现出来了。
我对算法的思考
粗糙集属性约简算法重在约简,至于用什么原则作为约简的标准,其实本身不止一种,当然你可以根本不需要用上下近似集的概念,这样确实使得验证变得非常的繁琐,你可以直接一条条的记录去约简属性,看会不会对分类的最终结果造成影响,然后做出判断,通过对决策影响的判断也仅仅是一种属性约简的情况。
算法的适用情况
RoughSets算法在属性集比较少的情况下能得到一个不错的分类的,也可以降低存储开销,但是属性集比较多的时候,可能准确率无法保证。