受这些事实的启发,提出了很多不同的聚类技术,通过利用特征选择方法消除不相关和冗余的特征,同时保留相关特征,以提高聚类效率和质量。后面我们将描述基于域的不同的特征选择聚类(FSC)方法。介绍:传统FSC,文本数据中的FSC,流数据中的FSC和FSC链接数据。
与监督学习的特征选择类似,用于聚类的特征选择也被分类为Filter[15]、Wrapper[55]、Hybrid[19]。
不是使用聚类算法测试特征的质量,通过一个确定的标准来给特征的打分,然后选择最高评分的特征。
Dash等人在总结Ben-Bassat等人、Doak等人的工作后将评价准则分为五类:
距离度量(Distance Measure)、
信息增益度量(Information Gain Measure)、
依赖性度量(Dependence Measure)、
一致性度量(Consistency Measure)、
分类器错误率度量(Classifier Error Rate Measure)。
(1)距离度量:距离度量一般认为是差异性或者分离性的度量,常用的距离度量方法有欧式距离等。对于一个二元分类问题,对于两个特征f1f1和f2f2,如果特征f1f1引起的两类条件概率差异大于特征f2f2,则认为特征f1f1优于特征f2f2。
(2)信息增益度量:特征f的信息增益定义为使用特征f的先验不确定性与期望的后验不确性之间的差异。若特征f1f1的信息增益大于特征f2f2的信息增益,则认为特征f1f1优于特征f2f2。
(3)依赖性度量:依赖性度量又称为相关性度量(Correlation Measure)、通常可采用皮尔逊相关系数(Pearson correlation coefficient)来计算特征f与类别C之间的相关度,若特征f1f1与类别C之间的相关性大于特征f2f2与类别C之间的相关性,则认为特征f1f1优于特征f2f2。同样也可以计算得到属性与属性之间的相关度,属性与属性之间的相关性越低越好。
(4)一致性度量:假定两个样本,若它们的特征值相同,且所属类别也相同,则认为它们是一致的:否则,则称它们不一致。一致性常用不一致率来衡量,其尝试找出与原始特征集具有一样辨别能力的最小的属性子集。
(5)分类器错误率度量:该度量使用学习器的性能作为最终的评价阈值。它倾向于选择那些在分类器上表现较好的子集。
-以上5种度量方法中,距离度量(Distance Measure)、信息增益度量(Information Gain Measure)、依赖性度量(Dependence Measure)、一致性度量(Consistency Measure)常用于过滤式(filter);
-分类器错误率度量(Classifier Error Rate Measure)则用于包裹式(wrapper)。
https://blog.csdn.net/u012328159/article/details/53954522
特征评估可以是单变量(univariate)或多变量(multivariate)。
算法:SPEC(0.2.1.1),是单变量Filter模型的一个例子,在[78]中扩展到多变量方法。feature dependency [62](特征依赖), entropy-based distance [15](基于熵的距离), and laplacian score [26, 80](拉普拉斯分数)。
一些算法处理文本数据,一些算法处理流数据。还有一些算法能够处理不同类型的数据。在本节中,我们将讨论一下算法以及它们可以处理的数据类型。
能够处理通用数据集的聚类特征选择
SPEC[80]既可以监督也可以无监督学习,这里作为Filter模型 无监督 特征选择方法。
如果将SPEC 中替换为:
则LS拉普拉斯分数是SPEC的一个特殊的案例。
LS在数据大小方面非常有效。与SPEC相似,LS中最耗时的是构造相似矩阵s。该算法的优点是既能处理带标记的数据,又能处理无标记的数据。
[71]用Lasso和 L 1 L_1 L1范数作为特征选择方法嵌入在聚类过程中。特征选择的数量L使用gap statistics选择,类似于[67]中的选择聚类数量。
[1] Feature selection for dna methylation based cancer classi_cation. Bioinformatics, 17Suppl 1:S157-S164, 2001.
[2] A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507-2517, Oct 2007.
[3] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu. A framework for clustering evolving data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pages 81-92. VLDB Endowment, 2003.
[4] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu. A framework for projected clustering of high dimensional data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 852-863. VLDB Endowment, 2004.
[5] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park. Fast algorithms for projected clustering. ACM SIGMOD Record, 28(2):61-72, 1999.
[6] T.M. Akhriza, Y. Ma, and J. Li. Text clustering using frequent contextual termset. In Information Management, Innovation Management and Industrial Engineering(ICIII), 2011 International Conference on, volume 1, pages 339-342. IEEE, 2011.
[7] Salem Alelyani, LeiWang, and Huan Liu. The e_ect of the characteristics of the dataset on the selection stability. In Proceedings of the 23rd IEEE International Conference on Tools with Arti_cial Intelligence, 2011.
[8] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436-442. ACM, 2002.
[9] C. Boutsidis, M.W. Mahoney, and P. Drineas. Unsupervised feature selection for the k-means clustering problem. Advances in Neural Information Processing Systems, 22:153-161, 2009.
[10] P.S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. pages 82-90. Morgan Kaufmann, 1998.
[11] D. Cai, C. Zhang, and X. He. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 333-342. ACM, 2010.
[12] A.C. Carvalho, R.F. Mello, S. Alelyani, H. Liu, et al. Quantifying features using false nearest neighbors: An unsupervised approach. In Tools with Arti_cial Intelligence(ICTAI), 2011 23rd IEEE International Conference on, pages 994-997. IEEE, 2011.
[13] Sanmay Das. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML ‘01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 74-81, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.29 30
[14] M. Dash and Y.S. Ong. Relief-c: E_cient feature selection for clustering over noisy data. In Tools with Arti_cial Intelligence (ICTAI), 2011 23rd IEEE International Conference on, pages 869-872. IEEE, 2011.
[15] Manoranjan Dash, Kiseok Choi, Peter Scheuermann, and Huan Liu. Feature selection for clustering - a filter solution. In In Proceedings of the Second International Conference on Data Mining, pages 115-122, 2002.
[16] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classi_cation. John Wiley & Sons, New York, 2 edition, 2001.
[17] Jennifer G. Dy and Carla E. Brodley. Feature subset selection and order identi_cation for unsupervised learning. In In Proc. 17th International Conf. on Machine Learning, pages 247-254. Morgan Kaufmann, 2000.
[18] Jennifer G. Dy and Carla E. Brodley. Feature selection for unsupervised learning. J. Mach. Learn. Res., 5:845-889, 2004.
[19] J.G. Dy. Unsupervised feature selection. Computational Methods of Feature Selection, pages 19-39, 2008.
[20] B.C.M. Fung, K. Wang, and M. Ester. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining, volume 30, pages 59-70, 2003.
[21] Nicola L. C. Talbot Gavin C. Cawley and Mark Girolami. Sparse multinomial logistic regression via bayesian l1 regularisation. In NIPS, 2006.
[22] L. Gong, J. Zeng, and S. Zhang. Text stream clustering algorithm based on adaptive feature selection. Expert Systems with Applications, 38(3):1393-1399, 2011.
[23] I. Guyon and A. Elissee. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157-1182, 2003.
[24] Mark A. Hall. Correlation-based feature selection for machine learning. Technical report, 1999.
[25] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001.
[26] X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. Advances in Neural Information Processing Systems, 18:507, 2006.
[27] J.Z. Huang, M.K. Ng, H. Rong, and Z. Li. Automated variable weighting in k-means type clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(5):657-668, 2005.
[28] Anil Jain and Douglas Zongker. Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:153-158, 1997.
[29] D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational learning. In ICML, pages 259-266, 2002.
[30] L. Jing, M.K. Ng, and J.Z. Huang. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. Knowledge and Data Engineering, IEEE Transactions on, 19(8):1026-1041, 2007.31
[31] Thorsten Joachims, Fachbereich Informatik, Fachbereich Informatik, Fachbereich Informatik, Fachbereich Informatik, and Lehrstuhl Viii. Text categorization with support vector machines: Learning with many relevant features, 1997.
[32] Y.S. Kim, W.N. Street, and F. Menczer. Evolutionary model selection in unsupervised learning. Intelligent Data Analysis, 6(6):531-556, 2002.
[33] Ron Kohavi and George H. John. Wrappers for feature subset selection, 1996.
[34] Y. Li, S.M. Chung, and J.D. Holt. Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, 64(1):381-404, 2008.
[35] Y. Li, M. Dong, and J. Hua. Localized feature selection for clustering. Pattern Recognition Letters, 29(1):10-18, 2008.
[36] Y. Li, C. Luo, and S.M. Chung. Text clustering with feature selection by using statistical data. Knowledge and Data Engineering, IEEE Transactions on, 20(5):641-652,2008.
[37] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Boston: Kluwer Academic Publishers, 1998.
[38] H. Liu and H. Motoda, editors. Computational Methods of Feature Selection. Chapman and Hall/CRC Press, 2007.
[39] Huan Liu and Rudy Setiono. A probabilistic approach to feature selection - a filter solution. pages 319-327. Morgan Kaufmann.
[40] Huan Liu and Lei Yu. Toward integrating feature selection algorithms for classi_cation and clustering. Knowledge and Data Engineering, IEEE Transactions on, 17(4):491 -502, April 2005.
[41] B. Long, Z.M. Zhang, X. Wu, and P.S. Yu. Spectral clustering for multi-type relational data. In Proceedings of the 23rd international conference on Machine learning, pages 585-592. ACM, 2006.
[42] B. Long, Z.M. Zhang, and P.S. Yu. A probabilistic framework for relational clustering. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470-479. ACM, 2007.
[43] H.P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development, 1(4):309-317, 1957.
[44] Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395-416, 2007.
[45] S.A. Macskassy and F. Provost. Classi_cation in networked data: A toolkit and a univariate case study. The Journal of Machine Learning Research, 8:935-983, 2007.
[46] Pabitra Mitra, Student Member, C. A. Murthy, and Sankar K. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24:301-312, 2002.
[47] D.S. Modha and W.S. Spangler. Feature weighting in k-means clustering. Machine learning, 52(3):217-237, 2003.32
[48] K. Morik, A. Kaspari, M. Wurst, and M. Skirzynski. Multi-objective frequent termset clustering. Knowledge and Information Systems, pages 1-24, 2012.
[49] MSK Mugunthadevi, M. Punitha, and M. Punithavalli. Survey on feature selection in document clustering. International Journal, 3, 2011.
[50] Mark E. J. Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):26113, 2004.
[51] Andrew Y. Ng. On feature selection: Learning with exponentially many irrelevant features as training examples. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 404-412. Morgan Kaufmann, 1998.
[52] Kamal Nigam, Andrew Kachites Mccallum, Sebastian Thrun, and Tom Mitchell. Text classi_cation from labeled and unlabeled documents using em. In Machine Learning,pages 103-134, 1999.
[53] I.S. Oh, J.S. Lee, and B.R. Moon. Hybrid genetic algorithms for feature selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(11):1424-1437,2004.
[54] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [55] V. Roth and T. Lange. Feature selection in clustering problems. Advances in neural information processing systems, 16, 2003.
[56] Yong Rui and Thomas S. Huang. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10:39-62, 1999.
[57] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classi_cation in network data. AI magazine, 29(3):93, 2008.
[58] Wojciech Siedlecki and Jack Sklansky. On automatic feature selection. pages 63-87, 1993.
[59] M. R. Sikonja and I. Kononenko. Theoretical and empirical analysis of Relief and ReliefF. Machine Learning, 53:23-69, 2003.
[60] L. Song, A. Smola, A. Gretton, K. Borgwardt, and J. Bedo. Supervised feature selection via dependence estimation. In International Conference on Machine Learning, 2007.
[61] C. Su, Q. Chen, X. Wang, and X. Meng. Text clustering approach based on maximal frequent term sets. In Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on, pages 1551-1556. IEEE, 2009.
[62] L. Talavera. Feature selection as a preprocessing step for hierarchical clustering. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 389-397. MORGAN KAUFMANN PUBLISHERS, INC., 1999.
[63] Jiliang Tang and Huan Liu. Feature selection with linked data in social media. In SDM, 2012.
[64] Jiliang Tang and Huan Liu. Unsupervised feature selection for linked social media data. In KDD, 2012.33
[65] L. Tang and H. Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 817-826. ACM, 2009.
[66] B. Taskar, P. Abbeel, M.F. Wong, and D. Koller. Label and link prediction in relational data. In Proceedings of the IJCAI Workshop on Learning Statistical Models from Relational Data. Citeseer, 2003.
[67] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411-423, 2001.
[68] C.Y. Tsai and C.C. Chiu. Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Computational statistics & data analysis, 52(10):4658-4672, 2008.
[69] J. Weston, A. Elisse, B. Schoelkopf, and M. Tipping. Use of the zero norm with linear odels and kernel methods. Journal of Machine Learning Research, 3:1439-1461, 2003.
[70] Dietrich Wettschereck, David W. Aha, and Takao Mohri. A review and empirical valuation of feature weighting methods for a class of lazy learning algorithms. Arti_cial ntelligence Review, 11:273-314, 1997.
[71] D.M. Witten and R. Tibshirani. A framework for feature selection in clustering. Journal f the American Statistical Association, 105(490):713-726, 2010.
[72] I.H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Pub, 2005.
[73] Zenglin Xu, Rong Jin, Jieping Ye, Michael R. Lyu, and Irwin King. Discriminative semi-upervised feature selection via manifold regularization. In IJCAI’ 09: Proceedings of he 21th International Joint Conference on Arti_cial Intelligence, 2009.
[74] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text ategorization. pages 412-420. Morgan Kaufmann Publishers, 1997.
[75] L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In T. Fawcett and N. Mishra, editors, Proceedings of the 20th Inter-
national Conference on Machine Learning (ICML-03), pages 856-863, Washington,D.C., August 21-24, 2003 2003. Morgan Kaufmann.
[76] L. Yu and H. Liu. E_cient feature selection via analysis of relevance and redundancy.Journal of Machine Learning Research (JMLR), 5(Oct):1205-1224, 2004.
[77] W. Zhang, T. Yoshida, X. Tang, and Q. Wang. Text clustering using frequent itemsets. Knowledge-Based Systems, 23(5):379-388, 2010.
[78] Z. Zhao and H. Liu. Spectral Feature Selection for Data Mining. Chapman & Hall/Crc Data Mining and Knowledge Discovery. Taylor & Francis, 2011.
[79] Zheng Zhao and Huan Liu. Semi-supervised feature selection via spectral analysis. In Proceedings of SIAM International Conference on Data Mining (SDM), 2007.
[80] Zheng Zhao and Huan Liu. Spectral feature selection for supervised and unsupervised learning. In ICML '07: Proceedings of the 24th international conference on Machine