Homework 6 INF 552,1. Supervised, Semi-Supervised, and Unsupervised Learning(a) Download the Breast Cancer Wisconsin (Diagnostic) Data Set from:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. Download the data in https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data, whichhas IDs, classes (Benign=B, Malignant=M), and 30 attributes. This data hastwo output classes. Use the first 20% of the positive and negative classes in thefile as the test set and the rest as the training set.(b) Monte-Carlo Simulation: Repeat the following procedures for supervised, unsupervised,and semi-supervised learning M = 30 times, and use randomly selectedtrain and test data (make sure you use 20% of both the positve and negativeclasses as the test set). Then compare the average scores (accuracy, precision,recall, F-score, and AUC) that you obtain from each algorithm.i. Supervised Learning: Train an L1-penalized SVM to classify the data.Use 5 fold cross validation to choose the penalty parameter. Use normalizeddata. Report the average accuracy, precision, recall, F-score, and AUC, forboth training and test sets over your M runs. Plot the ROC and report theconfusion matrix for training and testing in one of the runs.ii. Semi-Supervised Learning/ Self-training: select 50% of the positiveclass along with 50% of the negative class in the training set as labeled dataand the rest as unlabelled data. You can select them randomly.A. Train an L1-penalized SVM to classify the labeled data Use normalizeddata. Choose the penalty parameter using 5 fold cross validation.B. Find the unlabeled data point that is the farthest to the decision boundaryof the SVM. Let the SVM label it (ignore its true label), and add it tothe labeled data, and retrain the SVM. Continue this process until allunlabeled data are used. Test the final SVM on the test data andtheaverage accuracy, precision, recall, F-score, and AUC, for both trainingand test sets over your M runs. Plot the ROC and report the confusionmatrix for training and testing in one of the runs.iii. Unsupervised Learning: Run k-means algorithm on the whole trainingset. Ignore the labels of the data, and assume k = 2.A. Run the k-means algorithm multiple times. Make sure that you initializethe algoritm randomly. How do you make sure that the algorithm wasnot trapped in a local minimum?B. Compute the centers of the two clusters and find the closest 30 datapoints to each center. Read the true labels of those 30 data points andtake a majority poll within them. The majority poll becomes the labelpredicted by k-means for the members of each cluster. Then compare thelabels provided by k-means with the true labels of the training data andreport the average accuracy, precision, recall, F-score, and AUC over M1Homework 6 INF 552, Instructor: Mohammad Reza Rajatiruns, and ROC and the confusion matrix for one of the runs.1C. Classify test data based on their proximity to the centers of the clusters.Report the average accuracy, precision, recall, F-score, and AUC over Mruns, and ROC and the confusion matrix for one of the runs for the testdata.iv. Spectral Clustering: Repeat 1(b)iii using spectral clustering, which is clusteringbased on kernels. Research what spectral clustering is. Use RBF kernelwith gamma=1 or find a gamma that the two clutsres have the same balanceas the one in original data set (if the positive class has p and the negative classhas n samples, the two clusters must have p and n members). Do not labeldata based on their proximity to cluster center, because spectral clusteringmay give you non-convex clusters . Instead, use fit ? predict method.v. One can expect that supervised learning on the full data set works better thansemi-supervised learning with half of the data set labeled.O ne can expectsthat unsupervised learning underperforms in such situations. Compare theresults you obtained by those methods.2. Active Learning Using Support Vector Machines(a) Download the banknote authentication Data Set from: https://archive.ics.uci.edu/ml/datasets/banknote+authentication. Choose 472 data points randomlyas the test set, and the remaining 900 points as the training set. This is abinary classification problem.(b) Repeat each of the following two procedures 50 times. You will have 50 errors for90 SVMs per each procedure.i. Train a SVM with a pool of 10 randomly selected data points from the trainingset using linear kernel and L1 penalty. Select the penalty parameter using10-fold cross validation.2 Repeat this process by adding 10 other randomlyselected data points to the pool, until you use all the 900 points. Do NOTreplace the samples back into the training set at each step. Calculate thetest error for each SVM. You will have 90 SVMs that were trained using 10,20, 30, ... , 900 data points and their 90 test errors. You have implementedpassive learning.1Here we are using k-means as a classifier. The closest 30 data points to each center are labeled byexperts, so as to use k-means for classification. Obviously, this is a native approach.2How to choose parameter ranges for SVMs? One can use wide ranges for the parameters and a finegrid (e.g. 1000 points) for cross validation; however,this method may be computationally expensive. Analternative way is to train the SVM with very large and very small parameters on the whole training dataand find very large and very small parameters for which the training accuracy is not below a threshold (e.g.,70%). Then one can select a fixed number of parameters (e.g., 20) between those points for cross validation.For the penalty parameter, usually one has to consider increments in log(λ). For example, if one found thatthe accuracy of a support vector machine will not be below 70% for λ = 103 and λ = 106, one has to chooselog(λ) ∈ {3, 2, . . . , 4, 5, 6}. For the Gaussian Kernel parameter, one usually chooses linear increments,e.g.σ ∈ {.1, .2, . . . , 2}. When both σ and λ are to be chosen using cross-validation, combinations of very smalland very large λ’s and σ’s that keep the accuracy above a threshold (e.g.70%) can be used to determine theranges for σ and λ. Please note that these are very rough rules of thumb, not general procedures.2Homework 6 INF 552, Instructor: Mohammad Reza Rajatiii. Train a SVM with a pool of 10 randomly selected data points from the trainingset3 using linear kernel and L1 penalty. Select the parameters of the SVMwith 10-fold cross validation. Choose the 10 closest data points in the trainingset to the hyperplane of the SVM4 and add them to the pool. Do not replacethe samples back into the training set. Train a new SVM using the pool.Repeat this process until all training data is used. You will have 90 SVMsthat were trained using 10, 20, 30,..., 900 data points and their 90 test errors.You have implemented active learning.(c) Average the 50 test errors for each of the incrementally trained 90 SVMs in 2(b)iand 2(b)ii. By doing so, you are performing a Monte Carlo simulation. Plotaverage test error versus number of training instances for both active and passivelearners on the same figure and report your conclusions. Here, you are actuallyobtaining a learning curve by Monte-Carlo simulation.3If all selected data points are from one class, select another set of 10 data points randomly.4You may use the result from linear algebra about the distance of a point from a hyperplane.3本团队核心人员组成主要包括硅谷工程师、BAT一线工程师,精通德英语!我们主要业务范围是代做编程大作业、课程设计等等。我们的方向领域:window编程 数值算法 AI人工智能 金融统计 计量分析 大数据 网络编程 WEB编程 通讯编程 游戏编程多媒体linux 外挂编程 程序API图像处理 嵌入式/单片机 数据库编程 控制台 进程与线程 网络安全 汇编语言 硬件编程 软件设计 工程标准规等。其中代写编程、代写程序、代写留学生程序作业语言或工具包括但不限于以下范围:C/C++/C#代写Java代写IT代写Python代写辅导编程作业Matlab代写Haskell代写Processing代写Linux环境搭建Rust代写Data Structure Assginment 数据结构代写MIPS代写Machine Learning 作业 代写Oracle/SQL/PostgreSQL/Pig 数据库代写/代做/辅导Web开发、网站开发、网站作业ASP.NET网站开发Finance Insurace Statistics统计、回归、迭代Prolog代写Computer Computational method代做因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected] 微信:codehelp