CS5487 – Assignment 3 – Course ProjectDepartment of Computer ScienceCity University of Hong KongProposal due date: Fri, Week 10Presentation date: TBA, Week 14Report due date: Fri, Week 141 Course ProjectThe final assignment is a student-defined course project. The goal of the project is to get somehands-on experience using the course material on your own research problems. If you can’t thinkof a project, then you can do the “default” project, which is digit classification (see Section 2).1.1 Project topicThe goal of the project is to get some hands-on experience using the course material on your ownresearch problems. Keep in mind that there will only be about 4 weeks to do the project, so thescope should not be too large. Following the major themes of the course, here are some generaltopics for the project:• regression (supervised learning) – use regression methods (e.g. ridge regression, Gaussianprocesses) to model data or predict from data.• classification (supervised learning) – use classification methods (e.g., SVM, BDR, LogisticRegression) to learn to distinguish between multiple classes given a feature vector.• clustering (unsupervised learning) – use clustering methods (e.g., K-means, EM, Mean-Shift)to discover the natural groups in data.• visualization (unsupervised learning) – use dimensionality reduction methods (e.g., PCA,kernel-PCA, non-linear embedding) to visualize the structure of high-dimensional data.You can pick any one of these topics and apply them to your own problem/data. Before actuallydoing the project, you need to write a project proposal so that we can make sure the project isdoable within the 3-4 weeks. I can also give you some pointers to relevant methods, if necessary.• Can my project be my recently submitted or soon-to-be submitted paper? If you plan to justturn in the results from your paper, then the answer is no. The project cannot be be workthat you have already done. However, your course project can be based on extending yourwork. For example, you can try some models introduced in the course on your data/problem.11.2 Project details• Group project – all projects should have a group of 2 students. To sign up for a group, goto Canvas ⇒ “People” and then join one of the existing “Project Groups”. If you cannot finda group, please use the Discussion board.• Project Proposal – For the first part of the project, you need to write a project proposal.The project proposal should be at most one page with the following contents: 1) an introductionthat briefly states the problem; 2) a precise description of what you plan to do –e.g., What types of features do you plan to use? What algorithms do you plan to use? Whatdataset will you use? How will you evaluate your results? How do you define a good outcomefor the project? – The goal of the proposal is to work out, in your head, what your projectwill be. Once the proposal is done, it is just a matter of implementation!• Project Poster Presentation – Project poster presentations will be at the end of thesemester, before the project due date (Week 14). More details will be sent out later. Theposter presentation is optional. However, if you want to get an “A” on your project, thenyou must give a presentation. Put it another way, if you don’t give a presentation, then youwill get at most a “B+” on your project.• Project Report – The project report is essentially the project proposal with all the detailsfilled in. The report should have the following contents: 1) introduction – what is the problem?why is important?; 2) methodology – what algorithms did you use and what are the technicaldetails? what are the advantages and disadvantages?; 3) experimental setup – what datadid you use? how did you pre-process the data? which algorithms did you run on the data?what is the metric for evaluation?; 4) experimental results – what were the results? whatinsight do you get from these results? what are some typical success and failure cases? –The project report should be at least 4 pages. There is no upper page limit, but probably itshould not be more than 8 pages long. For group projects, the project report must state thelevel of contribution from each project member.• What to hand in – You need to turn in the following things:1. Project proposal (due Friday, Week 10).2. Project report (due Friday, Week 14).3. Presentation poster (due Friday, Week 14).4. Source code files (due Friday, Week 14).Only one group member needs to submit the files on Canvas. You must submit your courseproject materials using the Canvas website. Go to “Assignments” ⇒ “Course Project” ⇒select the appropriate entry.• Third Party Code – In the course project, you may use 3rd party source code,CS5487代做、Computer Science代写、代做 e.g., libsvm,etc. If you use 3rd party code, you must acknowledge it with an appropriate reference.• Grading – The marks for this project will be distributed as follows:– 16.7% – Project proposal.– 16.7% – Technical correctness (whether you used the algorithms correctly)2– 16.7% – Experiments. More points for thoroughness and testing interesting cases (e.g.,different parameter settings).– 16.7% – Analysis of the experiments. More points for insightful observations and analysis.– 16.7% – Quality of the written report (organized, complete descriptions, etc).– 16.7% – Project poster presentation.Note: Here 16.7% means 5/3032 Default Course Project – Digit ClassificationThe default project is handwritten digit classification on a subset of the MNIST digits dataset.• Dataset – The provided dataset is a subset of the MNIST digits. The dataset has 10 classes(digits 0 through 9) with 4000 images (400 images per class). Each feature vector is avectorized image (784 dimensions), containing grayscale values [0, 255]. The original imagedimensions are 28 × 28. Here is an example montage of the digits:The MATLAB file digits4000.mat (or digits4000 *.txt for non-MATLAB users) containsthe following data:– digits vec – a 784 × 4000 matrix, where each column is a vectorized image, i.e. thefeature vector xi ∈ R784.– digits labels – a 1 × 4000 matrix with the corresponding labels yi ∈ {0, · · · , 9}.– trainset – a 2 × 2000 matrix, where each row is a set of indices to be used for trainingthe classifier.– testset – a 2 × 2000 matrix, where each row is the corresponding set of indices to beused for testing the classifier.The image above was generated with the following MATLAB code:4testX = digits_vec(:,testset(1,:)); % get test data (trial 1)testXimg = reshape(testX, [28 28 1 2000]); % turn into an image sequencemontage(uint8(testXimg), ’size’, [40 50]); % view as a montage• Methodology – You can use any technique from the course material, e.g., Bayes classifiers,Fisher’s Discriminant, SVMs, logistic regression, perceptron, kernel functions, etc. You mayalso use other classification techniques not learned in class, but you will need to describe themin detail in your report. Two useful libraries for classification are “libsvm” and “liblinear”.You can also pre-process the feature vectors, e.g., using PCA or kPCA to reduce the dimension,or apply other processing techniques (e.g., normalization or some image processing).Finally, a common trick for doing multi-class classification using only binary classifiers (e.g.SVMs) is to use a set of 1-vs-all binary classifiers. Each binary classifier is trained to distinguishone digit (+1) vs. the rest of the digits (-1). In this case, there are 10 binary classifierstotal. Given a test example, each binary classifier makes a prediction. Hopefully, only oneclassifier has a positive prediction, which can then be selected as the class. If not, then theclassifier that has the most confidence in its prediction is selected. For example, for SVMsthe classifier that places the test example furthest from the margin would be selected. Forlogistic regression, the selection would be based on the calculated class probability.• Evaluation – The classifiers are evaluated over 2 experiment trials. In each trial, 50% of thedata has been set aside for training (and cross-validation of parameters), and the remaining50% is held out for testing only. The indices of the training set and test sets are given in thetrainset and testset matrices. For a given trial, the same writer does not appear in boththe training and test sets.For each trial, train a classifier using only the training set data (images and labels). You mayalso use the training set to select the optimal model parameters using cross-validation. Aftertraining the classifier, apply the classifier to the test data (images only) to predict the class.Record the accuracy (number correct predictions / total number) for that trial. Do not tunethe parameters to optimize the test accuracy directly! You can only tune the parameters usingthe training set.As a baseline, a simple nearest-neighbors classifier with Euclidean distance was used on thetest data. The resulting classification accuracy for each experiment trial is:trial 1 2 mean (std)1-NN 0.9135 0.9185 0.9160 (0.0035)In your experiments, which classifier does better? What feature pre-processing helps or hurtsthe performance? How does the performance vary with parameter values?• Bonus Challenge – In the bonus challenge, I will give you a new test set containing my ownhandwritten digits, and you will try to classify them using your trained classifiers. Whoevergets the best performance wins a prize!5转自:http://www.3daixie.com/contents/11/3444.html