讲解:Data Ming、systematic、R、R Statistics、、|Ja

ObjectivesCurriculum Design is a systematic way as well as a practical means of strengthening the theories and methods in the course of Data Ming. In the Curriculum Design for Data Mining, some simulated real application data sets are provided and several curriculum design projects are planned. By doing the Curriculum Design, the students will master the techniques such as: 1. The handling of real application data through data base techniques; 2. The big data mining steps with elementary supervised learning methods; 3. The strategies for evaluating classifiers; 4. The main aspects that impact a classifier’s performances; 5. The primary tools to solve real application problem with data mining. Project 1: Comparison between supervised learning algorithms1. Data set Refer to the affiliated files: adult.train, adult.test and adult.desctiption. adult.train file is used for training, adult.test for test, adult.desctiption for description of the attributes in data. The data have missing values labeled as ‘?’ 2. Tasks(1) Data preprocess. Migrate the data from the files to a data base such oracle, then process the data by data base techniques. Remove the tuples with missing values.(2) Building prediction models using the training data. The elementary supervised learning methods, such as Naïve Bayesian classification, ID3, C4.5, CART, BPANN, are used for training a classifier, respectively. (3) Accuracy comparison between different classifiers Project 2: Investigation of noisy data impact1. Data set Refer to the data for project 1. 2. Tasks (1) Data preprocess. Do not remove the tuples with missing values. Instead, replace the missing values with a proper value in the same column, e.g., mean value, a regressed value, or other values derived by data imputation techniques. (2) Building a prediction model using C4.5. (3) Accuracy comparison between classifiers by C4.5 on two sets of data without and with missing values. Project 3: Simulated application 1. Introduction to letter recognition application The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 numerical attributes. Examples of the character images generated by these procedures are presented in the Figure. Each character image was then scanned, pixel by pixel, to extract 16 numerical attributes. These attributes represent primitive statistical features of the pixel distribution. To achieve compactness, each attribute was then scaled linearly to a range of integer values from 0 to 15. This final set of values was adequate to provide a perfect separation of the 26 classes. That is, no feature vector mapped to more than one class. The attributes (before scaling to 0-15 range) are: (1) The horizontal position, counting pixels from the left edge of the image, of the center of the smallest rectangular box that can be drawn with all on pixels inside the box.(2) The vertical position, counting pixels from the bottom, of the above box. (3) The width, in pixels, of the box. (4) The height, in pixels, of the box. (5) The total number of on pixels in the character image.(6) The mean horizontal position of all on pixels relative to the center of the box and divided by the width of the box. This feature has a negative value if the image is leftheavy as would be the case for the letter L. (7) The mean vertical position of all on pixels relative to the center of the box and divided by the height of the box. (8) The mean squared value of the horizontal pixel distances as measured in 6 above. This attribute will have a higher value for images whose pixels are more widely separated in the horizontal direction as would be the case for the letters W or M. (9) The mean squared value of the vertical pixel distances as measured in 7 above. (10) The mean product of the horizontal and vertical distances for each on pixel as measured in 6 and 7 above. This attribute has a positive value for diagonal lines that run from bottom left to top right and a negative value for diagonal lines from top left to bottom right.(11) The mean value of the squared horizontal distance times the vertical distance for each on pixel代做Data Ming、代写systematic、代写R编程. This measures the correlation of the horizontal variance with the vertical position.(12) The mean value of the squared vertical distance times the horizontal distance for each on pixel. This measures the correlation of the vertical variance with the horizontal position.(13) The mean number of edges (an on pixel immediately to the right of either an off pixel or the image boundary) encountered when making systematic scans from left(15) The mean number of edges (an on pixel immediately above either an off pixel or the image boundary) encountered when making systematic scans of the image from bottom to top over all horizontal positions within the box.(16) The sum of horizontal positions of edges encountered as measured in 15 above. 2. Data set Refer to the affiliated files: letter-recognition.data and letter-recognition.desctiption. letter-recognition.data file is used for training and test, adult.desctiption for description of the attributes in data. 3. Tasks(1) Data preprocess. Migrate the data from the files to a data base such oracle. (2) Data partition by Hold-out method, i.e., randomly divide the data into two parts, 2/3 as training set and 1/3 as test set. (3) Building a prediction model using C4.5 on training set. (4) Assessing its accuracy on test set. Project 4: Comparison between evaluating methods1. Data set Refer to the data for project 3. 2. Tasks (1) Building a prediction model/classifier using C4.5. (2) Evaluate its accuracy by Hold-out method (i.e., project 3), Random sampling, 10-CV, stratified 10-CV and bootstrap, respectively. (3) Accuracy comparison between classifiers by C4.5 under different evaluating methods. Project 5: Investigation of Pruning to overfitting 1. Data set Refer to the data for project 3. 2. Tasks (1) Building a prediction model using CART. (2) Building a prediction model using CART with CCP. (3) Accuracy comparison between classifiers by CART without and with pruning. Requirements1. The experiment is carried out in a group of no more than 5 students. Every group has to finish the 5 compulsory projects before the due date. 2. Python or R can be used to program for your projects, but Python is preferred since it will help you find a good job in the near future. 3. In order to finish the projects, you can download the packages from the online resources and make modification, but you should understand all the codes involved in your projects. 4. To ensure that the curriculum design can be implemented smoothly, each group should select one as the head in charge of the team’s work. He is responsible to organize the team members to do the five projects in collaboration. He has the right to assign the tasks to each member and decide the contribution rate of each member. 5. 最后书写课程设计报告,经过组长协调和同意,每个组成员只能选择至多一个project阐述完成的工作。Evaluation of your work1. Your performance in this course is evaluated based on curriculum design report. Every one should finish the report according to the tasks. Your performance will be judged by: Completeness. Each group should finish all the 5 compulsory projects, or the team scores will be deducted by certain amount for one missing project or task. Correctness. Please try several times for each project until you are sure that the final result is right. Format. Please edit your report in a unified format. In your report, the font style and size as well as the picture and table should be in their unified formats, respectively. Readability by format accounts for 50% of total scores. Plagiarism. Everyone should finish his curriculum design report on his own. If any two reports are found identical, all will be penalized with the same measurement. 2. Curriculum Design Report. Every one must finish a report according to his selected project. The curriculum design report can be made in a word file and written in any language (either in English or Chinese). You are required to submit two versions of your curriculum design report, printed one as well as electronic one. 3. How to submit electronic curriculum design report. 请参照:数据库系统课程设 2提交电子档案模板.A blank page is left here From the following page, your work is presented Project ?:(please give your project no and name)Contents (内容目录)转自:http://www.3daixie.com/contents/11/3444.html

你可能感兴趣的:(讲解:Data Ming、systematic、R、R Statistics、、|Ja)