代写Python设计、PythonCOMP20008程序代做、代写Python设计

COMP20008 - 2018 - SM2 - Project Phase 2Release Date: 11:59am Monday, September 2018Due Date: 11:59am Friday, September 2018Submission is via the LMSPlease, make sure you get a submission confirmation email once you submit your assignment. Otherwise, it will be consideredas a late submission.Phase 2: Python Data Wrangling (15 marks, worth 15% of subject grade)For banks, risk management and default detection has always been a crucial part in issuing credit cards. Defaults in creditcards can result in a great financial loss. In this phase, you will practice your Python wrangling skills, specifically correlation,classification and clustering parts with a modified version of an available default credit cards dataset at the UCI MachineLearning Repository. The dataset contains information on default payments, demographic factors, credit data, history ofpayment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.In this phase, you will be working with the &"UCI_Credit_Card_Modified.csv&" dataset. It has 200 records for credit card users,each is described by 25 variables.Libraries to use are Pandas, Matplotlib, NumPy, SciPy, seaborn and sklearn. You will need to write Python 3 code (Jupyternotebook) and work with the topics discussed in workshops weeks 6-8. If you are using other packages, you must provide anexplanation in your code about why it is necessary.Dataset ContentThere are 25 variables: (1 ID, 1 label, 23 features/attributes)ID: ID of each clientlimit_bal: Amount of given credit in NT dollars (includes individual and family/supplementary creditis_male: Binary : Gender (1=male, 0=female)education: Categorical : (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)is_married: Binary : Marital status (1=married, 0= not married, i.e. single or others)age: Age in yearspay_1: Numerical : Repayment status in September, 2005 (0=pay duly, 1=payment delay for one month, 2=payment delayfor two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)pay_2: Numerical : Repayment status in August, 2005 (scale same as above)pay_3: Numerical : Repayment status in July, 2005 (scale same as above)pay_4: Numerical : Repayment status in June, 2005 (scale same as above)pay_5: Numerical : Repayment status in May, 2005 (scale same as above)pay_6: Numerical : Repayment status in April, 2005 (scale same as above)bill_amt1: Numerical : Amount of bill statement in September, 2005 (NT dollar)bill_amt2: Numerical : Amount of bill statement in August, 2005 (NT dollar)bill_amt3: Numerical : Amount of bill statement in July, 2005 (NT dollar)bill_amt4: Numerical : Amount of bill statement in June, 2005 (NT dollar)bill_amt5: Numerical : Amount of bill statement in May, 2005 (NT dollar)bill_amt6: Numerical : Amount of bill statement in April, 2005 (NT dollar)pay_amt1: Numerical : Amount of previous payment in September, 2005 (NT dollar)pay_amt2: Numerical : Amount of previous payment in August, 2005 (NT dollar)pay_amt3: Numerical : Amount of previous payment in July, 2005 (NT dollar)pay_amt4: Numerical : Amount of previous payment in June, 2005 (NT dollar)pay_amt5: Numerical : Amount of previous payment in May, 2005 (NT dollar)pay_amt6: Numerical : Amount of previous payment in April, 2005 (NT dollar)label: Binary :Default payment (1=yes, 0=no)Import Required Python Libraries and Load the DataPlease write here all the Python libraries you will be using! Also load the dataset (.csv) in a dataframe. object.In& [& ]:Helper FunctionsThis section includes few functions discussed in workshops week 6 and 7, and will be used in this assignment.#import ....import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom scipy.cluster.hierarchy import dendrogram, linkagefrom scipy.spatial.distance import pdist, squareformimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_scorefrom sklearn.tree import DecisionTreeClassifierIn& [& ]:In& [& ]:def VAT(R):R = np.array(R)N, M = R.shapeif N != M:R = squareform(pdist(R))J = list(range(0, N))y = np.max(R, axis=0)i = np.argmax(R, axis=0)j = np.argmax(y)y = np.max(y)I = i[j]del J[I]y = np.min(R[I,J], axis=0)j = np.argmin(R[I,J], axis=0)I = [I, J[j]]J = [e for e in J if e != J[j]]C = [1,1]for r in range(2, N-1): y = np.min(R[I,:][:,J], axis=0)i = np.argmin(R[I,:][:,J], axis=0)j = np.argmin(y) y = np.min(y) I.extend([J[j]])J = [e for e in J if e != J[j]]C.extend([i[j]])y = np.min(R[I,:][:,J], axis=0)i = np.argmin(R[I,:][:,J], axis=0)I.extend(J)C.extend(i)RI = list(range(N))for idx, val in enumerate(I):RI[val] = idxRV = R[I,:][:,I]return RV.tolist(), C, Idef my_entropy(probs):return -probs.dot(np.log2(probs))def mutual_info(X,Y):df = pd.DataFrame.from_dict({X : X, Y :Y})Hx = my_entropy(df.iloc[:,0].value_counts(normalize=True, sort=False))Hy = my_entropy(df.iloc[:,1].value_counts(normalize=True, sort=False))counts = df.groupby([&"X&",&"Y&"]).size()probs = counts/ counts.values.sum()H_xy = my_entropy(probs)# Mutual InformationI_xy = Hx + Hy - H_xyMI = I_xyNMI = I_xy/min(Hx,Hy) return NMIdefault_credit_card_df = pd.read_csv(&"UCI_Credit_Card_Modified.csv&", index_col=&"ID&")default_credit_card_df.head(3)1 Data Preparation and Dimension Reduction (5 Marks)1.1 Categorical Features:In this assignment, you will use the dataset to perform. clustering and classification methods. In this regard, using categoricalfeatures might lead to inaccurate evaluation results. Therefore, categorical features should be converted to numerical first. Theprovided default credit card dataset contains one categorical attribute &"education&" (1=graduate school, 2=university, 3=highschool, 4=others, 5=unknown, 6=unknown). Write code to replace the &"education&" column with another three numerical columnsof type integer, with the names &"graduate_school&", &"university&" and &"high_school&". The three numerical columns will have thevalue 0 or 1, where each categorical value will be converted to a binary vector as follow: (1 Mark)The resulted dataframe. should include 26 columns in the following order:Note that the &"ID&" column should be the dataframe. index. The output of this step should print the first two rows of preprocesseddataframe.& 1.2 Feature Scaling:The second step in the data preparation is the feature scaling/transformation. You will usesklearn.preprocessing.StandardScaler() function to normalise each attribute separately, to have 0 mean and unit variance.To do so, implement the following steps: (1 Mark)Create a features matrix &"X&" which contains all columns in your dataframe. except the &"label&" column.Create a vector &"y&" which contains the values in the column &"label&" in your dataframe.Use the StandardScaler() function to normalise the features matrix &"X&". Store the transformed features in &"X_scaled&"matrix.The output of this question should print shape, min, max, average and standard deviation of &"X_scaled&" matrix in the followingformat:*** Q1.2: X_scaled matrix detailsShape: # Min: #Max: # Average: #Standard Deviation: # ***where # is the vector of calculated values rounded to 4 decimal places.In& [& ]:### answer Q1.1### answer Q1.21.3 Dimension Reduction: PCAIt is hard to visualise default clients versus non-default clients since each client has 25 attributes. Write code to reduce numberof dimensions for &"X&" matrix to two dimensions by using PCA. Store the result in &"X_reduced&" matrix. Then, use &"X_reduced&"matrix to display a scatter plot. X-axis should contain the first principal component and y-axis should contain the secondprincipal component. Default clients should be colored with red, while other non-default clients should be colored with blue. (3Marks)Then, implement the same steps with &"X_scaled&" data matrix instead of &"X&" matrix. Now, you should have two plots. Assign thetitles &"Dimension reduction without feature scaling&" and &"Dimension reduction with feature scaling&" for the plots repectively. Thefinal output of this question should be one figure with the two required plots.Based on the visualised plots, which features data matrix is better for PCA, &"X_scaled&" or &"X&"? Why?Do you think using PCA is a good idea for visualising the default credit card clients? Yes/No, Why?In& [& ]:2 Clustering and Clustering Visualisation (5 Marks)In this section you will perform. some of the clustering and clustering visualisation techniques.2.1 Hierarchical Clustering:Write code to plot the dendrogram using &"X_scaled&" matrix for each of the linkage methods: COMPLETE and SINGLE. Youshould also use Euclidean distance to calculate the dissimilarity matrix. (2 Marks)The output of this question should be one figure with two dendrogram sub-plots. First sub-plot should have the title&"Agglomerative clustering with complete linkage method&", while the second sub-plot should have the title &"Agglomerativeclustering with single linkage method&".Following this, you need to look at the y-axis range for the generated dendrograms and justify the different y-axis range/scalefor each dendrogram?In& [& ]:2.2 Clustering Visualisation:In this question, you will plot the heatmap for both dissimilarity matrix and ordered dissimilarity matrix. First, calculate the&"Euclidean distance&" dissimilarity matrix for the &"X_scaled&" data matrix. Next, use the VAT() function discussed in workshop-week6, in order to get the ordered dissimilarity matrix for the &"X_scaled&" data matrix. Finally, plot the heatmap for &"Euclidean&"dissimilarity matrix and the ordered one (i.e. VAT heatmap). (3 Marks)The final output of this question should be one figure with two sub-plots. Use appropriate titles and xy-labels for each sub-plot.Is there any relation between the ordered dissimilarity matrix (VAT heatma代写留学生Python课程设计、PythonCOMP20008程序代做、代写留学生Python课程设计p) and the complete linkage dendrogram plots? If yes,describe one relation between both plots?Knowing that the dataset contains two classes (default and non-default clients), does VAT heatmap give an accurate number ofclusters, Yes|No? Why?In& [& ]:### answer Q1.3### answer Q1.3 justification### answer Q2.1### answer Q2.1 justification### answer Q2.2### answer justification 2.23 Correlation and Mutual Information (7 Marks)In this section you will investigate the correlation and mutual information between credit card dataset attributes and the classlabel.3.1 Pearson Correlation:Write code to calculate a correlation matrix of size 25 x 25 (25 is the number of attributes) using the &"X_scaled&" data matrix. Thecaculated square symmetric matrix will contain the correlation between every of attributes. For example, a value in row i andcolumn j should contain the Pearson Correlation between two attributes i and j. Then, plot the heatmap for the calculatedcorrelation matrix. (3 Marks)The final output for this question should look like the below heatmap plot. Note that the values in this plot are randomlygenerated and you should get different colors/values for your visualised heatmap.From your visualised heatmap, report three different/interesting findings/explanations regarding the calculated correlationbetween every pair of the 25 attributes? For example you might focus the highest and/or lowest r values, and justify/explainwhether you expected such high or low correlation between these specific attributes or not.rijIn& [& ]:3.2 Mutual Information:In this question, you will use the provided mutual_info() and my_entropy() functions discussed in workshop-week7 to calculatethe mutual information between different attributes and the class label. Since discretisation is an important preprocessing for MIcalculations, you will be using the &"X&" data matrix (i.e. without scaling) in this question. By looking at the values of the attributesin the &"X&" data matrix, they can be grouped into 11 discrete and 14 numerical attributes. (4 Marks)Discrete columns are [is_male, grad_school, university, high_school, is_married, pay_1, pay_2, pay_3, pay_4, pay_5,pay_6] andNumerical columns are [limit_bal, age, bill_amt1, bill_amt2, bill_amt3, bill_amt4, bill_amt5, bill_amt6, pay_amt1,pay_amt2, pay_amt3, pay_amt4, pay_amt5, pay_amt6]First, you need to discretize each of the numerical attributes by using the 4-bin equal-width technique. The output for this stepshould have the following format:### answer Q3.1### answer Q3.1 justification################# limit_bal #################bin# 1: range [a,b)bin# 2: range [a,b)bin# 3: range [a,b)bin# 4: range [a,b)################# pay_amt6 #################bin# 1: range [a,b)bin# 2: range [a,b)bin# 3: range [a,b)bin# 4: range [a,b)where a and b are the calculated range for each bin.Next, you will calculate the normalised mutual information (NMI) between each of the columns (i.e. 25 in total) and the classlabel &"y&". Then display a bar plot for the calculated NMI values. The format of the bar plot should be similar to the below plot.Note that the NMI values in this plot are randomly generated and you should get different bars/values for your bar plot.In& [& ]:4 Classification (7 Marks)In this section you will create a model to predict credit card default. This will enable you to answer the question: who is going todefault into the credit card payments next month. Functions used in this section are covered in workshop-week8 materials.4.1 Train-Test Split:To evaluate the performance of any model/classifier, you should test it on unseen set of instances. Therefore, as a preparationfor the model training, you will use train_test_split() from sklearn.model_selection to split your features matrix &"X&" into trainingand testing sets. Specifically, you will use train_test_split() function to randomly select 80% of the instances to be training andthe rest to be testing. (1 Mark)The output of the function should contain two matrices: X_train and X_test, and two vectors: y_train and y_test.X_train contains features of training instances (i.e. 80%) and y_train contains labels for the training instances.X_test contains features of test instances (i.e. 20%) and y_test contains labels for test instances.### answer Q3.2The output of this question should print shape of X_train and X_test matrices as well as shape of y_train and y_test vectors, inthe following format:*** Q4.1: Train Test Split ResultsX_train matrix: #y_train labels: # X_test matrix: #y_test labels: #***where # is the shape values.In& [& ]:4.2 Feature Scaling:Another preprocessing step before training any classifier is normalizing the data features. Because features with large scalemight bias the trained classifier. Write code to transform. each feature (i.e. column) in both training and testing sets, (i.e. X_trainand X_test matrices), to have 0 mean and unit variance. Each transformed matrix should be stored in the matrix name, i.e.X_train and X_test. (1 Mark)Is it a good idea to use the transformed matrix &"X_scaled&" as input for the split function and in this case, you dont have toscale/transform. X_train and X_test again, Yes/No? Why?In& [& ]:4.3 K-nearest Neighbor Classifier:In this question, you will build a K-NN classifier by using the built-in functions from sklearn package. To do so, implement thefollowing steps: (3 Marks)Create an instance from the model/estimator (i.e. KNeighborsClassifier) and set the K (number of neighbors) to 3.Train/fit the model using X_train and y_trainEvaluate the model by calculating the accuracy of the model on the test set (i.e. X_test)The output of this question should be in the following format:***Q4.3: Default credit card user prediction using K-NNTest accuracy: # %***Where # is the calculated classifier accuracy rounded to 2 decimal places.As you might know, choosing the value of k-neighbors used for training the K-NN model is very important. Using the providedcredit card clients dataset, what is the best value/s for k? You need to justify your choice either by providing extra code,visualisation plot and/or explanation?In& [& ]:### answer Q4.1### answer Q4.2### answer justification 4.2### answer Q4.3### answer justification 4.34.4 Decision Tree Classifier:In this question, you will build a decision tree classifier by following same steps provided in the previous question. Only, you willbe using &"DecisionTreeClassifier&" instead of &"KNeighborsClassifier&". You should use the default values for all DT parameters. (2Marks)The output of this question should be in the following format:***Q4.4: Default credit card user prediction using DTTest accuracy: # %***Where # is the calculated classifier accuracy rounded to 2 decimal places.Do you expect the decision tree classifier to perform. better than K-NN, Yes/No? why?Does the resulted accuracies follow your expectations? If not, justify why this has happened? You might add extra code,visualisations and/or explanation to support your answer.In& [& ]:Marking schemeCorrectness (24 marks): For each of the 4 questions a mark will be allocated for level of correctness (does it provide the rightanswer, is the logic right), according to the number in parentheses next to each question. Correctness will also take into accountthe readability and labelling provided for any plots and figures (plots should include title of the plot, labels/scale on axes, namesof axes, and legends for colours where appropriate).Coding style. (1 Mark): Mark will be allocated for coding style. In particular the following aspects will be considered:Formatting of code (e.g. use of indentation and overall readability for a human)Code modularity and flexibility. Use of functions or loops where appropriate, to avoid redundant or excessively verbosedefinitions of code.Use of python library functions (you should avoid reinventing logic if a library function can be used instead)Code commenting and clarity of logic. You should provide comments about the logic of our code for each question, so thatit can be easily understood by the marker.The final mark of the assignment will be scaled from 25 to 15 using the following formula:your_final_mark = × 15your_mark_out_of_2525### answer Q4.4### answer Q4.4 justificationSubmission InstructionsVia the LMS, submit a jupyter notebook containing the code. Make sure you get a submission receipt via email. If you didnt geta receipt via email, this means we didnt receive your submission and it will be considered as late submission.OtherExtensions and Late Submission Penalties: If requesting an extension due to illness, please submit a medical certificate to thelecturer. If there are any other exceptional circumstances, please contact the lecturer with plenty of notice. Late submissionswithout an approved extension will attract a penalty of 10% of the marks available per 24hr period (or part thereof) that it is late.E.g. A late submission will be penalised 1.5 marks if 4 hours late, 3 marks if 28 hours late, 4.5 marks if 50 hours late, 6 marks if73 hours late, 7.5 marks if 106 hours late, etc.Phase 2 is expected to require 15-18 hours work.Academic HonestyYou are expected to follow the academic honesty guidelines on the University website https://academichonesty.unimelb.edu.au(https://academichonesty.unimelb.edu.au)Further InformationA project discussion forum has also been created on the subject LMS. Please use this in the first instance if you havequestions, since it will allow discussion and responses to be seen by everyone. The Phase 1 project page will also contain a listof frequently asked questions.AcknowledgementsDataset used in this assignment was sampled from the original dataset available at the UCI Machine Learning Repository.Further, few changes have been applied to some attributes.Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml] (http://archive.ics.uci.edu/ml]). Irvine, CA:University of California, School of Information and Computer Science.The original dataset can be found here at the UCI Machine Learning Repository. https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/home (https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/home)转自:http://ass.3daixie.com/2018091960894303.html

你可能感兴趣的:(代写Python设计、PythonCOMP20008程序代做、代写Python设计)