AM 218机器学习

Homework 3
AM 218 Machine Learning
Due on April 16 at 11:59 PM on Canvas
Part I: Soft-margin SVM (10 pts)
In order to deal with non-separable cases, we introduce the slack variables ξi
, i = 1, 2 . . . , m allowing
a few data points to be misclassified or inside the margin.
min
w,b,ξ
1
2
||w||2 + C
Xm
i=1
ξi
s.t. yi(w>x + b) ≥ 1 − ξi
, i = 1, 2, . . . , m,
ξi ≥ 0, i = 1, 2, . . . , m.
Similar to the linear-separable case, we solve the above optimization problem by applying
Lagrangian method to derive the dual problem.
(1) Could you derive the dual problem for the SVM with slack variables using the Lagrangian
method? (5 pts)
(2) Let α∗ denote the solution to the dual problem in (1), could you derive the solution to the
primal problem? In other words, give the expression of the optimal solution w∗ and b
∗ using
α∗
. (5 pts)
Part II: Implementation of Support Vector Machine: Ham or Spam
Email has been one of the most important forms of communication. Spam is one of the major
threats posed to email users. Links in spam emails may lead to users websites with malware or
phishing schemes. Therefore, an effective spam filtering technology is a significant contribution to
the sustainability of the cyberspace and to our society. Our client company provides a collection of
spam emails as well as regular work and personal emails that came from its employee. The client
company asked us to design an effective spam-filtering system. In order to increase the effectiveness
of spam detection, we plan to analyze the content of the email to determine if the email is spam.
1
After pre-processing the content of the emails, we have extracted 57 features from those emails.
Your task now is to build a spam filter to predict whether an email is a spam or not.
Data
In the dataset (see spambase.csv) provided by the client company, the first 48 columns indicate
the feature word freq WORD, the percentage of words in the email that match WORD (number
of times the WORD appears in the email/total number of words in email * 100). The following
6 columns indicates the feature char freq CHAR, the percentage of characters in the email that
match CHAR. The next column indicates the feature capital run length longest, the length of
longest uninterrupted sequence of capital letters. The last feature is capital run length total, the
total number of capital letters in the email. The last three features have been standardized for
improving the efficiency of training process of the models. The last column indicates whether the
email was considered as spam (1) or not (0). There are in total 4600 instances in this dataset.
Programming assignment
In this assignment, you will implement Support Vector Machine on the spambase dataset to predict
whether the email is spam or not using Python. You can either use the functions from libraries
such as scikit-learn or write the code from scratch to implement the algorithm. After you run the
sample code (see, “svm demo.py”) successfully (this step is not required, but recommended), you
should apply the Support Vector Machine following the instructions as below. In this assignment,
you should use linear kernels and set the penalty coefficient C to 1 if those parameters are not
specified.

  1. Learn and apply the algorithm (20pts)
    a. For each W in [7, 17, 27, 37, 47, 57]
    i. Use the first W features and randomly sample 70% of the data for training and use
    the remaining 30% of the data for testing
    iii. Learn a model from the training data and apply it to test data
    iv. Measure the performance on both the training and test data using accuracy
    2
    v. Repeat 100 times with different random samples (using the same W)
    vi. Record and report the average and the standard deviation of accuracy on both
    training and test sets across the 100 trials
    b. Plot curves for the results (the number of features vs avg., the number of features vs
    std. dev.) for both training and test data. Discuss the results.
  2. Explore the effect of the penalty coefficient C (25pts)
    a. Randomly sample 80% of the data for training and use the remaining 20% of data for
    testing.
    b. For each C in [0.001, 0.01, 0.05, 0.1, 0.5, 1, 10]
    i. Use 10-fold cross validation on the training set to evaluate the accuracy of the SVM
    with the penalty coefficient C.
    ii. Record and report the average and the standard deviation of accuracy on 10 validation
    sets.
    iii. Record and report the average margin of the SVM that you trained.
    c. Plot curves for the results (the penalty coefficient vs the avg. accuracy, the penalty
    coefficient vs the std. dev. of accuracy, the penalty coefficient vs the avg. margin).
    Discuss the results.
    d. Choose a penalty coefficient C and give the reason why you choose this penalty coeffi-
    cient. Learn the SVM model on the whole training set with the chosen penalty coefficient
    C and report the accuracy on the testing set.
  3. Comparison with Logistic Regression (15pts)
    a. Evaluate the accuracy of SVM and Logisitc Regression using 10-fold cross validation on
    the whole dataset.
    b. Record and report the average and the standard deviation of the accuracy of SVM and
    Logistic Regression on the validation sets.
    c. Record and report the average precision and average recall of SVM and Logistic Regression
    on validation sets.
    3
    d. Is one classifier better than the other based on the performance metric accuracy? Use
    the paired t-test to justify your conclusion.
    Hint
  4. For part 1, you may use sklearn.svm.SVC() to train the model.
  5. For part 2, you may use sklearn.model selection.KFold to implementation k-fold cross
    validation.
  6. For part 3, you may use sklearn.metrics.recall score and sklearn.metrics.precision score
    to evaluate your models.
    Submission
  7. The source code in python.
    Name your file as “svm.py”. You should include the code that you write for the above assignments.
    In addition, you should prepare the code such that TA can run without additional
    modifications and the output should report the results for the question (b) in the part 2 like
    this:
    ✩python svm.py
    The average accuracy with different penalty coefficients:
    [0.111 0.333 0.777 0.212 0.676 0.216 0.313 ]
    The std. dev. of accuracy with different penalty coefficients:
    [0.013 0.013 0.012 0.011 0.012 0.222 0.213]
    The average margin with different penalty coefficients:
    [1.432 2.212 0.266 3.154 1.089 1.212 3.122]
  8. Your evaluation & analysis in .pdf format.
    Note that your analysis should include the graphs as well as a discussion of results.

你可能感兴趣的:(机器学习)