Homework 2 - Security Analytics - Decision Tree

1. Use a decision tree classifier (default) to train and test the malicious URL dataset. (2pt)

  • link: https://www.kaggle.com/akashkr/phishing-url-eda-and-modelling
  • Use 30% for training and 70% for testing
  • What is the accuracy score?

2. Explore how the tree depth number can affect the accuracy score (3 pt)

  • Make a loop to set max_depth to 10, 20, 30…until 100, and observe the accuracy score change
  • What’s the tree depth number that can get the best accuracy score?

3. Use random forest to repeat step 2. What’s your observation? (3 pt)

4. Try both the decision tree and random forests with the credit card fraud dataset. What’s your observation on accuracy score change? (2 pt)

creditcard.csv
dataset.csv
hw2_Kyle Wang.ipynb

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score

urls = pd.read_csv("dataset.csv")

X = urls.iloc[:, 1:30]
y = urls['Result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test 
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9484429512856958

DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = DecisionTreeClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9381056984106474
max_depth: 20
Accuracy: 0.9488305982685101
max_depth: 30
Accuracy: 0.9477968729810053
max_depth: 40
Accuracy: 0.9505104018607056
max_depth: 50
Accuracy: 0.9477968729810053
max_depth: 60
Accuracy: 0.9481845199638196
max_depth: 70
Accuracy: 0.9489598139294483
max_depth: 80
Accuracy: 0.9461170693888099
max_depth: 90
Accuracy: 0.9497351078950769
max_depth: 100
Accuracy: 0.9490890295903863

When max_depth = 10, the accuracy always is the lowest. The trend of the accuracy is to rise first and then level off.
In most cases, when max_depth = 60 can get the best accuracy score.


from sklearn.ensemble import RandomForestClassifier
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print(DEPTH)
    clf = RandomForestClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

10
Accuracy: 0.9483137356247577
20
Accuracy: 0.9594262824654348
30
Accuracy: 0.9591678511435586
40
Accuracy: 0.9598139294482492
50
Accuracy: 0.9589094198216824
60
Accuracy: 0.960847654735754
70
Accuracy: 0.9605892234138778
80
Accuracy: 0.960847654735754
90
Accuracy: 0.9603307920920016
100
Accuracy: 0.9592970668044967

The overall accuracy has been slightly improved. The oeverall trend unchanged.


cards = pd.read_csv("creditcard.csv")

X = cards.iloc[:, 0:30]
y = cards['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test 
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = DecisionTreeClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9993027863466506
max_depth: 20
Accuracy: 0.999117197100795
max_depth: 30
Accuracy: 0.999132244877486
max_depth: 40
Accuracy: 0.9991121811752314
max_depth: 50
Accuracy: 0.9990770696962857
max_depth: 60
Accuracy: 0.999102149324104
max_depth: 70
Accuracy: 0.999102149324104
max_depth: 80
Accuracy: 0.9991372608030497
max_depth: 90
Accuracy: 0.9990770696962857
max_depth: 100
Accuracy: 0.9990319263662127

The overall accuracy has a downward trend ———— DecisionTreeClassifier


DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = RandomForestClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9994181526346149
max_depth: 20
Accuracy: 0.9994382163368696
max_depth: 30
Accuracy: 0.9994332004113059
max_depth: 40
Accuracy: 0.9994382163368696
max_depth: 50
Accuracy: 0.9994231685601785
max_depth: 60
Accuracy: 0.9994332004113059
max_depth: 70
Accuracy: 0.9994332004113059
max_depth: 80
Accuracy: 0.9994181526346149
max_depth: 90
Accuracy: 0.9994131367090512
max_depth: 100
Accuracy: 0.9994281844857422

The overall trend of accuracy is relative stable ———— RandomForestClassifier


你可能感兴趣的:(机器学习,python)