COMP534 Lab session
21/02/2022
This is an example you can use to work through loading, fixing, plotting, and predicting
on some data. The instructions may not be exact, and the snippets of code may have
small errors – be prepared to search for what seems to be missing.
I have suggested using PyCharm as a python development environment, but feel free to use
anything else you are more familiar with. I also suggest conda as a package and environment
manager, but it is not the only option.
Getting started with data
install Pycharm and Miniconda (or Anaconda)
Create a new python project - call it COMP534_1
For Project interpreter select : new conda environment
And select to Make available to all other projects
Conda is a package management system for python (Anaconda and Miniconda are interchangeable)
It allows you to easily install and manage many different packages
It also provides a method of managing environments. An 'environment' is a unique set of python
versions and libraries,
as sometimes you need to switch between different sets of libraries, or even different python
version.
You first need to install some packages – so open a terminal window in Pycharm
For Lab pcs – the conda setup is complicated due to file permissions: make the following changes:
Pycharm can be installed from the ‘Install University Applications’ on the desktop
For Project interpreter select : new virtual environment
Instead of using conda to install packages, use pip
o i.e., pip install scikit-learn
o pip install seaborn
note that it says (COMP534_1) in brackets
this tells you that the virtual environment COMP534_1 is currently selected.
Some of the common conda Virtual environment commands are:-
conda create -n name
conda activate ...
conda activate base
And for managing packages…
conda list
conda install ...
For now, you should just need…
conda install scikit-learn
conda install seaborn
Note that things like matplotlib and pandas are installed automatically as dependencies.
Iris dataset
Add a new python file (e.g. first.py) to your project (right-click on the project name in the Project
window), and add some code
from sklearn import datasets
iris = datasets.load_iris()
print(type(iris))
right-click on test.py, and click run 'test'
With python, you can happily run from the python console, going one step at a time - but if you
might need to rerun
you analysis, maybe with different parameters, then it becomes easier to store a program, and re-
run it whenever you have mnade changes.
(with pycharm, you can click the 'run' button, or press CTRL-F5 to re-run the last python file)
We will convert this to a pandas dataframe- we don't need to, but that keeps it more consistent
from sklearn import datasets
import pandas as pd
data = datasets.load_iris()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
print(df.describe())
As we are running from inside a program - we will want to 'print' things to see them. If you are
running from a python console then you will see the results of each command anyway - so you only
need df.describe()
now we will add a plot - we will include matplotlib as well as seaborn, as we will need some of the
lower level commands
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
import seaborn as sns
data = datasets.load_iris()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
sns.histplot(df)
plt.show()
We often call an example dataframe df, it's just a convention which is convenient when looking at
other people's code
You can refer to columns by their name - which you can get from df.columns
hence df[df.columns[0]] is the first column. But you should be careful of doing this, in case
the order changes later on.
See here for more information about accessing and selecting rows and columns:-
https://www.shanelynn.ie/pand...
Seaborn has lots of different plots you can use, and there is loads of information at
https://seaborn.pydata.org/tu...
https://seaborn.pydata.org/tu...
You can create a new dataframe with only certain columns
df2 = df[['sepal length (cm)' ,'petal length (cm)' ]]
sb.histplot(df2)
In the histogram plot, we can see that there sepals are usually longer than petals, but what else can
we find out from just this data?
sns.scatterplot(x=df['petal length (cm)'], y=df['sepal length (cm)'])
or
sns.scatterplot(data=df,x='petal length (cm)', y='sepal length (cm)')
Notice anything strange?
Petal length, not surprisingly is roughly related to sepal length - but there are at least two distinct
clusters of values, seemingly with different relationships.
print(df.columns) to see what columns we have included so fare
but this dataset contains something else - a 'target'. In this case, it is a classification for each iris as
one of 3 species.
You can see it here:-
print(data['target'])
So we can copy that to the dataframe, and we now have an extra piece of information we can see....
df[‘target’] = data[‘target’]
and plot with…
sns.scatterplot(data=df,x='petal length (cm)', y='sepal length
(cm)', hue='target')
And now we can see why we have this separate cluster on the left - they are a different species of
Iris to the others.
You can see what they are called with
print(data.target_names)
What other plots can you generate for this data - can you think of anything that may actually be
useful?
Predicting
And this is why we have a 'target' - we are going to see how well we can predict the species, just
based on the size of the petals and sepals. The 'target' column will come up frequently, sometimes
with different names, but this is typically the thing that we are interested in predicting. As a
reminder, for supervised learning, we have some 'training' data where we know the value of the
target (or at least have a reasonable guess), and we want to learn how to predict this value for new
data, where we don't know the real value.
In this case, it is the species of Iris, but it may be a huge variety of things in the real world - likelihood
of disease, the value of a hand-written number, the cost of a footballer, the ratio of peptide
ionisation, etc. etc.
One of the names for the 'target' is simply y. We call the rest of the data X, and the target y.
You will often see this in example code.
To run supervised learning, we also want to see how well it performs - so we split the data we have
into 2. 'train' and 'test'. The classifier uses the 'train' dataset to learn how to predict the result.
Then we can give it the 'test' set to see if it really works!
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
We are just using the default parameters, you can set your own splits etc.
X_train, X_test, y_train, y_test =
train_test_split(df,data['target'])
model = KNeighborsClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(predictions==y_test)
And you can see which predictions it made correctly...
(Make sure that you don't accidentally include the 'target' in the training data - or it will be very
good at predicting only when it has been given the answer!
Remove the df['target'] = data['target'], and try again
You may now get one or two wrong; this is normal, 100% accuracy only happens with quite simple
systems...
You can plot the predictions against the actual values
sns.scatterplot(x=y_test,y=predictions)
plt.show()
It's not very interesting for this data, but it will show you where predictions are going wrong...
Try using the different classifiers - see how easy it is to change, as all of the inputs are usually the
same.
(Try at least SVM, naive_bayes, DecisionTree)
Try the DecisionTreeClassifier with (max_depth = 1) and (max_depth = 10) what is the difference?
Each classifier may also have different parameters, which you can look into. But, for this data, as it is
so simple they are generally unlikely to make much difference.
Titanic Data
So, let's get a more complex dataset....
For the sake of simplicity, data is often converted into .csv files. These are very simple text files,
each line consists of one data record, and the values are just text, separated by commas. The first
line is usually a 'header', which gives you the column names.
You can open these in Excel, in Notepad, or even just view them from a command / terminal
prompt.
e.g. with the commands type file.csv in Windows or cat file.csv in Linux
Datasets can be in more complex database-style formats, such as json, XML, or even stored in a
database.
This one is a dataset many people use as a form of competition - it gives passenger details from
those on board the Titanic when it sank. We want to see if it is possible to predict who survived,
based on their details.
https://raw.githubusercontent...
titanic/master/train.csv
(There is a separate test and train dataset, but we can just split the training dataset as we have done
before. Once you are finished with it, you could also get the test datset and work out how to
incorporate that.)
df = pd.read_csv('train.csv')
print(df.columns)
print(df.describe())
Note that we can't describle columns which contain non-numerical data, for now we can just remove
them, but we will look at dealing with them better later on
df = df.drop(['name','sex','ticket','cabin','embarked'], axis=1)
We will also drop rows which are incomplete - they have a NA value somewhere. This isn't always
(or often) the best way to handle missing data
df = df.dropna()
and just have a look at the data, to check we can plot it – it won’t really make much sense as the
columns have very different values.
sns.histplot(df)
What useful plots could you make instead?
So lets take the code from our last attempt at a decision tree classifier
And put the 'target' into a separate series variable
target = df['survived']
Don't forget to remove it from the training data!
df = df.drop(['survived'], axis=1)
model = DecisionTreeClassifier(max_depth = 1)
X_train, X_test, y_train, y_test = train_test_split(df,target])
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(predictions==y_test)
If you want to more easily see how good your prediction is ...
print(len(predictions), sum(predictions==y_test))
tells you how many predictions you made, and how many are correct - it should be possible to get
over 90% accuracy (but not without some more work).
This is a very basic statistic, there are many others that you can use…
You may see that some of the predictors perform better than others. As the test/train split is
random, you will also get slightly different answers every time.
In order to improve performance, we are going to use some of the text data that we removed
earlier.
Where data is strings, we are just going to treat them as categories - i.e., the order doesn't mean
anything.
So, we will just use the LabelEncoder in scikit-learn
insert the following code, and no long drop the column marked 'sex'
from sklearn.preprocessing import LabelEncoder
df['sex'] = LabelEncoder().fit_transform(df['sex'])
This will just set the sex to 0 or 1 as Male or Female.
Will this change the prediction?
Do you think you could encode the other values as numbers in a sensible way?
Regression
Everything we have looked at so far is Classification - i.e., there are a set number of possible
outcomes
(3 species of flowers, or survived / not survived the Titanic).
But we often want to work out more detail - e.g., what is the probability of..., what is the value of....,
and for that we use 'regression'
We can look at a California house price dataset, from
https://github.com/ageron/han...
And will re-use some of the things that we already learned how to do…
df=pd.read_csv("housing.csv")
df['ocean_proximity'] =
LabelEncoder().fit_transform(df['ocean_proximity'])
print(df.describe())
again, we remove the incomplete rows with NA
df = df.dropna()
set the target value
target = df['median_house_value']
df = df.drop(['median_house_value'], axis=1)
model = RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(df,target)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
sns.scatterplot(x=y_test,y=predictions)
plt.show()
What are some of the ways that you can view and evaluate the performance?
请加QQ:99515681 或邮箱:[email protected] WX:codehelp