COMP534 Lab session 21/02/2022 This is an example you can use to work through loading, fixing, plotting, and predicting on some data. The instructions may not be exact, and the snippets of code may have small errors – be prepared to search for what seems to be missing. I have suggested using PyCharm as a python development environment, but feel free to use anything else you are more familiar with. I also suggest conda as a package and environment manager, but it is not the only option. Getting started with data • install Pycharm and Miniconda (or Anaconda) • Create a new python project - call it COMP534_1 • For Project interpreter select : new conda environment • And select to Make available to all other projects Conda is a package management system for python (Anaconda and Miniconda are interchangeable) It allows you to easily install and manage many different packages It also provides a method of managing environments. An 'environment' is a unique set of python versions and libraries, as sometimes you need to switch between different sets of libraries, or even different python version. You first need to install some packages – so open a terminal window in Pycharm For Lab pcs – the conda setup is complicated due to file permissions: make the following changes: • Pycharm can be installed from the ‘Install University Applications’ on the desktop • For Project interpreter select : new virtual environment • Instead of using conda to install packages, use pip o i.e., pip install scikit-learn o pip install seaborn note that it says (COMP534_1) in brackets this tells you that the virtual environment COMP534_1 is currently selected. Some of the common conda Virtual environment commands are:- conda create -n name conda activate ... conda activate base And for managing packages… conda list conda install ... For now, you should just need… conda install scikit-learn conda install seaborn Note that things like matplotlib and pandas are installed automatically as dependencies. Iris dataset Add a new python file (e.g. first.py) to your project (right-click on the project name in the Project window), and add some code from sklearn import datasets iris = datasets.load_iris() print(type(iris)) right-click on test.py, and click run 'test' With python, you can happily run from the python console, going one step at a time - but if you might need to rerun you analysis, maybe with different parameters, then it becomes easier to store a program, and re- run it whenever you have mnade changes. (with pycharm, you can click the 'run' button, or press CTRL-F5 to re-run the last python file) We will convert this to a pandas dataframe- we don't need to, but that keeps it more consistent from sklearn import datasets import pandas as pd data = datasets.load_iris() df = pd.DataFrame(data=data.data, columns=data.feature_names) print(df.describe()) As we are running from inside a program - we will want to 'print' things to see them. If you are running from a python console then you will see the results of each command anyway - so you only need df.describe() now we will add a plot - we will include matplotlib as well as seaborn, as we will need some of the lower level commands import matplotlib.pyplot as plt from sklearn import datasets import pandas as pd import seaborn as sns data = datasets.load_iris() df = pd.DataFrame(data=data.data, columns=data.feature_names) sns.histplot(df) plt.show() We often call an example dataframe df, it's just a convention which is convenient when looking at other people's code You can refer to columns by their name - which you can get from df.columns hence df[df.columns[0]] is the first column. But you should be careful of doing this, in case the order changes later on. See here for more information about accessing and selecting rows and columns:- https://www.shanelynn.ie/pandas-iloc-loc-select-rows-and-columns-dataframe/ Seaborn has lots of different plots you can use, and there is loads of information at https://seaborn.pydata.org/tutorial https://seaborn.pydata.org/tutorial/categorical.html You can create a new dataframe with only certain columns df2 = df[['sepal length (cm)' ,'petal length (cm)' ]] sb.histplot(df2) In the histogram plot, we can see that there sepals are usually longer than petals, but what else can we find out from just this data? sns.scatterplot(x=df['petal length (cm)'], y=df['sepal length (cm)']) or sns.scatterplot(data=df,x='petal length (cm)', y='sepal length (cm)') Notice anything strange? Petal length, not surprisingly is roughly related to sepal length - but there are at least two distinct clusters of values, seemingly with different relationships. print(df.columns) to see what columns we have included so fare but this dataset contains something else - a 'target'. In this case, it is a classification for each iris as one of 3 species. You can see it here:- print(data['target']) So we can copy that to the dataframe, and we now have an extra piece of information we can see.... df[‘target’] = data[‘target’] and plot with… sns.scatterplot(data=df,x='petal length (cm)', y='sepal length (cm)', hue='target') And now we can see why we have this separate cluster on the left - they are a different species of Iris to the others. You can see what they are called with print(data.target_names) What other plots can you generate for this data - can you think of anything that may actually be useful? Predicting And this is why we have a 'target' - we are going to see how well we can predict the species, just based on the size of the petals and sepals. The 'target' column will come up frequently, sometimes with different names, but this is typically the thing that we are interested in predicting. As a reminder, for supervised learning, we have some 'training' data where we know the value of the target (or at least have a reasonable guess), and we want to learn how to predict this value for new data, where we don't know the real value. In this case, it is the species of Iris, but it may be a huge variety of things in the real world - likelihood of disease, the value of a hand-written number, the cost of a footballer, the ratio of peptide ionisation, etc. etc. One of the names for the 'target' is simply y. We call the rest of the data X, and the target y. You will often see this in example code. To run supervised learning, we also want to see how well it performs - so we split the data we have into 2. 'train' and 'test'. The classifier uses the 'train' dataset to learn how to predict the result. Then we can give it the 'test' set to see if it really works! from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier #We are just using the default parameters, you can set your own splits etc. X_train, X_test, y_train, y_test = train_test_split(df,data['target']) model = KNeighborsClassifier() model.fit(X_train, y_train) predictions = model.predict(X_test) print(predictions==y_test) And you can see which predictions it made correctly... (Make sure that you don't accidentally include the 'target' in the training data - or it will be very good at predicting only when it has been given the answer! Remove the df['target'] = data['target'], and try again You may now get one or two wrong; this is normal, 100% accuracy only happens with quite simple systems... You can plot the predictions against the actual values sns.scatterplot(x=y_test,y=predictions) plt.show() It's not very interesting for this data, but it will show you where predictions are going wrong... Try using the different classifiers - see how easy it is to change, as all of the inputs are usually the same. (Try at least SVM, naive_bayes, DecisionTree) Try the DecisionTreeClassifier with (max_depth = 1) and (max_depth = 10) what is the difference? Each classifier may also have different parameters, which you can look into. But, for this data, as it is so simple they are generally unlikely to make much difference. Titanic Data So, let's get a more complex dataset.... For the sake of simplicity, data is often converted into .csv files. These are very simple text files, each line consists of one data record, and the values are just text, separated by commas. The first line is usually a 'header', which gives you the column names. You can open these in Excel, in Notepad, or even just view them from a command / terminal prompt. e.g. with the commands type file.csv in Windows or cat file.csv in Linux Datasets can be in more complex database-style formats, such as json, XML, or even stored in a database. This one is a dataset many people use as a form of competition - it gives passenger details from those on board the Titanic when it sank. We want to see if it is possible to predict who survived, based on their details. https://raw.githubusercontent.com/pcsanwald/kaggle- titanic/master/train.csv (There is a separate test and train dataset, but we can just split the training dataset as we have done before. Once you are finished with it, you could also get the test datset and work out how to incorporate that.) df = pd.read_csv('train.csv') print(df.columns) print(df.describe()) Note that we can't describle columns which contain non-numerical data, for now we can just remove them, but we will look at dealing with them better later on df = df.drop(['name','sex','ticket','cabin','embarked'], axis=1) We will also drop rows which are incomplete - they have a NA value somewhere. This isn't always (or often) the best way to handle missing data df = df.dropna() and just have a look at the data, to check we can plot it – it won’t really make much sense as the columns have very different values. sns.histplot(df) What useful plots could you make instead? So lets take the code from our last attempt at a decision tree classifier And put the 'target' into a separate series variable target = df['survived'] # Don't forget to remove it from the training data! df = df.drop(['survived'], axis=1) model = DecisionTreeClassifier(max_depth = 1) X_train, X_test, y_train, y_test = train_test_split(df,target]) model.fit(X_train, y_train) predictions = model.predict(X_test) print(predictions==y_test) If you want to more easily see how good your prediction is ... print(len(predictions), sum(predictions==y_test)) tells you how many predictions you made, and how many are correct - it should be possible to get over 90% accuracy (but not without some more work). This is a very basic statistic, there are many others that you can use… You may see that some of the predictors perform better than others. As the test/train split is random, you will also get slightly different answers every time. In order to improve performance, we are going to use some of the text data that we removed earlier. Where data is strings, we are just going to treat them as categories - i.e., the order doesn't mean anything. So, we will just use the LabelEncoder in scikit-learn insert the following code, and no long drop the column marked 'sex' from sklearn.preprocessing import LabelEncoder df['sex'] = LabelEncoder().fit_transform(df['sex']) This will just set the sex to 0 or 1 as Male or Female. Will this change the prediction? Do you think you could encode the other values as numbers in a sensible way? Regression Everything we have looked at so far is Classification - i.e., there are a set number of possible outcomes (3 species of flowers, or survived / not survived the Titanic). But we often want to work out more detail - e.g., what is the probability of..., what is the value of...., and for that we use 'regression' We can look at a California house price dataset, from https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv And will re-use some of the things that we already learned how to do… df=pd.read_csv("housing.csv") df['ocean_proximity'] = LabelEncoder().fit_transform(df['ocean_proximity']) print(df.describe()) #again, we remove the incomplete rows with NA df = df.dropna() #set the target value target = df['median_house_value'] df = df.drop(['median_house_value'], axis=1) model = RandomForestRegressor() X_train, X_test, y_train, y_test = train_test_split(df,target) model.fit(X_train, y_train) predictions = model.predict(X_test) sns.scatterplot(x=y_test,y=predictions) plt.show() What are some of the ways that you can view and evaluate the performance? Look for the other regression models here:- https://scikit-learn.org/stable/supervised_learning.html Here is a complete example import seaborn as sns import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split df=pd.read_csv('https://raw.githubusercontent.com/ageron/handson- ml/master/datasets/housing/housing.csv' ) df['ocean_proximity'] = LabelEncoder().fit_transform(df['ocean_proximity']) #again, we remove the incomplete rows with NA df = df.dropna() #set the target value target = df['median_house_value'] df = df.drop(['median_house_value'], axis=1) model = RandomForestRegressor() X_train, X_test, y_train, y_test = train_test_split(df,target) model.fit(X_train, y_train) predictions = model.predict(X_test) sns.scatterplot(x=y_test,y=predictions) plt.show()
欢迎咨询51作业君