Please go to link and download lenses dataset and perform classification

CHAPTER 1

1. Basic operations

2 + 3*5

log(7) #natural logarithm

log(7, base=10) #logarithm of 5 with base 10

4^2 #4 raised to the second power

3/2 #division

sqrt(16) #square root

abs(3-7) #absolute value of 3-7

pi #the number pi

exp(2) #exponent of 2 with base e

a^b #base a with exponent b

15%/%4 #remainder of integer division

2. R object and names

There are two main concepts behind the R language: objects and functions.

An object can be seen as a storage space with an associated name. Everything

in R is stored in an object. Since there are many built-in functions in R,

make sure that the object names you assign are not already been used by R. A

simple way of checking is to type in the name you want to use. If R returns

an error message telling you that such object is not found, it is safe to

use this name. For example, concatenating function c is a built-in function

used to combine elements, so NEVER assign an object to c!

Try

exp

3. R data types

Basic data type:

Numeric: decimal value, 10.34

Integer: integer value, 3

Complex: complex value, 4-5i

Logical: TRUE or FALSE, T or F

Character: string value

Advanced data type:

Vector:

x=c(1,2,5.3,6,-2) #numeric vector

x=c("one","two","three") #character vector

x=c(TRUE,TRUE,TRUE,FALSE) #logical vector

Matrix:

x=matrix(c(1,3,2,10,-6,5),nrow=2,ncol=3)

x=matrix(c(1, 3, 2,

10,-6,5,

7, 4, 8),nrow=3,ncol=3, byrow=T)

All entries in a matrix must have the same type, either numeric,

integer, complex, logical, or character, etc.

Data Frame:

A data frame is the most important data type. It is a table, a spreadsheet,

or a data set. It is more general than a matrix because different columns

can have different data types.

#creat a data table x with three variables

d=c(1,2,3,4)

e=c("red", "white", "red", NA)

f=c(TRUE,TRUE,TRUE,FALSE)

x=data.frame(d,e,f)

#There is no names for variables. Now, we can give name to each variable

names(x)=c("ID","Color","Passed")

Factor:

Factors are variables in R which take on a limited number of different values, such variables

are often referred to as categorical variables. R stores categorical variables

into a factor. The factor stores a categorical variable as a vector of integers internal.

Characters are not supported in machine learning algorithm, and the only way is to convert a string

to an integer.

#creat a variable gender with 20 "male" entries and 30 "female" entries

gender=c(rep("male",20), rep("female", 30)) #gender is a vector of 50 entries

class(gender) #display the class of the variable

gender_factor=factor(gender)

class(gender_factor)

#Now R treats gender_factor as a nominal variable. It stores gender as 20 1s and 30 2s

and associates 1=female, 2=male internally (alphabetically).

summary(gender_factor) #The number of values in each level

A categorical variable can be divided into nominal and ordinal. The next example is for ordinal variables.

#Create Ordinal categorical vector

day_vector=c('evening', 'morning', 'afternoon', 'midday', 'midnight', 'evening', 'morning')

#Convert day_vector to a factor with ordered level

factor_day=factor(day_vector, order = TRUE, levels =c('morning', 'midday', 'afternoon', 'evening', 'midnight'))

#Print the new variable

factor_day

summary(factor_day)

#seq() and rep() provide convenient ways to construct vectors with certain patterns.

seq(10)

seq(0,1,length=10) #There are 10 values

seq(0,1,by=0.1) #There are 11 values

rep(1,3)

c(rep(1,3),rep(2,2),rep(-1,4))

rep("Small",3)

c(rep("Small",3),rep("Medium",4))

rep(c("Low","High"),3)

4. Vector operations

x=c(1,3,2,10,5)

y=1:5 #create a vector of consecutive integers

y+2

2*y

y^2 #raise each component to the second power

2^y #raise 2 to the first through fifth power

x+y

x*y

x/y

x^y

sum(x)

cumsum(x) #cumulative sum vector

sort(x) #increasing order

sort(x, decreasing=T) #decreasing order

length(x) #number of elements in x

x[3] #the third element of x

x[3:5] #the third to fifth element of x,

x[-2] #all except the second element

x[x>3] #list of elements in x greater than 3

5. Matrix operations

x=c(1, 3, 2, 10,5)

y=c(1, 2, 3, 4, 5)

m=cbind(x,y)

m=rbind(x,y)

t(m) #transpose of m

dim(m) #dimension of matrix m

m=matrix(c(1,3,2,

5,-1,2,

2,3,9),ncol=3,byrow=T)

m[2,3] #element of m at the second row, third column

m[2,] #second row

m[,3] #third column

m[-1,] #submatrix of m without the first row

m[,-1] #submatrix of m without the first column

m[-1,-1] #submatrix of m without the first row and column

m1=matrix(1:4, ncol=2)

m2=matrix(c(10,20,30,40),ncol=2)

2*m1 #scalar multiplication

m1+m2 #matrix addition

m1*m2 #component-wise multiplication,NOT the usual matrix multiplication

m1%*%m2 #usual matrix multiplication

6. Functions for basic statistics

x = c(0,4,15, 1, 6, 3, 20, 5, 8, 1, 3)

mean(x) #mean

median(x) #median

var(x) #variance

sd(x) #standard deviation

quantile(x,0.25) #25th Percentile

quantile(x,c(0,0.25,0.5,0.75,1))

summary(x) #five number summary

IQR(x) #interquatile

range(x) #minimum to maximum

7. R Graphing

Bar Chart

x = c(1, 3, 6, 4, 9)

barplot(x, xlab="Title on x-axis", ylab="Title on y-axis", main="Title of

the chart")

Scatter Plot (both x and y are numerical variables)

x = c(2, 3, 6, 8, 9, 12)

y = c(6, 9, 14, 2, 7, 23)

plot(x,y)

x = c(1, 2, 1, 2, 1, 2)

boxplot(x)

Histogram

x = c(4,4,6,6,16,2,7,9,12)

hist(x)

hist(x, breaks=c(2, 5, 8, 9, 14, 17))

hist(x, prob=T) #if probability instead of frequency is desired

8. Distributions

The prefix "d" is for probability density value (pdf), "p" for cumulative

probability (cdf), "q" for quantile, and "r" for random numbers from the

distribution.

Normal distributions

rnorm(10) #generates 10 random numbers according to standard

#Normal distribution.

set.seed(10)

rnorm(10)

rnorm(20, mean=10, sd=3)

#generates 20 random numbers according to Normal

#distribution with mean 10 and standard deviation 3.

x=seq(-8,8,length=500)

y=dnorm(x,mean=0,sd=1)

plot(x,y)

pnorm(90,100,10) #return the area below x=90

pnorm(90,100,10,lower.tail=F) #return the area above x=90

qnorm(0.95,100,10) #return the critical value for 95th percentile

qnorm(c(.025,.975),100,10) #critical value for 25th and 97.5th percentiles

9. Read external data sets

#Load external text file. Pay attention to the options header and na.strings

md=read.table("G:/My Drive/Teaching/Data/CS5630/Auto.data")

#open a viewable table

fix(md)

md=read.table("G:/My Drive/Teaching/Data/CS5630/Auto.data", header=T)

md=read.table("G:/My Drive/Teaching/Data/CS5630/Auto.data", header=T, na.strings ="?")

pairs(md) #scatterplot matrix for all variables

#scatterplot matrix for selected variables

pairs(~mpg+displacement+horsepower+weight+acceleration, md)

pairs(md[, c(1,3, 4, 5, 6)])

#If the values of the external file are separated by tab or comma, you need to provide

sep options: sep="\t" or sep=","

#load external csv file

md=read.csv("G:/My Drive/Teaching/Data/CS5630/Auto.csv", header=T, na.strings ="?")

10. User defined R function

Syntex:

function_name=function(arg1, arg2, ... ){

statements

return(object)

#An object in the function is local to the function. The object returned can be any data type.

#it is not necessary to include the return statement. R automatically returns whichever variable

#is on the last line of the body of the function.

fahrenheit_to_celsius=function(temp_F) {

temp_C=(temp_F-32)*5/9

return(temp_C)

CHAPTER 2

Distance functions

a = c(22, 1, 42, 10);

b = c(20, 0, 36, 8);

Euclidean Distance:

sqrt(sum((a-b)^2));


City Block, Manhattan, taxicab distance:

sum(abs(a-b));

Minkowski distance:

(sum((a-b)^r))^(1/r);


Supremum distance:

max(abs(a-b));


library(pracma)

d1 = c(3, 2, 0, 5, 0, 0, 0, 2, 0, 0)

d2 = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

Cosine similarity:

sum(d1*d2)/(Norm(d1)*Norm(d2))

Correlation measure:

cor(d1, d2)

library(philentropy)

a = c(22, 1, 42, 10);

b = c(20, 0, 36, 8);

u = c(12, 18, 34, 5);

Create disimilarity matrix

x = rbind(a,b,u)

distance(x, method = "euclidean")

distance(x, method = "manhattan")

To find out which methods are implemented in distance(), you can consult the getDistMethods() function.

getDistMethods()

a = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0)

b = c(0, 0, 0, 1, 0, 1, 1, 0, 0, 1)

x=rbind(a, b)

distance(x, method = "jaccard")

distance(x, method = "cosine")

Note that Jaccard distance and Jaccard coefficients are different. JC = 1-JD

CHAPTER 3

#R programming for classifications

library(ISLR)#Load the data set Carseats

library(tree)#tree library is for making decision trees of classifications

#To check the values of the data set Carseats

fix(Carseats)

#To check the description of all attributes

?Carseats

#The tree library is for making decision trees of classifications

attach(Carseats)

#To discretize the numerical attrribute Sales

High=ifelse(Sales<=8,"No","Yes")

#Use the data.frame() function to merge High with the rest of the Carseats data

Carseats=data.frame(Carseats,High)

Carseats=Carseats[,-1]

dim(Carseats)

#Convert the response variable as a factor.

Carseats$High=factor(Carseats$High, levels=c("Yes","No"))

#To make a decision tree that classify the attribute High based on all other attributes

tree.carseats=tree(High~.,Carseats)

#The summary() function lists the attributes that are used as internal nodes

#in the tree, the number of terminal nodes, and the (training) error rate.

#Note that a small residual mean deviance indicates a tree that provides

#a good fit to the data.

summary(tree.carseats)

#Use the plot() function to display the tree structure, and the text()

#function to display the node labels. The argument pretty=0 instructs R

#to include the category names for any catergorical attributes, rather than

#simply displaying a letter for each category

plot(tree.carseats)

text(tree.carseats,pretty=0)

#Type the name of the tree object, R prints output corresponding

#to each branch of the tree. R displays the split criterion,

#the number of observations in that branch, the deviance, the overall

#prediction for the branch (Yes or No)

tree.carseats

#The predict() function is used to test the model. The argument

#type="class" instructs R to return the actual class prediction

tree.pred=predict(tree.carseats,type="class")

#Create the confusion matrix

table(tree.pred,High)

#Correct prediction rate

mean(tree.pred==High)

#Error prediction rate

mean(tree.pred!=High)

#In order to properly evaluate the performance of a classification tree on

#these data, we must estimate the test error rather than simply computing

#the training error. We split the observations into a training set and a test

#set, build the tree using the training set, and evaluate its performance on

#the test data.

#Random select a sample of 200 observations of the data set as a training set and the rest

#of the data set as a test set.

set.seed(3)

train=sample(1:nrow(Carseats), 200)

Carseats.train=Carseats[train,]

Carseats.test=Carseats[-train,]

High.test=High[-train]

tree.carseats=tree(High~.,Carseats.train)

tree.pred=predict(tree.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#Now we consider whether pruning the tree might lead to improved

#results. The function cv.tree() performs cross-validation in order to

#determine the optimal level of tree complexity. The argument FUN=prune.misclass

#indicates that we want the error rate to guide the cross-validation

#and pruning process.

set.seed(2)

cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass)

cv.carseats

#Despite the name, dev corresponds to the cross-validation error

#rate in classification. We plot the error rate as a function of size


plot(cv.carseats$size,cv.carseats$dev,type="b")

#Check the tree with size 6

prune.carseats=prune.misclass(tree.carseats,best=6)

plot(prune.carseats)

text(prune.carseats,pretty=0)

tree.pred=predict(prune.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#Check the tree with size 10

prune.carseats=prune.misclass(tree.carseats,best=10)

plot(prune.carseats)

text(prune.carseats,pretty=0)

tree.pred=predict(prune.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#Check the tree with size 12

prune.carseats=prune.misclass(tree.carseats,best=12)

plot(prune.carseats)

text(prune.carseats,pretty=0)

tree.pred=predict(prune.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#Check the tree with size 13

prune.carseats=prune.misclass(tree.carseats,best=13)

plot(prune.carseats)

text(prune.carseats,pretty=0)

tree.pred=predict(prune.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#Bagging and Random Forests

The exact results obtained in this section may depend on the version of R and the version

of the randomForest package installed on your computer. Recall that bagging is simply a

special case of a random forest with m = p. Therefore, the randomForest() function can

be used to perform both random forests and bagging.

library(randomForest)

#Some arguments of function randomForest():

#mtry: the number of variables randomly sampled as candidates at each split.

#ntree: the number of trees to grow.

#ntree indicates the number of trees are generated by bagging

#mtry indicates the number of variables are used at each split.

#Check with 100 trees

set.seed(3)

tree.carseats=randomForest(High~.,Carseats.train, ntree=100,mtry=10)

tree.pred=predict(tree.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#Check with 500 trees

set.seed(3)

tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=10)

tree.pred=predict(tree.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#Check with 1000 trees

set.seed(3)

tree.carseats=randomForest(High~.,Carseats.train, ntree=1000, mtry=10)

tree.pred=predict(tree.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#By default, randomForest() uses about sqrt(p) variables when building a random forest of

classification trees. sqrt(10)=3.16.

set.seed(3)

tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=4)

tree.pred=predict(tree.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

set.seed(3)

tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=3)

tree.pred=predict(tree.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

set.seed(3)

tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=5)

tree.pred=predict(tree.carseats,Carseats.test,type="class")

table(tree.pred,High.test)

mean(tree.pred!=High.test)

#To check the variable importance in bagging and random forest, use functions importance

#and varImpPlot. The mean decrease in Gini index for each variable is diaplayed.

importance(tree.carseats)

varImpPlot(tree.carseats)

#Boosting requires the Generalized Boosted Regression Models (gbm) package

library(gbm)

#For boosting technique the trees are grown sequentially: each tree is grown using

#information from previously grown trees. Boosting does not involve bootstrap sampling;

#instead each tree is fit on a modified version of the training data set.

#For binary classification, the response variable should be 0 or 1 if using Bernoulli distribution.

Carseats$High=ifelse(Carseats$High=="Yes",1,0)

set.seed(3)

train=sample(1:nrow(Carseats),200)

Carseats.train=Carseats[train,]

Carseats.test=Carseats[-train,]

High.test=High[-train]

#The option n.trees specifies the number of trees

set.seed(3)

tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=100)

#Note that predict function provides the probability of getting Yes or No, therefore, we need to

#set up a threshhold in order to get the correct confusion table.

tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=100, type="response")

tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")

table(High.test, tree.pred)

mean(tree.pred!=High.test)

set.seed(3)

tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=200)

tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=200, type="response")

tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")

table(High.test, tree.pred)

mean(tree.pred!=High.test)

set.seed(3)

tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=300)

tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=300, type="response")

tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")

table(High.test, tree.pred)

mean(tree.pred!=High.test)

set.seed(3)

tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=400)

tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=400, type="response")

tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")

table(High.test, tree.pred)

mean(tree.pred!=High.test)

tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=500)

tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=500, type="response")

tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")

table(High.test, tree.pred)

mean(tree.pred!=High.test)

CHAPTER 4

#K-Nearest Neighbors (KNN) Classifier

#knn() function requires 4 inputs:

#A matrix containing the predictors associated with the training data

#A matrix containing the predictors associated with the test data

#A vector containing the class labels for the training data

#A value for k, the number of observations in nearest neighbors to be used by the classifier

library(class)

library(ISLR)

#Smarlet data set consists of percentage returns for the S&P 500 stock index over 1250 days,

#from the beginning of 2001 until the end of 2005. For each date, we have recorded

#the percentage returns for each of the five previous trading days, Lag1 through Lag5.

?Smarket

attach(Smarket)

fix(Smarket)

train=(Year<2005)

trainSet=cbind(Lag1,Lag2)[train,]

testSet=cbind(Lag1,Lag2)[!train,]

train.Label=Direction[train]

test.Label=Direction[!train]

knn.pred=knn(trainSet,testSet,train.Label,k=1)

table(knn.pred,test.Label)

mean(knn.pred!=test.Label)

knn.pred=knn(trainSet,testSet,train.Label,k=3)

table(knn.pred,test.Label)

mean(knn.pred!=test.Label)

knn.pred=knn(trainSet,testSet,train.Label,k=5)

table(knn.pred,test.Label)

mean(knn.pred!=test.Label)

#Naive Bayes classification

#The e1071 library contains implementations for a number of classiification methods including

#Naive Bayes classification and Support Vector Machine.

library(e1071)

attach(Carseats)

High=ifelse(Sales<=8,"No","Yes")

Carseats=data.frame(Carseats,High)

Carseats=Carseats[,-1]

Carseats$High=factor(Carseats$High,levels=c("Yes","No"))

#Fitting the Naive Bayes model

NB.model=naiveBayes(High~., Carseats)

#What does the model say? Print the model summary

NB.model


#Prediction on the dataset

NB.prediction=predict(NB.model,Carseats)

#Performance on the entire data set

table(NB.prediction,High)

mean(NB.prediction!=High)

#Performance on test data set

set.seed(3)

train=sample(1:nrow(Carseats), 200)

Carseats.train=Carseats[train,]

Carseats.test=Carseats[-train,]

High.test=High[-train]

NB.model=naiveBayes(High~.,Carseats.train)

NB.prediction=predict(NB.model,Carseats.test)

table(NB.prediction,High.test)

mean(NB.prediction!=High.test)

#Support Vector Classifier

#The cost argument controls the tradeoff between classification of training points accurately

#and a smooth decision boundary. It allows us to specify the cost of a violation to the margin.

#When the cost argument is small, the margins will be wide and many support

#vectors will be on the margin or will violate the margin. We get lower variance and high bias.

#When the cost argument is large, then the margins will be narrow and there will be few support

#vectors on the margin or violating the margin. We get higher variance and lower bias.

#the kernel means transforming data into another dimension that has a clear dividing margin between

#classes of data.

svm.model=svm(High~., Carseats.train, kernel="linear", cost=1)

svm.prediction=predict(svm.model,Carseats.test)

table(svm.prediction, High.test)

mean(svm.prediction!=High.test)

#The function tune.svm() performs cross validation. By default, tune.svm()

#performs 10-fold cross-validation.

set.seed(3)

tune.out=tune.svm(High~.,data=Carseats.train, kernel="linear",

cost=c(0.0001,0.01,0.1,1,5,10,100,1000))

#We can access the cross-validation errors for each of these models using the summary() command

summary(tune.out)

#We see that cost=5 results in the lowest cross-validation error rate.

#The tune.svm() function stores the best model obtained.

bestmodel=tune.out$best.model

summary(bestmodel)

svm.prediction=predict(bestmodel,Carseats.test)

table(svm.prediction, High.test)

mean(svm.prediction!=High.test)

#In order to use a nonlinear kernel, we use a different value of the parameter kernel. To fit an SVM

#with a polynomial kernel we use kernel="polynomial", and to fit an SVM with a radial kernel we use

#kernel="radial". In the former case we also use the degree argument to specify a degree for the

#polynomial kernel, and in the latter case we also specify a value of gamma for the radial basis

#kernel.

#The parameter gamma defines how far the influence of single training example reaches. If the value

#of gamma is high, the decision boundary will depend on points close to the decision boundary and

#nearer points carry more weights than far away points. The decision boundary becomes more wiggly

#in this case. If the value of gamma is low, then far away points carry more weights than nearer

#points and thus the decision boundary becomes more like a straight line.

svm.model=svm(High~.,data=Carseats.train, kernel="radial", gamma=1, cost=1)

svm.prediction=predict(svm.model,Carseats.test)

table(svm.prediction, High.test)

mean(svm.prediction!=High.test)

#If we increase the value of cost, we can reduce the number of training errors. However, this comes

#at the price of a more irregular decision boundary that seems to be at risk of overfitting the data.

#We can perform cross-validation using tune.svm() to select the best choice of

#gamma and cost for an SVM with a radial kernel.

set.seed(3)

tune.out=tune.svm(High~.,data=Carseats.train, kernel="radial",

cost=c(0.0001,0.01,0.1,1,5,10,100,1000), gamma=c(0.05,0.5,1,2,3,4,5))

summary(tune.out)

bestmodel=tune.out$best.model

summary(bestmodel)

svm.prediction=predict(bestmodel,Carseats.test)

table(svm.prediction, High.test)

mean(svm.prediction!=High.test)

#Use polynomial kernel

svm.model=svm(High~.,data=Carseats.train, kernel="poly", degree=2,

gamma=1, cost=1)

svm.prediction=predict(svm.model,Carseats.test)

table(svm.prediction, High.test)

mean(svm.prediction!=High.test)