Please go to link and download lenses dataset and perform classification
CHAPTER 1
1. Basic operations
2 + 3*5
log(7) #natural logarithm
log(7, base=10) #logarithm of 5 with base 10
4^2 #4 raised to the second power
3/2 #division
sqrt(16) #square root
abs(3-7) #absolute value of 3-7
pi #the number pi
exp(2) #exponent of 2 with base e
a^b #base a with exponent b
15%/%4 #remainder of integer division
2. R object and names
There are two main concepts behind the R language: objects and functions.
An object can be seen as a storage space with an associated name. Everything
in R is stored in an object. Since there are many built-in functions in R,
make sure that the object names you assign are not already been used by R. A
simple way of checking is to type in the name you want to use. If R returns
an error message telling you that such object is not found, it is safe to
use this name. For example, concatenating function c is a built-in function
used to combine elements, so NEVER assign an object to c!
Try
exp
3. R data types
Basic data type:
Numeric: decimal value, 10.34
Integer: integer value, 3
Complex: complex value, 4-5i
Logical: TRUE or FALSE, T or F
Character: string value
Advanced data type:
Vector:
x=c(1,2,5.3,6,-2) #numeric vector
x=c("one","two","three") #character vector
x=c(TRUE,TRUE,TRUE,FALSE) #logical vector
Matrix:
x=matrix(c(1,3,2,10,-6,5),nrow=2,ncol=3)
x=matrix(c(1, 3, 2,
10,-6,5,
7, 4, 8),nrow=3,ncol=3, byrow=T)
All entries in a matrix must have the same type, either numeric,
integer, complex, logical, or character, etc.
Data Frame:
A data frame is the most important data type. It is a table, a spreadsheet,
or a data set. It is more general than a matrix because different columns
can have different data types.
#creat a data table x with three variables
d=c(1,2,3,4)
e=c("red", "white", "red", NA)
f=c(TRUE,TRUE,TRUE,FALSE)
x=data.frame(d,e,f)
#There is no names for variables. Now, we can give name to each variable
names(x)=c("ID","Color","Passed")
Factor:
Factors are variables in R which take on a limited number of different values, such variables
are often referred to as categorical variables. R stores categorical variables
into a factor. The factor stores a categorical variable as a vector of integers internal.
Characters are not supported in machine learning algorithm, and the only way is to convert a string
to an integer.
#creat a variable gender with 20 "male" entries and 30 "female" entries
gender=c(rep("male",20), rep("female", 30)) #gender is a vector of 50 entries
class(gender) #display the class of the variable
gender_factor=factor(gender)
class(gender_factor)
#Now R treats gender_factor as a nominal variable. It stores gender as 20 1s and 30 2s
and associates 1=female, 2=male internally (alphabetically).
summary(gender_factor) #The number of values in each level
A categorical variable can be divided into nominal and ordinal. The next example is for ordinal variables.
#Create Ordinal categorical vector
day_vector=c('evening', 'morning', 'afternoon', 'midday', 'midnight', 'evening', 'morning')
#Convert day_vector to a factor with ordered level
factor_day=factor(day_vector, order = TRUE, levels =c('morning', 'midday', 'afternoon', 'evening', 'midnight'))
#Print the new variable
factor_day
summary(factor_day)
#seq() and rep() provide convenient ways to construct vectors with certain patterns.
seq(10)
seq(0,1,length=10) #There are 10 values
seq(0,1,by=0.1) #There are 11 values
rep(1,3)
c(rep(1,3),rep(2,2),rep(-1,4))
rep("Small",3)
c(rep("Small",3),rep("Medium",4))
rep(c("Low","High"),3)
4. Vector operations
x=c(1,3,2,10,5)
y=1:5 #create a vector of consecutive integers
y+2
2*y
y^2 #raise each component to the second power
2^y #raise 2 to the first through fifth power
x+y
x*y
x/y
x^y
sum(x)
cumsum(x) #cumulative sum vector
sort(x) #increasing order
sort(x, decreasing=T) #decreasing order
length(x) #number of elements in x
x[3] #the third element of x
x[3:5] #the third to fifth element of x,
x[-2] #all except the second element
x[x>3] #list of elements in x greater than 3
5. Matrix operations
x=c(1, 3, 2, 10,5)
y=c(1, 2, 3, 4, 5)
m=cbind(x,y)
m=rbind(x,y)
t(m) #transpose of m
dim(m) #dimension of matrix m
m=matrix(c(1,3,2,
5,-1,2,
2,3,9),ncol=3,byrow=T)
m[2,3] #element of m at the second row, third column
m[2,] #second row
m[,3] #third column
m[-1,] #submatrix of m without the first row
m[,-1] #submatrix of m without the first column
m[-1,-1] #submatrix of m without the first row and column
m1=matrix(1:4, ncol=2)
m2=matrix(c(10,20,30,40),ncol=2)
2*m1 #scalar multiplication
m1+m2 #matrix addition
m1*m2 #component-wise multiplication,NOT the usual matrix multiplication
m1%*%m2 #usual matrix multiplication
6. Functions for basic statistics
x = c(0,4,15, 1, 6, 3, 20, 5, 8, 1, 3)
mean(x) #mean
median(x) #median
var(x) #variance
sd(x) #standard deviation
quantile(x,0.25) #25th Percentile
quantile(x,c(0,0.25,0.5,0.75,1))
summary(x) #five number summary
IQR(x) #interquatile
range(x) #minimum to maximum
7. R Graphing
Bar Chart
x = c(1, 3, 6, 4, 9)
barplot(x, xlab="Title on x-axis", ylab="Title on y-axis", main="Title of
the chart")
Scatter Plot (both x and y are numerical variables)
x = c(2, 3, 6, 8, 9, 12)
y = c(6, 9, 14, 2, 7, 23)
plot(x,y)
x = c(1, 2, 1, 2, 1, 2)
boxplot(x)
Histogram
x = c(4,4,6,6,16,2,7,9,12)
hist(x)
hist(x, breaks=c(2, 5, 8, 9, 14, 17))
hist(x, prob=T) #if probability instead of frequency is desired
8. Distributions
The prefix "d" is for probability density value (pdf), "p" for cumulative
probability (cdf), "q" for quantile, and "r" for random numbers from the
distribution.
Normal distributions
rnorm(10) #generates 10 random numbers according to standard
#Normal distribution.
set.seed(10)
rnorm(10)
rnorm(20, mean=10, sd=3)
#generates 20 random numbers according to Normal
#distribution with mean 10 and standard deviation 3.
x=seq(-8,8,length=500)
y=dnorm(x,mean=0,sd=1)
plot(x,y)
pnorm(90,100,10) #return the area below x=90
pnorm(90,100,10,lower.tail=F) #return the area above x=90
qnorm(0.95,100,10) #return the critical value for 95th percentile
qnorm(c(.025,.975),100,10) #critical value for 25th and 97.5th percentiles
9. Read external data sets
#Load external text file. Pay attention to the options header and na.strings
md=read.table("G:/My Drive/Teaching/Data/CS5630/Auto.data")
#open a viewable table
fix(md)
md=read.table("G:/My Drive/Teaching/Data/CS5630/Auto.data", header=T)
md=read.table("G:/My Drive/Teaching/Data/CS5630/Auto.data", header=T, na.strings ="?")
pairs(md) #scatterplot matrix for all variables
#scatterplot matrix for selected variables
pairs(~mpg+displacement+horsepower+weight+acceleration, md)
pairs(md[, c(1,3, 4, 5, 6)])
#If the values of the external file are separated by tab or comma, you need to provide
sep options: sep="\t" or sep=","
#load external csv file
md=read.csv("G:/My Drive/Teaching/Data/CS5630/Auto.csv", header=T, na.strings ="?")
10. User defined R function
Syntex:
function_name=function(arg1, arg2, ... ){
statements
return(object)
#An object in the function is local to the function. The object returned can be any data type.
#it is not necessary to include the return statement. R automatically returns whichever variable
#is on the last line of the body of the function.
fahrenheit_to_celsius=function(temp_F) {
temp_C=(temp_F-32)*5/9
return(temp_C)
CHAPTER 2
Distance functions
a = c(22, 1, 42, 10);
b = c(20, 0, 36, 8);
Euclidean Distance:
sqrt(sum((a-b)^2));
City Block, Manhattan, taxicab distance:
sum(abs(a-b));
Minkowski distance:
(sum((a-b)^r))^(1/r);
Supremum distance:
max(abs(a-b));
library(pracma)
d1 = c(3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
d2 = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
Cosine similarity:
sum(d1*d2)/(Norm(d1)*Norm(d2))
Correlation measure:
cor(d1, d2)
library(philentropy)
a = c(22, 1, 42, 10);
b = c(20, 0, 36, 8);
u = c(12, 18, 34, 5);
Create disimilarity matrix
x = rbind(a,b,u)
distance(x, method = "euclidean")
distance(x, method = "manhattan")
To find out which methods are implemented in distance(), you can consult the getDistMethods() function.
getDistMethods()
a = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0)
b = c(0, 0, 0, 1, 0, 1, 1, 0, 0, 1)
x=rbind(a, b)
distance(x, method = "jaccard")
distance(x, method = "cosine")
Note that Jaccard distance and Jaccard coefficients are different. JC = 1-JD
CHAPTER 3
#R programming for classifications
library(ISLR)#Load the data set Carseats
library(tree)#tree library is for making decision trees of classifications
#To check the values of the data set Carseats
fix(Carseats)
#To check the description of all attributes
?Carseats
#The tree library is for making decision trees of classifications
attach(Carseats)
#To discretize the numerical attrribute Sales
High=ifelse(Sales<=8,"No","Yes")
#Use the data.frame() function to merge High with the rest of the Carseats data
Carseats=data.frame(Carseats,High)
Carseats=Carseats[,-1]
dim(Carseats)
#Convert the response variable as a factor.
Carseats$High=factor(Carseats$High, levels=c("Yes","No"))
#To make a decision tree that classify the attribute High based on all other attributes
tree.carseats=tree(High~.,Carseats)
#The summary() function lists the attributes that are used as internal nodes
#in the tree, the number of terminal nodes, and the (training) error rate.
#Note that a small residual mean deviance indicates a tree that provides
#a good fit to the data.
summary(tree.carseats)
#Use the plot() function to display the tree structure, and the text()
#function to display the node labels. The argument pretty=0 instructs R
#to include the category names for any catergorical attributes, rather than
#simply displaying a letter for each category
plot(tree.carseats)
text(tree.carseats,pretty=0)
#Type the name of the tree object, R prints output corresponding
#to each branch of the tree. R displays the split criterion,
#the number of observations in that branch, the deviance, the overall
#prediction for the branch (Yes or No)
tree.carseats
#The predict() function is used to test the model. The argument
#type="class" instructs R to return the actual class prediction
tree.pred=predict(tree.carseats,type="class")
#Create the confusion matrix
table(tree.pred,High)
#Correct prediction rate
mean(tree.pred==High)
#Error prediction rate
mean(tree.pred!=High)
#In order to properly evaluate the performance of a classification tree on
#these data, we must estimate the test error rather than simply computing
#the training error. We split the observations into a training set and a test
#set, build the tree using the training set, and evaluate its performance on
#the test data.
#Random select a sample of 200 observations of the data set as a training set and the rest
#of the data set as a test set.
set.seed(3)
train=sample(1:nrow(Carseats), 200)
Carseats.train=Carseats[train,]
Carseats.test=Carseats[-train,]
High.test=High[-train]
tree.carseats=tree(High~.,Carseats.train)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#Now we consider whether pruning the tree might lead to improved
#results. The function cv.tree() performs cross-validation in order to
#determine the optimal level of tree complexity. The argument FUN=prune.misclass
#indicates that we want the error rate to guide the cross-validation
#and pruning process.
set.seed(2)
cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass)
cv.carseats
#Despite the name, dev corresponds to the cross-validation error
#rate in classification. We plot the error rate as a function of size
plot(cv.carseats$size,cv.carseats$dev,type="b")
#Check the tree with size 6
prune.carseats=prune.misclass(tree.carseats,best=6)
plot(prune.carseats)
text(prune.carseats,pretty=0)
tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#Check the tree with size 10
prune.carseats=prune.misclass(tree.carseats,best=10)
plot(prune.carseats)
text(prune.carseats,pretty=0)
tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#Check the tree with size 12
prune.carseats=prune.misclass(tree.carseats,best=12)
plot(prune.carseats)
text(prune.carseats,pretty=0)
tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#Check the tree with size 13
prune.carseats=prune.misclass(tree.carseats,best=13)
plot(prune.carseats)
text(prune.carseats,pretty=0)
tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#Bagging and Random Forests
The exact results obtained in this section may depend on the version of R and the version
of the randomForest package installed on your computer. Recall that bagging is simply a
special case of a random forest with m = p. Therefore, the randomForest() function can
be used to perform both random forests and bagging.
library(randomForest)
#Some arguments of function randomForest():
#mtry: the number of variables randomly sampled as candidates at each split.
#ntree: the number of trees to grow.
#ntree indicates the number of trees are generated by bagging
#mtry indicates the number of variables are used at each split.
#Check with 100 trees
set.seed(3)
tree.carseats=randomForest(High~.,Carseats.train, ntree=100,mtry=10)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#Check with 500 trees
set.seed(3)
tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=10)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#Check with 1000 trees
set.seed(3)
tree.carseats=randomForest(High~.,Carseats.train, ntree=1000, mtry=10)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#By default, randomForest() uses about sqrt(p) variables when building a random forest of
classification trees. sqrt(10)=3.16.
set.seed(3)
tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=4)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
set.seed(3)
tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=3)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
set.seed(3)
tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=5)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
mean(tree.pred!=High.test)
#To check the variable importance in bagging and random forest, use functions importance
#and varImpPlot. The mean decrease in Gini index for each variable is diaplayed.
importance(tree.carseats)
varImpPlot(tree.carseats)
#Boosting requires the Generalized Boosted Regression Models (gbm) package
library(gbm)
#For boosting technique the trees are grown sequentially: each tree is grown using
#information from previously grown trees. Boosting does not involve bootstrap sampling;
#instead each tree is fit on a modified version of the training data set.
#For binary classification, the response variable should be 0 or 1 if using Bernoulli distribution.
Carseats$High=ifelse(Carseats$High=="Yes",1,0)
set.seed(3)
train=sample(1:nrow(Carseats),200)
Carseats.train=Carseats[train,]
Carseats.test=Carseats[-train,]
High.test=High[-train]
#The option n.trees specifies the number of trees
set.seed(3)
tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=100)
#Note that predict function provides the probability of getting Yes or No, therefore, we need to
#set up a threshhold in order to get the correct confusion table.
tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=100, type="response")
tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")
table(High.test, tree.pred)
mean(tree.pred!=High.test)
set.seed(3)
tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=200)
tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=200, type="response")
tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")
table(High.test, tree.pred)
mean(tree.pred!=High.test)
set.seed(3)
tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=300)
tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=300, type="response")
tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")
table(High.test, tree.pred)
mean(tree.pred!=High.test)
set.seed(3)
tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=400)
tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=400, type="response")
tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")
table(High.test, tree.pred)
mean(tree.pred!=High.test)
tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=500)
tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=500, type="response")
tree.pred=ifelse(tree.pred.prob>0.5, "Yes", "No")
table(High.test, tree.pred)
mean(tree.pred!=High.test)
CHAPTER 4
#K-Nearest Neighbors (KNN) Classifier
#knn() function requires 4 inputs:
#A matrix containing the predictors associated with the training data
#A matrix containing the predictors associated with the test data
#A vector containing the class labels for the training data
#A value for k, the number of observations in nearest neighbors to be used by the classifier
library(class)
library(ISLR)
#Smarlet data set consists of percentage returns for the S&P 500 stock index over 1250 days,
#from the beginning of 2001 until the end of 2005. For each date, we have recorded
#the percentage returns for each of the five previous trading days, Lag1 through Lag5.
?Smarket
attach(Smarket)
fix(Smarket)
train=(Year<2005)
trainSet=cbind(Lag1,Lag2)[train,]
testSet=cbind(Lag1,Lag2)[!train,]
train.Label=Direction[train]
test.Label=Direction[!train]
knn.pred=knn(trainSet,testSet,train.Label,k=1)
table(knn.pred,test.Label)
mean(knn.pred!=test.Label)
knn.pred=knn(trainSet,testSet,train.Label,k=3)
table(knn.pred,test.Label)
mean(knn.pred!=test.Label)
knn.pred=knn(trainSet,testSet,train.Label,k=5)
table(knn.pred,test.Label)
mean(knn.pred!=test.Label)
#Naive Bayes classification
#The e1071 library contains implementations for a number of classiification methods including
#Naive Bayes classification and Support Vector Machine.
library(e1071)
attach(Carseats)
High=ifelse(Sales<=8,"No","Yes")
Carseats=data.frame(Carseats,High)
Carseats=Carseats[,-1]
Carseats$High=factor(Carseats$High,levels=c("Yes","No"))
#Fitting the Naive Bayes model
NB.model=naiveBayes(High~., Carseats)
#What does the model say? Print the model summary
NB.model
#Prediction on the dataset
NB.prediction=predict(NB.model,Carseats)
#Performance on the entire data set
table(NB.prediction,High)
mean(NB.prediction!=High)
#Performance on test data set
set.seed(3)
train=sample(1:nrow(Carseats), 200)
Carseats.train=Carseats[train,]
Carseats.test=Carseats[-train,]
High.test=High[-train]
NB.model=naiveBayes(High~.,Carseats.train)
NB.prediction=predict(NB.model,Carseats.test)
table(NB.prediction,High.test)
mean(NB.prediction!=High.test)
#Support Vector Classifier
#The cost argument controls the tradeoff between classification of training points accurately
#and a smooth decision boundary. It allows us to specify the cost of a violation to the margin.
#When the cost argument is small, the margins will be wide and many support
#vectors will be on the margin or will violate the margin. We get lower variance and high bias.
#When the cost argument is large, then the margins will be narrow and there will be few support
#vectors on the margin or violating the margin. We get higher variance and lower bias.
#the kernel means transforming data into another dimension that has a clear dividing margin between
#classes of data.
svm.model=svm(High~., Carseats.train, kernel="linear", cost=1)
svm.prediction=predict(svm.model,Carseats.test)
table(svm.prediction, High.test)
mean(svm.prediction!=High.test)
#The function tune.svm() performs cross validation. By default, tune.svm()
#performs 10-fold cross-validation.
set.seed(3)
tune.out=tune.svm(High~.,data=Carseats.train, kernel="linear",
cost=c(0.0001,0.01,0.1,1,5,10,100,1000))
#We can access the cross-validation errors for each of these models using the summary() command
summary(tune.out)
#We see that cost=5 results in the lowest cross-validation error rate.
#The tune.svm() function stores the best model obtained.
bestmodel=tune.out$best.model
summary(bestmodel)
svm.prediction=predict(bestmodel,Carseats.test)
table(svm.prediction, High.test)
mean(svm.prediction!=High.test)
#In order to use a nonlinear kernel, we use a different value of the parameter kernel. To fit an SVM
#with a polynomial kernel we use kernel="polynomial", and to fit an SVM with a radial kernel we use
#kernel="radial". In the former case we also use the degree argument to specify a degree for the
#polynomial kernel, and in the latter case we also specify a value of gamma for the radial basis
#kernel.
#The parameter gamma defines how far the influence of single training example reaches. If the value
#of gamma is high, the decision boundary will depend on points close to the decision boundary and
#nearer points carry more weights than far away points. The decision boundary becomes more wiggly
#in this case. If the value of gamma is low, then far away points carry more weights than nearer
#points and thus the decision boundary becomes more like a straight line.
svm.model=svm(High~.,data=Carseats.train, kernel="radial", gamma=1, cost=1)
svm.prediction=predict(svm.model,Carseats.test)
table(svm.prediction, High.test)
mean(svm.prediction!=High.test)
#If we increase the value of cost, we can reduce the number of training errors. However, this comes
#at the price of a more irregular decision boundary that seems to be at risk of overfitting the data.
#We can perform cross-validation using tune.svm() to select the best choice of
#gamma and cost for an SVM with a radial kernel.
set.seed(3)
tune.out=tune.svm(High~.,data=Carseats.train, kernel="radial",
cost=c(0.0001,0.01,0.1,1,5,10,100,1000), gamma=c(0.05,0.5,1,2,3,4,5))
summary(tune.out)
bestmodel=tune.out$best.model
summary(bestmodel)
svm.prediction=predict(bestmodel,Carseats.test)
table(svm.prediction, High.test)
mean(svm.prediction!=High.test)
#Use polynomial kernel
svm.model=svm(High~.,data=Carseats.train, kernel="poly", degree=2,
gamma=1, cost=1)
svm.prediction=predict(svm.model,Carseats.test)
table(svm.prediction, High.test)
mean(svm.prediction!=High.test)