Classification &
Regression Tree is a classification method, technically known as Binary
Recursive Partitioning. It uses historical data to construct Decision trees. Decision
trees are further used for classifying new data.
Here the point comes : Where should we use CART?
Sometimes we have
problems where we want answer in
“Yes/No”.
i.e
: “Is salary greater than 30000?”,” Is it going to rain today?” etc
CART asks “Yes/No”
questions. CART algorithm searches all possible variables and possible values
in order to find the best split (means the question that split the data into
two parts to find the maximum homogeneity )
Key elements for CART
analysis are :
·
Split
each node in a tree.
·
Decide whether tree is complete or not.
·
Assign each leaf node to a class outcome
Following
are the steps to get the CART model :
·
Grow a
tree: Use following to grow a tree
rpart(formula, data,
weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE,
parms, control, cost, ...)
·
Examine
the results based on the model  There are some functions that help to test the
result
printcp(fit)

display cp table

plotcp(fit)

plot
crossvalidation results

rsq.rpart(fit)

plot
approximate Rsquared and relative error for different splits (2 plots).
labels are only appropriate for the "anova" method.

print(fit)

print
results

summary(fit)

detailed
results including surrogate splits

plot(fit)

plot
decision tree

text(fit)

label
the decision tree plot

post(fit,file=)

create
postscript plot of decision tree

Here fit is the model output of
rpart command.
·
Pruning
the tree  It helps
in avoiding the overfitting of data. Typically, you will want to select a tree
size that minimizes the crossvalidated error, the xerror column printed
by printcp( ).
Prune the
tree of desired size using
prune(fit,cp=)
Here is the example of
classification tree :
library(rpart)
dataset <
read.table("C:\\Users\\Nishu\\Downloads\\bank\\bank.csv",header=T,sep=";")
# grow tree
fit < rpart(y ~ .,
method="class", data=dataset )
printcp(fit) # display the results
plotcp(fit) # visualize crossvalidation
results
summary(fit) # detailed summary of
splits
# plot tree
plot(fit,
uniform=TRUE,main="Classification Tree for Bank")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postscript plot of
tree
post(fit, file = "c:/tree.ps",
title = "Classification Tree")
# prune the tree
pfit< prune(fit, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
# plot the pruned tree
plot(pfit, uniform=TRUE,
main="Pruned Classification Tree")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree.ps",
title = "Pruned Classification ")
pfit< prune(fit, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
# plot the pruned tree
plot(pfit, uniform=TRUE,
main="Pruned Classification Tree")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree.ps",
title = "Pruned Classification ")
Now we have got the model. Next step is to predict data based on the trained model.
First we’ll split the
dataset into trained and testdata with a fixed percentage.
Here confusion matrix
is :
library(rpart)
dataset<read.table("C:\\Users\\Nishu\\Downloads\\bank\\bank.csv",header=T,sep=";")
sub < sample(nrow(dataset), floor(nrow(dataset) * 0.9)) # Here we are taking
90% training data
training < dataset [sub, ]
testing < dataset [sub, ]
fit <
rpart(y ~ ., method="class", data=dataset )
predict(fit,testing,type=”class”)
# to get the confusion matrix
out <
table(predict(fit,testing,type="class"),dataset[sub,"y"])
no yes
no 391 25
yes 13 24
library(caret)
confusionMatrix(out)
Output would be :
no yes
no 391 25
yes 13 24
Accuracy : 0.9101
95% CI : (0.8936, 0.9248)
No Information Rate : 0.8968
PValue [Acc > NIR] : 0.05704
Kappa : 0.4005
Mcnemar's Test PValue : 9.213e08
Sensitivity : 0.9745
Pos Pred Value : 0.9287
Neg Pred Value : 0.6125
Prevalence : 0.8968
Detection Rate : 0.8740
Detection Prevalence : 0.9410
So here we have the
desired output. Prediction and Accuracy of model based on which, we can predict
future data.
Download this dataset or other dataset from here and test the algorithm.
Here you go..!!!!
Download this dataset or other dataset from here and test the algorithm.
Here you go..!!!!
No comments:
Post a Comment