Need help with the following assignment have attached the required data and materials of the professor's lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary do

Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered:

•Overview of Decision Tree classifier •General algorithm for Decision Trees •Decision Tree use cases •Entropy, Information gain •Reasons to Choose (+) and Cautions ( -) of Decision Tree classifier •Classifier methods and conditions in which they are best suited Decision Trees Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Decision Tree Classifier -What is it? • Used for classification:

 Returns probability scores of class membership  Well -calibrated, like logistic regression  Assigns label based on highest scoring class  Some Decision Tree algorithms return simply the most likely class  Regression Trees: a variation for regression  Returns average value at every node  Predictions can be discontinuous at the decision boundaries • Input variables can be continuous or discrete • Output :  A tree that describes the decision flow.  Leaf nodes return either a probability score, or simply a classification.  Trees can be converted to a set of "decision rules“  "IF income < $50,000 AND mortgage_amt > $100K THEN default=T with 75% probability“ 3 Module 4: Analytics Theory/Methods Decision Trees are a flexible method very commonly deployed in data mining applications. In this lesson we will focus on Decision Trees used for classification problems. There are two types of trees; Classification Trees and Regression (or Prediction) Trees •Classification Trees –are used to segment observations into more homogenous groups (assign class labels). They usually apply to outcomes that are binary or categorical in nature.

•Regression Trees –are variations of regression and what is returned in each node is the average value at each node (type of a step function with which the average value can be computed). Regression trees can be applied to outcomes that are continuous (like account spend or personal income).

The input values can be continuous or discrete. Decision Tree models output a tree that describes the decision flow. The leaf nodes return class labels and in some implementations they also return the probability scores. In theory the tree can be converted into decision rules such as the example shown in the slide.

Decision Trees are a popular method because they can be applied to a variety of situations. The rules of classification are very straight forward and the results can easily be presented visually. Additionally, because the end result is a series of logical “if -then” statements, there is no underlying assumption of a linear (or non -linear) relationship between the predictor variables and the dependent variable. 3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Decision Tree – Example of Visual Structure Gender Income Age Yes No Yes No Male Female <=40 >40 >45,000 <=45,000 Internal Node –decision on variable Leaf Node –class label Branch –outcome of test Income Age Female Male 4 Module 4: Analytics Theory/Methods Decision Tree s are typically depicted in a flow -chart like manner. Branches refer to the outcome of a decision and are represented by the connecting lines here. When the decision is numerical, the “greater than” branch is usually shown on the right and “less than” on the left. Depending on the nature of the variable, you may need to include an “equal to” component on one branch. Internal Nodes are the decision or test points. Each refers to a single variable or attribute. In the example here the outcomes are binary, although there could be more than 2 branches stemming from an internal node. For example, if the variable was categorical and had 3 choices, you might need a branch for each choice.

The Leaf Nodes are at the end of the last branch on the tree. These represent the outcome of all the prior decisions. The leaf nodes are the class labels, or the segment in which all observations that follow the path to the leaf would be placed. 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Decision Tree Classifier -Use Cases • When a series of questions (yes/no) are answered to arrive at a classification  Biological species classification  Checklist of symptoms during a doctor’s evaluation of a patient • When “if -then” conditions are preferred to linear models.  Customer segmentation to predict response rates  Financial decisions such as loan approval  Fraud detection • Short Decision Trees are the most popular "weak learner" in ensemble learning techniques 5 Module 4: Analytics Theory/Methods An example of Decision Trees in practice is the method for classifying biological species. A series of questions (yes/no) are answered to arrive at a classification. Another example is a checklist of symptoms during a doctor’s evaluation of a patient. People mentally perform these types of analysis frequently when assessing a situation. Other use cases can be customer segmentation to better predict response rates to marketing and promotions. Computers can be “taught” to evaluate a series of criteria and automatically approve or deny an application for a loan. In the case of loan approval, computers can use the logical “if -then” statements to predict whether the customer will default on the loan. For customers with a clear (strong) outcome, no human interaction is required, for observations which may not generate a clear response, a human is needed for the decision.

Short Decision Trees (where we have limited the number of splits) are often used as components (called "weak learners" or "base learners") in ensemble techniques (a set of predictive models which will all vote and we take decisions based on the combination of the votes) such as Random forests, bagging and boosting (Beyond the scope for this class) .The very simplest of the short trees are decision stumps: Decision Trees with one internal node (the root) which is immediately connected to the terminal nodes. A decision stump makes a prediction based on the value of just a single input feature. 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Example: The Credit Prediction Problem savings=(500:1000), >=1000,no known savings personal=female, male div/sep personal=male mar/wid, male single housing=free, rent housing=own 700/1000p(good)=0.70 245/294p(good)=0.83 349/501p(good)=0.70 70/117p(good)=0.60 36/88 p(good) = 0.41 savings= <100, (100:500) 6 Module 4: Analytics Theory/Methods We will use the same example we used in the previous lesson with Naïve Bayesian classifier.

For the people with good credit and we start at the top of the tree the probability is 70% (700 out of 1000 people have good credit). The process has decided that we are going to split how much is in the savings account into two groups.

One group with savings less than $100 or between $100 to $ 500.

The second group is the rest of the population which has savings of $500 to $1000 or greater than $1000 or no known savings.

We compute the probability of good credit at the second node and we find in the second savings category 245 out of 294 have good credit and the probability at this node is 83%.

Looking at the other node (Savings <100 or Savings 100:500) we look into housing. We split this node into Housing (free,rent) as one group and Housing (own) as the other. Computing probability of good credit at housing (own) node we see that 349 out of 501 people have good credit, a 70% probability.

Traversing down the housing (free, rent) node we split now on the variable known as personal. The two groups are Personal (female, male divorced/ separated) and Personal (male,married/widowed,male_single). In the node on the right, the probability of good credit is 0.6; in the node on the left, the probability of good credit is 42% (which is less than 50%, so we have shaded this box red).

We can see that for this case, we might want to work with the probabilities, rather than the class labels; this tree would only label 88 rows (out of 1000) of the training set as "bad", which is far less than the 30% "bad" rate of the training set, and of those cases labeled "bad", only 59% of them would truly be bad. Tuning the splitting parameters, or using a random forest or other ensemble technique (more on that later) might improve the performance.

Decision Trees are greedy algorithms. They take decisions based on what is available at that moment and once a bad decision is taken it is propagated all the way down. An ensemble technique may randomize the splitting (or even randomize data) and come up with multiple tree structures. It then assigns labels by looking at the average of the nodes in all the trees and assigns class labels or probability values. 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

General Algorithm • To construct tree T from training set S  If all examples in S belong to some class in C, or S is sufficiently "pure", then make a leaf labeled C.  Otherwise:

 select the “most informative” attribute A  partition S according to A’s values  recursively construct sub -trees T1, T2, ..., for the subsets of S • The details vary according to the specific algorithm –CART, ID3, C4.5 –but the general idea is the same 7 Module 4: Analytics Theory/Methods We now describe the general algorithm. Our objective is to construct a tree T from a training set S. If all examples in S belongs to some class “C “ (good_credit for example) or S is sufficiently “pure” (in our case node p(credit_good) is 70% pure) we make a leaf labeled “C”.

Otherwise we will select another attribute considered as the “most informative” (savings, housing etc.) and partition S according to A’s values. Something similar to what we explained in the previous slide. We will construct sub -trees T1,T2….. or the subsets of S recursively until • You have all of the nodes as pure as required or • You cannot split further as per your specifications or • Any other stopping criteria specified. There are several algorithms that implement Decision Trees and the methods of tree construction vary with each one of them. CART,ID3 and C4.5 are some of the popular algorithms. 7 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Step 1: Pick the Most “Informative" Attribute • Entropy -based methods are one common way • H = 0 if p(c) = 0 or 1 for any class  So for binary classification, H=0 is a "pure" node • H is maximum when all classes are equally probable  For binary classification, H=1 when classes are 50/50 8 Module 4: Analytics Theory/Methods The first step is to pick the most informative attribute. There are many ways to do it. We detail Entropy based methods.

Let p( c ) be the probability of a given class. H as defined by the formula shown above will have a value 0 if p (c ) is 0 or 1. So for binary classification H=0 means it is a “pure” node. H is maximum when all classes are equally probable. If the probability of classes are 50/50 then H=1 (maximum entropy). 8 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Step 1: Pick the most "informative" attribute (Continued) • First, we need to get the base entropy of the data 9 Module 4: Analytics Theory/Methods In our credit problem p(credit_good) is 0.7 and p(credit_bad) is 0.3.

The base entropy H credit = -(0.7 log 2(0.7) + 0.3log 2(0.3)) = 0.88 ( very close to 1) Our unconditioned credit problem has fairly high entropy. 9 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Step 1: Pick the Most “Informative" Attribute (Continued) Conditional Entropy • The weighted sum of the class entropies for each value of the attribute • In English: attribute values (home owner vs. renter) give more information about class membership  "Home owners are more likely to have good credit than renters" • Conditional entropy should be lower than unconditioned entropy 10 Module 4: Analytics Theory/Methods Continuing with step 1 we now find the conditional entropy, which is the weighted sum of class entropies for each value of the attribute.

Let us say we choose the attribute “Housing” we have three levels for this attribute (free, rent and own). Intuitively we can say that home owners are more likely to have better credit than renters. So the attribute value Housing will give more information about the class membership for credit_good. The conditional entropy of attribute Housing should be lower than the base entropy.

At worst (in the case where the attribute is uncorrelated with the class label), the conditional entropy is the same as the unconditioned entropy. 10 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Conditional Entropy Example for free own rent P(housing) 0.108 0.713 0.179 P(bad | housing) 0.407 0.261 0.391 p(good | housing ) 0.592 0.739 0.601 11 Module 4: Analytics Theory/Methods Let's compute the conditional entropy of credit class conditioned on housing status.

In the top row of the table are the probabilities of each value. In the n ext two rows are the probabilities of the class labels conditioned on the housing value.

Note that each term inside parentheses is the entropy of the class labels within a single housing value.

The conditional entropy is still fairly high; but it is a little less than the unconditioned entropy. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Step 1: Pick the Most “Informative" Attribute (Continued) Information Gain • The information that you gain, by knowing the value of an attribute • So the "most informative" attribute is the attribute with the highest InfoGain 12 Module 4: Analytics Theory/Methods Information Gain is defined as the difference between the base entropy and the conditional entropy of the attribute.

So the most informative attribute is the attribute with most information gain. Remember, this is just an example. There are other information/purity measures, but InfoGain is a fairly popular one for inducing Decision Trees. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Back to the Credit Prediction Example Attribute InfoGain job 0.001 housing 0.013 personal_status 0.006 savings_status 0.028 13 Module 4: Analytics Theory/Methods If we compute the InfoGain for all of our input variables, we see that savings_status is the most informative variable. We can see that savings_status gives the most infoGain and that is why it was the first variable on which the tree was split. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. savings= <100, (100:500) savings=(500:1000),>=1000,no known savings 700/1000p(good)=0.7 245/294p(good)=0.83 Step 2 & 3: Partition on the Selected Variable • Step 2: Find the partition with the highest InfoGain  In our example the selected partition has InfoGain = 0.028 • Step 3: At each resulting node, repeat Steps 1 and 2  until node is "pure enough" • Pure nodes => no information gain by splitting on other attributes 14 Module 4: Analytics Theory/Methods The selected partitioning has InfoGain almost as high as using each savings value as a separate node. And InfoGain happens to be biased to many partitions, so this partition is basically as informative.

InfoGain can be used with continuous variables as well; in that case, finding the partition and computing the information gain are the same step.

"Pure enough" usually means that no more information can be gained by splitting on other attributes 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics • Hold -out data • ROC/AUC • Confusion Matrix • FPR/FNR, Precision/Recall • Do the splits (or the "rules") make sense?

 What does the domain expert say? • How deep is the tree?

 Too many layers are prone to over -fit • Do you get nodes with very few members?

 Over -fit 15 Module 4: Analytics Theory/Methods The diagnostics are exactly the same as the one we detailed for Naïve Bayesian classifier. We use the hold -out data /AUC and confusion matrix. There are sanity checks that can be performed such as validating the “decision rules” with domain experts and determining if they make sense. Having too many layers and obtaining nodes with very few members are signs of over fitting. 15 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Takes any input type (numeric, categorical) In principle, can handle categorical variables with many distinct values (ZIP code) Decision surfaces can only be axis -aligned Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the training data Naturally handles variable interaction A "deep" tree is probably over -fit Because each split reduces the training data for subsequent splits Handles variables that have non -linear effect on outcome Not good for outcomes that are dependent on many variables Related to over -fit problem, above Computationally efficient to build Doesn't naturally handle missing values; However most implementations include a method for dealing with this Easy to score data In practice, decision rules can be fairly complex Many algorithms can return a measure of variable importance In principle, decision rules are easy to understand Decision Tree Classifier -Reasons to Choose (+) & Cautions ( -) Module 4: Analytics Theory/Methods 16 Decision Trees take both numerical and categorical variables. They can handle many distinct values such as the zip code in the data.

Unlike Naïve Bayesian the Decision Tree method is robust with redundant or correlated variables. Decision Trees handles variables that are non -linear. Linear/logistic regression computes the value as b1*x1 + b2*x2 .. And so on.

If two variables interact and say the value y depends on x1*x2, linear regression does not model this type of data correctly.

Naïve Bayes also does not do variable interactions (by design). Decision Trees handle variable interactions naturally. Every node in the tree is in some sense an interaction.

Decision Tree algorithms are computationally efficient and it is easy to score the data. The outputs are easy to understand. Many algorithms return a measure of variable importance.

Basically the information gain from each variable is provided by many packages.

In terms of Cautions ( -), decision surface is axis aligned and the decision regions are rectangular surfaces. However, if the decision surface is not axis aligned (say a triangular surface), the Decision Tree algorithms do not handle this type of data well.

Tree structure is sensitive to small variations in the training data. If you have a large data set and you build a Decision Tree on one subset and another Decision Tree on a different subset the resulting trees can be very different even though they are from the same data set. If you get a deep tree you are probably over fitting as each split reduces the training data for subsequent splits.

Module 4: Analytics Theory/Methods 16 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Typical Questions Recommended Method Do I want class probabilities, rather than just class labels? Logistic regression Decision Tree Do I want insight into how the variables affect the model? Logistic regression Decision Tree Isthe problem high -dimensional? Naïve Bayes Do I suspect some of the inputs are correlated? Decision Tree Logistic Regression Do I suspect some of the inputs are irrelevant? Decision Tree Naïve Bayes Are there categorical variables with a large number of levels? Naïve Bayes Decision Tree Are there m ixed variable types? Decision Tree Logistic Regression Is there non -linear data or discontinuities in the inputs that will affect the outputs? Decision Tree Which Classifier Should I Try? Module 4: Analytics Theory/Methods 18 This is only advisory. It's a list of things to think about when picking a classifier, based on the Reasons to Choose (+) and Cautions ( -) we've discussed. Module 4: Analytics Theory/Methods 18 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge 1. How do you define information gain? 2. For what conditions is the value of entropy at a maximum and when is it at a minimum? 3. List three use cases of Decision Trees. 4. What are weak learners and how are they used in ensemble methods? 5. Why do we end up with an over fitted model with deep trees and in data sets when we have outcomes that are dependent on many variables? 6. What classification method would you recommend for the following cases:

 High dimensional data  Data in which outputs are affected by non -linearity and discontinuity in the inputs Your Thoughts? 19 Module 4: Analytics Theory/Methods Record your answers here. 19 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered:

•Overview of Decision Tree classifier •General algorithm for Decision Trees •Decision Tree use cases •Entropy, Information gain •Reasons to Choose (+) and Cautions ( -) of Decision Tree classifier •Classifier methods and conditions in which they are best suited Decision Trees -Summary Module 4: Analytics Theory/Methods 20 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 20