Need help with the following assignment have attached the required data and materials of the professor's lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary do

Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •Naïve Bayesian Classifier •Theoretical foundations of the classifier •Use cases •Evaluating the effectiveness of the classifier •The Reasons to Choose (+) and Cautions ( -) with the use of the classifier Naïve Bayesian Classifiers Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Classifiers • Classification: assign labels to objects. • Usually supervised: training set of pre -classified examples. • Our examples:

 Naïve Bayesian  Decision Trees  (and Logistic Regression) Module 4: Analytics Theory/Methods 3 Where in the catalog should I place this product listing? Is this email spam? Is this politician Democrat/Republican/Green? The primary task performed by c lassifiers is to assign labels to objects. Labels in classifiers are pre -determined unlike in clustering where we discover the structure and assign labels. Classifier problems are supervised learning methods. We start with a training set of pre - classified examples and with the knowledge of probabilities we assign class labels.

Some use case examples are shown in the slide. Based on the voting pattern on issues we could classify whether a politician has an affiliation to a party or a principle. Retailers use classifiers to assign proper catalog entry locations for their products. Most importantly the classification of emails as spam is another useful application of classifier methods.

Logistic regression, discussed in the previous lesson, can be viewed and used as a classifier. We will discuss Naïve Bayesian Classifiers in this lesson and the use of Decision Trees in the next lesson. Module 4: Analytics Theory/Methods 3 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Naïve Bayesian Classifier • Determine the most probable class label for each object  Based on the observed object attributes  Naïvely assumed to be conditionally independent of each other  Example:

 Based on the objects attributes {shape, color, weight}  A given object that is {spherical, yellow, < 60 grams}, may be classified (labeled) as a tennis ball  Class label probabilities are determined using Bayes’ Law • Input variables are discrete • Output :  Probability score –proportional to the true probability  Class label –based on the highest probability score 4 Module 4: Analytics Theory/Methods The Naïve Bayesian Classifier is a probabilistic classifier based on Bayes' Law and naïve conditional independence assumptions. In simple terms, a Naïve Bayes Classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature.

For example, an object can be classified into a particular category based on its attributes such as shape, color, and weight. A reasonable classification for an object, that is spherical, yellow and less than 60 grams in weight, may be a tennis ball. Even if these features depend on each other or upon the existence of the other features, a Naïve Bayesian Classifier considers all of these properties to independently contribute to the probability that the object is a tennis ball.

The input variables are generally discrete (categorical) but there are variations to the algorithms that work with continuous variables as well. For this lesson, we will consider only discrete input variables. Although weight may be considered a continuous variable, in the tennis ball example, weight was grouped into intervals in order to make weight a categorical variable.

The output typically returns a probability score and class membership. The output from most implementations are log probability scores for the class (we will address this later in the lesson) and we assign the class label that corresponds to the highest log probability score. 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Naïve Bayesian Classifier -Use Cases • Preferred method for many text classification problems.

 Try this first; if it doesn't work, try something more complicated • Use cases  Spam filtering, other text classification tasks  Fraud detection 5 Module 4: Analytics Theory/Methods Naïve Bayesian Classifiers are among the most successful known algorithms for learning to classify text documents. Spam filtering is the best known use of Naïve Bayesian Text Classification. Bayesian Spam Filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email. Many modern mail clients implement Bayesian Spam Filtering.

Naïve Bayesian Classifiers are used to detect fraud. For example in auto insurance, based on a training data set with attributes (such as driver’s rating, vehicle age, vehicle price, is it a claim by the policy holder, police report status, claim genuine ) we can classify a new claim as genuine or not.

References:

Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering) http://www.cisjournal.org/archive/vol2no4/vol2no4_1.pdf Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering (http://eprints.ecs.soton.ac.uk/18483/ Online applications (http://www.convo.co.uk/x02/) 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Building a Training Dataset to Predict Good or Bad Credit 6 Module 4: Analytics Theory/Methods • Predict the credit behavior of a credit card applicant from applicant's attributes:

 Personal status  Job type  Housing type  Savings amount • These are all categorical variables and are better suited to Naïve Bayesian Classifier than to logistic regression. Let us look into a specific use case example. We present here the same example we worked with in Lesson 2 of this module with the Apriori algorithm. The training dataset consists of attributes: personal status, job type, housing type and amount of money in their savings account. They are represented as categorical variables which are well suited for Naïve Bayesian Classifier.

With this training set we want to predict the credit behavior of a new customer. This problem could be solved with logistic regression as well. If there are multiple levels for the outcome you want to predict, then Naïve Bayesian Classifier is a better solution.

Next, we will go through the technical basis for Naïve Bayesian Classifiers and will revisit this credit dataset later. 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Technical Description -Bayes' Law • C is the class label:

 C ϵ{C 1, C 2, … Cn} • A is the observed object attributes  A = (a 1, a 2, … a m) • P(C | A) is the probability of C given A is observed  Called the conditional probability 7 Module 4: Analytics Theory/Methods Bayes' Law states: P(C | A)*P(A) = P(A | C)*P(C) = P(A ^ C).

That is, the conditional probability that C is true given that A is true, denoted P(C|A), times the probability of A is the same as the conditional probability that A is true given that C is true, denoted P(A|C), times the probability of C. Both of these terms are equal to P(A^C) that is the probability A and C are simultaneously true. If we divide all three terms by P(A), then we get the form shown on the slide.

The reason that Bayes’ Law is important is that we may not know P(C|A) (and we want to), but we do know P(A|C) and P(C) for each possible value of C from the training data. As we will see later, it is not necessary to know P(A) for the purposes of Naïve Bayes Classifiers.

An example using Bayes Law:

John flies frequently and likes to upgrade his seat to first class. He has determined that, if he checks in for his flight at least two hours early, the probability that he will get the upgrade is .75; otherwise, the probability that he will get the upgrade is .35. With his busy schedule, he checks in at least two hours before his flight only 40% of the time. Suppose John didn’t receive an upgrade on his most recent attempt. What is the probability that he arrived late?

C = John arrives late A = John did not receive an upgrade P(C) = Probability John arrives late = .6 P(A) = Probability John did not receive an upgrade = 1 –( .4 x .75 + .6 x .35) = 1 -.51 = .49 P(A|C) = Probability that John did not receive an upgrade given that he arrived late = 1 -.35 = .65 P(C|A) = Probability that John arrived late given that he did not receive his upgrade = P(A|C)P(C)/P(A) = (.65 x .6)/.49 = .80 (approx) In this simple example, C can take one of two possible values {arriving early, arriving late) and there is only one attribute which can take one of two possible values {received upgrade, did not receive upgrade}. Next, we will generalize Bayes’ Law to multiple attributes and apply the naïve independence assumptions. 7 Module 4: Analytics Theory/Methods) ( ) ( ) | ( ) ( ) ( ) | ( A P C P C A P A P C A P A C P    Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

• For observed attributes A = (a 1, a 2, … a m), we want to compute and assign the classifier, Ci, with the largest P(C i|A) • Two simplifications to the calculations  Apply naïve assumption -each aj is conditionally independent of each other, then  Denominator P(a 1,a2,…a m) is a constant and can be ignored Apply the Naïve Assumption and Remove a Constant 9 Module 4: Analytics Theory/Methods The general approach is to assign the classifier label, Ci, to the object with attributes A = (a 1, a 2, … a m) that corresponds to the largest value of P(Ci|A ). The probability that a set of attribute values A (comprised of m variables a 1thru a m) should be labeled with a classification C iwill equal the probability that of the set of variables a1thru a m given Ci is true, times the probability of Ciall divided by the probability of the set of attribute values a 1thru a m . The conditional independence assumption is that the probability of observing the value of a particular attribute given Ciis independent of the other attributes. This naïve assumption simplifies the calculation of P(a 1, a 2, …, am|C i) as shown on the slide. Since P(a 1, a 2, …, a m) appears in the denominator of P(Ci|A ), for all values of i, removing the denominator will have no impact on the relative probability scores and will simplify calculations. Next, these two simplifications to the calculations will be applied to build the Naïve Bayesian Classifier. 9 Module 4: Analytics Theory/Methodsn i a a a P C P C a a a P A C P m i i m i ,...,2 ,1 ) ,..., , ( ) ( ) | ,..., , ( ) | ( 2 1 2 1      m j i j i m i i i m C a P C a P C a P C a P C a a a P 1 2 1 2 1 ) | ( ) | ( ) | ( ) | ( ) | ,..., , (  Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Building a Naïve Bayesian Classifier • Applying the two simplifications • To build a Naïve Bayesian Classifier, collect the following statistics from the training data:

 P(C i) for all the class labels.  P(a j| C i) for all possible a jand C i  Assign the classifier label, C i, that maximizes the value of 10 Module 4: Analytics Theory/Methods Applying the two simplifications, P(C i|a 1, a 2, …, a m) is proportional to the product of the various P( aj|C i), for j=1,2,…m, times P(C i). From a training dataset, these probabilities can be computed and stored for future classifier assignments. We now return to the credit applicant example. 10 Module 4: Analytics Theory/Methodsn i C P C a P a a a C P i m j i j m i ,...,2 ,1 ) ( ) | ( ) ,..., , | ( 1 2 1          n i C P C a P i m j i j ,...,2 ,1 ) ( ) | ( 1         Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Naïve Bayesian Classifiers for the Credit Example • Class labels: {good, bad}  P(good) = 0.7  P(bad) = 0.3 • Conditional Probabilities  P(own|bad ) = 0.62  P(own|good) = 0.75  P(rent|bad ) = 0.23  P(rent|good) = 0.14  … and so on 11 Module 4: Analytics Theory/Methods To build a Naïve Bayesian Classifier we need to collect the following statistics: 1. Probability of all class labels –Probability of good credit and probability of bad credit. From the all data available in the training set we determine P(good) = 0.7 and P(bad) = 0.3 2. In the training set, there are several attributes: personal_status, job, housing, and saving_status. For each attribute and its possible values, we need to compute the conditional probabilities given bad or good credit. For example, relative to the housing attribute, we need to compute P( own|bad ), P( own|good ), P( rent|bad ), P( rent|good ), etc. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Naïve Bayesian Classifier for a Particular Applicant • Given applicant attributes of A= {female single, owns home, self -employed, savings > $1000} • Since P(good|A) > (bad|A), assign the applicant the label "good" credit aj Ci P(a j| C i) female single good 0.28 female single bad 0.36 own good 0.75 own bad 0.62 self emp good 0.14 self emp bad 0.17 savings>1K good 0.06 savings>1K bad 0.02 P(good|A) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012 P(bad|A) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002 12 Module 4: Analytics Theory/Methods Here we have an example of an applicant who is female, single, owns a home, is self -employed and has savings over $1000 in her savings account. How will we classify this person? Will she be scored as a person with good or bad credit?

Having built the classifier with the training set we find P( good|A ) which is equal to 0.0012 (see the computation on the slide) and P( bad|A ) is 0.0002. Since P( good|A ) is the maximum of the two probability scores, we assign the label “good” credit. The score is only proportional to the probability. It doesn't equal the probability, because we haven't included the denominator. However, both formulas have the same denominator, so we don't need to calculate it in order to know which quantity is bigger.

Notice, though, how small in magnitude these scores are. When we are looking at problems with a large number of attributes, or attributes with a very high number of levels, these values can become very small in magnitude. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Naïve Bayesian Implementation Considerations • Numerical underflow  Resulting from multiplying several probabilities near zero  Preventable by computing the logarithm of the products • Zero probabilities due to unobserved attribute/classifier pairs  Resulting from rare events  Handled by smoothing (adjusting each probability by a small amount) • Assign the classifier label, C i, that maximizes the value of 13 Module 4: Analytics Theory/Methods where i= 1,2,…,n and P’denotes the adjusted probabilities Multiplying several probability values, each possibly close to zero, invariably leads to the problem of numerical underflow. So an important implementation guideline is to compute the logarithm of the product of the probabilities, which is equivalent to the summation of the logarithm of the probabilities. Although the risk of underflow may increase as the number of attributes increase, the use of logarithms should be applied regardless of the number of attribute dimensions.

Additionally, to address the possibility of probabilities equal to zero, smoothing techniques can be employed to adjust the probabilities to ensure non -zero values. Applying a smoothing technique assigns a small non -zero probability to rare events not included in the training dataset. Also, the smoothing addresses the possibility of taking the logarithm of zero.

The R implementation of Naïve Bayes incorporates the smoothing directly into the probability tables. Essentially, the Laplace smoothing that R uses adds one (or a small value) to every count. For example, if we have 100 "good" customers, and 20 of them rent their housing, the "raw" P(rent | good) = 20/100 = 0.2; with Laplace smoothing add adding one to the counts, the calculation would be P(rent | good) ~ (20 + 1)/(100+3) = 0.20388, where there are 3 possible values for housing (own, rent, for free). Fortunately, the use of the logarithms and the smoothing techniques are already implemented in standard software packages for Naïve Bayes Classifiers. However, if for performance reasons, the Naïve Bayes Classifier algorithm needs to be coded directly into an application, these considerations should be implemented. 13 Module 4: Analytics Theory/Methods ) ( log ) | ( log 1 i m j i j C P C a P         Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics • Hold -out data  How well does the model classify new instances? • Cross -validation • ROC curve/AUC 14 Module 4: Analytics Theory/Methods The diagnostics we used in regression can be used to validate the effectiveness of the model we built. The technique of using the hold -out data and performing N -fold cross validations and using the ROC/Area Under the Curve methods can be deployed with Naïve Bayesian Classifier as well. 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Prediction Actual Class good bad good 671 29 700 bad 38 262 300 709 291 1000 Diagnostics: Confusion Matrix Overall success rate (or accuracy): (TP + TN) / (TP+TN+FP+FN) = (671+262)/1000 ≈ 0.93 TPR: TP / (TP + FN) = 671 / (671+29) = 671/700 ≈ 0.96 FPR: FP / (FP + TN) = 38 / (38 + 262) = 38/300 ≈ 0.13 FNR: FN / (TP + FN) = 29 / (671 + 29) = 29/700 ≈ 0.04 Precision: TP/ (TP + FP) = 671/709 ≈ 0.95 Recall (or TPR): TP / (TP + FN) ≈ 0.96 false negatives (FN) false positives (FP) 15 Module 4: Analytics Theory/Methods true positives (TP) true negatives (TN) A confusion matrix is a specific table layout that allows visualization of the performance of a model. In the hypothetical example of confusion matrix shown:

Of 1000 credit score samples, the system predicted that there were good and bad credit, and of the 700 good credits, the model predicted 29 as bad and similarly 38 of the actual bad credits were predicted as good. All correct guesses are located in the diagonal of the table, so it's easy to visually inspect the table for errors, as they will be represented by any non -zero values outside the diagonal. We define overall success rate (or accuracy) as a metric defining –what we got right -which is the ratio between the sum of the diagonal values (i.e., TP and TN) vs. the sum of the table. In other words, the confusion table of a good model has large numbers diagonally and small (ideally zero) numbers off - diagonally.

We saw a true positive rate (TPR) and a false positive rate (FPR) when we discussed ROC curves: •TPR –what percent of positive instances did we correctly identify. •FPR –what percent of negatives we marked positive. Additionally we can measure the false negative rate (FNR): •FNR –what percent of positives we marked negative The computation of TPR, FPR and FNR are shown in the slide.

Precision and Recall are accuracy metrics used by the information retrieval community; they are often used to characterize classifiers as well. We will detail these metrics in lesson 8 of this module.

Note: •precision –what percent of things we marked positive really are positive •recall –what percent of positive instances did we correctly identify. Recall is equivalent to TPR. 15 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Handles missing values quite well Numeric variables have to be discrete (categorized) Intervals Robust to irrelevant variables Sensitive to correlated variables "Double -counting" Easy to implement Not good for estimating probabilities Stick to class label or yes/no Easy to score data Resistant to over -fitting Computationally efficient Handles very high dimensional problems Handles categorical variables with a lot of levels Naïve Bayesian Classifier -Reasons to Choose (+) and Cautions ( -) Module 4: Analytics Theory/Methods 16 The Reasons to Choose (+) and Cautions ( -) of the Naïve Bayesian Classifier are listed. Unlike Logistic regression, missing values are handled well by the Naïve Bayesian Classifier. It is also very robust to irrelevant variables (irrelevant variables are distributed among all the classes and their effects are not pronounced).

The model is easy to implement and we will see how easily a basic version can be implemented in the lab without using any packages. Scoring data (predicting) is very simple and the model is resistant to over fitting. (Over fitting refers to fitting training data so well that we fit the idiosyncrasies such as the data that are not relevant in characterizing the data). It is computationally efficient and handles high dimensional problems efficiently. Unlike logistic regression Naïve Bayesian Classifier handles categorical variables with a lot of levels.

The Cautions ( -) are that it is sensitive to correlated variables as the algorithm double counts the effect of the correlated variables. For example people with low income tend to default and people with low credit tend to default. It is also true that people with low income tend to have low credit. If we try to score “default” with both low income and low credit as variables we will see the double counting effect in our model output and in the scoring.

Though the probabilities are provided as an output of the scored data, Naïve Bayesian Classifier is not very reliable for the probability estimation and should be used for class label assignments only. Naïve Bayesian Classifier in its simple form is used only with categorical variables and any continuous variables should be rendered discrete into intervals. You will learn more about this in the lab. However it is not necessary to have the continuous variables as “discrete” and several standard implementations can handle continuous variables as well. Module 4: Analytics Theory/Methods 16 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge 1. Consider the following Training Data Set: • Apply the Naïve Bayesian Classifier to this data set and compute the probability score for P(y = 1|X) for X = (1,0,0) Show your work 2. List some prominent use cases of the Naïve Bayesian Classifier.

3. What gives the Naïve Bayesian Classifier the advantage of being computationally inexpensive? 4. Why should we use log -likelihoods rather than pure probability values in the Naïve Bayesian Classifier? Training Data Set Your Thoughts? 17 Module 4: Analytics Theory/Methods Record your answers here. More Check Your Knowledge questions are on the next page. 17 Module 4: Analytics Theory/MethodsX1 X2 X3 Y 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge (Continued) 5. What is a confusion matrix and how it is used to evaluate the effectiveness of the model? 6. Consider the following data set with two input features temperature and season • What is the Naïve Bayesian assumption? • Is the Naïve Bayesian assumption satisfied for this problem? Your Thoughts? 18 Module 4: Analytics Theory/Methods Record your answers here. 18 Module 4: Analytics Theory/MethodsTemperature Season Electricty Usage -10 to 50 F Winter High 50 to 70 F Winter Low 70 to 85 F Summer Low 85 to 110 F Summer High Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered: •Naïve Bayesian Classifier •Theoretical foundations of the classifier •Use cases •Evaluating the effectiveness of the classifier •The Reasons to Choose (+) and Cautions ( -) with the use of the classifier Naïve Bayesian Classifiers -Summary Module 4: Analytics Theory/Methods 19 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 19