Note: I need all of the 10 answers exact and accurate, not any crap with bull shit answers or excel data. Take your time and go through the following first.I will not finish the job before validating

Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered:

•General description of regression models •Technical description of a linear regression model •Common use cases for the linear regression model •Interpretation and scoring with the linear regression model •Diagnostics for validating the linear regression model •The Reasons to Choose (+) and Cautions ( -) of the linear regression model Lesson 4a: Linear Regression Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Regression • Regression focuses on the relationship between an outcome and its input variables.

 Provides an estimate of the outcome based on the input values.  Models how changes in the input variables affect the outcome. • The outcome can be continuous or discrete. • Possible use cases:

 Estimate the lifetime value (LTV) of a customer and understand what influences LTV.  Estimate the probability that a loan will default and understand what leads to default. • Our approaches: linear regression and logistic regression 3 Module 4: Analytics Theory/Methods The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards an average (a phenomenon also known as regression toward the mean).

Specifically, regression analysis helps one understand how the value of the dependent variable (also referred to as outcome) changes when any one of the independent (or input) variables changes, while the other independent variables are held fixed. Regression analysis estimates the conditional expectation of the outcome variable given the input variables — that is, the mean value of the outcome variable when the input variables are held fixed.

Regression focuses on the relationship between the outcome and the inputs. It also provides a model that has some explanatory value , in addition to estimating outcomes. Although social scientists use regression primarily for its explanatory value, data scientists apply regression techniques as predictors or classifiers.

The outcome can be continuous or discrete. For continuous outcomes, such as income, this lesson examines the use of linear regression . For discrete outcomes of a categorical attribute, such as success/fail, gender, or political party affiliation, the next lesson presents the use of logistic regression . 3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Linear Regression • Used to estimate a continuous value as a linear (additive) function of other variables  Income as a function of years of education, age, and gender  House sales price as function of square footage, number of bedrooms/bathrooms, and lot size • Outcome variable is continuous. • Input variables can be continuous or discrete. • Model Output:

 A set of estimated coefficients that indicate the relative impact of each input variable on the outcome  A linear expression for estimating the outcome as a function of input variables 4 Module 4: Analytics Theory/Methods Linear regression is a commonly used technique for modeling a continuous outcome. It is simple and works well in many instances. It is recommended that linear regression should be tried and if it is determined that the results are not reliable, other more complicated models should be considered. Alternative modeling approaches include ridge regression, local linear regression, regression trees, and neural nets (these models are out of scope for this course). Linear regression m odels a continuous outcome, such as income or housing sales prices, as a linear or additive function of other input variables. The input variables can be continuous or discrete. 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Linear Regression Model Module 4: Analytics Theory/Methods 5 = 0+ 01+ 01+ ⋯ + −1−1+ where is the outcome variable are the input variables, for j= 1,2,…,p−1 0is the value of when each equals zero is the change in based on a unit change in ~ N(0, 2)and the ’s are independent of each other In linear regression, the outcome variable is expressed as a linear combination of the input variables. For a given set of input variables, the linear regression model provides the expected outcome value. Unless the situation being modeled is purely deterministic, there will be some random variability in the outcome. This random error, denoted by ɛ, is assumed to be normally distributed with a mean of zero and a constant variance (2). 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Example: Linear Regression with One Input Variable 6 • x1-the number of employees reporting to a manager • y -the hours per week spent in meetings by the manager Module 4: Analytics Theory/Methods = 0+ 11+ In this example, the human resources department decides to examine the effect that the number of employees directly reporting to a manager has on how many hours per week the manager spends in meetings. The expected time spent in meetings is represented by the equation of a line with unknown intercept and slope. Suppose the true value of the i ntercept is 3.27 hours and the true value of the slope is 2.2 hours per employee. Then, a manager can expect to spend an additional 2.2 hours per week in meetings for every additional employee.

The distribution of the error term is represented by the rotated normal distribution plots provided at specific values of x 1. For example, a typical manager with 8 employees may be expected to spend 20.87 hours per week in meetings, but any amount of time from 15 to 27 hours per week is very probable.

This example illustrates a theoretical regression model. In practice, it is necessary to collect and prepare the necessary data and use a software package such as R to estimate the values of the coefficients. Coefficient estimation is covered later in this lesson. Additional variables could be included to this model. For example, a categorical attribute can be added to this linear regression model to account for the manager’s functional organization, such as engineering, finance, manufacturing, or sales. It may be somewhat tempting to included one variable, x 2, to represent the organization and denote engineering by 1, finance by 2, manufacturing by 3, and sales by 4. However, such an approach incorrectly suggests that the interval between the assigned numeric values has meaning (for example, sales is three more than engineering). The proper implementation of categorical attributes in a regression model will be addressed next. 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Representing Categorical Attributes • For a categorical attribute with m possible values  Add m-1binary (0/1) variables to the regression model  The remaining category is represented by setting the m-1binary variables equal to zero 7 Module 4: Analytics Theory/Methods Possible Situation Input Variables Finance manager with 8 employees (8,1,0,0) Manufacturing manager with 8 employees (8,0,1,0) Sales manager with 8 employees (8,0,0,1) Engineering manager with 8 employees (8,0,0,0) In expanding the previous example to include the manager’s functional organization, the input variables, denoted earlier by the x’s, have been replaced by more meaningful variable names. In addition to the employees variable for the number of employees reporting to a manager, three binary variables have been added to the model to identify finance, manufacturing ( mfg ), and sales managers. If a manager belongs to either of these functional organizations, the corresponding variable is set to 1. Otherwise, the variable is set to 0. Thus, for four functional organizations, engineering is represented by the case where the three binary variables are each set to 0. For this categorical variable, engineering is considered the reference level. For example, the coefficient of finance denotes the relative difference from the reference level. Choosing a different organization as the reference level changes the coefficient values, but not their relative differences. Interpreting the coefficients for categorical variables relative to the reference level is covered later in this lesson.

In general, a categorical attribute with m possible distinct values can be represented in the linear regression model by adding m-1binary variables. For a categorical attribute, such as gender with only two possible values, female or male, then only one binary variable needs to be added with one value assigned a 0 and the other value assigned 1.

Suppose it was decided to include the manager’s U.S. state of employment in the regression model. Then 49 binary variables would have to be added to the regression model to account for 50 states. However, that many categorical values can be quite cumbersome to interpret or analyze. Alternatively, it may make more sense to group the states into geographic regions or into other groupings such as type of location: headquarters, plant, field office, or remote. In the latter case, only three binary variables would need to be added. 7 Module 4: Analytics Theory/Methods            sales mfg finance employees y 4 3 2 1 0 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

• Choose the line that minimizes: • Provides the coefficient estimates, denoted bj Fitting a Line with Ordinary Least Squares (OLS) 8 Module 4: Analytics Theory/Methods ෍ =1 [− (0+ 11+ ⋯ + −1,−1)]2 Once a dataset has been collected, the objective is fit the “best” line to the data points. A very common approach to determine the best fitting line is to choose the line that minimizes the sum of the squares of the differences between the observed outcomes in the dataset and the estimated outcomes based on the equation of the fitted line. This method is known as Ordinary Least Squares (OLS). In the case of one input variable, the differences or distances between the observed outcome values and the estimated values along the fitted regression line are presented in the provided graph as the vertical line segments.

Although this minimization problem can be solved by hand calculations, it becomes very difficult for more than one input variable. Mathematically, the problem involves calculating the inverse of a matrix. However, other methods such as QR decomposition are used to minimize numerical round -off errors. Depending on the implementation, the required storage to perform the OLS calculations may grow quadratically as the number of input variables grows. For a large number of observations and many variables, the storage and RAM requirements should be carefully considered.

Note the provided equation of the fitted line. The use of the carat over y, read y-hat , is used to denote the estimated outcome for a given set of input. This notation helps to distinguish the observed y values from the fitted yvalues. In this example, the estimated coefficients are b0= 3.21 and b1= 2.19. 8 Module 4: Analytics Theory/Methods1 19.2 21.3 ˆ x y   Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Interpreting the Estimated Coefficients, bj • Coefficients for numeric input variables  Change in outcome due to a unit change in input variable *  Example: b1= 2.2  Extra 2.2 hrs/wk in meetings for each additional employee managed * • Coefficients for binary input variables  Represent the additive difference from the reference level *  Example: b2= 0.5  Finance managers meet 0.5 hr/wk more than engineering managers do * • Statistical significance of each coefficient  Are the coefficients significantly different from zero?  For small p-values (say < 0.05), the coefficient is statistically significant * when all other input values remain the same 9 Module 4: Analytics Theory/Methods For numeric variables, the estimated coefficients are interpreted in the same way as the concept of slope was introduced in algebra. For a unit change in a numeric variable, the outcome will change by the amount and in the direction of the corresponding coefficient. A fitted linear regression model is provided for the example where the hours per week spent in meeting by managers are modeled as a function of the number of employees and the manager’s functional organization. In this case, the coefficient of 2.2, corresponding to the employees variable, is interpreted as the expected amount of time spent in meetings will increase by 2.2 hours per week for each additional employee reporting to a manager.

The interpretation of a binary variable coefficient is slightly different. When a binary variable only assumes a value of 0 or 1, the coefficient is the additive difference or shift in the outcome from the reference level. In this example, engineering is the reference level for the functional organizations. So, a manufacturing manager would be expected to spend 1.9 hours per week less in meetings than an engineering manager when the number of employees is the same.

When used to fit linear regression models, many statistical software packages will provide a p - value with each coefficient estimate. This p -value can be used to determine if the coefficient is significantly different that zero. In other words, the software performs a hypothesis test where the null hypothesis is the coefficient equals zero and the alternate hypothesis is that the coefficient does not equal zero. For small p -values (say <0.05), then the null hypothesis would be rejected and the corresponding variable should remain in the linear regression model. If a larger p -value is observed, then the null hypothesis would not be rejected and the corresponding variable should be considered for removal from the model. 9 Module 4: Analytics Theory/Methodssales mfg finance employees y 6 . 0 9 . 1 5 . 0 2 . 2 0 . 4 ˆ      Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

• Residuals  Differences between the observed and estimated outcomes  The observed values of the error term, ε, in the regression model  Expressed as: • Errors are assumed to be normally distributed with  A mean of zero  Constant variance Diagnostics – Examining Residuals 10 Module 4: Analytics Theory/Methods Residuals are the differences between the observed and the estimated outcomes. The residuals are the observed values of the error term in the linear regression model. In linear regression modeling, these error terms are assumed to be normally distributed with a mean of zero and a constant variance regardless of the input variable values. Although this normality assumption is not required to fit a line using OLS, this assumption is the basis for many of the hypothesis tests and confidence interval calculations performed by statistical software packages such as R. The next few slides will address the use of residual plots to evaluate the adherence to this assumption as well as to access the appropriateness of a linear model to a given dataset. 10 Module 4: Analytics Theory/Methodsn i for y y e i i i ..., 2 , 1     Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics – Plotting Residuals 11 Module 4: Analytics Theory/Methods Ideal Residual Plot Quadratic Trend Non -centered Non -constant Variance When plotting the residuals agains t the estimated or fitted outcome values, the ideal residual plot will show residuals symmetrically centered around zero with a constant variance and with no apparent trends. If the ideal residual plot is not observed, it is often necessary to add additional variables to the model or transform some of the existing input and outcome variables. Common transformations include the square root and logarithmic functions.

Residual plots are also useful for identifying outliers that may require further investigation or special handling. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics – Residual Normality Assumption 12 Module 4: Analytics Theory/Methods Ideal Histogram Ideal Q-Q Plot The provided histogram shows that the residuals are centered around zero and appear to be symmetric about zero in a bell -shaped curved as one would expect for a normally distributed random variable. Another option is to examine a Q -Q plot that compares the observed data against the quantiles (Q) of the assumed distribution. In this example, the observed residuals follow a theoretical normal distribution represented by the line. If any significant departures of the plotted points from the line are observed , transformations , such as logarithms, may be required to satisfy the normality assumption. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Train Train Test D3 D2 D1 Training Set #1 Train Test D3 D2 D1 Train Diagnostics – Using Hold -out Data • Hold -out data  Training and testing datasets  Does the model predict well on data it hasn't seen? • N-fold cross validation  Partition the data into N groups.  Holding out each group,  Fit the model  Calculate the residuals on the group  Estimated prediction error is the average over all the residuals . 13 Module 4: Analytics Theory/Methods Test Train D3 D2 D1 Train Training Set #2 Training Set #3 Creating a hold -out dataset (this was discussed in Apriori diagnostics earlier in lesson 2 of this module) before you fit the model, and using that dataset to estimate prediction error is by far the easiest thing to do.

N-fold cross validation –it tells you if your set of variables is reasonable. This method is used when you don't have enough data to create a hold -out dataset. N -fold c ross validation is performed by randomly splitting the dataset into N non -overlapping subsets or groups and then fitting a model using N -1 groups and predicting its performance using the group that was held out. This process is repeated a total of N times, by holding out each group. After completing the N model fits, you estimate the mean performance of the model (maybe also the variance/standard deviation of the performance).

"Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions ", by Seni and Elder, provides a succinct description of N -fold cross -validation. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics – Other Considerations • R2  The fraction of the variability in the outcome variable explained by the fitted regression model.  Attains values from 0 (poorest fit) to 1 (perfect fit) • Identify correlated input variables  Pair -wise scatterplots  Sanity check the coefficients  Are the magnitudes excessively large?  Do the signs make sense? 14 Module 4: Analytics Theory/Methods R2(goodness of fit metric) is reported by all standard packages. It is the fraction of the variability in the outcome variable that the fitted model explains. The definition of R 2is 1 – SSerr /Sstot where SSerr = Sum[(y -ypred )2] and SStot = Sum[(y -ymean )2]. For a good fit, we want an R2value near 1. Regression modeling works best if the input variables are independent of each other. A simple way to look for the correlated variables is to examine pair -wise scatterplots such as the one generated in Module 3 for the Iris dataset. If two input variables, x1and x2, are linearly related to the outcome variable y, but are also correlated to each other, it may be only necessary to include one of these variables in the model. After fitting a regression model, it is useful to examine the magnitude and signs of the coefficients. Coefficients with large magnitudes or intuitively incorrect signs are also indications on correlated input variables. If the correlated variables remain in the fitted model, the predictive power of the regression model may not suffer, but its explanatory power will be diminished when the magnitude and signs of the coefficients do not make sense.

If correlated input variables need to remain in the model, restrictions on the magnitudes of the estimated coefficients can be accomplished with alternative regression techniques. Ridge regression , which applies a penalty based on the size of the coefficients, is one technique that can be applied. In fitting a linear regression model, the objective is to find the values of the coefficients that minimize the sum of the residuals squared. In ridge regression, a penalty term proportional to the sum of the squares of the coefficients is added to the sum of the residuals squared. A related technique is lasso regression , in which the penalty is proportional to the sum of the absolute values of the coefficients. Both of these techniques are outside of the scope of this course. 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Concise representation (the coefficients) Does not handle missing values well Robust to redundant or correlated variables Lose some explanatory value Assumes that each variable affects the outcome linearly and additively Variable transformations and modeling variable interactions can alleviate this A good idea to take the log of monetary amounts or any variable with a wide dynamic range Explanatory value Relative impact of each variable on the outcome Does not easily handle variables that affect the outcome in a discontinuous way Step functions Easy to score data Does not work well with categorical attributes with a lot of distinct values For example, ZIP code Linear Regression -Reasons to Choose (+) and Cautions ( -) Module 4: Analytics Theory/Methods 15 The estimated coefficients provide a concise representation of the outcome variable as a function of the input variables. T he estimated coefficients provide t he explanatory value of the model and are used to easily determine how the individual input variables affect the outcome. Linear regression is robust to redundant or correlated variables. Although the predictive power may not be impacted, the model does lose some explanatory value in the case of correlated variables. With the fitted model, it is also easy to score a given set of input values.

A caution is that linear regression does not handle missing values well. Another caution is that linear regression assumes that each variable affects the outcome linearly and additively. If some variables affect the outcome non -linearly and the relationships are not actually additive, the model will often not explain the data well. Variable transformations and modeling variable interactions can address this issue to some extent. Hypothesis testing and confidence intervals depend on the normality assumption of the error term. To satisfy the normality assumption, a common practice is take the log of an outcome variable with a skewed distribution for a given set of input values. Also, linear regression models are not ideal for handling variables that affect the outcome in a discontinuous way. In the case of a categorical attribute with a large number of distinct values, the model becomes complex and computationally inefficient. Module 4: Analytics Theory/Methods 15 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge 1. How is the measure of significance used in determining the explanatory value of a driver (input variable) with linear regression models? 2. Detail the challenges with categorical values in linear regression model. 3. Describe N -Fold cross validation method used for diagnosing a fitted model. 4. List two use cases of linear regression models. 5. List and discuss two standard checks that you will perform on the coefficients derived from a linear regression model. Your Thoughts? 16 Module 4: Analytics Theory/Methods Record your answers here. 16 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered:

• General description of regression models • Technical description of a linear regression model • Common use cases for the linear regression model • Interpretation and scoring with the linear regression model • Diagnostics for validating the linear regression model • The Reasons to Choose (+) and Cautions ( -) of the linear regression model Lesson 4a: Linear Regression -Summary Module 4: Analytics Theory/Methods 17 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 17