Note: I need all of the 10 answers exact and accurate, not any crap with bull shit answers or excel data. Take your time and go through the following first.I will not finish the job before validating

Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •Time Series Analysis and its applications in forecasting •ARMA and ARIMA Models •Implementing the Box -Jenkins Methodology using R •Reasons to Choose (+) and Cautions ( -) with Time Series Analysis Time Series Analysis Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. ARIMA and Box -Jenkins methodology are explained in the following slides. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Time Series Analysis • Time Series: Ordered sequence of equally spaced values over time • Time Series Analysis: Accounts for the internal structure of observations taken over time  Trend  Seasonality  Cycles  Random • Goals  To identify the internal structure of the time series  To forecast future events  Example: Based on sales history, what will next December sales be? • Method: Box -Jenkins (ARMA) 3 Module 4: Analytics Theory/Methods Businesses perform sales forecasting to look ahead in order to plan their investments, launch new products, decide when to close or withdraw products, etc. The sales forecasting process is a critical one for most businesses. Part of the sales forecasting process is to examine the past.

How well did we do in the last few months or what were our sales in the same time period for the last few years? Time Series Analysis provides a scientific methodology for sales forecasting.

Time Series Analysis is the analysis of sequential data across equally spaced units of time. Time Series is a basic research methodology in which data for one or more variables are collected for many observations at different time periods. The main objectives in Time Series Analysis are:

•To understand the underlying structure of the time series by breaking it down to its components.

•Fit a mathematical model and then proceed to forecast the future The time periods are usually regularly spaced and the observations may be either univariate or multivariate. Univariate time series are those where only one variable is measured over time, whereas multivariate time series are those, where multiple variables are measured simultaneously. The internal structure of the data may specify a trend, seasonality or cycles:

3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

•Models historical behavior to forecast the future •Applies ARMA (Autoregressive Moving Averages)  Input : Time Series  Accounting for Trends and Seasonality components  Output : Expected future value of the time series Box -Jenkins Method: What is it? 5 Module 4: Analytics Theory/Methods Box -Jenkins methodology developed by Professors G.E.P. Box and G.M. Jenkins, enables the forecasting with time series data with both high accuracy and low computational requirements. The technique may be applied to quickly determine forecasts that are as uncomplicated in form as the simple smoothing methods, or that involve a number of economic variables. In either case, use of this technique enables efficient utilization of other predictive information contained in the data. It offers assurance of obtaining the highest forecasting accuracy possible in terms of the variables on which the forecast is based.

The input for the model is the trend and seasonality adjusted time series and the output is the expected future value of the time series.

Box Jenkins Methodology applies autoregressive moving average ARMA models to find the best fit of a time series to past values of this time series, in order to make forecasts. 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Use Cases Forecast:

• Next month's sales • Tomorrow's stock price • Hourly power demand 6 Module 4: Analytics Theory/Methods The key application of Time Series Analysis is in forecasting. Economic and business planning, inventory and production control of industrial processes are some of the key applications in which time series analysis is deployed.

Time Series data provide useful information about the physical, biological, social or economic systems generating the time series, such as:

Economics/ Finance: share prices, profits, imports, exports, stock exchange indices Sociology: school enrollments, unemployment, crime rate Environment: Amount of pollutants, such as suspended particulate matter (SPM), in the environment Meteorology: Rainfall, temperature, wind speed Epidemiology: Number of SARS cases over time Medicine: Blood pressure measurements over time for evaluating drugs to control hypertension 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Modeling a Time Series • Let's model the time series as Yt=T t+S t+R t, t=1,...,n. • Tt: Trend term  Air travel steadily increased over the last few years • St: The seasonal term  Air travel fluctuates in a regular pattern over the course of a year • Rt: Random component  To be modeled with ARMA 7 Module 4: Analytics Theory/Methods We present a simple model for the time series with the trend, seasonality and a random fluctuation. There is sometimes a low frequency cyclic term as well, but we are ignoring that for simplicity.

Examples of trend and seasonality are also detailed in the slide 7 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Stationary Sequences •Box -Jenkins methodology assumes the random component is a stationary sequence  Constant mean  Constant variance  Autocorrelation does not change over time  Constant correlation of a variable with itself at different times • In practice, to obtain a stationary sequence, the data must be:

 De -trended  Seasonally adjusted 8 Module 4: Analytics Theory/Methods A stationary sequence is a random sequence in which the joint probability distribution does not vary over time. In other words the mean, variance and auto correlations do not change in the sequence over time.

In order to render a sequence stationary we need to remove the effects of trend and seasonality. The ARIMA model (implemented with Box Jenkins) uses the method of differencing to render the data stationary. 8 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

De -trending • In this example, we see a linear trend, so we fit a linear model  Tt= m·t + b • The de -trended series is then  Y1t= Y t–Tt • In some cases, may have to fit a non -linear model  Quadratic  Exponential 9 Module 4: Analytics Theory/Methods Trend in a time series is a slow, gradual change in some property of the series over the whole interval under investigation.

De -trending is a pre -processing step to prepare time series for analysis by methods that assume stationarity. A simple linear trend can be removed by subtracting a least -squares -fit straight line. In the example shown we fit a linear model and obtain the difference. The graph shown next is a de - trended time series.

More complicated trends might require different procedures such a fitting a non -linear model such as a quadratic or a exponential model.

Use a Linear Trend Model if the first differences are more or less constant [ (y2-y1) = (y 3-y2) = ……. = (y n-yn-1) ] Use a Quadratic Trend Model if the second differences are more or less constant. [ (y3-y2) – (y2-y1) = ………= (y n-yn-1)-(yn-1-yn-2) ] Use an Exponential Trend Model if the percentage differences are more or constant. [ ( (y2-y1) /y 1) * 100% = …….((y n-yn-1)/y n-1 ) * 100% 9 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Seasonal Adjustment • Plotting the de -trended series identifies seasons  For CO2 concentration, we can model the period as being a year, with variation at the month level • Simple ad -hoc adjustment: take several years of data, calculate the average value for each month, and subtract that from Y 1t Y2t= Y 1t– St 10 Module 4: Analytics Theory/Methods Unlike the trend and cyclical components, seasonal components, theoretically, happen with similar magnitude during the same time period each year.

The holiday sales spike is an example of seasonality. By removing the seasonal component, it is easier to focus on other components. The seasonal component of a series typically makes the interpretation of a series more difficult.

A simple adjustment for seasonality is done with taking several years of data, calculating average value for each month and subtracting them from the actual value. 10 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

ARMA(p, q) Model • The simplest Box -Jenkins Model  Ytis de -trended and seasonally adjusted • Combination of two process models  Autoregressive : Y tis a linear combination of its last pvalues  Moving average : Y tis a constant value plus the effects of a dampened white noise process over the last q time values (lags) 11 Module 4: Analytics Theory/Methods Autoregressive (AR) models can be coupled with moving average (MA) models to form a general and useful class of time series models called Autoregressive Moving Average (ARMA) models. This is the simplest Box -Jenkins model. AR model predicts Ytas a linear combination of its last p values. An autoregressive model is simply a linear regression of the current value of the series on one or more prior values of the same series. Several options are available for analyzing autoregressive models, including standard linear least squares techniques. They also have a straightforward interpretation.

The time series Ytis called an autoregressive process of order p and is denoted as AR(p) process.

A moving average (MA) model adds to Ytthe effects of a dampened white noise process over the last q steps. The simple moving average is one of the most basic of the forecasting methods. Moving backwards in time, minus 1, minus, 2, minus 3 and so forth until we have n data points, divide the sum of those points by the number of data points, n, and that gives you the forecast for the next period. So it's called a single moving average or simple moving average. The forecast is simply a constant value that projects the next time period. “n” is also the order of the moving averages.

moving average: like a random walk, or brownian motion 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

ARIMA(p, d, q) Model • ARIMA adds a differencing term, d, to the ARMA model  Autoregressive Integrated Moving Average  Includes the de -trending as part of the model  linear trend can be removed by d=1  quadratic trend by d=2  and so on for higher order trends • The general non -seasonal model is known as ARIMA ( p, d, q ):  pis the number of autoregressive terms  dis the number of differences  qis the number of moving average terms 12 Module 4: Analytics Theory/Methods ARMA models can be used when the series is weakly stationary ; in other words, the series has a constant variance around a constant mean.. This class of models can be extended to non - stationary series by allowing the differencing of the data series. These are called Autoregressive Integrated Moving Average(ARIMA) models. There are a large variety of ARIMA models. ARIMA –difference the Ytd times to "induce stationarity". d is usually 1 or 2. "I" stands for integrated –the outputs of the model are summed up (or "integrated") to recover Yt The general ARIMA (p, d, q) model gives a tremendous variety of patterns in the ACF and PACF, so it is not practical to state rules for identifying general ARIMA models. In practice, it is seldom necessary to deal with values of p, d, or q other than 0, 1, or 2. It is remarkable that such a small range of values for p, d, or q can cover such a large range of practical forecasting situations. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

ACF & PACF • Auto Correlation Function (ACF)  Correlation of the values of the time series with itself  Autocorrelation "carries over"  Helps to determine the order, q, of a MA model  Where does ACF go to zero? • Partial Auto Correlation Function (PACF)  An autocorrelation calculated after removing the linear dependence of the previous terms  Helps to determine the order, p, of an AR model  Where does PACF go to zero? 13 Module 4: Analytics Theory/Methods A common assumption in many time series techniques is that the time series is stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time.

An ACF plot provides an indication of the stationarity of the data. If the time series is not stationary, we can often transform it to stationarity with the simple technique of differencing. It should be noted that the autocorrelation carries over; if Ytis correlated with Y t-1, it is also correlated with Y t-2(though to a lesser degree). PACF -The partial autocorrelation at lag kis the autocorrelation between Ytand Yt-kthat is not accounted for by lags 1 through k-1. One looks for the point on the plot where the partial autocorrelations for all higher lags are essentially zero.

We will look into ACF and PACF graphs in the next Lab. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Model Selection • Based on the data, the Data Scientist selects p, d and q  An "art form" that requires domain knowledge, modeling experience, and a few iterations  Use a simple model when possible  AR model (q = 0)  MA model (p = 0) • Multiple models need to be built and compared  Using ACF and PACF 14 Module 4: Analytics Theory/Methods Identification of the most appropriate model is the most important part of the process, where it becomes as much ‘art’ as ‘science’.

The first step is to determine if the time series is stationary. This can be done with a correlogram , plots of the ACF and PACF. If the time series is not stationary, it needs to be first - differenced. (it may need to be differenced again to induce stationarity) The next stage is to determine the pand qin the ARIMA ( p, d, q) model (the drefers to how many times the data needs to be differenced to produce a stationary series).

In the diagnostic stage we assess the model’s adequacy by checking whether the model assumptions are satisfied. If the model is inadequate, this stage will provide some information for us to re -identify the model. We also perform: checking normality, constant variance, and independence assumption among residuals. 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Minimal data collection Only have to collect the series itself Do not need to input drivers No meaningful drivers: prediction based only on past performance No explanatory value Can't do "what -if" scenarios Can't stress test Designed to handle the inherent autocorrelation of lagged time series It's an "art form" to select appropriate parameters Accounts for trends and seasonality Only suitable for short term predictions Time Series Analysis -Reasons to Choose (+) & Cautions ( -) Module 4: Analytics Theory/Methods 15 The Reasons to Choose (+) and Cautions ( -) of Time Series Analysis are listed. Time Series Analysis is not a common “tool” in a Data Scientist’s tool kit. Though the models require minimal data collection and handle the inherent auto correlations of lagged time series, it does not produce meaningful drivers for the prediction.

The selection of (p,d,q) appropriately is not very straight forward. A complete understanding of the domain knowledge and very detailed analysis of trend and seasonality may be required.

Further this method is suitable for short term predictions only. Module 4: Analytics Theory/Methods 15 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Time Series Analysis with R •The function “ ts” is used to create time series objects  mydata <-ts(mydata,start =c(1999,1),frequency=12) • Visualize data  plot( mydata ) • De -trend using differencing  diff( mydata ) •Examine ACF and PACF  acf (mydata ): It computes and plots estimates of the autocorrelations  pacf (mydata ): It computes and plots estimates of the partial autocorrelations 16 Module 4 : Analytics Theory/Methods Important R functions and commands we will be using are listed here. 16 Module 4 : Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Other Useful R Functions in Time Series Analysis •ar (): Fit an autoregressive time series model to the data •arima (): Fit an ARIMA model •predict() : Makes predictions “predict” is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument •arima.sim() : Simulate a time series from an ARIMA model •decompose() : Decompose a time series into seasonal, trend and irregular components using moving averages Deals with additive or multiplicative seasonal component •stl (): Decompose a time series into seasonal, trend and irregular components using loess 17 Module 4 : Analytics Theory/Methods Some additional commands in the ts package are listed. We will use these commands in the lab. 17 Module 4 : Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge 1. What is a time series and what are the key components of a time series? 2. How do we “de -trend” a time series data? 3. What makes data stationary? 4. How is seasonality removed from the data? 5. What are the modeling parameters in ARIMA? 6. How do you use ACF and PACF to determine the “stationarity” of time series data? Your Thoughts? 18 Module 4: Analytics Theory/Methods Record your answers here. 18 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered:

• Time Series Analysis and its applications in forecasting •ARMA and ARIMA Models • Implementing the Box -Jenkins Methodology using R • Reasons to Choose (+) and Cautions ( -) with Time Series Analysis Time Series Analysis -Summary Module 4: Analytics Theory/Methods 19 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 19