**Research Paper Review**

Due Date: 10/17/2019

The final exam for this course is going to be a research paper review. At the end of each chapter of the book there is a bibliography section which lists many research papers that relates to the chapter content and cited inside the chapter. Here is what you need to do:

1- Select on the chapters of the book that of interest to you

2- Select on the papers listed in the bibliography section of that chapter

3- Search online for the paper manuscript

4- Read the paper

5- Summarize it

6- Create a Power Point presentation of that paper summary

7- Upload your presentation to BB.

Example

__http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.83&rep=rep1&type=pdf__

Data Science and Big Data Analytics

Chap 8: Advanced Analytical Theory and Methods:

Time Series Analysis

1

Chapter Sections

8.1 Overview of Time Series Analysis

8.1.1 Box-Jenkins Methodology

8.2 ARIMA Model

8.2.1 Autocorrelation Function (ACF)

8.2.2 Autoregressive Models

8.2.3 Moving Average Models

8.2.4 ARMA and ARIMA Models

8.2.5 Building and Evaluating an ARIMA Model

8.2.6 Reasons to Choose and Cautions

8.3 Additional Methods

Summary

2

8 Time Series Analysis

This chapter’s emphasis is on

Identifying the underlying structure of the time series

Fitting an appropriate Autoregressive Integrated Moving Average (ARIMA) model

3

Time series analysis attempts to model the underlying structure of observations over time

A time series, Y =a+ bX , is an ordered sequence of equally spaced values over time

The analyses presented in this chapter are limited to equally spaced time series of one variable

8.1 Overview of Time Series Analysis

4

The time series below plots #passengers vs months (144 months or 12 years)

8.1 Overview of Time Series Analysis

5

The goals of time series analysis are

Identify and model the structure of the time series

Forecast future values in the time series

Time series analysis has many applications in finance, economics, biology, engineering, retail, and manufacturing

8.1 Overview of Time Series Analysis

6

8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology

A time series can consist of the components:

Trend – long-term movement in a time series, increasing or decreasing over time – for example,

Steady increase in sales month over month

Annual decline of fatalities due to car accidents

Seasonality – describes the fixed, periodic fluctuation in the observations over time

Usually related to the calendar – e.g., airline passenger example

Cyclic – also periodic but not as fixed

E.g., retail sales versus the boom-bust cycle of the economy

Random – is what remains

Often an underlying structure remains but usually with significant noise

This structure is what is modeled to obtain forecasts

7

8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology

The Box-Jenkins methodology has three main steps:

Condition data and select a model

Identify/account for trends/seasonality in time series

Examine remaining time series to determine a model

Estimate the model parameters.

Assess the model, return to Step 1 if necessary

This chapter uses the Box-Jenkins methodology to apply an ARIMA model to a given time series

8

8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology

The remainder of the chapter is rather advanced and will not be covered in this course

The remaining slides have not been finalized but can be reviewed by those interested in time series analysis

9

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

Step 1: remove any trends/seasonality in time series

Achieve a time series with certain properties to which autoregressive and moving average models can be applied

Such a time series is known as a stationary time series

10

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

A time series, Yt for t= 1,2,3, … t, is a stationary time series if the following three conditions are met

The expected value (mean) of Y is constant for all values

The variance of Y is finite

The covariance of Y, and Y, h depends only on the value of h = 0, 1, 2, .. .for all t

The covariance of Y, andY,. h is a measure of how the two variables, Y, andY,_ h• vary together

11

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

The covariance of Y, andY,. h is a measure of how the two variables, Y, andY,_ h• vary together

If two variables are independent, covariance is zero.

If the variables change together in the same direction, cov is positive; conversely, if the variables change in opposite directions, cov is negative

12

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

A stationary time series, by condition (1), has constant mean, say m, so covariance simplifies to

By condition (3), cov between two points can be nonzero, but cov is only function of h – e.g., h=3

If h=0, cov(0) = cov(yt,yt) = var(yt) for all t

13

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

A plot of a stationary time series

14

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

From the figure, it appears that each point is somewhat dependent on the past points, but does not provide insight into the cov and its structure

The plot of autocorrelation function (ACF) provides this insight

For a stationary time series, the ACF is defined as

15

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

Because the cov(0) is the variance,

the ACF is analogous to the correlation function of two variables, corr (yt , yt+h), and

the value of the ACF falls between -1 and 1

Thus, the closer the absolute value of ACF(h) is to 1, the more useful yt can be as a predictor of yt+h

16

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

Using the dataset plotted above, the ACF plot is

17

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

By convention, the quantity h in the ACF is referred to as the lag, the difference between the time points t and t +h.

At lag 0, the ACF provides the correlation of every point with itself

According to the ACF plot, at lag 1 the correlation between Y, andY, 1 is approximately 0.9, which is very close to 1, so Y, 1 appears to be a good predictor of the value of Y,

In other words, a model can be considered that would express Y, as a linear sum of its previous 8 terms. Such a model is known as an autoregressive model of order 8

18

8.2 ARIMA Model 8.2.2 Autoregressive Models

For a stationary time series, y, t= 1, 2, 3, … , an autoregressive model of order p, denoted AR(p), is

19

8.2 ARIMA Model 8.2.2 Autoregressive Models

Thus, a particular point in the time series can be expressed as a linear combination of the prior p values, Y, _ i for j = 1, 2, … p, of the time series plus a random error term, c,.

the c, time series is often called a white noise process that represents random, independent fluctuations that are part of the time series

20

8.2 ARIMA Model 8.2.2 Autoregressive Models

In the earlier example, the autocorrelations are quite high for the first several lags.

Although an AR(8) model might be good, examining an AR(l) model provides further insight into the ACF and the p value to choose

An AR(1) model, centered around 6 = 0, yields

21

8.2 ARIMA Model 8.2.3 Moving Average Models

For a time series, y 1 , centered at zero, a moving average model of order q, denoted MA(q), is expressed as

the value of a time series is a linear combination of the current white noise term and the prior q white noise terms. So earlier random shocks directly affect the current value of the time series

22

8.2 ARIMA Model 8.2.3 Moving Average Models

the value of a time series is a linear combination of the current white noise term and the prior q white noise terms, so earlier random shocks directly affect the current value of the time series

the behavior of the ACF and PACF plots are somewhat swapped from the behavior of these plots for AR(p) models.

23

8.2 ARIMA Model 8.2.3 Moving Average Models

For a simulated MA(3) time series of the form Y, = E1 – 0.4 E, 1 + 1.1 £1 2 – 2.S E:1 3 where e, – N(O, 1), the scatterplot of the simulated data over time is

24

8.2 ARIMA Model 8.2.3 Moving Average Models

The ACF plot of the simulated MA(3) series is shown below

ACF(0) = 1, because any variable correlates perfectly with itself. At higher lags, the absolute values of terms decays

In an autoregressive model, the ACF slowly decays, but for an MA(3) model, the ACF cuts off abruptly after lag 3, and this pattern extends to any MA(q) model.

25

8.2 ARIMA Model 8.2.3 Moving Average Models

To understand this, examine the MA(3) model equations

Because Y1 shares specific white noise variables with Y1 _ 1 through Y1 _ 3,, those three variables are correlated to y1 • However, the expression of Yr does not share white noise variables with Y1_ 4 in Equation 8-14. So the theoretical correlation between Y1 and Y1 _ 4 is zero. Of course, when dealing with a particular dataset, the theoretical autocorrelations are unknown, but the observed autocorrelations should be close to zero for lags greater than q when working with an MA(q) model

26

8.2 ARIMA Model 8.2.4 ARMA and ARIMA Models

In general, we don’t need to choose between an AR(p) and an MA(q) model, rather combine these two representations into an Autoregressive Moving Average model, ARMA(p,q),

27

8.2 ARIMA Model 8.2.4 ARMA and ARIMA Models

If p = 0 and q =;e. 0, then the ARMA(p,q) model is simply an AR(p) model. Similarly, if p = 0 and q =;e. 0, then the ARMA(p,q) model is an MA(q) model

Although the time series must be stationary, many series exhibit a trend over time – e.g., an increasing linear trend

28

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

For a large country, monthly gasoline production (millions of barrels) was obtained for 240 months (20 years).

A market research firm requires some short-term gasoline production forecasts

29

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

library (forecast )

gas__prod_input <- as. data . f rame ( r ead.csv ( "c: / data/ gas__prod. csv")

gas__prod <- ts (gas__prod_input[ , 2])

plot (gas _prod, xlab = "Time (months) ", ylab = "Gas oline production (mi llions of barrels ) " )

30

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

Comparing Fitted Time Series Models

The arima () function in Ruses Maximum Likelihood Estimation (MLE) to estimate the model coefficients. In the R output for an ARIMA model, the log-likelihood (logLl value is provided. The values of the model coefficients are determined such that the value of the log likelihood function is maximized. Based on the log L value, the R output provides several measures that are useful for comparing the appropriateness of one fitted model against another fitted model.

AIC (Akaike Information Criterion)

A ICc (Akaike Information Criterion, corrected)

BIC (Bayesian Information Criterion)

31

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

Normality and Constant Variance

32

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

Forecasting

33

8.2 ARIMA Model 8.2.6 Reasons to Choose and Cautions

One advantage of ARIMA modeling is that the analysis can be based simply on historical time series data for the variable of interest. As observed in the chapter about regression (Chapter 6), various input variables need to be considered and evaluated for inclusion in the regression model for the outcome variable

34

8.3 Additional Methods

Autoregressive Moving Average with Exogenous inputs (ARMAX)

Used to analyze a time series that is dependent on another time series.

For example

Retail demand for products can be modeled based on the previous demand combined with a weather-related time series such as temperature or rainfall.

Spectral analysis is commonly used for signal processing and other engineering applications.

Speech recognition software uses such techniques to separate the signal for the spoken words from the overall signal that may include some noise.

Generalized Autoregressive Conditionally Heteroscedastic (GARCH)

A useful model for addressing time series with nonconstant variance or volatility.

Used for modeling stock market activity and price fluctuations.

8.3 Additional Methods

Kalman filtering

Useful for analyzing real-time inputs about a system that can exist in certain states.

Typically, there is an underlying model of how the various components of the system interact and affect each other.

Processes the various inputs,

Attempts to identify the errors in the input, and

Predicts the current state.

For example

A Kalman filter in a vehicle navigation system can

Process various inputs, such as speed and direction, and

Update the estimate of the current location.

8.3 Additional Methods

Multivariate time series analysis

Examines multiple time series and their effect on each other.

Vector ARIMA (VARIMA)

Extends ARIMA by considering a vector of several time series at a particular time, t.

Can be used in marketing analyses

Examine the time series related to a company’s price and sales volume as well as related time series for the competitors.

Summary

Time series analysis is different from other statistical techniques in the sense that most statistical analyses assume the observations are independent of each other. Time series ana lysis implicitly addresses the case in which any particular observation is somewhat dependent on prior observations.

Using differencing, ARIMA models allow nonstationary series to be transformed into stationary series to which seasonal and nonseasonal ARMA models can be appl ied. The importance of using the ACF and PACF plots to evaluate the autocorrelations was illustrated in determining ARIMA models to consider fitting. Aka ike and Bayesian Information Criteria can be used to compare one fitted A RIMA model against another. Once an appropriate model has been determined, future values in the time series can be forecasted

38

,

Data Science and Big Data Analytics

Chapter 5: Advanced Analytical Theory and Methods: Association Rules

1

Chapter Sections

5.1 Overview

5.2 Apriori Algorithm

5.3 Evaluation of Candidate Rules

5.4 Example: Transactions in a Grocery Store

5.5 Validation and Testing

5.6 Diagnostics

2

5.1 Overview

Association rules method

Unsupervised learning method

Descriptive (not predictive) method

Used to find hidden relationships in data

The relationships are represented as rules

Questions association rules might answer

Which products tend to be purchased together

What products do similar customers tend to buy

3

5.1 Overview

Example – general logic of association rules

4

5.1 Overview

Rules have the form X -> Y

When X is observed, Y is also observed

Itemset

Collection of items or entities

k-itemset = {item 1, item 2,…,item k}

Examples

Items purchased in one transaction

Set of hyperlinks clicked by a user in one session

5

5.1 Overview – Apriori Algorithm

Apriori is the most fundamental algorithm

Given itemset L, support of L is the percent of transactions that contain L

Frequent itemset – items appear together “often enough”

Minimum support defines “often enough” (% transactions)

If an itemset is frequent, then any subset is frequent

6

5.1 Overview – Apriori Algorithm

If {B,C,D} frequent, then all subsets frequent

7

5.2 Apriori Algorithm Frequent = minimum support

Bottom-up iterative algorithm

Identify the frequent (min support) 1-itemsets

Frequent 1-itemsets are paired into 2-itemsets, and the frequent 2-itemsets are identified, etc.

Definitions for next slide

D = transaction database

d = minimum support threshold

N = maximum length of itemset (optional parameter)

Ck = set of candidate k-itemsets

Lk = set of k-itemsets with minimum support

8

5.2 Apriori Algorithm

9

5.3 Evaluation of Candidate Rules Confidence

Frequent itemsets can form candidate rules

Confidence measures the certainty of a rule

Minimum confidence – predefined threshold

Problem with confidence

Given a rule X->Y, confidence considers only the antecedent (X) and the co-occurrence of X and Y

Cannot tell if a rule contains true implication

10

5.3 Evaluation of Candidate Rules Lift

Lift measures how much more often X and Y occur together than expected if statistically independent

Lift = 1 if X and Y are statistically independent

Lift > 1 indicates the degree of usefulness of the rule

Example – in 1000 transactions,

If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5

If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0

11

5.3 Evaluation of Candidate Rules Leverage

Leverage measures the difference in the probability of X and Y appearing together compared to statistical independence

Leverage = 0 if X and Y are statistically independent

Leverage > 0 indicates degree of usefulness of rule

Example – in 1000 transactions,

If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Leverage(milk->eggs) = 0.3 – 0.5*0.4 = 0.1

If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Leverage (milk->bread) = 0.4 – 0.5*0.4 = 0.2

12

5.4 Applications of Association Rules

The term market basket analysis refers to a specific implementation of association rules

For better merchandising – products to include/exclude from inventory each month

Placement of products within related products

Association rules also used for

Recommender systems – Amazon, Netflix

Clickstream analysis from web usage log files

Website visitors to page X click on links A,B,C more than on links D,E,F

13

5.5 Example: Grocery Store Transactions 5.5.1 The Groceries Dataset

Packages -> Install -> arules, arulesViz # don’t enter next line

> install.packages(c("arules", "arulesViz")) # appears on console

> library('arules')

> library('arulesViz')

> data(Groceries)

> summary(Groceries) # indicates 9835 rows

Class of dataset Groceries is transactions, containing 3 slots

transactionInfo # data frame with vectors having length of transactions

itemInfo # data frame storing item labels

data # binary evidence matrix of labels in transactions

> [email protected][1:10,]

> apply([email protected][,10:20],2,function(r) paste([email protected][r,"labels"],collapse=", "))

14

5.5 Example: Grocery Store Transactions 5.5.2 Frequent Itemset Generation

To illustrate the Apriori algorithm, the code below does each iteration separately.

Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total

First, get itemsets of length 1

> itemsets<-apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent itemsets"))

> summary(itemsets) # found 59 itemsets

> inspect(head(sort(itemsets,by="support"),10)) # lists top 10

Second, get itemsets of length 2

> itemsets<-apriori(Groceries,parameter=list(minlen=2,maxlen=2,support=0.02,target="frequent itemsets"))

> summary(itemsets) # found 61 itemsets

> inspect(head(sort(itemsets,by="support"),10)) # lists top 10

Third, get itemsets of length 3

> itemsets<-apriori(Groceries,parameter=list(minlen=3,maxlen=3,support=0.02,target="frequent itemsets"))

> summary(itemsets) # found 2 itemsets

> inspect(head(sort(itemsets,by="support"),10)) # lists top 10

> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10 supported items

15

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

The Apriori algorithm will now generate rules.

Set minimum support threshold to 0.001 (allows more rules, presumably for the scatterplot) and minimum confidence threshold to 0.6 to generate 2,918 rules.

> rules <- apriori(Groceries,parameter=list(support=0.001,confidence=0.6,target="rules"))

> summary(rules) # finds 2918 rules

> plot(rules) # displays scatterplot

The scatterplot shows that the highest lift occurs at a low support and a low confidence.

16

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

> plot(rules)

17

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

Get scatterplot matrix to compare the support, confidence, and lift of the 2918 rules

> plot([email protected]) # displays scatterplot matrix

Lift is proportional to confidence with several linear groupings.

Note that Lift = Confidence/Support(Y), so when support of Y remains the same, lift is proportional to confidence and the slope of the linear trend is the reciprocal of Support(Y).

18

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

> plot(rules)

19

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

Compute the 1/Support(Y) which is the slope

> slope<-sort(round([email protected]$lift/[email protected]$confidence,2))

Display the number of times each slope appears in dataset

> unlist(lapply(split(slope,f=slope),length))

Display the top 10 rules sorted by lift

> inspect(head(sort(rules,by="lift"),10))

Rule {Instant food products, soda} -> {hamburger meat}

has the highest lift of 19 (page 154)

20

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

Find the rules with confidence above 0.9

> confidentRules<-rules[quality(rules)$confidence>0.9]

> confidentRules # set of 127 rules

Plot a matrix-based visualization of the LHS v RHS of rules

> plot(confidentRules,method="matrix",measure=c("lift","confidence"),control=list(reorder=TRUE))

The legend on the right is a color matrix indicating the lift and the confidence to which each square in the main matrix corresponds

21

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

> plot(rules)

22

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

Visualize the top 5 rules with the highest lift.

> highLiftRules<-head(sort(rules,by="lift"),5)

> plot(highLiftRules,method="graph",control=list(type="items"))

In the graph, the arrow always points from an item on the LHS to an item on the RHS.

For example, the arrows that connects ham, processed cheese, and white bread suggest the rule

{ham, processed cheese} -> {white bread}

Size of circle indicates support and shade represents lift

23

5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

24

5.6 Validation and Testing

The frequent and high confidence itemsets are found by pre-specified minimum support and minimum confidence levels

Measures like lift and/or leverage then ensure that interesting rules are identified rather than coincidental ones

However, some of the remaining rules may be considered subjectively uninteresting because they don’t yield unexpected profitable actions

E.g., rules like {paper} -> {pencil} are not interesting/meaningful

Incorporating subjective knowledge requires domain experts

Good rules provide valuable insights for institutions to improve their business operations

25

5.7 Diagnostics

Although minimum support is pre-specified in phases 3&4, this level can be adjusted to target the range of the number of rules – variants/improvements of Apriori are available

For large datasets the Apriori algorithm can be computationally expensive – efficiency improvements

Partitioning

Sampling

Transaction reduction

Hash-based itemset counting

Dynamic itemset counting

26

,

Data Science and Big Data Analytics

Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics

1

Chapter Contents

11.1 SQL Essentials

11.1.1 Joins

11.1.2 Set Operations

11.1.3 Grouping Extensions

11.2 In-Database Text Analysis

11.3 Advanced SQL

11.3.1 Window Functions

11.3.2 User-Defined Functions and Aggregates

11.3.3 Ordered Aggregates

11.3.4 MADlib

Summary

2

Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics

In-database analytics is a broad term that describes the processing of data within its repository

This is in contrast to extracting data from a source and loading it into a sandbox or workspace like R

Advantages and disadvantages

Advantage: Eliminates need to move the data

Advantage: Fast – often almost real-time results

Disadvantage: Data must be mostly structured

Disadvantage: Data must be limited, not too huge

Applications – Credit card transaction fraud detection, product recommendations, web advertisement selection

3

11.1 SQL Essentials Relational Database – Entity Relationship Diagram

11.1 SQL Essentials

Tables

Records (rows)

Fields

Primary Keys

Foreign Keys

Normalization

reduces dup

SQL queries

4

11.2 In-Database Text Analysis

SQL offers basic text string functions

Example – extract zip code from text string

5

11.2 In-Database Text Analysis

Example – identify invalid zip codes

6

11.3 Advanced SQL

Window functions – moving averages

7

11.3 Advanced SQL

EWMA = Exponentially Weighted Moving Average

8

1