This chapter will go more in detail into statistical parameters and techniques that will enable you to analyze the data and draw conclusions based on standard statistical techniques. Then, it will provide the practical coverage of the basic econometric models, linear regression, and logistic regression, and will show how to build a forecast out of regression results.

This chapter was designed to keep the theoretical part to a minimum, with the purpose of teaching how to do a reliable analysis in a few and relatively simple steps, and enabling the reader to gain a basic understanding of some key statistical measures and use them well in order to draw, valid, reliable conclusions. This is an essential issue that makes up the value of an analyst.

As a foreword to this chapter, and a forewarning for the future business analytics practitioners, I would urge the readers to keep in mind that each and every statistical method is a tool, and that it depends largely on them and their knowledge to apply it correctly. Nowadays, this is more true than ever, as we have software available that enables doing advanced statistical calculations with little effort. However, the interpretation of results and the conclusions drawn require a good understanding and practice of the results at hand, something that is not that easy to achieve.

For the readers who are already familiar with this material, it is fine to go through it faster. I would also encourage them to not skip it, but use it as a practical refresher for the later parts of the course.

**3.1 Basic Statistical Measures**

The most basic statistical measures, which are widely used in practice, are those that describe the data, and give a picture of its characteristics, essential for picking the right models and techniques for conducting more advanced quantitative work.

Description of the data often starts with measures of its central tendency. The most commonly used is the mean, or average, which is computed as the sum of all elements (x) for a series, divided by their number (n).

(3.1)

Despite its widespread use, the mean has some undesirable properties that may alter one's conclusions and render the use of other statistical measures more appropriate.

Let us consider the following data sample: 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 8, 10, 24, 36.

Figure 3.1 Graphical

representation of sample

data set

This could represent daily unit sales data for the same product, ordered from the smallest to the largest. The sum of all values is 124 and the average is 124/14=8.86.

By looking at figure, 3.1, do you see any problem with this distribution? The data seems fairly homogeneous from the first up to and including the 12th element. But the last two values clearly stand out as being much bigger than the previous values.

These observations are called outliers, and have important implications for data analysis. Given the fact that their values are abnormally high (as in this example) or abnormally low, they can easily alter the values obtained for some statistical calculations and point out to misleading conclusions.

To illustrate this, let's omit these extreme values, which are out of line compared to the rest, and compute the average on the first 12. The average for these values is 5.33, significantly different than the result obtained above. Now the question is what is the most relevant average value for the daily sales, 8.86 or 5.33?

There could be legitimate reasons for the abnormally high sales that would give us the 8.86 average (for example, an event that drew large crowds of people), or they might be just normal for certain days like weekends. Whatever the reason, the average may not give the appropriate answer on which to draw conclusions and make important business decisions (e.g. how many units of the product to buy so as to have enough of them for sale).

This example shows that the mean, that we all use in everyday life, may be sensitive to abnormal values (outliers), not very informative, and raise the question whether there are other methods to find out an average value that can be relied on. An interesting remark here: we are used to think in averages as the most basic measure to describe some data, but we seldom thought whether we are computing a reliable measure by just using the mean.

The median is another measure of the average or 'central tendency' of the data. It is computed by taking the middle value, for an odd number of values, or the average of the two middle values for an even number of values, values that are sorted in ascending or descending order. Table 3.1 shows how to order and choose the middle values for the dataset provided above.

Table 3.1. Isolating the values for computing a median in a data set

Position

1

2

3

4

5

6

**7**

**8**

9

10

11

12

13

14

Value

1

2

3

4

4

5

**6**

**7**

7

7

8

10

24

36

We obtain a median value of 6.5. The difference between the mean (8.86) and the median (6.5) is significant, of 2.36. This is much smaller than the difference between the mean calculated on all data and the mean for the first 12 values (5.33), of 3.53.

The median for the first 12 values (without outliers) is 5.5, which is extremely close to the 5.33 value for the mean.

What does this tell us? It is clear that the median is better at explaining data which has abnormal values, and that for relatively homogeneous data, it will make not much difference if either the mean or the median is used.

Another widely used value for describing the 'average' or central tendency is the mode, which takes the value of the most frequently occurring observations. It is sometimes useful in pointing out which value will occur the most. But it may also not give you a good measure of it, if there are several most frequently occurring values.

Along with the central tendency, measures of variation in the data are also used in describing a set, or series, of data.

The standard deviation is the most commonly used measure. The formula for it is:

(3.2)

This formula is based on taking the difference between each value x and the average of all values . The reason for raising it to a power of two is for taking into consideration the absolute value of the difference, and thus seeing how spread out the values are from their 'average' or 'central tendency'. If this was not done, a positive difference between a data value and the mean could be offset by a negative value. Then, these differences are divided by the number of elements in the data set to get the average spread of the data. In the end, we take the square root of the entire expression to offset raising it to the power of two.

Very often the standard deviation is rather an abstract value, which does not describe much how spread out the data is. The common practice is to compute the coefficient of variation, by dividing the standard deviation with the mean. A value over 35% for the coefficient of variation means that data is fairly spread out; below it is considered homogeneous.

Let us calculate these measures by using our data. For simplicity, we will use the STDEV function in Excel. For all 14 values, we obtain a standard deviation of 9.57. Dividing it by its mean gives a coefficient of variation of 1.08, or 108%, which indicates that data is indeed, extremely spread out.

Taking the first 12 values only, the standard deviation is 2.64, with a coefficient of variation of 49%. This also indicated that data is fairly spread out, albeit to a much lesser extent.

In addition to these measures, there are also quartiles and percentiles to describe the data. They go into further detail in explaining your data by dividing it into parts, using boundaries that split if after being sorted in ascending order. That is, for unemployment data, the first quartile contains the bottom 25% of the observations, which are less or equal to the first quartile boundary.

The second quartile boundary is the median, and splits the data into two halves, one lower than the median and one higher than the median. The third quartile contains data from the median up to the boundary of the third quartile; above that boundary, we can find the data with contains the highest 25% of the values. This example is illustrated in figure 3.2.

Figure 3.2. The layout of quartile distribution

1st quartile 2nd quartile 3rd quartile

boundary boundary(median) boundary

25% of the smallest values

25% of next smallest values

25% of bottom highest values

25% of the highest values

Percentiles use the same principle. They divide the data into one hundred parts, starting with the lowest 1% of the values to the highest 1%.

But why are quartiles and percentiles useful after all? They help describe the distribution of the data, distribution that is often non-symmetrical, and help us divide it into meaningful ways to do an analysis. For example, one would like to look at the customers who spend the least money on our company's products, or isolate the top 5% of them that may need special attention and increased customer service.

As an example, I will use a dataset of 163,639 stock transactions, and summarize it. In the R Commander, in the Statistics menu, there is the Numerical Summaries submenu where there are submenus for computing all of the values described above except for median and mode.

Figure 3.3. The Numerical Summaries window

The results look like this:

mean sd cv 0% 25% 50% 75% 100% n

286.8258 810.0987 2.824358 1 48.51 100 258 40000 163639

This corresponds to the following command that appears in the script window

numSummary(au2012[,"Amount"], statistics=c("median", "sd", "quantiles", "cv"), quantiles=c(0,.25,.5,.75,1))

The quantile boundaries can be easily changed in order to get the required percentiles, e.g. by changing the quantiles part in the above command to quantiles=c(0,.05,.5,.75,.95,1)). This will give the output:

mean sd cv 0% 5% 50% 75% 95% 100% n

286.8258 810.0987 2.824358 1 15 100 258 1000 40000 163639

In the same Statistics menu, there is another option similar to Numerical summaries, that would enable you to compute basic statistics, called table of statistics, as shown in figure 3.4.

Figure 3.4. The Table of Statistics window

While there are books that would treat these concepts in a more rigorous way, here the purpose is to give the reader the basics in a way that is easy to understand and follow, and translate these formal concepts in an easy-to use language, easy to explain to co-workers and managers.

**3.2 Basic data exploration**

Once we got a grasp about a data series, we often want to explore relationships between variables. For example, to what extent ice cream sales are dependent on temperature, or how education and tenure influence salary levels.

In order to infer relationships between the data, we have two basic techniques, correlation and regression.

**3.2.1 Correlation**

Correlation gives the extent to which one variables changes as a second variable changes.

In the same statistics menu, correlation matrix will allow you to do this, by picking two or more variables.

Figure 3.6. The correlation window

R will compute correlations between sets of two variables at a time. Either Pearson or Spearman correlation methods can be chosen. The difference is in the way each is computed.

Essentially, when we talk about correlation we talk about Pearson correlation, computed with the actual numeric values of the variables. Intuitively, the value obtained says how much a change of, say $100 in sales is reflected in an increase or a decrease of $X in revenue.

The Spearman method uses the ranks of the variables in computing the relationship between the ranks of the values of the analysis variable. This methods is often preferred when the data is less homogeneous, or has a strange distribution with, say, a lot of values close to the minimum, or clustered together around two-three values. In this case the Pearson correlation may give unreliable results given the distribution of the variables.

Now which method to use when doing an analysis? As a rule of thumb, it is advisable to compute both when there are doubts about the 'regular' distribution of the data, and use the Spearman results if they differ a lot from the Pearson results. For 'normally distributed data' where most values cluster around the mean within one standard deviation, results are fairly close for both methods.

Other useful options are the treatment of missing data and the pairwise p-values. I often prefer using the pairwise-completed cases so as to retain as much data as possible for analysis. The p-values are used to provide a measure of reliability of the correlations results obtained.

As an example, we will compute the correlation between three variables: unemployment rates, GDP growth, and shares of workers employed in manufacturing, shown as S2, using the Pearson method.

The result shown in the output window is a correlation matrix. The diagonal is 1, showing a perfect correlation of a variable with itself. This is not of interest, except for the fact that we can see when R is malfunctioning due to some error! What is of interested is the relationship between variables. Thus, we can see that we have 0.11 correlation between GDP growth and share of manufacturing in total employment, a 0.03 correlation between GDP growth and unemployment, and -0.03 between unemployment and share of manufacturing in total employment.

GDP S2 UE

GDP 1.00 0.11 0.03

S2 0.11 1.00 -0.03

UE 0.03 -0.03 1.00

n= 1077

What does this tell us? First there is almost no relationship between unemployment (UE) and GDP growth, and unemployment and share of manufacturing in total employment. There is an extremely weak positive correlation between GDP growth and share of manufacturing in total employment, that is the higher the share of manufacturing the higher the GDP growth.

As a guideline, positive numbers point to positive correlation, that is an increase or decrease in one variable is associated with a corresponding increase or decrease in another variable. A negative correlation shows that an increase in one variable is associated with a decrease in another variable and vice-versa.

The correlation coefficient shows the strength of the relationship. Values above 0.7 or below -0.7 show a strong relationship between variables. Values above 0.5 or below -0.5 show moderate correlation. Weak correlation is between 0.5 and -0.5. Perfect correlation is 1 and no correlation is close to, or 0.

The p-values shown in the output window tell us about the reliability of the results. Is the correlation coefficient significant? In order to be so the p-value has to be very small, below 0.05.

P

GDP S2 UE

GDP 0.0004 0.3349

S2 0.0004 0.2559

UE 0.3349 0.2559

From the output above, corresponding to the correlation coefficients obtained we can see that only the relationship between GDP growth and share of manufacturing in total employment is significant. No p-values are computed for correlations between the same variables (e.g. between UE and UE). For the other relationships, the absence of correlation (or zero correlation) is confirmed by the high p-values, that show that correlation coefficients are not reliable.

Before any complicated analysis, it is often a good practice to run a correlation analysis for all the variables of interest in the data set. Thus, the relationships that may exist between them will be indentified, which will be useful when more advanced methods to analyze data will be employed.

**3.2.2 Regression Analysis**

However good the correlation may be in showing the relationships between variables, it has its limitations in many respects. One of them is that it can only model relationships between two variables at a time. Another one is the lack of adequate measures to ensure that the relationship is correctly specified.

Regression analysis is a superior technique in this respect. It can model relations between several variables at a time, and even allow us to infer the combined effect of two variables taken together. It can allow to isolate the effect of a certain event that may have a major impact, and also the effect of time on the evolution of certain variables. And, last but not least, it offers the opportunity to check whether the results are valid, and detect potential problems associated with modeling the evolution of one variable with respect to others.

More formally, regression is defined as the statistical technique that attempts to explain the relationship between one variable, called dependent variable, as a function of the changes in other variables, called the independent and explanatory variables.

The equation for the regression model is as follows:

(3.3)

where y is the explanatory variable, Î± is the intercept which indicates the value of y when all X variables take the value of 0, Î²i are coefficients for the independent (also called explanatory) xi, and is the error term, which represents the variation in y that is unexplained by the intercept and the explanatory variables.

As an example, let us load the Salaries data set from the car package.

Figure 3.8. How to import data from an R package

**Choosing a regression model**

Modeling relationships between variables in econometrics always starts with an assumption. This assumption is the basis of the new model to be created, and it consists of the fact that a dependent variable will be influenced by other explanatory variables. For example, for a national retail chain, it could be assumed that sales are influenced by the size of the population of the cities where its stores are located.

However, the relationship between variables is not the only thing to have in mind when we build an assumption. We also need to make an assumption about the way in which the variables are related. For our example, we will assume that the bigger the population, the higher the sales of a store. In statistical terms, we say that there is a positive relationship between sales and population, and that the regression coefficient for the population as explanatory variable is positive.

Using the Salaries data set, we will try to find out to what extent salaries of university professors are influenced by the number of years since graduation and the number of years of experience since employed as professors. For this, we assume that, as professors have more work experience and more time has passed since PhD graduation, the higher their salaries.

This is done by going in the Statistics menu, selecting the Fit models submenu, and then Linear model. In the window that appears, shown in figure 3.9, the model (here called a) is specified.

In the left-hand window below Model formula the variable Salary will be entered by double-clicking it in the variables window above. This will be the dependent variable. By clicking the other variables, yrs.service and yrs.since.phd, they will be added in the longer window showing the explanatory variables.

Figure 3.9 Linear model window

By clicking Ok, we obtain the following output:

Call:

lm(formula = salary ~ yrs.service + yrs.since.phd, data = Salaries)

Residuals:

Min 1Q Median 3Q Max

-79735 -19823 -2617 15149 106149

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 89912.2 2843.6 31.620 < 2e-16 ***

yrs.service -629.1 254.5 -2.472 0.0138 *

yrs.since.phd 1562.9 256.8 6.086 2.75e-09 ***

**---**

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27360 on 394 degrees of freedom

Multiple R-squared: 0.1883, Adjusted R-squared: 0.1842

F-statistic: 45.71 on 2 and 394 DF, p-value: < 2.2e-16

The most important elements of this output will be explained below.

On top, after the formula, R shows the layout of the regression model. With respect to the theoretical model above, we can rewrite it as:

Salary= intercept+ years of service+ years since graduation+ error

Please note that the error term and the intercept are not explicitly stated in the R output, but they are computed and their values are used in producing the estimates and statistics of the model.

After the formula and the distribution of residuals there is the most important part of the results. Here we have the coefficients of the regression corresponding to Î± and Î²i (or intercept and regression coefficients) from the regression formula (1) above. Here the coefficients are 89912.2 for a, and -629.1 for years of service, and 1562.9 for years since graduation.

With them, we can write the estimated regression equation for the salaries of university professors as:

Salary= 89912.2 -629.1*years of service+ 1562.9* years since graduation (3.4)

The hats above the coefficients indicate that they are estimated values, which means that the salary variable is an estimate different from the actual value of the salaries in our file. This is true if we look at the right-hand side of the equation which shows that estimated salaries are explained by using the estimated coefficients, multiplied by the values of the explanatory variables. The error term, which gives the unexplained variation between estimates salaries and actual salaries is missing from the equation.

There is one thing we will need in order to make sure we can write this equation. When estimated coefficients are used, it is always necessary to verify their reliability. For this, there are several statistical values available, shown in the output right after the regression estimates: the standard error, t values and p-values. These indicate the strength of the relationship between the intercept or one explanatory variable and the dependent variable. In short, this gives an answer to the question: can we use this estimate in writing the estimated regression equation above or not?

The best statistic to use in making this judgment is the p-value, shown as Pr(>|t|) in the output. The smaller it is, the stronger its reliability. It should be smaller than 0.05, although in some cases higher values can be accepted based on specific considerations of the analysis. The way R expresses small values is using the scientific notation e.g. 2.75e-09. This apparently complicated number shows a small value 2.75*10-9 that is 2.75 times 1/10 to the power of 9. Fortunately, R output also gives significance codes shown as dots or asterisks, that follow any p-value that appears in the output. The 2.75e-09 is followed by three asterisks, which shows significance at 0.001 level.

As a rule of thumb, any p-value followed by at least one asterisk is significant, and marginally significant when followed by a dot (which corresponds to a 10% significance level). In our case, we have all p-values followed by one or there asterisks, therefore we can conclude that all coefficients are significant and safely write equation 3.4.

Thus, p-values helped to answer the question whether the explanatory variables included in the model are significant. But they do not answer another major question: whether the model is reliable? Do all the variables in the model do a good job in explaining the evolution of the dependent variable?

The coefficient of determination, or more commonly known, the R2, is the metric that shows the overall reliability of the model.

The R2 is essentially the ratio of the explained sum of squares over the total sum of squares. A more intuitive definition is that R2 is the variation of the estimated dependent variable versus the variation of the actual dependent variable. Note that R2 is also called the goodness of fit measure of the regression, which tells how well our model fits the data. The higher it is, the better the model explains the evolution of the dependent variable. While there is considerable debate over the value of the R2 that indicates a good fit, a value above 0.7 is considered acceptable, 0.8 is good and 0.9 is excellent. However, one should have reservations over the high values of R2, as they may point out to overfitting the data. I will come back to this in the next chapter.

By looking at our model we see that R2 is 0.1883. This shows that the model does a poor job in describing the data, and that results should be taken with great caution. That is, we can say that the explanatory variables, years from graduation and seniority, help explain to a certain extent professors' salaries, but we do not have a good model that will allow us to estimate their salaries based on these variables alone.

The remedy to this usually consist in searching for another model, which contains explanatory variables that do a better job at explaining professors' salaries. It may be that those teaching at Ivy League universities get better salaries. Perhaps academic prestige, measured in the number of publications, can command a higher salary. Maybe the gender bias and the position held (assistant professor, lecturer, full professor) translate in salary differences.

Another good measure to assess a model is to look at the plot of residuals. Residuals are the standard errors and, for classic linear models as the one described above, are supposed to be independent from each other, and randomly distributed. This means that if we plot them against the actual values of the explanatory variables, we should obtain a cloud of points randomly spread.

Using the following commands we will get the residuals for our model and store them into a new variable, resdt, and we will plot it against the salaries.

resdt=resid(a)

plot(Salaries$salaries, resdt,ylab="Residuals", xlab="salaries")

**Figure 3.10**

Residuals plot

It is clear from figure 3.10 that residuals are not random, and that their value fluctuates with the size of the salary. This is clearly not acceptable, and this shows that our explanatory variables taken together do not explain well the evolution of the dependent variable.

**3.2.2.1 Regression Analysis. Features, Properties and Other Key Facts**

Before going on expanding our analysis and addressing known issues, it is better to do a brief theoretical overview. While this book is not supposed to be an econometrics text, its aim is to introduce the most frequently used methods in business analytics. And in order to understand some data issues and solve them, it is necessary to see a review of them and to understand their basic nature and implications. This review will also be useful in the future should you want to learn more about regression and do more advanced work.

The model used is called the linear regression model. It is the most widely used and taught. Although there are better models that do a better job at estimating relationships between variables, this model, often referred as the OLS model, is very easy to work with, has straightforward properties, and, in most cases gives reliable results.

The key properties that an OLS estimation should have are called BLUE. That is, OLS estimation is the Best Linear Unbiased Estimator. A short version of these properties is that:

1) The estimates are unbiased. That is, the estimates of the value of the estimated parameters are close to their true value.

2) The estimates are consistent. As the number of observations increases, estimates converge to their true parameters for all data available.

3) The estimates are efficient. That is, the distribution of the coefficient estimates are as tightly distributed as possible around their true but unknown values. No other estimator has a lower variance for the estimated coefficients than the OLS.

Another useful feature of linear regression is the fact that one can easily estimate the relative impact of one explanatory variable on the dependent variable. The average of the explanatory variable, multiplied by its regression coefficient, and then divided by the average of the dependent variable will give you its relative influence.

A frequently encountered issue in linear regression analysis is multicollinearity. That is, some explanatory variables tend to move together very closely. This may pose issues with estimating a regression. The regression may just not work, or estimates can become inefficient in the sense that standard errors are higher and p-values are lower, with an impact on the true significance of the regression coefficients.

Usually multicollinearity can be seen by doing a correlation analysis on all variables. The correlation results obtained, often in the form of a matrix, is useful as a guide in running regressions. Of several variables are highly correlated with coefficients over 0.7, or less than -0.7.

**3.2.2.2 Heteroskedasticity**

One of the major problems that can affect the estimation and validity of models is heteroskedasticity. A classic definition from Greene (2011) is 'regression disturbances whose variances are not constant across observations are heteroscedastic'.

Let's look at an example in order to explain it. Using a data set on unemployment, we will show an example of heteroskedasticity, and discuss its implications.

**Dummy variables**

Modeling relationships between variables often employs what is called a dummy variable. Unlike other variables that can take a wide range of values, dummy variables, often called indicator variables or flags, only take two values, 1 for the existence of a certain thing or phenomena, or 0 when it is not present.

Dummy variables are useful in modeling data when there are certain variables whose existence may impact the result of an estimation. For example, we can consider home ownership as significant in determining the average expenditure of a household since it may be a sign of wealth and financial stability. In other cases, sales for certain stores may be influenced by the fact that they is placed in a popular resort instead of a regular city.

In other cases, dummy variables are used to account for seasonality, and this is a recommended approach when working with monthly or quarterly data.

One final word or caution; if you choose to use several dummies in your model, you need to make sure that adding them up does not yield a vector with a constant value of 1. Thus, an important property that is intrinsic to regression model is violated. And is a common mistake done by inexperienced analysts! This will prevent you from performing regression analysis for reasons relating to matrix algebra. This discussion is complex, but what you should remember is that if you will use several dummy variables to refer to one type of occurrence (for example seasonality, structure of sales), you should always omit one. For example, for quarterly data, you will not use four dummies, but three.

Let's take the following regression:

average expenditure= a+ b1*age+b2*income+b3*squared income+b4*dummy variable for home ownership.

The residual plot against income does not show a random scatter of the residuals, which are more grouped together around 2 (20,000), more spread out as we go towards 6 (60,000) and more grouped together as we go over 6. This is shown in figure 3.11.

Call:

lm(formula = Avgexp ~ Income + Incsq + Age + Ownrent, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-429.03 -130.36 -51.10 53.98 1460.62

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -237.147 199.352 -1.190 0.23841

Income 234.347 80.366 2.916 0.00482 **

Incsq -14.997 7.469 -2.008 0.04870 *

Age -3.082 5.515 -0.559 0.57814

Ownrent 27.941 82.922 0.337 0.73721

**---**

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.8 on 67 degrees of freedom

Multiple R-squared: 0.2436, Adjusted R-squared: 0.1984

F-statistic: 5.394 on 4 and 67 DF, p-value: 0.0007952

**Figure 3.11**

Residuals plot

for

heteroskedastic

data

While in this example heteroskedasticity is quite obvious, in real life this may not be the case. In order to be on the safe side and prove heteroskedasticity, a heteroskedasticity test should be used. The Breusch-Pagan test, available in the package lmtest, will be run in order to prove heteroskedasticity.

After loading the lmtest package from RCommander, the command to run the test is:

bptest(a, varformula = NULL, studentize = FALSE, data = list())

And the output obtained is:

Breusch-Pagan test

data: a

BP = 49.0616, df = 4, p-value = 5.669e-10

The results above show that heteroskedasticity is present in the data set, since the null hypothesis of heteroskedasticity is rejected. Rejection is shown by the p-value very close to 0.

What are the consequences of heteroskedasticity, in terms of the BLUE properties? The estimates will be unbiased; in other words the regression coefficients we have obtained estimate the true influence of the explanatory variable on the dependent variable. The same applies for consistency.

The efficiency of the estimates will be impacted by heteroskedasticity. This means that the squared errors, t tests and p-values are not reliable any more. They may be larger or smaller as they are affected by the changing variance in the data. Thus p-values may show a variable is significant while in fact it isn't, or that a variable is not significant when in fact it is. This may lead to a poor specification of our model.

To offset the impact of heteroskedasticity and obtain corrected errors, t tests and p-values, a widely used method is to rescale the estimates with the size of the error terms.

For this we will run the macro below in the script window. This will create a function that will compute the adjusted standard errors, t tests and p-values for all the coefficients.

summaryh <- function(model) {

sm <- summary(model)

X <- model.matrix(model)

usq <- residuals(model)^2

XDX <- 0

for( i in 1:nrow( X)) {

XDX <- XDX + usq[i]*X[i,]%*%t(X[i,])}

XX1 <- solve(t(X)%*%X)

varcovar <- XX1 %*% XDX %*% XX1

stdh <- sqrt(diag(varcovar))

t <- model$coefficients/stdh

p <- 2*pnorm( -abs(t))

results <- cbind(model$coefficients, stdh, t, p)

dimnames(results) <- dimnames( sm$coefficients)

results

**}**

Afterwards, by running the command summary(model_name) (e.g. summary(a) for my example) function to get the corrected statistics for the coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -237.146514 212.990530 -1.1134134 0.265530915

Income 234.347027 88.866352 2.6370727 0.008362492

Incsq -14.996844 6.944563 -2.1595085 0.030810735

Age -3.081814 3.301661 -0.9334132 0.350606689

Ownrent 27.940908 92.187777 0.3030869 0.761823630

The change in the p-values is obvious; for age the p-value decreases dramatically from 0.58 to 0.35, while for income it increases from 0.005 to 0.008. However, for the sake of the present analysis and development of the regression model on page 65, this does has much impact on the initial estimates.

**3.2.2.3 Autocorrelation**

Another major problem in the data is autocorrelation. Simply put, autocorrelation means that the current values of a variable are influenced by its past values. Another name for it, frequently met in the econometric books, is serial correlation.

Autocorrelation occurs frequently in real life and it is quite common to many data. In fact, most of the economic activity is related to its past. Current sales can be modeled as a function of past sales; likewise, price indexes, population numbers, city areas, unemployment and many other variables are often influenced by their past values. This can often alter the results of our estimations, and lead to estimations where standard errors are underestimated, and error terms are overestimated.

Having said that, let us take an example of how sales can exhibit autocorrelation, how to spot and correct for it. This example from Studenmund (2006) is used to estimate the demand of chicken Y as a function of the price for chicken (PC), price for beef (PB) and disposable income per capita (YD). The specification of this equation can be seen in the output, right after the lm(formula= output text.

As the result is good for about every parameter (R2, p-values) indicates that we have a good model.

Call:

lm(formula = Y ~ PB + PC + YD, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-3.4983 -1.4777 -0.4272 1.4260 3.5298

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 27.59394 1.58446 17.415 < 2e-16 ***

PB 0.09219 0.03988 2.311 0.026644 *

PC -0.60716 0.15712 -3.864 0.000447 ***

YD 0.24486 0.01110 22.069 < 2e-16 ***

**---**

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.994 on 36 degrees of freedom

Multiple R-squared: 0.9904, Adjusted R-squared: 0.9896

F-statistic: 1237 on 3 and 36 DF, p-value: < 2.2e-16

However, the residual plot in figure 3.12 shows a distinct pattern in the error terms which does not resemble randomness. This fact is confirmed by plot of residuals against time, which shows that error terms have a tendency to fluctuate over time from negative values to positive values and vice-versa, and by the Durbin-Watson statistics which shows that autocorrelation exists. Durbin-Watson is run by loading first the package lmtest, and then using the command dwtest(a).

Durbin-Watson test

data: a

DW = 0.8978, p-value = 9.26e-06

alternative hypothesis: true autocorrelation is greater than 0

The Durbin-Watson test confirms the existence of autocorrelation as the test value is different than 2.

Figure 3.12 Residuals plots for autocorrelation

What are the consequences of autocorrelation? The model will be inconsistent, and the error terms are to a certain extent related to each other.

The most common way to remove autocorrelation is to calculate the first difference between the variables (that is the difference between a variable and its lagged value, or in other words, its value at time t-1). The form of this model is shown in formula 3.5.

(3.5)

However, this results in a model that does not use the levels of variables. In order to overcome this shortcoming, the following model is being estimated:

(3.6)

where represents the autocorrelation coefficient, which shows to what extent the past values of the variables explain their current levels.

The best way to assess the validity of these models is to check the residual plot and run the Breusch-Godfrey test [1] with the command bgtest(a).

**Computing lagged values in R**

This relatively simple operation in Excel or other statistical software is a bit cumbersome in R.

A quick solution is shown below:

- load the dataset in the window

- put all the original dataset data into a new dataset called a with the command

a<-Dataset

- move the desired variable one period ahead with the command

y <- c(NA, head(a$Y, -1))

this will insert an NA on top of the series so that the data is moved one period ahead

- merge this new variable with the rest of the data using z<-cbind(Dataset,y)

Now the data looks like this, with y being the lag of Y.

An example of the first solution is to include the lagged values of all variables in the regression

a <- lm(Y ~ YD + DY + PB + PC +DYD +DPB +DPC, data= Dataset)

with the prefix D noting the lagged values.

The resulting output is:

Call:

lm(formula = Y ~ YD + DY + PB + PC + DYD + DPB + DPC, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-2.00876 -0.52786 0.03516 0.52525 1.79014

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 10.27246 2.85356 3.600 0.001095 **

YD -0.07089 0.09927 -0.714 0.480471

DY 0.64648 0.09003 7.180 4.51e-08 ***

PB 0.14236 0.03773 3.774 0.000683 ***

PC -0.09393 0.10405 -0.903 0.373641

DYD 0.16820 0.10064 1.671 0.104725

DPB -0.10668 0.03766 -2.833 0.008043 **

DPC -0.10322 0.09701 -1.064 0.295527

**---**

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.007 on 31 degrees of freedom

(1 observation deleted due to missingness)

Multiple R-squared: 0.9978, Adjusted R-squared: 0.9973

F-statistic: 1986 on 7 and 31 DF, p-value: < 2.2e-16

Running the regression again by omitting one outlier and the non-significant variables, with p-values above 0.1, the following results are obtained:

Call:

lm(formula = Y ~ DY + PB + DPB, data = b)

Residuals:

Min 1Q Median 3Q Max

-1.99848 -0.75684 -0.05301 0.72590 2.05733

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.26227 0.49120 0.534 0.596855

DY 1.01724 0.01891 53.782 < 2e-16 ***

PB 0.14834 0.03541 4.189 0.000188 ***

DPB -0.13744 0.03751 -3.664 0.000838 ***

**---**

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.045 on 34 degrees of freedom

(1 observation deleted due to missingness)

Multiple R-squared: 0.9973, Adjusted R-squared: 0.9971

F-statistic: 4241 on 3 and 34 DF, p-value: < 2.2e-16

Figure 3.13 Residuals plots for autocorrelation-corrected regression

In the regression model Y ~ DY + PB + DPB [2] , residuals appear to be r distributed in a more random fashion than in the initial model, and the Breusch- Godfrey test shows that there is no serial correlation detected in our model at a 5% level of significance. A further look at the regression results show a very high R2 and p-values much lower than those in the initial regression.

Breusch-Godfrey test for serial correlation of order up to 1

data: a

LM test = 3.6021, df = 1, p-value = 0.05771

Results obtained confirm the fact that autocorrelation has been corrected to a significant extent by using the model described in equation 3.6. However, autocorrelation has not been removed entirely; if we relax the level of significance to 10%, we do have significant autocorrelation, a fact suggested by the residual plot against time.

There are other more advanced methods to deal with autocorrelation and its consequences, but they are far beyond the scope of this book, and require specialized and more rigorous treatment. Another thing worth mentioning, which is tremendously important in practice, is that these methods are not foolproof, and do not always yield nicely behaved results as in standard statistics and econometrics textbooks. Therefore, it is important to have a good grasp of the statistical/econometric methods, and to always use diagnostics in order to assess the quality of the model and detect the estimation issues. With this final remark, you may have spotted another issue with the estimation from equation 3.6, which will be addressed in the exercises section.

**3.2.2.4 Doing basic forecasts based on regression analysis results**

The regression models obtained can be used to predict the values of a dependent variable for values, or combinations of values, of the explanatory variables that are not in the estimation data set. Forecasts are often called predictions, where estimation of the dependent variable is not deemed to be that rigorous.

Most of the time forecasts comprise estimating the value of a dependent variable in the future; however, estimations can be done for the same time period, using values of explanatory variables which are not in the initial data set.

A simple forecasting model can be derived from results of a regression model, provided this has a good R2, and coefficients with significant p-values.

From the last regression performed in section 3.2.2.4, the estimated demand for chicken is derived, as in equation 3.4.

Å¶= 0.26227 + 1.01724*DY+ 0.14834* PB -0.13744*DPB (3.7)

With this equation, we can attempt to estimate the price of chicken Å¶ [3] at different possible combinations of its values in the previous period (DY), and of beef prices (PB and DPB) at current and, respectively, previous time periods.

One remark before going forward: the intercept was included in the estimation equation, although its p-value indicates it is not reliable. While this contradicts the common practice of eliminating coefficients for variables that are not significant, it is commonly done because omitting it may cause significant breaks in forecasted values with respect to the actual values, which may cause confusion and undermine the credibility of the forecasts. As a rule, coefficients for the intercept should always be included in the estimated and forecasted equations for a model.

Using the 1998 values for the above-mentioned variables, we get the following estimation for demand for chicken for year 1998:

Å¶= 0.26227+1.01724*83.67+ 0.14834*59.6 -0.13744*63.1= 85.543 (3.8)

This value is by 2 % off the actual value of 83.89, but significantly below the coefficient of variation for Y, of 38.6%. While this forecast is fairly close to the actual value, it may cause compatibility issues, especially when data from the forecasts done for future periods is used in continuation of existing data series.

Another issue with forecast done above is that only a point estimate for a period of time is produced. This is fine from the computational point of view, but not satisfactory from the statistical point of view. First, point estimates are related to the time period used to produce the coefficient estimates used in computing them. Using a different time period may yield different regression coefficients, and removing outliers can also alter the results. Second, both the regression model and the forecast are statistical models, subject to a certain degree of uncertainty, which is also true for the forecasted values.

In order to account for and incorporate this uncertainty into our forecasts, forecast confidence intervals are used. Their use is essential to compute a range of values where the actual future values of the forecasted variable will lie with a high degree of confidence. This is the regular approach of forecasters, which provide three values for each point in time: a low value, a middle value and a high value.

Forecasted intervals are computed from the point forecasted values using the following formula:

(3.9)

with being the critical value for the forecast confidence interval (1.96 for a 95% interval, 1.65 for a 90% interval, and 2.54 for a 99% interval), h the notation for the number periods for which the forecast is done, and the standard deviation of the forecast error, which is computed with the following formula:

(3.10)

For the model 3.7, computing a forecast involves the following steps:

- getting point estimates, as in equation 3.8 for both future values and for

existing values

- computing for existing values, as in equation 3.10

- computing the upper and lower bounds of the forecast interval at a given confidence level, as in equation 3.9.

An example of the second and third steps is given below. In table 3.2, columns represent the variables and steps needed to perform calculations for the second step. is obtained by summing up the values and then carrying out the calculations in formula 3.10, to obtain 0.9885.

Table 3.2. Layout for computing the elements of forecast error standard deviation

Year

Å¶

1961

25.95

24.38

1.57

2.46

1962

25.94

27.04

-1.10

1.22

1963

27.22

26.67

0.55

0.30

1964

27.82

27.89

-0.07

0.00

1965

29.77

29.04

0.73

0.53

**...**

**...**

**...**

**..**

**...**

Total

37.13

The upper and lower bound of the confidence intervals (Å¶u and Å¶l) are computed by adding and, respectively, substracting the product of equation 3.9, which is 1.96*0.9885=1.937 for a 95% confidence interval. For our example, in equation 3.8, we have three forecast values for 1999: Å¶u = 87.48, a middle one of 85.543, and a low one of 83.61. The interval [87.48, 83.61] contains the actual value of 83.89, which tells us that our estimations are valid.

There is one additional issue that needs to be solved in order to do a reliable forecast for future time periods: the value of the explanatory variables for future periods. In the case of simple autoregression, past values can be used to generate future values. For other explanatory variables, a way must be found to infer their future values, either as average values or as a function. We will attempt to do a forecast based on making some assumptions about the PB variable in the assignment.

**3.3 Logistic regression**

So far we have examined statistical models that attempt to model what is called continuous variables. This are the most commonly used variables, and can take almost any value within a certain interval, with a distribution that is somehow regular. Sales, unemployment, temperature, salaries are continuous variables.

However, there are a different types of variables that are equally important in analysis work. These are the binary variables, that can take only two values, 0 or 1, and can express a series of events or phenomena such as the existence or absence of a characteristic or event.

Modeling the evolution of these variables as a function of other explanatory variables requires a completely different approach than the one used in modeling continuous variables for a variety of reasons.

First, a linear model does not offer the guarantee that resulting values are confined to a set of values. Results are quite often below 0 or above 1, and making an assumption that values below 0 are 0 and values above 1 are 1 will yield in an abnormally high proportion of 0 and 1, thus making the estimation extremely inaccurate.

Another reason is that binary relationships are expressed as probabilities of occurrence, that is at some values of the explanatory variables there will be a proportion of observations of the binary variable which are 0 and another proportion that will be 1. These probabilities will have a certain layout of occurrence, increasing from 0 to 1 over the range of the dependent variables.

The standard formula for logit regression is the following:

with Di representing the probability of occurrence of the binary variable, with the other terms , and defined in much the same way as in the linear regression.

Using the binary.xls dataset, available from UCLA Statistical Consulting Group (2013), we will demonstrate how to run a logistic regression , by modelling the admission of students to university as a function of their gpa (graduate average from previous studies), gre (aptitude test results) and rank of the institution where the gpa was obtained.

Logistic regression is made available through the RCommander interface, menu Statistics, submenu Fit Models, Generalized Linear Model. In the Generalized Linear Model window, the binary variable is the first to be selected, and will appear in the box below the Model formula text. Explanatory variables are added by subsequent clicks in the Variables window. The options binomial for Family, and logit for link function are essential for geeting the probit model.

Figure 3.14 Window for running the logistic regression in RCommander

Call:

glm(formula = admit ~ gpa + gre + rank, family = binomial(logit),data = prob)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.5802 -0.8848 -0.6382 1.1575 2.1732

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.449548 1.132846 -3.045 0.00233 **

gpa 0.777014 0.327484 2.373 0.01766 *

gre 0.002294 0.001092 2.101 0.03564 *

rank -0.560031 0.127137 -4.405 1.06e-05 ***

**---**

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 499.98 on 399 degrees of freedom

Residual deviance: 459.44 on 396 degrees of freedom

AIC: 467.44

Number of Fisher Scoring iterations: 4

Most parts of this output are quite similar to the ones encountered in linear regression. We got the formula part, which shows the variables used, admit, gpa, gre and rank, the dataset prob, and the models used family = binomial(logit). We also have estimated coefficients of the intercept and explanatory variables, along with their standard errors, z-values and p-values.

What is now different and a bit embarrassing is that the statistics at the end of the output are fairly non-intuitive. However, a quick way to ascertain whether the model obtained has a good explanatory power is to run a classical goodness of fit test for the logistic regression, the Hosmer and Lemenshow test.

A reliable implementation of it is in package R330, which has to be installed and launched following the instruction in section 2.2.2.5. The command HLstat (modelname)will give the following output:

Value of HL statistic = 3.519

P-value = 0.898

With the Hosmer and Lemenshow test , a model is assessed as reliable if the resulting p-value is high (over 0.05 or 0.1, depending on the level of significance chosen). In this case, the model is assessed as reliable as the p-value obtained is much higher than 0.05.

One final word before the end of the chapter on the regressions. Both linear and logistic regressions are important to know, study and apply. This is motivated by the fact that not only they are relatively easy to understand and use, but they are also the building blocks of more advanced data mining and machine learning models, some of which will be presented in the next chapter.

**3.4 Exercises and questions for review**

1) Your manager asks you to produce a report with top 10% of the customers by sales and their average sales.

a) How would you do this in R? List the commands to be performed. You do not need to do all work in R.

b) What are the potential issues with their average sales? Please comment on their relevance. Would you propose as an alternative measure? If so, which one?

2) Use the 'wages' dataset from the package lmtest (you will need to load the lmtest package in the GUI in order to load it using the Read data from package) , attach a time trend to it using the following commands:

tt <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)

wages<-cbind(wages,tt)

and do the following:

a) formulate a hypothesis for the model,

b) do a correlation matrix and comment on it

c) run the model and report the results from the output window in R (paste it in here)

d) make an assessment of the results. Have you obtained a valid model? Based on the values for the coefficients, please write the estimating equation

e) check the residual plot and comment on it. Are there any issues that can be spotted?

f) based on d) and e) , is it possible to improve the model? If so, please submit the output, residual plot and any relevant tests

Note: you should check for relationships between variables and for different specifications

3. In the last regression for the autocorrelation example, there is another issue revealed by examining the residual plots. Please identify the issue, and run the appropriate test in order to confirm it.

4. Please do run a forecast on the values of the Studenmund dataset, using the model in equation 3.7. Compute the 95% forecast confidence interval and compare it with the actual values. What is your conclusion? Please compute the 90% forecast confidence interval and do the same. Does your conclusion change?

5. Please look at the residuals obtained by running equation 3.7. Can you isolate an outlier? If you redo the calculations at exercise 4 and compare them with actual values, what conclusion do you get?

6. The PB variable and its lagged value is essential in obtaining the model shown in equation 3.7. Suppose you are asked to compute a forecast of variable Y for the period 2000-2005. What elements do you need to do this? Please explain in detail.

7. By looking at evolution of the PB variable over time, what do you think are its plausible values for the period 2000-2005? Please explain your answer in detail.