Statistic - Spss - Regression.doc

(1188 KB) Pobierz

SPSS for Beginners

Regression explained

Published by VJBooks Inc.

All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system, without prior written permission of the publisher except in the case of brief quotations embodied in reviews, articles, and research papers. Making copies of any part of this book for any purpose other than personal use is a violation of United States and international copyright laws. For information contact Vijay Gupta at vgupta1000@aol.com.

You can reach the author at vgupta1000@aol.com.

Library of Congress Catalog No.: Pending

ISBN: Pending

First year of printing: 2000

Date of this copy: April 23, 2000

This book is sold as is, without warranty of any kind, either express or implied, respecting the contents of this book, including but not limited to implied warranties for the book's quality, performance, merchantability, or fitness for any particular purpose. Neither the author, the publisher and its dealers, nor distributors shall be liable to the purchaser or any other person or entity with respect to any liability, loss, or damage caused or alleged to be caused directly or indirectly by the book.

Publisher: VJBooks Inc.

Editor: Vijay Gupta

Author: Vijay Gupta

About the Author

Vijay Gupta has taught statistics and econometrics to graduate students at Georgetown University. A Georgetown University graduate with a Masters degree in economics, he has a vision of making the tools of econometrics and statistics easily accessible to professionals and graduate students.

In addition, he has assisted the World Bank and other organizations with statistical analysis, design of international investments, cost-benefit and sensitivity analysis, and training and troubleshooting in several areas.

He is currently working on:

· a package of SPSS Scripts "Making the Formatting of Output Easy"

· a manual on Word

· a manual for Excel

· a tutorial for E-Views

· an Excel add-in "Tools for Enriching Excel's Data Analysis Capacity"

Expect them to be available during fall 2000. Early versions can be downloaded from www.vgupta.com.

1. LINEAR REGRESSION

Interpretation of regression output is discussed in section 1 [1]. Our approach might conflict with practices you have employed in the past, such as always looking at the R-square first. As a result of our vast experience in using and teaching econometrics, we are firm believers in our approach. You will find the presentation to be quite simple - everything is in one place and displayed in an orderly manner.

The acceptance (as being reliable/true) of regression results hinges on diagnostic checking for the breakdown of classical assumptions [2]. If there is a breakdown, then the estimation is unreliable, and thus the interpretation from section 1 is unreliable. The table in section 2 succinctly lists the various possible breakdowns and their implications for the reliability of the regression results [3].

Why is the result not acceptable unless the assumptions are met? The reason is that the strong statements inferred from a regression (i.e. - "an increase in one unit of the value of variable X causes an increase in the value of variable Y by 0.21 units") depend on the presumption that the variables used in a regression, and the residuals from the regression, satisfy certain statistical properties. These are expressed in the properties of the distribution of the residuals (that explains why so many of the diagnostic tests shown in sections 3-4 and the corrective methods are based on the use of the residuals). If these properties are satisfied, then we can be confident in our interpretation of the results.

The above statements are based on complex formal mathematical proofs. Please check your textbook if you are curious about the formal foundations of the statements.

Section 3 provides a brief schema for checking for the breakdown of classical assumptions. The testing usually involves informal (graphical) and formal (distribution-based hypothesis tests like the F and T) testing, with the latter involving the running of other regressions and computing of variables.

1. 1 Interpretation of regression results

Assume you want to run a regression of wage on age, work experience, education, gender, and a dummy for sector of employment (whether employed in the public sector).

wage = function(age, work experience, education, gender, sector)

or, as your textbook will have it,

wage = b1 + b2*age + b3*work experience + b4*education + b5*gender + b6*sector

Always look at the model fit (“ANOVA”) first. Do not make the mistake of looking at the R-square before checking the goodness of fit.

Significance of the model (“Did the model explain the deviations in the dependent variable”)

The last column shows the goodness of fit of the model. The lower this number, the better the fit. Typically, if “Sig” is greater than 0.05, we conclude that our model could not fit the data [4].

Sum of squares

In your textbook you will encounter the terms TSS, ESS, and RSS (Total, Explained, and Residual Sum of Squares, respectively).

· The TSS is the total deviations in the dependent variable.

· The ESS is the amount of this total that could be explained by the model.

· The R-square, shown in the next table, is the ratio ESS/TSS. It captures the percent of deviation from the mean in the dependent variable that could be explained by the model. The RSS is the amount that could not be explained (TSS minus ESS).

In the previous table, the column "Sum of Squares" holds the values for TSS, ESS, and RSS. The row "Total" is TSS (106809.9 in the example), the row "Regression" is ESS (54514.39 in the example), and the row "Residual" contains the RSS (52295.48 in the example).

Adjusted R-square

Measures the proportion of the variance in the dependent variable (wage) that was explained by variations in the independent variables. In this example, the “Adjusted R-Square” shows that 50.9% of the variance was explained.

R-square

Measures the proportion of the variation in the dependent variable (wage) that was explained by variations in the independent variables. In this example, the "R-Square"' tells us that 51% of the variation was explained.

Std Error of Estimate

Std error of the estimate measures the dispersion of the dependent variables estimate around its mean (in this example, the “Std. Error of the Estimate” is 5.13). Compare this to the mean of the “Predicted" values of the dependent variable. If the Std. Error is more than 10% of the mean, it is high.

The reliability of individual coefficients

The table “Coefficients” provides information on the confidence with which we can support the estimate for each such estimate (see the columns “T” and “Sig.”.) If the value in “Sig.” is less than 0.05, then we can assume that the estimate in column “B” can be asserted as true with a 95% level of confidence [5]. Always interpret the "Sig" value first. If this value is more than 0 .1 then the coefficient estimate is not reliable because it has "too" much dispersion/variance.

The individual coefficients

The table “Coefficients” provides information effect of individual variables (the "Estimated Coefficients" or “beta” --see column “B”) on the dependent variable

Confidence Interval

Plot of residual versus predicted dependent variable

This is the plot for the standardized predicted variable and the standardized residuals. The pattern in this plot indicates the presence of mis-specification [6] and/or heteroskedasticity [7].

Plot of residuals versus independent variables

The definite positive pattern indicates [8] the presence of heteroskedasticity caused, at least in part, by the variable education.

The plot of age and the residual has no pattern [9], which implies that no heteroskedasticity is caused by this variables.

Plots of the residuals

The thick curve should lie close to the diagonal.

Idealized Normal Curve. In order to meet the classical assumptions, .the residuals should, roughly, follow this curves shape.

The histogram and the P-P plot of the residual suggest that the residual is probably normally distributed [10]. You can also use other tests to check for normality.

You may want to use the Runs test to determine whether the residuals can be assumed to be randomly distributed.

Vjbooks.net

Regression output interpretation guidelines

Name Of Statistic/ Chart	What Does It Measure Or Indicate?	Critical Values	Comment
Sig.-F	Whether the model as a whole is significant. It tests whether R-square is significantly different from zero	- below .01 for 99% confidence in the ability of the model to explain the dependent variable - below .05 for 95% confidence in the ability of the model to explain the dependent variable - below 0.1 for 90% confidence in the ability of the model to explain the dependent variable	The first statistic to look for in SPSS output. If Sig.-F is insignificant, then the regression as a whole has failed. No more interpretation is necessary (although some statisticians disagree on this point). You must conclude that the "Dependent variable cannot be explained by the independent/explanatory variables." The next steps could be rebuilding the model, using more data points, etc.
RSS, ESS & TSS	The main function of these values lies in calculating test statistics like the F-test, etc.	The ESS should be high compared to the TSS (the ratio equals the R-square). Note for interpreting the SPSS table, column "Sum of Squares": ...

Name Of Statistic/ Chart

What Does It Measure Or Indicate?

Critical Values

Comment

Sig.-F

Whether the model as a whole is significant. It tests whether R-square is significantly different from zero

- below .01 for 99% confidence in the ability of the model to explain the dependent variable

- below .05 for 95% confidence in the ability of the model to explain the dependent variable

- below 0.1 for 90% confidence in the ability of the model to explain the dependent variable

The first statistic to look for in SPSS output. If Sig.-F is insignificant, then the regression as a whole has failed. No more interpretation is necessary (although some statisticians disagree on this point). You must conclude that the "Dependent variable cannot be explained by the independent/explanatory variables." The next steps could be rebuilding the model, using more data points, etc.

RSS, ESS & TSS

The main function of these values lies in calculating test statistics like the F-test, etc.

The ESS should be high compared to the TSS (the ratio equals the R-square). Note for interpreting the SPSS table, column "Sum of Squares":