Regression Critiques

Banker, R. J., Field, M. M, Schroeder, R. G., & Sinha, K. K. (1996). Impact of work teams on manufacturing performance: A longitudinal field study. Academy of Management Journal, 39, 867-890.

Setting and research question. They followed the productivity of four production lines in an electromechanical assembly plant over time. They wanted to know if the implementation of teams increased productivity.

DVs. The dependent variables were quality and labor productivity. Quality was measured by a manufacturing defect rate (inspection). There were 84 weeks' data points. Labor productivity was measured by the ratio of units produced to total production hours. Data were available for 21 months.

IVs. The main independent variable was time since the implementation of teams. It was measured in weeks for quality and months for productivity. There were also control variables that were measured and used to help rule out alternative explanations of the results. Variables included time before the implementation, workforce policies, and policies affecting confusion in the factors. Workforce policy variables were overtime, headcount additions and headcount deletions. Confusion variables were product diversity, capacity utilization (both overutilization and underutilizaiton), and engineering change orders. For one of the production lines, an experiment was conducted for engineering purposes during the study (not related to the study purpose). A variable called "adhesive experiment period on the gear train line" was used to identify the duration of this experiment. Because there were four different lines, each line was identified using dummy coding.

Analyses. The authors ran two different models: a fixed-effects model and a seemingly unrelated regressions (SUR) model. The fixed-effects model is a regression model that shows the effects of changes in time for all four lines. All four lines are expected to show identical slopes for time (and the control variables), but are allowed to have different intercepts. In the SUR model, each line is analyzed independently, so that both intercepts and the slope for time (as well as the effects of the control variables) are allowed to vary among lines. The SUR model is less powerful because there are fewer observations (1/4 as many) for the SUR model than the fixed-effects model. However, the SUR model allows a check on the similarity of the lines to one another in terms of the models (the "robustness" of the fixed-effects model).

Diagnostics. The authors report checking for collinearity and completing a residuals analysis. They used data transformations to reduce the time-series autocorrelations. They reported that collinearity was not a problem according to their diagnostics. They also worried about heteroscedasticity. They reported problems of heteroscedasticity and nonlinearity of the residuals, and transformations to correct the problems. They deleted outliers (based on residuals) from the analysis and reran the analyses. Because there were no differences in the interpretations of the results, they only report results based on the full dataset. The data are summarized in a table (Table 1) that shows all the correlations, means and standard deviations. Regression results for both the fixed-effects and SUR models for both quality and quantity are presented in tables.

Critique.

  1. For the production data (N=21 months) the sample size is too small for the SUR regression, and probably too small even for the fixed-effects model (N=84 observations, i.e., 21 x 4 = 84). However, there are a number of significant correlations even with the small sample size. For the weekly (quality) data fixed-effects model, there were 355 observations, which is a goodly number.
  2. The correlations and descriptive information (mean, SD) are presented in a table, which is good.
  3. Regression results are complete. We get b weights, standard errors, R-square, an adjusted R-square and F for the model as well as significance tests for each of the IVs. Beta weights (standardized regression coefficients) are not presented, though.
  4. The use of diagnostics was appropriate and thorough. Where they found problems, they took appropriate action.
  5. The comparison of fixed-effects and SUR models allowed them to assess the effects of assuming similar team effects on each line. Good. They could have (should have) computed interaction terms instead and tested for theses to see if the effects varied by line.
  6. The statistical approach seems to match the research question. That is, they were interested in the effects of introducing teams to the assembly line process, and they found pre-team time trends that were flat and post-team trends for improvement that were significant. Thus the analysis was appropriate to the data and to the research question. Big plus.
  7. Inference & generalizability. Because this was a field study, there are common benefits and costs. There was no real control group, so there are problems inferring causality. Because it was done in the field, we can have more confidence that teams are useful in actual jobs on real assembly lines than if the study had been done in the lab.

Overall. This is an excellent article in terms of the use of regression. I'd give them an "A." They were very sensitive about modeling issues (the use of various diagnostics, transformations, control variables, and the comparison of fixed-effects and SUR models). The models that they employed were appropriate for the data and research question. The main drawback is that they really had a very small sample for the production data. We do not study time series regression in this class because it is used so infrequently in psychology. I included this article so that you could see what a thorough job looks like and how they wrote it up.

Keller, R. T. (1994). Technology-information processing fit and the performance of R&D project groups: A test of contingency theory. Academy of Management Journal, 37, 167-179.

Setting and research question. Research and development (R&D) project teams from four industrial organizations were surveyed (total N=683 people). The main hypothesis was that fit between the teams' information processing (communication process) and task characteristics would predict team performance. Team with a good match between the type of task and the amount of information processing were expected to perform better than teams with a lesser match.

IVs. A ten-item scale was used to measure perceptions of the task. Five items measured Nonroutinesss of the task; the other five measured Unanalyzability of the task. A survey was also used to measure the Information Processing of each team. Fit was defined by taking the absolute value of the difference between the score on information processing and each of the other two scales. First individual scores were averaged to create a team score for each scale. Then the team scores were standardized to have a mean of zero and a standard deviation of 1.0 across teams. Finally the absolute value of the difference between the information processing variable and each of the task variables was computed, yielding a total of five IVs.

Dvs. Project group performance was measured by managerial ratings of five criteria. Each of the criteria ( technical quality, budget and cost performance, schedule adherence, value to the company, overall performance) were rated using a five-point scale. Ratings were taken at two different times, approximately one year apart. A panel of managers was used to make each of the ratings. A factor analysis of the ratings cause the researchers to lump the Dvs into two new variables, project quality and budget-schedule performance. Thus, there were four dependent variables, two at each time.

 

Analyses and Results. The correlations and descriptive statistics for all variables were reported (Table 1). The authors note that the maximum VIF was 3, and that therefore collinearity did not appear to be a problem. The results include standardized regression weights (betas) for the full model using simultaneous regression. The R-square value from some hierarchical regressions are also reported to determine the unique variance of some variables. The authors report that the fit between nonroutineness and information processing was a good predictor of project quality, but not budget-schedule performance. The fit of unanalyzability and information processing did not predict either performance variable. Neither did the task variables by themselves. The authors concluded that the hypothesis that the fit of nonroutineness and information processing predicted performance was supported but that the hypothesis that the fit between unanalyzability and performance was not supported.

Critique.

  1. The sample size (98 and 91) for the analyses is marginal. For those analyses where significant results were obtained, we can be fairly confident. For the nonsignificant results, however, we are still fairly ignorant.
  2. The measurement of fit is problematic. Their description of what they actually did is kind of sketchy. The authors did a good job of critiquing various means of assessing fit, such as using the interaction term. However, they didn't do a good job of mentioning the drawbacks to the method that they used. The main problem is that there is no reason to equate a specific score on the information processing scale with a specific "best fitting" score on either of the other scales. Why should a score of .2 SD on one scale best fit a score of .2 SD on another scale?
  3. Diagnostics. They worried about collinearity and reported a diagnostic to support their claim that collinearity was not a problem in these data.
  4. They failed to report an inspection of the residuals. The residuals could be very helpful in detecting the presence of a nonlinear relation between the IVs (that is, the notion of "fit"). So this is mixed.
  5. The reported results were complete for the correlations and descriptive statistics.
  6. Results were complete for the regression analyses. We got last-in R-square increments, beta weights, and adjusted R-square estimates for each model.
  7. The analysis was appropriate to the research question (at least if we grant that the handling of the modeling of fit was appropriated).

Overall, this is okay but not outstanding. I'd give it a "B." The sample size is marginal, I'm not convinced that the measurement of fit is the best way to go, and they didn't look at the fit of the model to the data (residuals: linearity, homoscedasticity, outliers).

Kinsella, G., Ong, B., Murtah, D., Prior, M. & Sawyer, M. (1999). The role of the family for behavioral outcome in children and adolescents following traumatic brain injury. Journal of Consulting and Clinical Psychology, 67, 116-123.

Setting and research question. Children and adolescents with traumatic brain injury (TBI) were assessed at three months, one year and two years post injury for behavioral problems. The research question was whether family characteristics predicted the severity of behavioral problems following TBI.

IVs. Parents completed a standardized behavioral problem checklist that indicated behavioral problems in the child or adolescent prior to the TBI. This was done prior to the assessment of the severity of the current injury to the child. The severity of the injury was coded into either mild, moderate, or severe using medical convention. They coded family status variables including the age of the child, the socio-economic status (SES) of the family, and whether the primary caretaker lived alone or with a partner. They coded family environment variables for well being of the primary care parent and overall pathology of the family (obtained by self-report surveys).

Dvs. Parents completed the same behavior problem checklist that they used for pre-injury status at 3 months, one year, and two years after the injury.

Analyses and Results. The correlations and descriptive statistics for all variables were not reported. The authors note that the sample size is small and report problems with missing data. Regarding missing data, they reported that "pairwise exclusion of missing data was used in the regression analyses." They concluded that "Multicollinearity among predictors was ruled out, given that there were no significant correlations between any two predictor variables within the same regression model." The hierarchical regression analyses proceeded in blocks. The authors' explanation for their procedure was that they were worried about adjusting the significance level for multiple predictors but also about the small sample size. The analyses are presented in three parts of the same table, one analysis for each time (3 months, one year, two years). The authors describe the sequence of steps in each block. For example, they note that "In step 2, prediction of residual behavior problems scores by severity was not significant..." The authors note that the significance of the predictors changes over time and discuss the findings as showing that the relations between problem behavior and the independent variables change over time following the injury.

Critique.

  1. The correlations and descriptive statistics are not reported.
  2. The sample size is too small (but the authors seem to be aware of this problem).
  3. Regarding the missing data, it is not clear what they mean when they say that "pairwise exclusion" of missing data was used. Multivariate procedures, including multiple regression, delete all observations in the analysis that are missing on any variable. They would have had to trick a program into thinking it had a listwise deletion when in fact it was a pairwise deletion. Bad idea.
  4. Regarding multicollinearity, they failed to report a meaningful evaluation of the problem. Simply reporting the lack of significant correlations is not sufficient, especially given the small sample size.
  5. They failed to examine the fit of the model (residuals, linearity, homoscedasticity, outliers, etc.).
  6. There was no real reason for the use of the blocks in the regression. Instead, they should have used simultaneous regression. (Well sample size is really too small, but the use of blocks doesn't really help at all.)
  7. Their discussion of hierarchical regression shows that they have misunderstood what happens in regression. They say that "In step 2, prediction of residual behavior problems scores by severity was not significant..." But in regression the dependent variable is not residualized. That is done with partial correlations. With regression, the IVs are residualized, which corrresponds to semipartial correlation. This is widely misunderstood, as the authors' writing demonstrates.
  8. Because the b weights are significant in some analyses but not others, the authors conclude that he weights have changed. This may or may not be true. The authors need to test whether the weights are the same or significantly different across analyses. That is, rather than testing whether the b weights are different from zero, they need to test whether the b weights are different from one another.

Overall, this is bad. I'd give it a "C - or D." The sample size is too small, although just as in the teams/ groups literature, it's hard to get large numbers in this type of research. However, the reporting of the results is just inadequate. Further, they do not really know what they are doing from a statistical standpoint, so the analyses and results do not really support the conclusions that the authors want to make.