Basics of Experimentation

Variables and Control

In psychology we study behavior. One important aspect of scientific study is replication. To replicate a study is to repeat it independently. If we get the same results with replication, our confidence in the finding grows. If the replication fails, our confidence is shaken. Now suppose one researcher does a study of the relation between alcohol consumption (glasses of beer) and test performance (my friends swear there’s an optimal number of beers). Suppose another research replicates the study using glasses of wine (instead of beer) and test performance. There could be a problem in replicating the results of the first study with the second. Why?

(There are different amounts of alcohol per glass, different preferences for taste, what tests did we use, maybe different volunteers, etc.)

Difference may be due to measures of alcohol consumption and differences in tests rather than real differences in relations across studies. Solution:

The operational definition means defining the independent and dependent variables by the procedures or operations used to produce or measure them.

The operational definition is crucial for replication and for communication about the results of scientific study.

Examples

Alcohol conumption:

Drunkenness:

Types of Variables

Independent (IV) should be

Relevant e.g., want to study effects of simulator time on flight proficiency. What experiences to include in simulator? How many?

Sampled Appropriately: Need to have either broad range to find effect (sim time diffs are large, not 2 or 3 minutes) or representative range to mirror what actually happens (sim time varies by several hours)

Manipulable (in experiments) – can manipulate the variable – cannot manipulate pilot aptitude, but can manipulate pilot training.

Dependent (DV)

Relevant – flight proficiency – pilot ratings? Incidents? Self evaluations

Reliable – retest sense at least (if no change in IV) e.g., instructor ratings, self evaluations

Sensitive – to change in IV e.g., flight knowledge vs. stick proficiency after different types of pilot training

Extraneous Variables

Extraneous variables have an unintended influence on the DV(s). Two major types: Nuisance variables and Confounders. Nuisance variables increase the variability within groups. This makes it harder to see real treatment effects.

Graph

 

Confounders (confounding variables) change the difference between groups, either increasing or decreasing treatment differences.

Graph

Thus confounders act like Ivs and may even be mistaken for the action of the Ivs. Confounders are alternative explanations for the action of the Iv, that is, they are alternative explanations for the outcome of the study.

 

Pizza study – want to know effect of promotional on sales. Nusiance variable is store location and subsequent gross sales.

Study design? Should use random assignment; but they never did (they selected only the best store for the experiment). They were forever surprised by the results of the promotionals.

Suppose they did use random assignment and chose 15 stores across the country for the experimental group (try the buy a large, get a second small free) and 15 other stores across the country for control. A counfounder might be location. Let’s say the experimentals were mostly located near college campuses and the controls were not, and that the study was done in the middle of September. The superior sales could be due to college in session rather than the promotional. Q: What other confounders might there be in this study (alternate reasons for the better sales in the experimental group)?

 

Clothing study. Wanted to know whether clothing style could influence judgments of hiring managers. Took pictures of people from neck to ankle, dressed them in masculine or feminine clothing (judged by panel).

Female feminine dress: pink lowcut dress

Female masculine dress: dark jacket & pants, white blouse

Male fem dress: white turtleneck sweater, dark pants

Male masc dress: dark suite, white shirt & tie

Took 20 males & 20 females, took picture of head. Asked judges to rate for attractiveness for each. Chose medium attractive males and females so that they were equivalent in attractiveness. Made composite photos (morphed) heads onto identical bodies wearing the four costumes. Had managers judge suitability of applicants for managerial position based on photos. Found serious influence of dress for both males & females.

Controlling Extraneous Variables

Want to control them because they make scientific life miserable.

Randomization – assign people to treatments with equal probability – recall pizza study; clothing study

Elimination – eliminate variable entirely. Rarely possible. Noise can be eliminated. Sensory input can be closely controlled in the laboratory. Possible to eliminate temperatures above 80 degrees or light above so many footcandles. Training of management in Pizza chain is an attempt to eliminate incompetent mangers, but it’s not entirely successful.

Constancy. Turn a variable into a constant. Pizza study, choose only certain location, management, profit, etc. Could have been done by using only 1 face per sex in clothing study.

Balancing. Placing equal numbers of types of people into each treatment. Pizza study, equal numbers of high, medium & low volume stores. Psychological study, equal numbers of males & females. Clothing study could have used only 1 face. (Why not use only 1 face in this study?)

Balancing: extraneous variables are assigned or distributed equally across groups. Pizza study, balance location. Clothing study used 4 male & 4 female faces.

Counterbalancing. Sometimes in studies we present a sequence of stimuli to individuals or groups. In the clothing study, each judge saw 2 different pictures. Judge 1 might have seen a masculine male followed by a feminine female. Counterbalancing is achieved by reversing some orders, so that, for example, Judge 2 might have seen a feminine female followed by a masculine male. Counterbalancing helps control for carryover or contrast effects. Pepsi challenge. Physical ability testing.

Experimenter as Extraneous Variable

Experimenter as stimulus

Physio & demographic

Sex of experimenter

e.g., clothing study, conformity

Race of experimenter

Polls & opinion surveys

Psychological

Friendliness

Openness, sharing confidential info

Competence

‘Bumbling experimenter’ & compliance

Experimenter Expectancy

Classroom performance: (1) all students tested, (2) some students randomly selected as "intellectual bloomers" (3) teachers but not students were told who the bloomers were, (4) at later date, all tested again. Bloomers bloomed more than chance. This effect is often documented, sometimes called the Rosenthal Effect (a.k.a. Pygmalion effect) because Rosenthal was the first to find it.

ESP. Student experimenters are recruited to help study ESP. They are told they will administer cards and record participant performance. Half of the student experimenters are told that ESP is likely, ½ told ESP is nonsense. All then give cards to pair of students 1 who hold cards, one who guesses what’s on back. The student experimenter records how often the guess is correct. Results show that the experimenters who are told that ESP is likely have students with better ESP (more correct guesses) than experimenters told ESP is nonsense.

Controlling Experimenter Effects

Demographic and psychological

  1. Make constant (same experimenter always)
  2. Balance or counterbalance across treatments (M & F balanced or counterbalanced).
  3. Remove or reduce experimenter (e.g., use videotaped stimulus to reduce personal contact and to minimize differences across subjects (reduce the Bill Murray effect), use machines to record responses (e.g., have a computer determine and record whether the response was correct) (the computer could also present the cards as stimuli, thus doing both).

Experimenter Expectancy

  1. Script experimenter contact.
  2. Use machines, videotape, internet, etc.
  3. Use the single blind technique where possible. E.g., teachers not told who the ‘intellectual bloomers’ are. Can work for alcohol effects, but not for things that can be easily perceived such as sex of participant.

Participants

Usually we use people; sometimes we use other animals. Not too long ago, getting a Ph.D. in psychology meant studying the rat.

Choosing participants

  1. Precedent
  2. Availability
  3. Suitability (easily the most important)
  1. for Generalizability to a context, e.g., performance appraisal by college students as opposed to business managers; using student pilots vs. Navy pilots to fly MS Flight Simulator.
  2. Important factors include experience (e.g., for judgment), individual differences (aptitudes, abilities, & attitudes), and contextual factors (e.g., consequences of behaviors).

  3. for testing specific theoretical notions, e.g., clinical populations of anorexics might be needed to test a treatment.

Number of Participants

  1. Time & Money
  2. Availability
  3. Size of treatment effect – the larger the effect, the smaller number of people needed.

 

Significance Testing

  1. Significance testing comes from statistics. It’s borrowed, and it doesn’t always make a lot of sense.
  2. The kind of statistics that we use for significance testing was developed for making decisions under uncertainty. For example, we want to decide which of two strains of wheat to plant. We do an experiment where we grow both of them, and see which produces more (and maybe look at soil, disease, & other things). We choose the strain with best predicted payoff. Other things being equal, we choose the wheat that produces significantly more bushels per acre. We have to decide which wheat to plant, but we are uncertain about which will produce best, even after we do the experiment. This is decision making under uncertainty. Statistics was developed to help us make good decisions.
  3. Statistical decision making is based on probabilistic reasoning. We evaluate propositions based on how likely or probable they are rather than whether they are true or not (this is decision making under uncertainty).
  4. The logic of significance testing is to set up a scenario and determine the probability of our data if the scenario is true. Suppose we want to know which of two strains of wheat gives more bushels per acre. We grow both kinds, measure bushels per acre for both, and we find that strain 1 yields 100 bushels per acre and strain 2 yields 110 bushels per acre. The question we have to answer is whether strain 2 is really superior or if the 110 bushels is due to chance and next time we do the experiment strain 1 is likely to result in about 110 bushels and strain 2 is likely to result in about 100 bushels.
  5. To answer this question, we are going to create a "what if" scenario that will allow us to attach some probabilities to our outcome. So, let’s assume that strain 1 and strain 2 really have the same yield on average. If we were to do the same study over and over and look at the yield from each strain, the averages would be the same for both. But most of the time, there would be some difference between the yield for s1 and s2. Sometimes s1 would be bigger than s2, and sometimes s2 would be bigger than s1. Most of the time, s1 would be a little larger or smaller than s2. On rare occasions, s1 would be a lot larger or smaller than s2.
  6. With a little stats, we can quantify what "rare occasions" means quite specifically. For example, it might turn out that 10 bushels per acre or greater differences are seen 50 percent of the time when the average for s1 and s2 are equal. In this case, we would be likely to conclude that there was no real difference between the two strains. On the other hand, it might turn out that when s1 and s2 are truly equal, a difference of 10 bushels per acre is seen on 1 time in 100. Therefore, if we see a difference of 10 bushel per acre, we are likely to conclude that the strains are not equal and that 1 is superior to the other.
  7. This is the logic of significance testing. The calculations follow from the scenario I described.