Instrumental Conditioning: Basic Paradigms & Concepts

In Pavlovian conditioning we studied the organism's sensitivity to the relationships between multiple stimulus events (e.g., CS & US). In instrumental conditioning (also known as operant conditioning), we study the organism's sensitivity to various relationships between the organism's behavior and the consequences of their behavior. After reviewing several different paradigms that have been commonly employed in the study of instrumental learning, we will discuss the different instrumental relationships that have been commonly studied, and discuss the basic concepts of reinforcement and schedules of reinforcement.

A. Basic Paradigms

1. Discrete Trial Tasks.

a. Runway. Rats are placed in a start box of a straight-alley runway. If they run down the runway they will find a goal box that may contain food reinforcement. Instrumental learning is assessed in this setup by measuring running speed as a function of the number of training trials. One interesting variant of this procedure is where two straight alleys are placed together to form one lone straight alley with 2 goal boxes. Amsel demonstrated that if the rats learn to run for food in Goal Box 1 and 2, then when food is omitted from Goal Box 1 they run very fast from there to Goal Box 2. He called this effect the "frustration effect." It has been determined experimentally, that this effect is due to the omission of an expected food in Goal Box 1. Other rats trained from the outset to find food only in Goal Box 2 run more slowly from Goal Box 1 to Goal Box 2 than rats trained with food in both Goal Boxes on trials in which food was omitted from Goal Box 1. This result suggests that the frustration effect is not due to differences in hunger levels on the different kinds of trials. Instead, it appears as though some energizing motivational state (frustration) is activated by the omission of an otherwise expected food.

b. T-Maze. Another variant of the runway setup is to place two runways perpendicular to one another (to form a T). Rats start at the bottom of the T and run to the choice point. In one application of this procedure, the rats may always find food reinforcement on the right hand side and no food on the left. Learning is assessed by monitoring the % of trials in a block of trials in which the rats chose correctly. Initially, performance will be close to 50%, but with additional training trials, rats will learn to select the side that contains food reinforcement most of the time. One interesting question that has been addressed used this apparatus is the question of whether rats learn about the "Place" of food or which "Response" leads to food. In class, we discussed one kind of experiment that was used historically to examine the issue. More recently, the Morris Water Maze has been used quite successfully to assess spatial learning (i.e., learning the place of a submerged platform in a circular water maze filled with milky water) in various species.

c. Hebb Maze. The Hebb maze is named after a famous early neuroscientist by the name of D.O. Hebb. It consists of a corn-maze-type puzzle where the animal has to learn how to exit a maze that has a lot of dead-end alleys and only one or a couple of correct paths from entrance to exit. Early applications of the Hebb maze included an assessment of the effects of early environment on cognitive abilities. In the famous "enrichment experiment," one group of rats was reared in a stimulating environment while a second group of rats was reared in a highly impoverished environment. The abilities of the two groups to solve a variety of Hebb mazes that differed in their difficulty is then assessed, and it is observed that the group raised in the enriched environment learns to solve the maze problems faster and with fewer errors than the group raised in the impoverished environment. More recent studies have been directed at the question of whether this effect might be due to differences in the way in which neurons in the brain become connected to one another during development in the two groups reared in different ways.

2. The Free Operant Task. B.F. Skinner advocated the use of a free operant procedure in the study of instrumental learning. In this procedure, the organism is placed in an "operant" chamber (also known as a Skinner box) in which there is a response manipulandum (e.g., a lever) that the animal must contact in some way in order for reinforcement to be automatically delivered. What made this approach different from Thorndike's puzzle box is that everything was automated. A computer can record the number of lever press (or key peck) responses the animal (usually rat or pigeon) makes and reinforcement can be automatically delivered if reinforcement is scheduled to occur.

a. Ongoing behavior can be investigated with this approach. Unlike the discrete-trials procedures, responding is assessed over an entire session during which the animal is left to its own devices. In the free operant version of this task, the animal is undisturbed and allowed to continue to work for reinforcement. In the discrete trials procedure (like Thorndike's or any of the runway & maze procedures described above), a trial begins usually when the animal is placed in the apparatus and it ends when reinforcement occurs. Then some time (an inter-trial interval) elapses before the next trial begins. A discrete trial procedure can be used in the Skinner box, however. In "stimulus control" situations, for example, a response contingency will only be in effect when a stimulus is present, but not when the stimulus is absent. In this way, trials would consist of separate presentations of the stimulus, and the inter-trial period would consist of the time in which the stimulus is off. What makes this different from the older tasks (mazes and runways) is that if the animal is being studied in the Skinner box during the inter-trial interval the animal can make instrumental responses, whereas the animal is removed from the apparatus in maze or runway experiments. Thus, one can study the development of control by specific stimuli more easily using a Skinner box setup. More on this later.

b. An important theoretical reason for why Skinner advocated use of the automated setup (why he suggested we study the animal in an operant chamber) was because it is possible to require the organism to learn an "arbitrary" response. This means that the animal is to learn to make an operate response which itself is NOT part of the animal's natural behavioral repertoire. The thinking goes something like this: if the animal is required to interact with its environment in some arbitrary way, then we will be studying instrumental learning in its "purest" form. This would, arguably, allow us to discover general principles of learning more effectively. Suppose that the alternative approach would be followed - that is, that we attempt to teach the organism a response that it already possessed in its natural behavioral repertoire. In this case, we can not study the development of new learned behaviors, because the behavior is already present. If one wanted to identify the "general principles" of instrumental learning that would apply across different species, then one would do well to require each species to acquire some new response and look for the general principles that govern the acquisition process across these different species.

B. Some Commonly Studied Instrumental Relations

1. Positive Reinforcement. This refers to a contingency in which a response is required before a positive event (the reinforcer, e.g., a food pellet) will be delivered. The reinforcer is said to be contingent upon the response.

2. Negative Reinforcement. This refers to a contingency in which a response is required before a negative event (e.g., a brief electric foot shock) will either be terminated or cancelled.

a. Unsignalled Avoidance contingency (otherwise known as Sidman avoidance). In this procedure the rat, for example, is placed in the operant chamber and shock will occur every 10 sec (for instance). If the rat presses the lever, however, this will delay shock for 30 sec (for instance). The rat can cancel the shock scheduled to occur 10 sec following the last one, by pressing the lever. However, it must press the lever again before 30 sec has elapsed in order to cancel that one. Thus, by varying the shock - shock and response - shock intervals, one can ask specific questions about the nature of avoidance learning.

b. Signalled Avoidance contingency. In the signaled avoidance procedure a warning stimulus comes on and signals that shock will occur in x seconds if the animal does nothing and will remain on until the animal makes the appropriate response. When the appropriate response occurs this turns the warning stimulus and the shock both off. After an inter-trial interval, during which time nothing of consequence happens, the warning signal comes on again. If the animal makes the appropriate response before x seconds has elapsed, then the warning stimulus will be turned off and no shock will occur on that trial. The first type of response (which turns the shock off) is called an escape response and the second type (which cancels the future shock) is called an avoidance response. Notice that this procedure is a discrete trials avoidance learning procedure and the unsignalled avoidance procedure described above is a free-operant avoidance procedure, and that they both involve negative reinforcement as a means of maintaining the operant response.

3. Punishment Contingency. This refers to a contingency in which the response is followed by some negative event (like shock). If that negative event is contingent upon the response then we call this a punishment contingency. The effect is to reduce the punished response.

4. Omission Contingency. Here, the response causes the positive event (e.g., the food pellet) NOT to occur (i.e., to be omitted). In the omission contingency, if the animal fails to respond, then food, for example, will be given, but food will be withheld if the animal makes the target response. This procedure has been most commonly studied with keypecking behavior by pigeons, but can be studied with any species in most any conditioning preparation.

C. Some Basic Concepts of Instrumental Learning

1. Reinforcement. An event is referred to as a reinforcing event if it increases the likelihood of the behavior that leads to it. This circular definition begs the question of what makes an event a reinforcing event in the first place, a topic which has received a lot of attention. One approach to this question has first asked what types of events have been demonstrated as reinforcing event.

a. Biologically significant events. Events like food in a hungry animal, water in a thirsty animal, a burst of hot air in an animal that is cold, etc, are all events that have some specific biological importance in the organism. In other words, all of these events have the effect of maintaining the animal's biological equilibrium. One idea on the nature of reinforcement is that events that have biological significance are reinforcement because they reduce some biological drive.

b. Secondary Reinforcement. Another class of events shown to have reinforcing properties are stimuli that have been associated with biologically significant events. Stimuli trained in a Pavlovian procedure, for instance, acquire the ability to reinforce instrumental responding. One interesting point about this is that since the Pavlovian CS itself is not biologically significant (in the same way that the US is), then the fact that CSs can reinforce instrumental responding suggests that reinforcement is something more than just drive reduction. A common example of something that is considered to be a powerful secondary reinforcer of instrumental behavior is money.

c. Other Behaviors. Research has also shown that other behaviors can also have reinforcing properties. In a famous set of experiments, Premack, studied what would happen in situations where the opportunity to engage in one behavior was contingent upon the animal first engaging in another behavior for some period of time. His observations led to the "Relativity Principle" of reinforcement. His studies included a baseline phase during which the animal was free to engage in various activities however it pleased, i.e., unconstrained. During this phase, Premack deduced the baseline probabilities of occurrence for each behavior. The relativity principle asserts that behaviors with higher baseline probabilities will always reinforce behaviors with lower baseline probabilities. In other words, what makes an activity reinforcing is not any intrinsic attribute of itself, but rather is related to the relative "preference" of the activity in question. A specific behavior, in other words, can be effective at reinforcing another behavior that has a lower baseline probability, but will be ineffective at reinforcing another behavior that has a higher baseline probability of occurrence. This is the relativity principle, and it can be seen that this notion provides an additional challenge to the view that reinforcement is tied to the biological concept of drive reduction

Timberlake and Allison have challenged PremackÕs relativity notion by arguing that sometimes a low probability behavior can reinforce a higher probability behavior. It should be emphasized, however, that these authors agree with the general notion that reinforcement is not an intrinsic attribute of a particular activity, and that behaviors, not events, are reinforcing, but they went beyond Premack in arguing for a behavioral regulation view of reinforcement. Importantly, they suggested that a low probability behavior can reinforce a higher probability behavior when the reinforcement contingency "deprives" the animal of engaging in the low probability behavior at its preferred level. In other words, if the contingency is arranged such that the animal must engage in the higher probability behavior above the level that the animal would ordinarily prefer, in order to engage in a minimal amount of the low probability behavior when there are no other means by which the animal could engage in the low probability behavior, then the low probability response will reinforce the high probability response (because the animals would be deprived of making that low probability response otherwise). This hypothesis is known as the "response deprivation" hypothesis.

2. Schedules of Reinforcement. Reinforcement can be arranged to occur after every target instrumental response, or after only some of these. Different rules whereby reinforcement is administered are referred to as different "schedules" of reinforcement. Skinner argued that schedules of reinforcement are ever present in real world situations, and that in order to understand behavior fully we must understand how different schedules of reinforcement influence behavior. There have been several different ways in which people have studied the influence of different reinforcement schedules upon behavior. Some are listed below.

a. Partial vs Continuous Reinforcement. In classic runway studies, it was discovered that runway speed was faster in continuous reinforcement (CRF) conditions compared to partial reinforcement (PRF) conditions. A PRF condition is whenever reinforcement occurs for fewer than 100% of the responses made. However, when observing runway speeds during an extinction phase, in which none of the responses are reinforced, the PRF group of rats persist longer than the CRF group. In other words, they extinguish their instrumental running response more slowly. This effect is called the Partial Reinforcement Extinction Effect (PREE).

Two popular theories of the PREE were offered by Amsel and Capaldi. Amsel's frustration theory states that the PFR animals learn to become frustrated during the acquisition phase, and that this frustration can serve as a stimulus for further reinforced responding. According to this view, responding persists more during extinction, because frustration occurs frequently during extinction trials, and this previously served as a cue to respond. The CRF subjects, meanwhile, only encounter frustration stimuli during the extinction phase, and they therefore have not learned that the thing to do while frustrated is to respond. They, therefore, give up sooner.

Capaldi's sequential theory states that the animal remembers what the outcome was on the previous trial, and can use this as a cue for what to do on the subsequent trial. In particular, if the previous trial was nonreinforced and the next one is reinforced, then subjects are assumed to learn that a memory of nonreinforcement (from the previous trial) signals that running on the next trial will be reinforced. An interesting experiment of Capaldi's compared responding during an extinction phase in two groups of rats. One of these groups was trained on a RNR sequence of trials each day (the trial was either reinforced, R, or not, N). The other group was trained on an RRN sequence. It was hypothesized that it would be easier to remember the previously nonreinforced trial at the beginning of the next, rewarded, trial in Group RNR because it merely had to remember over a short inter-trial interval. Animals in Group RRN would have to remember on the next day that the final trial on the day before was nonreinforced in order for this nonreinforced memory to serve as a cue for reinforced responding on the next trial. Thus, Group RNR should be better at learning that the nonreinforcement memory is a stimulus for reinforced responding on the next trial. This group should therefore persist more in responding during the extinction phase. This is the result that Capaldi observed.

b. Interval and Ratio Schedules of reinforcement. Skinner studied the effects of varying the interval or ratio requirements of different schedules of reinforcement upon behavior. Ratio schedules reinforce a set number of responses that the animal makes. This set number either does not change from one reinforcement to the next (this is a Fixed Ratio schedule, FR), or it does change from one reinforcement to the next (this is a Variable Ratio, VR). Interval schedules, on the other hand, reinforce the first response that occurs after a set interval of time has elapsed since the previous reinforcement. This set amount of time can either be the same from one reinforcement to the next (Fixed Interval, FI), or it can vary (Variable Interval, VI).

Since there is a direct relation between the rate of responding and the rate of reinforcement on the ratio schedules, but not on the interval schedules, then response rates are generally higher on the ratio schedules (compared to an interval schedule that has a comparable experienced reinforcement rate). Other relations between these schedules and the behaviors they produce include the specific patterns of responding that one observes. The variable schedules produce more steady rates of responding across the entire session (this can be easily seen on a cumulative record). The FR and FI schedules, however, produce a pattern of responding where responding is essentially absent immediately following reinforcement (the so-called post-reinforcement pause), and then responding emerges at some point thereafter. On the FI schedule response rate steadily increases across the time interval with a peak rate approximating the length of the FI. Once responding begins in the FR schedule, it continues at a relatively constant high rate.

c. Choice. The psychological mechanisms of choice behavior have been studied extensively in situations where animals choose to engage in different behaviors each constrained by a different reinforcement contingency. In these schedules, called concurrent schedules, Herrnstein discovered that a lawful relationship exists between response choice and reinforcement. The Matching law states that the animal's relative response rate on the two alternatives will match the relative reinforcement rates that occur on the two alternatives. Two theories of why matching occurs differ in the specific psychological mechanism thought to control choice. The momentary maximizing theory assumes that the animal keeps track of the "local" probabilities of reinforcement for each alternative response. These probabilities are assumed to change over time, such that for example the probability of reinforcement for behavior 1 immediately following a reinforcement for behavior 1 will be relatively low, etc. At each moment in time, the subject is assumed to base its choice for the different alternatives by determining which alternative has the highest probability of reinforcement at that instant.

Another theory of choice, the melioration theory, states that the local rates (not probability) of reinforcement govern choice. The animal is assumed to redistribute its behavior across the two alternatives until they experience an equal rate of reinforcement on each alternative. In class, we considered a specific example of how this might work in a concurrent VI 30 sec VI 60 sec schedule. Try to work out another example where the subject is working on a concurrent VI 45 sec VI 120 sec schedule.