Testing 65

Constructing Balanced Classroom Tests

This file should be read in conjunction with the discussion of alternative assessment. Pencil-and-paper classroom tests should be only one component of assessment strategies.

The principle of balance

Any classroom test is effectively a random sample of what students know and don't know, can and cannot do. It is limited in its reliability and validity by the fact that it only gives information about what students did under the test conditions at the time the test was taken, and may not generalize to tell us what the student might do under other conditions or at another time. It also tells us only about a very small subset of all the things students are expected to learn from the curriculum. We need to try to make sure that that subset is properly representative of the goals of the curriculum and is not a biased sample.

Many classroom tests are biased, or unbalanced, in one or more of the following ways:

testing only memory, recall, simple association, or simple one-step reasoning
including only questions of one level of difficulty
testing only verbal or mathematical forms of reasoning using only multiple-choice items

I recommend that classroom tests aim to be balanced in each of these ways:

balanced by function, or the type of reasoning required to respond to a test item, to include not only memory and recall, but logical deduction, generalization, explanation, hypothesizing, question-posing, model-building, model-testing, generation of alternatives, judgment among alternatives, and imagination
balanced by difficulty, to include easier, average, and more difficult items
balanced by medium of representation, to include visual-graphical as well as verbal and mathematical items
balanced by format, to include not only multiple-choice items but also short-answer items ranging from lists of terms to one- and multiple-sentence written responses, graphical items (graphs, charts, tables, maps, diagrams, etc.), problem-solving items (where students show their work) and other reliable and informative formats.

Teacher judgment is required to decide what proportions of these various possibilities provide the most reasonably balanced test for a particular curriculum and population of students.

Some general guidelines:

Types of reasoning: keep pure memory or recall items to no more than 50% of the test (by time, by number of items, and/or by number of points); less than 50% for more able groups of students

Difficulty: for average classes, about 15-20% easier and more basic questions that you believe all students should know the answer to and estimate that at least 60% will get correct; about 10-15% difficult questions to challenge the most able students (you estimate that only about 10-15% of the class will get these correct); the rest of average difficulty (you estimate that 30-60% of the class will get them correct).

Item formats: no more than 50% multiple-choice items; at least one visual-graphical item per test, more if possible; at least two items in which students must write out a full sentence, or more, response and/or show their work; the rest as short-answer or problem-solving or other types.

Secrets of writing effective multiple-choice items

TERMS: the prompt is what precedes the answers or response choices; incorrect choices are called distractors.

all items should have exactly four possible answers
one answer should be 100% correct, the other three 100% incorrect
avoid responses such as All of the above, None of the above, A and B but not C, etc.
use no more than one negative qualifier per item (no, not, none, without, unless, impossible, etc.)
use only necessary technical terms and avoid unnecessary, unfamiliar general vocabulary
keep the general reading-level demands of items as simple as possible
frequently link the prompt and the responses by logical connectors such as because, if ... then, implies that, results in, etc.
frequently make the responses each a complete clause (subject and verb)
sometimes provide information on the basis of which to respond to the prompt
make each distractor diagnostic for some common error or misconception
make each response choice as parallel in grammar and wording, and as equal in length, as possible

DO NOT ASSUME THAT BECAUSE AN ITEM APPEARS IN A STANDARDIZED TEST THAT IT IS EFFECTIVE FOR USE ON A CLASSROOM TEST!!

Standardized tests contain many more items and are based on statistical criteria of effectiveness that do not apply to classroom tests. They also contain many items which are deliberately chosen to give the test desirable statistical properties (e.g. to raise or lower the mean, variance, degree of inter-item correlation, etc.) but which make the items unreliable for classroom use.

Remember the GOLDEN RULE of classroom test item construction:

There should be no other reason for a student to get an item incorrect EXCEPT not understanding the curriculum content that item is designed to test.

NEVER include 'trick' items, except on practice tests designed to get students used to seeing such items on standardized tests. Items with multiple negators and responses that depend on combining other responses tend to confuse students unnecessarily. So do items with complex wording or unfamiliar, nontechnical vocabulary. If you want to build students' reading skills, do so gradually, and provide quizzes where students can practice responding to more complexly worded items. These items will reduce the effectiveness of your testing of curriculum content if included in regular classroom tests.