Issues in Testing and Assessment: Science Education

Key Terms:

Links to Relevant Internet Websites

Assessment is the general term in use today for the process of finding out how well students have mastered the curriculum, or alternatively for how well they can meet certain criteria that represent competence in a particular subject.

Testing is one way to do assessment. There are many others, including the use of student portfolios, or testing by evaluating student performance. Everything that you might take into account in assigning a student's grade is part of assessment. Written classroom tests may form only one part of this general process.

Traditional testing in science education has always included more than classroom and standardized tests. It has usually also included student laboratory reports and often a grade on laboratory skills measured by a performance assessment. Today it may include many other items, such as reports on individual scientists and their achievements, group project reports, student constructions and demonstrations, science fair projects, oral interviews, oral reports, video reports, computer creations, etc., etc.

Traditional testing became very narrow in many schools in the 1980s, when grades were based almost entirely on results of completing multiple-choice items on classroom tests that were similar to standardized tests. Research has almost always shown that there is very little relationship between such test grades and any measure of students' ability to use science knowledge and skills in practical contexts from labwork to jobs. Traditional testing has been criticized as artificial and as emphasizing memory and routine calculations or forms of reasoning over the more complex skills that are actually used in doing science. The traditional approach was called "inauthentic" for this reason.

Traditional methods of testing were based on the psychometric theories of the 1950s. These theories held that individuals have or develop specific and general abilities that can be measured by written tests and will predict or correlate with actual performance in practice. By and large this does not seem to be true when applied to individuals, though to some extent it offers a useable rough approximation when data is averaged statistically over large numbers of students. The test scores of an individual say relatively little reliably about that person except how well they do on similar kinds of tests on similar subjects. The average test scores of groups may however say something reliable about differences between the groups as a whole. Traditional testing methods produce so-called standardized tests, which are normed on large test populations. This means that scores of any group can be compared to the average score of the test population, as well as to each other. Statistical reliability in such tests tends to require that they be standardized in many ways. The formats of the items tend to be very similar (e.g. multiple-choice items), the conditions under which the tests are given are also standardized (amount of time to complete the test, strict control of what aids or materials can be used, forbidding communication between students while taking the test, etc.). Although this approach tends to produce results that are more reliably comparable between groups, it also tends to make testing highly artificial when compared to what scientists, or anyone, does in real life in using their knowledge.

Older psychological theories held that psychometric testing of individuals measures relatively stable and real traits or characteristics of the individuals as such, rather than of the interaction of the individual with the testing situation. The application of this theory to assessment of learning of complex skills is seriously in question, mainly because people apply their knowledge and use their skills in ways that depend very heavily on the situation and context in which they are working. We use all sorts of artifacts and materials that come to hand, we use sensory perceptions, we use accumulated experience in that specific type of situation, we interact with other people, and the result is that what we can do in a testing situation and what we can do in a real concrete problem-solving situation in daily life, in the laboratory, at home, on the job, in the supermarket, etc. are quite different. These facts have led to a new theory of "situated cognition" which says that the ability to do something is not a function of the person alone, but of the total situation in which a person finds him/herself at the time s/he needs to do it. This is especially true of situations in which we find ourselves repeatedly (including the test-taking situation, which is why you can learn to do better on a particular type of test without learning any more about the topic it is supposed to be testing you on).

The major debate in assessment theory today is over a matter of degree: to what degree is any particular skill or form of knowledge generalizable from one type of situation to another, and to what extent is it specific or specialized to a particular type of situation? The general belief today is that the answer to this question depends on (a) how different the situations are, (b) what kind of knowledge or skill is involved, and (c) whether the person has learned how to transfer the knowledge or skill from one specific kind of situation to another specific kind of situation, or from one class of situations to another more general class of situations. This issue of transfer, or generalizable vs. situation-type specific knowledge and know-how, applies to everything from memory to logical reasoning, to imagination, judgment, and creativity.

Standardized tests and traditional psychometric-style testing methods only work for individuals when transfer is expected to be high, and there is growing evidence that there are relatively few kinds of important learning in science for which transfer is high across the stark differences between the artificial testing situation and most real situations in which we would be interested in the performance of the student in a scientific or science-like activity. One very important additional reason for this is that some students use generalizable strategies to achieve something, while others use situation-specific strategies to achieve exactly the same thing by different means. This means that traditional testing rewards a particular method rather than actual results, and so it is systematically biased in favor of generalizable methods. Most people, on the other hand, including scientific experts, tend to prefer situation-specific methods over generalizable ones in any situation with which they have a lot of prior experience. It is also not clear that there really are generalizable methods in the sense that it is obvious how to apply the method in new cases. Very often you have to be taught how to apply it in each new type of situation, and this is really no different from the situation-specific approach. The advantage of generalizable methods is often an illusion: the method only looks the same in each situation after you learn how to apply it, but not before.

As a result, in science and many other subjects the goal of assessment has been redefined to require "authentic assessment", which means that what we test should be basically the same as what scientists or technologists actually do -- and obviously they do not spend their time taking multiple-choice tests! In science, if we followed this principle strictly, all assessment would be performance assessment. We would expect students to do research-like or engineering-like projects (discovery or design) and we would evaluate their science learning based on their successful use of concepts and principles from the curriculum in their projects. Maybe someday this will actually happen.

At present few people seriously propose a strictly authentic assessment, just a more authentic one than traditional tests can provide.

There are other arguments in favor of "alternative assessment", i.e. using methods other than traditional written tests. In science, since those tests were close to becoming almost 100% multiple-choice items, the idea of alternative assessment even includes adding more variety to test items on written tests. For example, most new standards for science competence emphasize the ability to work with visual representations: graphs, diagrams, charts, tables, etc. If you open any scientific journal that reports experimental results, you will often see several such representations on every page, and certainly in every article.

Multiple-choice tests also eliminate any measurement of students' ability to express and use scientific concepts in words, i.e. to write science. Science education research now emphasizes the importance of students' use of both spoken and written scientific language, and the need to use it as an assessment tool. In addition, students' ability to use mathematical and other specialized symbols (chemical formulas, genetic combinations, etc.) needs to be developed and tested by means other than asking them to identify correct already-written forms (which is much easier than actually writing them correctly yourself, but less relevant to your ability to use such skills anywhere other than on a test).

There is a close association between changing curriculum and achievement standards in science and alternative methods of assessment. New standards at the state and national level emphasize the ability to use science to solve real-world problems, or at least to think intelligently about real-world problems. Various studies and theories about how scientists and engineers and other technically expert people do this lead to lists of more specific skills which it is believed useful for students to master (such as being able to write a correct mathematical equation or chemical equation corresponding to a verbally stated or diagrammatically shown problem or situation). Each such component skill that contributes to using science to reason about real problems then corresponds to an objective of the science curriculum, and ideas about how students can demonstrate that they can use this skill define the assessment criteria.

At this point there is another split. In some cases only the objectives are defined and not the specific ways in which the student is expected to demonstrate mastery (student products or performances and criteria for evaluating them). In these approaches to assessment students and teachers can create many different sorts of evidence of achievement of an objective, for each of which there would be appropriate criteria for evaluation of this evidence. In other approaches, the objectives, the types of evidence for mastery, and the criteria for evaluating the evidence are all specified as part of the curriculum.

A compromise betweent these two approaches, which allows teachers, students, and curriculum planners to stake out positions at or anywhere between these extremes is "portfolio assessment". A student portfolio in science, like a student portfolio in art (where the idea originated), is a collection of a variety of different kinds of evidence of mastery of curriculum objectives. It could contain a lab report, a model airplane, a set of photographs recording the growth of a plant, a videotape of a group presentation, and even a test or two. More strict curricula specify some or all of the kinds of materials that students must include in their portfolios, right down to the specific tasks that produce these materials as evidence of achievement. Freeer curricula allow teachers and students more latitude to come up with original tasks and activities, so long as they result in a product that represents evidence of student achievement.

Standardized testing and related classroom testing methods make it as easy as possible to compare the results of two groups (and they are also misused to compare the learning of two individuals). Portfolio assessment can also permit comparison of different groups if the kinds of evidence, or some of them, are required to be the same, and to be based on completing the same tasks, but the comparisons are a little less reliable. On the other hand the comparisons are presumably more valid, which means more authentic: they are assessing what is really important in science education, even if not as reliably. When there is the maximum freedom in terms of what goes into a portfolio, it is still possible to compare individuals in a loose way, but it is no longer really meaningful to compare groups. There is no general comparison between 12 apples and 11 oranges, but there is still a kind of comparison possible between an apple and an orange.

When there is maximum freedom in terms of portfolio content, there is also a problem of fairness in grading. Such grading has to be based on criterion-referenced evaluations rather than norm-referenced evaluations. This means that instead of comparing students to one another or to large test populations (test norms), each student is evaluated on each curriculum objective, by using general criteria (sometimes called "rubrics") to assigning categories like: Insufficient evidence of mastery (fail), Minimal evidence of mastery (pass), and Substantial evidence of mastery (honors). This could be quantitatively based, but more likely would be a qualitative judgment that the student seems to know something just barely or knows it well, knows how to use it in a few similar situations or in many diverse situations. If the tasks and products and forms of evidence are not the same from student to student, or from class to class, or school to school, district to district, there is really no way to know how consistent the application of any set of criteria for grading are from one case to another.

In practice some realistic compromise is usually found between completely specific requirements for a portfolio and completely free portfolio content. The simplest way to do this is to have some prescribed tasks and products, and then additional more open-ended ones. For example: Include a report on work on Lab 7, and a report on any lab applying the concept of density, and your original plan for an experiment for your individual project, etc.

In the future, if students are ever evaluated solely on the basis of what they can do when they are performing like techno-scientists, so that all their projects are authentic rather than artificial ones imposed by the curriculum, their portfolios will have uses beyond the school. These portfolios will be evaluated against the objectives of a curriculum (perhaps not just the science curriculum!) for school credit, but they will also be available to be re-evaluated, using different criteria, by admissions committees for different possible post-secondary education programs the student might apply to, or by various prospective employers using criteria relevant for the performance of various sorts of jobs, etc. Your transcript of credits, or your diploma, might be a very small part of your total achievement portfolio. If assessment is genuinely authentic, there is no reason to limit portfolios to use by the school. Students would edit and rearrange their portfolios for different applications; the school would simply certify that what was contained in the portfolio was actually the work of that student. It is even possible that such a system, if it were in the form of a certified electronic portfolio, would completely replace the present system of credits, examinations, and diplomas. The same methods could be extended to higher education, and to your career. Everyone would have certified samples of their work, whether in school or on the job, available for a variety of purposes. Your resume would, like your transcript, just be a small part of the total package. People could be judged by their work rather than by their credentials. And the same body of work might be evaluated very differently by different committees or organizations, each using their own criteria as to what was most important.

For the time being, however, portfolios will tend to be tied more or less closely to a specific curriculum. Teachers will usually have some responsibility to create tasks and activities that can result in products appropriate for inclusion in the portfolio (i.e. ones that will show evidence of mastery of curriculum objectives). They will need to teach students how to produce useful portfolio items (from writing skills to drawing skills to all the other usual skills of doing good work in science). And teachers will most often be expected to apply criteria or rubrics to evaluate the contents of portfolios and assign some overall grade or rating, and/or complete an evaluation inventory or checklist for mastery of each curriculum objective for each student. All this means that more of a teacher's time will be spent in assessment-related activities, and more of the students' time as well. For this reason, most portfolio assessment programs try to integrate the work of creating a portfolio with the general work of learning and teaching science. Nevertheless, as we abandon the faulty shortcuts of old-fashioned testing, teachers are going to have to spend more time and pay more attention to issues of assessment. Hopefully this will be a good thing for learning! At the very least it ought to mean that teachers, as well as students, will have a more valid and authentic, as well as a more in-depth and continuous assessment of strengths and weaknesses during the learning process. The better teachers know students, the better able we are to help them learn.