6.6: Machines can Evaluate Writing Well

Last updated
Save as PDF

Page ID: 60974

Cheryl E. Ball & Drew M. Loewe ed.
West Virginia University via Digital Publishing Institute and West Virginia University Libraries

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Author: Les Perelman, Comparative Media Studies/ Writing, Massachusetts Institute of Technology

Across the United States, writing is being evaluated by machines. Consider the situation of Maria, a typical high school student. In a high-stakes test situation that could decide whether she’s admitted to the university of her choice, she’s given a prompt to write about a topic such as whether originality is overrated, or whether our society values certainty over skepticism. For the next 25 minutes, she tries to create a purposeful piece of writing, developing its ideas, shaping its structure, considering its style and voice, choosing appropriate examples, and honing it to suggest to its readers—her evaluators—that she can think and write effectively enough to be admitted to a good college. She even drops in a bit of humor to lighten up the essay.

She writes her essay for people who she imagines are like her teachers—people who can read and form conclusions about her essay from multiple angles, know irony or humor when they see it, can spot an unsupported generalization, or can forgive a minor grammatical error while taking note of a more serious one. But instead of reaching those human readers, her essay is fed into a computer system for evaluation. The machine scans the paper for a handful of simple features, such as length and the percentage of infrequently used words. In a few milliseconds, it spits out a score that seals Maria’s fate.

To testing agencies, machine scoring is irresistibly alluring. Instead of hiring, training, and paying warm-blooded human beings to read and judge tens of thousands of essays, they think that investing in a computer scoring system will save them large amounts of money and time and will generate big profits. They have faith that parents, students, school officials, and the general public will think the machines are better than human readers. After all, computers are so accurate and consistent and reliable, right? Why run the risk that the evaluator reading Maria’s essay is cranky or tired on the day of the evaluation, or is coming down with a cold? Machines offer razor-sharp precision and metallic solidity, never giving in to frustration or exhaustion.

But as we’ll show, although computers are brilliant at many things, they’re really bad at understanding and interpreting writing—even writing produced by fifth-graders—and that fact will not change in the foreseeable future. Understanding why this is true can prepare teachers, parents, students, and taxpayers to push back against testing agencies and politicians who think that people will be placated by the complexity of technology and seduced by the promise that writing can be evaluated cheaply and efficiently, justifying further cuts in educational funding.

Why Machines Make Lousy Humans

First, it’s important to understand that computers are not the enemy. In fact, computers play an important role in research on the language and writing that humans produce. There are some things a computer can do in a couple of seconds that would take a human researcher a lifetime (or two). Scholars of writing are the last people to resist the further development of computers to work with natural language—a term referring to the spoken or written language produced by humans as part of their daily lives.

But when it comes to evaluating writing, computers perform badly. That’s because natural language is extraordinarily complex— far more complex than even the most sophisticated computers can understand.

Let’s consider a few reasons why.

Computers don’t understand meaning. They can compute the likelihood of two words appearing close to each other, but their judgment is always based on statistical probabilities, not an understanding of word connotations. Think of the verb to serve. We can serve our country, serve in a game of tennis, or serve the president. We also can serve a casserole to you. (A cannibalistic restaurant could even serve presidents for lunch, though the supply would be pretty limited.) Humans can easily differentiate between the realistic and absurd meanings of a simple word like serve; computers can’t.
A computer can’t differentiate between reasonable and absurd inferences either. In fact, computers are really bad at making any inferences at all. When we speak or write, large amounts of information are left out and inferred by the listener or reader. When we read, “Fred realized he couldn’t pay for his daughter’s tuition. He looked up his uncle’s email address,” the space between the two sentences is filled with information that we infer. Almost all human language works this way. Making inferences requires vast amounts of information and astronomically large networks, connections, and permutations in infinite contexts. Although computers can obviously store and search for massive amounts of data, they don’t know how to put it together to infer. The computer would read the two statements above exactly the same as it would read, “Fred realized he couldn’t pay for his daughter’s tuition. He looked up his pet elephant’s email address.”
Most computer scoring programs judge logical development and effective organization by the number of sentences or words in a paragraph. If a system is programmed to see one-sentence paragraphs as undeveloped, it will apply this principle to all essays even though one-sentence paragraphs can be used to good effect (as in the sentence at the top of this bullet list). When one of us (Perelman) tried to write the best essay he could, one of the most popular machine graders admonished him that a paragraph was underdeveloped because it had only three sentences. He then expanded the paragraph by inserting completely irrelevant material—the opening line of Alan Ginsberg’s poem “Howl”: “I saw the best minds of my generation destroyed by madness, starving hysterical naked.” The computer then considered the new paragraph to be both adequately developed and coherent.
Computers get progressively worse at evaluating writing as it gets longer (for obvious reasons—there’s more to mess them up). The programmers know this. Although all commercial computer-scoring systems give higher scores to longer essays, paradoxically most limit the length of papers to around 1,000 words, about four typed pages. The Educational Testing Service’s program Criterion, for example, almost always gives high scores to essays of 999 words but will refuse to evaluate an essay containing 1,001 words. However, many college papers are more than 1,000 words.

The first myth to debunk about computer scoring systems is that they can read extended discourse, especially to evaluate students’ writing abilities. They can’t. They don’t understand or interpret anything that’s fed into them. They miss humor and irony, or clever turns of phrase, or any of a dozen aspects of prose that we try to teach students. They can’t discern purposeful stylistic decisions. They think gibberish is acceptable, and they mark perfectly reasonable prose that violates some simplistic criterion such as the number of words in a paragraph as unacceptable. They always interpret some aspect of writing the same way, without considering the writer’s intentions and context. They can’t make inferences between the lines of text. The complexity of human language simply baffles them—or, more accurately, goes right over their semiconductors. Writing experts have exposed these and other limitations of machine scoring using both coherent and incoherent essays. The computers can’t tell the difference.

In one experiment, researchers at MIT created the Basic Automated Bullshit Essay Language Generator (BABEL), which produces gibberish essays. When they submitted essays produced by BABEL to a system that scores tens of thousands of student test essays, including the Graduate Record Examination, the computer awarded the gibberish essays the highest possible score. Here is an excerpt of a Babel Generated Essay that received the highest score (6) from ETS’s e-rater, along with the canned comments from ETS’s GRE preparation website.

Careers with corroboration has not, and in all likelihood never will be compassionate, gratuitous, and disciplinary. Mankind will always proclaim noesis; many for a trope but a few on executioner. A quantity of vocation lies in the study of reality as well as the area of semantics. Why is imaginativeness so pulverous to happenstance? The reply to this query is that knowledge is vehemently and boisterously contemporary.

The score: 6

In addressing the specific task directions, a 6 response presents a cogent, well-articulated analysis of the issue and conveys meaning skillfully. A typical response in this category:

articulates a clear and insightful position on the issue in accordance with the assigned task
develops the position fully with compelling reasons and/or persuasive examples
sustains a well-focused, well-organized analysis, connecting ideas logically
conveys ideas fluently and precisely, using effective vocabulary and sentence variety

demonstrates superior facility with the conventions of standard written English (i.e., grammar, usage, and mechanics) but may have minor errors.

Obviously, the Babel gibberish essay does none of these things. So why, with all these limitations, has computer essay scoring even seen the light of day? We’ve pointed to the economic reasons and the desire for profit. But there’s another reason, and it’s about humans, not computers.

Why Humans Make Lousy Machines

When we look at how humans read and evaluate students’ test essays, we find an interesting paradox. For years, groups of readers have been trained—normed and calibrated—to read thousands of essays in the most consistent and accurate way possible. This is because when we allow people to read writing normally, they see it subjectively, through the lens of their experiences (think of a book club discussion). If a testing agency allowed this—if it couldn’t guarantee consistently of evaluation—it would be instantly sued. Through a long process, readers can often develop consensus on how to evaluate many aspects of papers, but such a process takes more time and money than the testing organizations are willing to spend. Instead, their training process turns humans into machines so that they will look for exactly the same features in exactly the same way, as quickly as possible. They’re told to ignore facts because they can’t verify everything they read. They’re constrained to see the essays only through the lens of what the evaluators think is important. They want to read beyond the lines of the assessment criteria, but they can’t. Because humans are required to read 20–30 essays per hour, they end up evaluating essays using the same simple features used by the machines.

In reading high-stakes, one-shot essay tests, then, both machines and humans make lousy evaluators when we reduce their reading process to a few limited features. Machines do this because they can’t do anything else. Humans do this because they’re trained to ignore everything else they might see and interpret in an essay, including even how factual its assertions are, in order to score only those things that the test makers deem significant and, more importantly, can be scored very quickly (slow, thoughtful reading costs money).

To take a (not so extreme) case, imagine that we assume good writing can be measured entirely by the number of grammatical and punctuation mistakes in a text. A human can be trained to act like a machine, hunting for grammatical mistakes and ignoring everything else. A computer can be similarly trained to recognize a lot of mistakes, even while missing some and flagging false positives. But both evaluators, human and computer, miss the point. Writing is far more complex than a missing comma. The testing agencies that fail to understand fully what writing is and how the ability to produce it is best measured are at fault.

Taking the Machine Out of Writing Assessment

When it comes to testing and evaluating our kids’ writing, machines alone aren’t really the problem. It’s what we’re telling the machines to do. And that’s very similar to what we ask human evaluators to do. What, then, is the solution?

First, we need to stop testing our kids’ writing to death. Computer scientists (who are not writing specialists) were attracted to the possibility of machine scoring precisely because the regressive kind of human scoring they were presented with looked so simple and replicable. We must start by critiquing the testing machine writ large—the multibillion-dollar industry that preys on school districts, misinformed politicians, naïve parents, and exploitable kids under the guise of providing assessment data designed to improve education. Nothing is improved by relentless testing, especially of the kind that reduces writing to the equivalent of running on a hamster wheel. No standardized writing test is purposeful, motivating, or engaging, and it almost never gives the writer any response other than a number.

If the methods of this testing and evaluation are misguided, what happens with the results can be deplorable. Because of relentless and unfounded accountability coming from politicians and government officials who often know next to nothing about how education really works, schools must demonstrate their success through standardized tests. Teachers’ pay raises or even their jobs are linked to their students’ scores on these tests, and entire schools can be defunded or closed if they fall too far below a norm, even though they may be located in an area of urban blight and populated by kids who, through no fault of their own, do not have advantages that support their early literacy development. So what happens? The teachers, fearful of the consequences of poor test scores, begin narrowing everything they do in anticipation of the standardized tests. This process can bankrupt kids’ education by denying them richer learning experiences unrelated to the narrow parameters of the tests. Worse, it bankrupts teachers’ creativity and freedom to apply skills and strategies they’ve learned as educators to create a meaningful, engaging curriculum—in other words, to teach, in the best sense of the word.

What’s the alternative? It’s not in evaluation, but in support. It’s to get testing off the backs of students and teachers. It’s to help young people to develop their writing abilities in authentic situations that give them time to think and formulate ideas, gather necessary information, structure and draft pieces of writing, and hone them to accomplish meaningful goals, such as to inform or persuade or entertain people who can make sense of what they write. It’s to put far more time into teaching than evaluating. It’s to re-empower teachers to use their best, most creative abilities to nurture students’ writing and give them multiple purposes, contexts, and audiences. It’s to recognize the meaning that writers are conveying and not just simple formal elements of their prose. It’s to recognize that students are at different stages of development and language proficiency and to teach accordingly.

Why are we reducing writing situations to sterile, purposeless tasks designed to yield a few metrics that are poorly related to the meaning of the word “writing?” Test makers and evaluation agencies will say that they aren’t denying learners all the rich, meaningful writing situations they should encounter, but that their tests are a convenient, simple, cheap way to measure what they can do. But they’re not. More authentic kinds of evaluation, such as student portfolios carefully read by teachers, are much better and more humane methods because they focus as much on the development of ability as they do on its measurement. And if computers can’t read a 1,000-word test essay, they won’t even begin to know what to do with a portfolio.

Keywords

essay grading, high-stakes writing tests, machine scoring, standardized tests, writing assessment

Author Bios

Chris Anson is Distinguished University Professor and director of the Campus Writing and Speaking Program at North Carolina State University, where he works with faculty across the curriculum to improve the way that writing is integrated into all disciplines. For almost four decades, he has studied, taught, and written about writing and learning to write, especially at the high school and college levels. He is past chair of the Conference on College Composition and Communication and past president of the Council of Writing Program Administrators. He has studied and written about writing and computer technology and is a strong advocate of increased attention to digital modes of communication in instruction, but his research does not support the use of computers to score the evaluation of high-stakes writing tests.

Les Perelman recently retired as director of Writing Across the Curriculum in the department of Comparative Media Studies/ Writing at the Massachusetts Institute of Technology, where he has also served as an associate dean in the Office of the Dean of Undergraduate Education. He is currently a research affiliate at MIT. He is a member of the executive committee of the Conference on College Composition and Communication and co-chairs that organization’s Committee on Assessment. Under a grant from Microsoft, Dr. Perelman developed an online evaluation system for writing that allows student access to readings and time to plan, draft, and revise essays for a variety of assessment contexts. Perelman has become a well-known critic of certain standardized writing tests and especially the use of computers to evaluate writing.

Why Humans Make Lousy Machines

Taking the Machine Out of Writing Assessment

Further Reading

Keywords

Author Bios