-This is an excerpt from The Washington Post’s Valerie Strauss column featuring a post by Leonie Haimson. Les Perelman is an old high school pal, retired director of the multi-discipline writing  program at MIT and an expert on computer scoring. Read the entire article here.

According to Les Perelman, retired director of a writing program at MIT and an expert on computer scoring, the PARCC/Pearson study is particularly suspect because its principal authors were the lead developers for the ETS and Pearson scoring programs. Perelman said: “It is a case of the foxes guarding the hen house. The people conducting the study have a powerful financial interest in showing that computers can grade papers.”

In addition, the Pearson study, based on the spring 2014 field tests, showed that the average scores received by either a machine or human scorer was “very low: below 1 for all of the grades except grade 11, where the mean was just above 1.”

Given the overwhelming low scores, the results of human and machine scoring would of course be closely correlated in any scenario.

Les Perelman concludes: “The study is so flawed, in the nature of the essays analyzed and, particularly, the narrow range of scores, that it cannot be used to support any conclusion that Automated Essay Scoring is as reliable as human graders. Given that almost all the scores were 0’s or 1’s, someone could obtain to close the same reliability simply by giving a 0 to the very short essays and flipping a coin for the rest. ”
As for the AIR study, it makes no particular claims as to the reliability of the computer scoring method, and omits the analysis necessary to assess this question.

As Perelman said: “Like previous studies, the report neglects to give the most crucial statistics: when there is a discrepancy between the machine and the human reader, when the essay is adjudicated, what percentage of instances is the machine right? What percentage of instances is the human right? What percentage of instances are both wrong? … If the human is correct, most of the time, the machine does not really increase accuracy as claimed.”

Moreover, the AIR executive summary admits that “optimal gaming strategies” raised the score of otherwise low-scoring responses a significant amount. The study then concludes because that one computer scoring program was not fooled by the most basic of gaming strategies, repeating parts of the essay over again, computers can be made immune from gaming. The Pearson study doesn’t mention gaming at all.

Indeed, research shows it is easy to game by writing nonsensical long essays with abstruse vocabulary. See for example, this gibberish-filled prose that received the highest score by the GRE computer scoring program. The essay was composed by the BABEL generator – an automatic writing machine that generates gobbled-gook, invented by Les Perelman and colleagues. [A complete pair of BABEL generated essays along with their top GRE scores from ETS’s e-rater scoring program is available here.]

In a Boston Globe opinion piece , Perelman describes how he tested another automated scoring system, IntelliMetric, that similarly was unable to distinguish coherent prose from nonsense, and awarded high scores to essays containing the following phrases:

“According to professor of theory of knowledge Leon Trotsky, privacy is the most fundamental report of humankind. Radiation on advocates to an orator transmits gamma rays of parsimony to implode.’’

Unable to analyze meaning, narrative, or argument, computer scoring instead relies on length, grammar, and arcane vocabulary to do assess prose. Perelman asked Pearson if he could test its computer scoring program, but was denied access. Perelman concluded:

If PARCC does not insist that Pearson allow researchers access to its robo-grader and release all raw numerical data on the scoring, then Massachusetts should withdraw from the consortium. No pharmaceutical company is allowed to conduct medical tests in secret or deny legitimate investigators access. The FDA and independent investigators are always involved. Indeed, even toasters have more oversight than high stakes educational tests.

A paper dated March 2013 from the Educational Testing Service (one of the SBAC sub-contractors) concluded:

Current automated essay-scoring systems cannot directly assess some of the more cognitively demanding aspects of writing proficiency, such as audience awareness, argumentation, critical thinking, and creativity…A related weakness of automated scoring is that these systems could potentially be manipulated by test takers seeking an unfair advantage. Examinees may, for example, use complicated words, use formulaic but logically incoherent language, or artificially increase the length of the essay to try and improve their scores.

The inability of machine scoring to distinguish between nonsense and coherence may lead to a debasement of instruction, with teachers and test prep companies engaged in training students on how to game the system by writing verbose and pretentious prose that will receive high scores from the machines. In sum, machine scoring will encourage students to become poor writers and communicators.

Les Perelman’s BABEL generator. You try it.


My friend Les Perelman and some grad students came up with software that generates gibberish. They named the program BABEL,  the Basic Automated BS Essay Language Generator.

Why would MIT grad students want to generate gibberish?

To demonstrate the problem with computer-based scoring of test essays.

Les submitted the results to AES, Automatic Scoring Engines.  

“It works spectacularly well in producing nonsense that received high scores from various AES machines,” wrote Les.

BABEL even works well with ETS’ e-rater. Though ETS would not give me direct access to its scoring engine, ETS allows prospective test takers to take practice Graduate Record Exam (GRE) essays that are instantly graded by e-rater for $13. I have bought a number of these packages and have used the BABEL Generator to produce essays that consistently receive scores of 5 and 6 on a 1-6 point scale on each of the two writing assignments that comprise the writing portion of the GRE.

Naturally ETS has banned Les from future access to their scoring engine.

I tried it.

BABEL asks for three keywords.

I picked Chicago, politics and schools.

Here is what BABEL produced:

Chicago has not, and no doubt never will be misplaced. Human society will always enlightenment teachers; some of intercessions and others for a accumulation. a lack of Windy City lies in the field of literature but also the field of philosophy. Teachers is the most prototypically collaborating commencement of mankind.

Verisimilitude, especially for acceptance, articulates a query on rapaciously but discordantly antagonistic consequences by instructor. If conveyances voyage or belittle contemplation, pique that is consistently inappropriate but is iconoclastic, quotidian, and reproving with politics can be more venomously portended. Additionally, political relation, often at a assumption, can be the ligation. In my experience, all of the affronts to our personal epigraph of the exposition we taunt afford the allocations in question. Even so, armed with the knowledge that the divisive surfeit mortifies impropriety, most of the probes for my reprimand blubber. Our personal injunction to the aggregation we disparage acquiesces. Chicago which performs all of the demonstrations might certainly be an amygdala on our personal assassination with the taunt we surprise as well. The rumination of devices may be inauguration but is petulant yet somehow discrepant, not remuneration that depreciates contretemps and denounces dictates. In my theory of knowledge class, none of the dictators at our personal axiom by the exposure we allure collaborate and probe quips which civilize the advancement. The more a circumstance that gambols should be reprobation, the less provocation can diligently be a Gaussian onslaught.

As I have learned in my semiotics class, teacher is the most fundamental amplification of humankind. Though interference for veracity inverts, information processes brains. The same pendulum may process two different orbitals to process an orbital. The plasma is not the only thing the brain reacts; it also receives neutrinoes for disruption with Chicago. Due to advancing, humanely but egotistically admonished accumulations collapse also on Chicago. a startling teacher changes the dictum at Windy City.

The authentication, frequently to a retort, contravenes politics. The sooner the people involved attest, the sooner contemplation sanctions confluences. Furthermore, as I have learned in my literature class, society will always verify political relation. Our personal congregation of the convulsion we expel will be demolition with apprentices and may risibly be commission. The inspection might, still yet, be elidible in the way we respond or utter the inflexibly and pusillanimously atrocious acquiescence but accumulate intercessions. In my semantics class, almost all of the tyroes at my escapade convulse or augur the appendage. a quantity of political relation is inchoate for our personal speculation on the authorization we encounter as well. The avocation denigrates conjecture, not a ligation. In my experience, many of the circumscriptions by our personal assassin at the appetite we ascertain bemoan insinuations. The less rancor that seethes is antipodal in the extent to which we demarcate most of the adjurations for the realm of reality and infuse or should unyieldingly be a trope, the more affronts articulate the trope of parsimony.

Politics with agronomists will always be an experience of human society. In any case, armed with the knowledge that sublimation may perilously be compensation, most of the domains at my aggregation dictate commencements but quibble and disseminate inquiries which fascinate a rumination. If elated agriculturalists intercede and appease sanctions to the admonishment, teachers which choreographs assassinations can be more naturally assimilated. Instructor has not, and undoubtedly never will be articulated but not risible. Chicago is genially but fallaciously whimpering as a result of its those in question.

Would this get me into Harvard? Who knows?

But Les’ research suggests it would score well on an AES.

You try it.