Thoughts on Pearson’s Machines (and their Humans too)

© Robert A. Buckmaster 2022

Pearson, the very large publisher, have moved into English language testing in a big way, throwing a lot of money at the problem, and providing stiff competition to the likes of Cambridge and IELTS.

About their PTE Academic exam, they say:

Pearson Test of English Academic (PTE Academic) is an international computer-based English language test. It provides a measure of a test taker’s language ability to assist education institutions and professional and government organizations that require a standard of academic English language proficiency for admission purposes.

Source: Pearson PTE Academic Score Guide for test takers, Version 15 – April 2021

[Accessed on the 16th March 2022.]

The key words here are that it is a computer based test, and it is a computer marked test, thus removing the need for human examiners, thus saving money: All items in PTE Academic are machine scored.

Let’s look at the writing part of the test. It seems to consist of two tasks – to read a text and summarize it, which is a good task, and to write an essay, which is another good task. So far, so good.

The Summary Writing Task

The summary task is rated according to form, content, grammar and vocabulary.

The form rating scale seems to be based on two possible marks, as can be seen below:

Summary Writing Task Form Rating Scale.

This rating scale seems rather limited and seems to not allow for a good summary in two sentences. This seems to be a strange choice.

The content rating seems to be this:

Summary Writing Task Content Rating Scale

While a slightly wider range than the form scale, the choice between 0, a bad summary, 1, fair and 2 good is a limited one and these bands must be very wide bands and contain a wide range of responses. And there is still the problem that a good response must be one sentence, and it should contain ‘all relevant aspects‘. The grammar and vocabulary scales are similar to this one.

The Essay Writing Task

In this task the candidates should read a short background statement and then address a question in an essay of 200-300 words. If they do this they will get 2 for Form. They get 1 for writing less or more: Length is between 120 and 199 or between 301 and 380 words.

Content is rated on a four band scale as shown below:

Essay Writing Content Rating Scale

With only four bands to choose from there is not much leeway for subtle distinctions between responses. Omit one minor point and you do not get the top mark for content!

The other criteria, Development, Structure and Coherence, Grammar, General Linguistic range, Vocabulary Range, and Spelling are all dealt with using the three band scale of 0 -2.

The scale for grammar, for example, is this:

Essay Writing Grammar Scale

If a candidate shows ‘a relatively high degree of grammatical control’ [Band score 1] but has an impeding error, according to this scale they should score 0. Again, there is no room in such a scale for finer graduated marking.


But what about the computers, and the humans?

Computers rate all aspects of the writing, which is fair enough on word count. For spelling though, the computer relies on this scale:

Spelling Rating Scale

This seems a bit harsh; two slips in spelling in up to 300 words and you get zero.

If we look at the samples given of the marking, there are some more concerns. In this table, from page 40 we see the ratings from the machine and the humans.

Test taker B: mid B2 Level p. 40

The version below is annotated to highlight some problems. These include the agreement between the human raters, or rather the divergence. And the machine marking, which is allowed to be more precise than the humans (why?), and the divergences with the humans. To explain the rating system, the first two humans rate the criteria. If there is disagreement then the adjudicator, presumably a more experienced examiner, decides what the proper mark is. The other problems include the wrong possible marks being shown in various places. In short, its a mess.

Test taker B: mid B2 Level p.40; annotated

For content each human rater disagreed completely. This tells us that there are problems with the rating scales – not flexible or precise to be sued properly and/or problems with rater training.

Why the machine is allowed to rate to a greater degree of precision is am mystery. perhaps the human raters would perform better if they had a rating scale of 0 to 5, with half band scores.

The problems with the maximum scores being wrong, and the adjudicator giving an inadmissible rating for grammar must surely be errors with proofing, or earlier systems being cut and pasted into this document or something, except that the error is repeated, as can be seen below.

Pearson Errors

Content is scored as a maximum of 3; this is stated in multiple places in the text; the others are 2. This makes a maximum possible score of 15. This does not give me confidence in these tests.


Addendum: Notes on Person Speaking scales

The same document has the speaking scales which are used in the exam. Below is an extract from the Oral Fluency Scale, which is used in addition to a Pronunciation Scale.

Pearson Oral Fluency Speaking Scale p. 20

Some comments:

  • ‘Good’ is not a recognized level between Intermediate and Advanced.
  • The ‘Native-like’ descriptor is completely unrealistic. Have you heard Barak Obama speaking without a teleprompter? There are always hesitations, repetitions and false starts in unscripted spontaneous ‘native-speaker’ or ‘native-like’ speech.
  • The advanced criteria is absurd with its ‘no more than one hesitation, one repetition or a false start’ formulation. What is the obsession with quantifying things in this way? [See Spelling, above]
  • The Intermediate descriptor is an extreme example of this quantification mania.
  • Though, perhaps it has to be quantified in this way so that the machine can rate it.

In my over 27 years of using various speaking scales, these are the worst I have seen, even if we make allowance for the fact that they might be generalized for public consumption.


In Conclusion

There are major problems with this document, and perhaps major problems with these tests.

The tables shown show significant errors in the maximum score ratings.

The analysis shows the limitations of the rating scales. They are too coarse grained to give an effective rating of a text. The criteria are basically ‘good’, ‘pass’ and ‘fail’. This is insufficient. These categories are too broad to be really meaningful, or even fair to the candidates.

Because of the lack of subtlety in the scales there is no allowable divergence between raters. The disagreement in the content rating shown above shows that three different raters choose three different ratings. Something is amiss.

The Speaking Scales are completely unrealistic in their expectations of candidate performance.

In short I would not recommend these tests.


Update (18.03.22)

Today, on another part of the Pearson website I found an updated version of the document used above [PTE_Academic_Score_Guide_for_Test_Takers_-_Jan_2022_V2.pdf] where the obvious errors with the maximum scores as noted above have been corrected, as shown below, but the error with the grammar band score by the adjudicator remains. The problems with the band descriptors themselves still stand.

From: PTE Academic – Test Takers | Score Guide | PAGE 29 |