Lessons on Algorithm ethics from UK Exam algorithm story

This document chronicles the recent controversy surrounding the UK exams fiasco where an algorithmic approach was rejected and demonstrates areas where those building and deploying models must be vigilant and have training, process and governance in place
Giles Pavey – September 2021

Background

In March 2020, the UK Government made the significant policy in response to the COVID-19 pandemic to close all schools and cancel all GCSE and A level examinations in summer 2020. This policy decision reflects a trade-off between the public health benefits of potentially slowing the spread of the virus, and the costs to the economy and to society, including disruption to the education system.

The Government instructed OFQUAL (the organisations responsible for UK exams) to create a system that:

Ensured grades were comparable to previous years – specifically to minimise grade inflation. This was deemed important for colleges, universities and employers in order that they could have confidence in exam results.
Be unbiased (not discriminate) with respect to the student’s protected characteristics. (race, religion, gender…)
Could therefore be used by students and colleges in order for pupils for progress fairly to the next stage of their education.

Method

The solution OFQUAL created was based upon an algorithm with the following inputs:

Each school’s previous results, for each subject, over the previous 3 years.
A rank order of the projected success of each pupil within each school, by subject on their predicted result if they were to take that exam.
The schools also produced a predicted grade for each pupil, for each subject; although these were not used by the algorithm unless required.

The way the algorithm worked was as follows – it looked at each school’s results in every subject – for example Blackfriars High School’s results in Mathematics over the past 3 years. This then gave a prediction of how many students would be expected to receive an A, B, C, D, E, grade this year. Assuming the algorithm predicted 10 A’s and 16 B’s then the top 10 students taking maths at Blackfriars would be allocated an A grade the next 16 would receive a B. Importantly, this was regardless of what they had been predicted.

The exception to this approach would be if the number of pupils taking a given subject in a given school was deemed too small to use this approach. If the number of students at Blackfriars Highschool taking Latin was less than 5 then all pupils would receive their predicted grade (irrespective of the schools previous results) . If it was 6-15 then the grade was a weighted combination of the algorithm and predicted grade.

Algorithmic results and ensuing calamity

Separate algorithms were used by the consistent countries of the UK and for the exams taken at 16 (GCSEs) and 18 (A levels). In all cases the algorithm produced were shown to broadly hit the target of having a spread of grades similar to previous years. OFQUAL also said that analysis showed that the algorithms were proved not to have any bias due to protected characteristics. (although more of this later!)

However, the grades received by many students were dramatically different – usually lower – than those predicted by their teachers. This is in predominantly driven by the fact that teachers tend to be optimistic about their student’s attainment – they are prone to predict their pupil’s results “on a good day”. (This happens in non-COVID years where more than half of student’s actuals grades are less than predicted. (e.g. predicted A-B-B and receiving B-B-C))

OFQUAL claimed that if the algorithm’s dampening effect had not been used the proportion of top grades would have risen from 7.7% to 13.9%. But it was when viewed at an individual pupil level that problems arose. The algorithm downgraded teacher’s predictions in 39% of cases.

This resulted in tens of thousands of students not achieving the grades that they needed to move on to their first choice universities.

There was also a dramatic difference in the number of downgrades received by government funded schools compared to “Private” (fee paying) schools. This was driven by the fact that Private schools are more likely to have both smaller numbers of students and offer more unusual exam subjects. Both of these factors mean that a greater proportion of Private schools had students doing subjects in groups of 5 or less leading to a higher proportion of student receiving predicted rather than algorithm grades.

A more detailed write up of the fall out of the exam algorithm can be found here. Along with a complete explanation of the formula here.

Lessons learned for those building and deploying models

There is a dramatic difference between the outcome of an algorithm when viewed at a total population (or macro level) compared to the individual experiences of those student (the micro level) where literally thousands received what they viewed as unfair treatment.
- Organisations deploying models must make sure to consider the impact on individuals of actions taken from their AI models in addition to the overall effectiveness.
Even though precautions were taken and validated to check that the algorithms was not biased towards or against certain protected groups the profiles of those attending different across schools inevitably led to outcomes being different because the profile of those attending smaller private schools and doing less mainstream subjects is more privileged than the UK as a whole.
- Specialist statistical analysis is required to design and assess algorithms. Organisations must make sure that even though an algorithm can be shown to not directly use protected characteristics it must be checked that it does not result in discriminatory recommendations. specialist statistical analysis is required to design and assess algorithms
The exam results were significantly influenced by previous outcomes achieved by their school: which the pupils themselves had no part in. This resulted in the so called “sins of my fore-fathers” problem – where students were penalised for events which did not reflect their personal behaviour.
- Organisations must take immense care when using models that consider historical data that is out of the model subjects’ control.
It remains to be seen whether the predicted results used were in fact subject to the teachers (human) biased. It has however been widely accepted that because these are “human decisions” they are more valid.
- Organisations that use algorithms in decision making should expect the results to be scrutinised and prepare accordingly. They should be prepared that when switching to algorithmic solutions to previously human decision making it is likely that we will uncover previously unknown bias and that that there may well be no perfect solution.
- Organisations must prepare clear communications and install a fair and proportionate appeals process where applicable.
Many of the problems with the algorithm were because it was asked to produce fair results from an existing unfair process, i.e. that of basing a student’s ability on exams rather than their performance over the whole 2 years of their study of the curriculum.
- Organisations must be aware that AI is a powerful tool to drive effectiveness and efficiency it is not magical in its ability to correct all wrongs or mitigate and unfair system. They must educate themselves accordingly.

Processing…

Success! You're on the list.

Whoops! There was an error and we couldn't process your subscription. Please reload the page and try again.

Background

Method

Algorithmic results and ensuing calamity

Lessons learned for those building and deploying models

Share this:

Related

Leave a comment Cancel reply