In January 2023, ChatGPT became the fastest-growing consumer software application in history. A huge public interest in Artificial Intelligence accompanied this, with AI proposed as a potential solution to a wide range of problems, from missed hospital appointments to climate change. One of the common problems discussed was teacher workload. Later that year, the Secretary of State for Education in England, Gillian Keegan, stated that AI had “the power to transform a teacher’s day-to-day work” and “could take much of the heavy lifting out of compiling lesson plans and marking”.
Indeed, arguably the future is now. At Cambridge, we’ve released guidance on how we are putting people at the heart of the generative AI revolution in an educational context.
Across the globe, companies are already offering products such as automated essay marking services and lesson planners, largely built on generative AI models. The potential benefits of these should not be understated: research by the Education Support charity showed that 78% of all education staff are stressed, and 36% of teachers have already experienced burn-out. Anything that can meaningfully lower workload for teachers, whilst maintaining (or improving!) the high-level of education that they already provide, should be welcomed.
For AI assessment platforms to be adopted successfully, it is key that schools, and teachers, possess the ability to act as critical consumers, able to ask challenging questions of any platforms they are considering using, to ensure that they serve in the best interests of both students and teachers.
This two-part blog series, by former teacher and Senior Professional Development Manager at The Assessment Network, James Beadle, suggests key thematic questions that teachers should ask of any AI platform they are considering using for assessment.
Is AI asking the questions you would want to ask?
When designing any assessment, there are two important concerns:
- does it test everything we want to assess? (often referred to as construct under-representation)
- Is it testing anything we don’t want to assess? (often referred to as construct-irrelevant variance)
If you were to write a mathematics test on fractions, what would you want to include?
You’d certainly want to feature all four operations (addition, multiplication, subtraction and division). Should there be negative numbers, and mixed numbers made up of both integers and proper fractions?
How about questions set in a real-world context? One of the key skills teachers possess is an understanding of the set of questions required to assess the extent to which a student understands a particular topic.
If you are now using an AI platform, do you know where the questions come from?
Are they written by a selection of subject experts, or by a Generative AI? If they are written by an AI, what steps are in place to ensure an appropriate spread of topics and demand?
Does each student see the same set of questions? If not, can outcomes between different students still be meaningfully compared?
As a teacher, can you review the questions before they are given to students, to check that they are appropriate? Are there some areas within your subject that are perhaps hard to assess on-screen, such as drawing diagrams: if so, are they missing, or not being assessed appropriately?
Conversely, AI assessment platforms may also end up measuring things we do not necessarily want to measure. If the language used within questions is too complex, or badly worded, students may struggle with them for reasons unrelated to the area we are trying to assess. Generative AI’s may set questions in inappropriate or unlikely contexts, which could be confusing for learners. Bias can frequently be present in large language models, which if used for assessment, may lead to questions that disadvantage particular groups. If you are thinking of using assessment platforms that incorporate AI, it is worth establishing what data sets they have been built around, and considering what biases may be present within them. For example, a dataset solely built upon previous used assessments is likely to produce different, and possibly more valid, material than a dataset that uses material from across the entirety of the internet.
Is the feedback effective and adaptive, particularly over time?
To quote Dylan Wiliam, “feedback is only successful if students use it to improve their performance”. This means that it is not enough for a platform to simply identify an answer as incorrect and then provide the correct answer as an example. Instead it must be able to diagnose, and help students address, the lack of knowledge, skill, or the misconception that led to the incorrect answer. Moreover, the feedback must be able to adapt over time. If a student continues to make the same mistake, repeatedly responding with the same set of feedback, it is not only likely to be ineffective, but also risks demotivating and demoralising the student. Instead, the platform must be able to adapt its feedback in a similar manner to a teacher. This could be achieved by meaningfully rephrasing, trying a different example, or demonstrating an alternative method. If this is not possible, teachers need to be made aware that a learner is struggling and use the opportunity to intervene.
Feedback also needs to be focused and targeted. Whilst AI models may well be capable of highlighting and responding to every single error in a student’s essay, it is unlikely that this is the best way to help them improve. Indeed, too much feedback is likely to overwhelm and demoralise learners. Rather, in a similar manner to a teacher, platforms must be able to support learners in identifying and prioritising key initial areas for improvement.
Of course, feedback isn’t simply identifying students’ mistakes and helping them avoid them in future. To support student’s metacognition, it is also important that we, or the assessment systems we use, highlight where, over time, learning has successfully occurred, and a mistake is no longer being made. Expert teachers will often also use feedback to highlight positive traits that their learners have demonstrated whilst carrying out a particular task, such as perseverance or resilience, and it is important that opportunities for this are not lost in any movement towards AI based assessment and feedback.
Finally, the effectiveness of any feedback ultimately depends on the student’s receptiveness to it. Students are likely to respond and engage better with feedback given by an AI if they have confidence that the judgements and processes carried out by any AI platform reflect those they expect their teachers to carry out. As the next question explores, this likely requires any system used to be explainable by teachers, to students.
Can you explain why (and agree with) the platform has given a particular set of feedback or grade?
A key concept within AI is that of explainability. Essentially, can a human (without an in-depth knowledge of AI) understand how and why an AI model has made a particular decision, or given a particular output? In the context of education, the question needs to be asked: do you (and your students) understand why an AI model has given a particular set of feedback, or how a grade has been awarded?
If, for example, the feedback given by an AI model on a history essay is that ‘more analysis needs to be given’, can you explain to a student how that feedback was generated?
Has the AI ‘read’ the material and found it lacking in depth, or has it instead identified possible proxy features, such as a low word count, or an absence of certain key words? Think about how you, as a teacher, would explain and justify similar feedback - you might highlight particular paragraphs of concern, and then give an example to help explain. Is the AI model capable of giving (or supporting you to give) a similar explanation?
When AI platforms are used for grading, they can often produce outcomes, such as rank order of students work, which appear to mirror those that would be given by a teacher.
However, as Nick Raikes, the Director of Data Science at Cambridge University Press & Assessment highlights, the computer may have looked at quite different features, such as the grammatical quality of writing, which would not necessarily match the grading criteria used by teachers. This might lead to some students, whose work does not fit the ‘pattern’, receiving inaccurate grades.
Initial research does indicate that some AI models may be biased regarding work written by non-native English speakers. If AI platforms mark to a differing criteria, then over time, students may, either consciously or unconsciously, seek to produce work that increasingly matches the marking criteria used by the AI, such as adopting a certain writing style, or using certain key phrases, that don’t necessarily reflect the knowledge and skills you would be looking for as a teacher.
This is part one of a two-part series on AI and E-Assessment within the classroom by James Beadle.
If you would like to further increase your knowledge of assessment and assessment-related concepts, such as validity and construct representation, then consider enrolling in our A101 course. If you would like to develop your tools in planning and designing assessments, then consider our A102 course.
The Assessment Network is part of Cambridge University Press & Assessment. We provide professional development for impactful assessment.