Tom Benton

2025

Research Matters 39: Spring 2025

Foreword Tim Oates
Editorial Victoria Crisp
The impact of taking Core Maths on students' higher education outcomesTim Gill
Is one comparative judgement exercise for one exam paper sufficient to set qualification-level grade boundaries?Tom Benton
Accessibility of GCSE science questions that ask students to create and augment visuals: Evidence from question omit ratesSanti Lestari
How do candidates annotate items in paper-based maths and science exams?Joanna Williamson
Learners’ annotations and written markings when taking a digital multiple-choice test: What support is needed?Victoria Crisp, Sylvia Vitello, Abdullah Ali Khan, Heather Mahy and Sarah Hughes
Research NewsLisa Bowett

Is one comparative judgement exercise for one exam paper sufficient to set qualification-level grade boundaries

Benton, T. (2025). Is one comparative judgement exercise for one exam paper sufficient to set qualification-level grade boundaries? Research Matters: A Cambridge University Press & Assessment publication, 39, 26-38. https://doi.org/10.17863/CAM.116168

This research draws on evidence from three qualifications taken in autumn 2020, when comparative judgement (CJ) was used as a key source of data in setting grade boundaries. In these cases, a separate CJ exercise was completed for each individual paper in the qualification so that standards could be maintained from a previous series. In this article, we explore what would have happened had we relied on a single CJ exercise on one paper to maintain standards in the whole qualification. We first examine whether evidence from different papers provides a consistent picture of changes in cohort ability between series. We then explore the impact of relying on evidence from one paper only on the precision with which we can identify appropriate qualification-level grade boundaries using CJ.

2024

How long should a high stakes test be?

Benton, T. (2024). EHow long should a high stakes test be? Research Matters: A Cambridge University Press & Assessment publication, 38, 28-47. https://doi.org/10.17863/CAM.111627

This article discusses one of the most obvious questions in assessment design: if a test has a high stakes purpose, how long should it be?

Firstly, we explore this question from a psychometric point of view starting from the (range of) minimum test reliability levels suggested in the academic literature. Then, by using published data on the typical relationship between the length, duration and reliability of exams, we develop a range of recommendations about the likely required duration of assessment.

Secondly, to force deeper reflection on the results from the psychometric approach, we also compare the actual lengths of exams in England to those in other education systems around the world. Such comparisons reveal very wide variations in the amount of time young people are required to spend taking exams in different countries and at various ages. This article concludes with some reflections on how the length of exams relates to the purpose of the assessment or to how its results will be used.

Research Matters 38: Autumn 2024

Foreword Tim Oates
Editorial Victoria Crisp
Troubleshooting in emergency education settings: What types of strategies did schools employ during the COVID-19 pandemic and what can they tell us about schools’ adaptability, values and crisis-readiness?Filio Constantinou
How long should a high stakes test be?Tom Benton
Core Maths: Who takes it, what do they take it with, and does it improve performance in other subjects?Tim Gill
Does typing or handwriting exam responses make any difference? Evidence from the literatureSanti Lestari
Comparing music recordings using Pairwise Comparative Judgement: Exploring the judge experienceLucy Chambers, Emma Walland and Jo Ireland
Research NewsLisa Bowett

2022

Which assessment is harder? Some limits of statistical linking.

Benton, T., & Williamson, J. (2022). Which assessment is harder? Some limits of statistical linking. Research Matters: A Cambridge University Press & Assessment publication, 34, 26–41.

Equating methods are designed to adjust between alternate versions of assessments targeting the same content at the same level, with the aim that scores from the different versions can be used interchangeably. The statistical processes used in equating have, however, been extended to statistically “link” assessments that differ, such as assessments of the same qualification type that assess different subjects. Despite careful debate on statistical linking in the literature, it can be tempting to apply equating methods and conclude that they have provided a definitive answer on whether a qualification is harder or easier than others.

This article offers a novel demonstration of some limits of statistical equating by exploring how accurately various equating methods were able to equate between identical assessments. To do this, we made use of pairs of live assessments that are “cover sheet” versions of each other, that is, identical assessments with different assessment codes. The results showed that equating errors with real-world impact (e.g., an increase of 5–10 per cent in the proportion of students achieving a grade A) occurred even where equating conditions were apparently favourable. No single method consistently produced more accurate results than the others.

The results emphasise the importance of considering multiple sources of information to make final grade boundary decisions. More broadly, the results are a reminder that if applied uncritically, equating methods can lead to incorrect conclusions about the relative difficulty of assessments.

Research Matters 34: Autumn 2022

Foreword Tim Oates
Editorial Tom Bramley
Learning loss in the Covid-19 pandemic: teachers’ views on the nature and extent of loss Matthew Carroll, Filio Constantinou
Which assessment is harder? Some limits of statistical linking Tom Benton, Joanna Williamson
Progress in the first year at school Chris Jellis
What are "recovery curricula" and what do they include? A literature review Martin Johnson
What's in a name? Are surnames derived from trades and occupations associated with lower GCSE scores? Joanna Williamson, Tom Bramley
Research News Lisa Bowett

Research Matters 33: Spring 2022

Foreword Tim Oates
Editorial Tom Bramley
A summary of OCR’s pilots of the use of Comparative Judgement in setting grade boundaries Tom Benton, Tim Gill, Sarah Hughes, Tony Leech
How do judges in Comparative Judgement exercises make their judgements? Tony Leech, Lucy Chambers
Judges' views on pairwise Comparative Judgement and Rank Ordering as alternatives to analytical essay marking Emma Walland
The concurrent validity of Comparative Judgement outcomes compared with marks Tim Gill
How are standard-maintaining activities based on Comparative Judgement affected by mismarking in the script evidence? Joanna Williamson
Moderation of non-exam assessments: is Comparative Judgement a practical alternative? Carmen Vidal Rodeiro, Lucy Chambers
Research News Lisa Bowett

A summary of OCR’s pilots of the use of Comparative Judgement in setting grade boundaries

Benton, T., Gill. T., Hughes, S., & Leech. T. (2022). A summary of OCR’s pilots of the use of Comparative Judgement in setting grade boundaries. Research Matters: A Cambridge University Press & Assessment publication, 33, 10–30.

The rationale for the use of comparative judgement (CJ) to help set grade boundaries is to provide a way of using expert judgement to identify and uphold certain minimum standards of performance rather than relying purely on statistical approaches such as comparable outcomes. This article summarises the results of recent trials of using CJ for this purpose in terms of how much difference it might have made to the positions of grade boundaries, the reported precision of estimates and the amount of time that was required from expert judges.

The results show that estimated grade boundaries from a CJ approach tend to be fairly close to those that were set (using other forms of evidence) in practice. However, occasionally, CJ results displayed small but significant differences with existing boundary locations. This implies that adopting a CJ approach to awarding would have a noticeable impact on awarding decisions but not such a large one as to be implausible. This article also demonstrates that implementing CJ using simplified methods (described by Benton, Cunningham et al, 2020) achieves the same precision as alternative CJ approaches, but in less time. On average, each CJ exercise required roughly 30 judge-hours across all judges.

2021

Evaluating the simplified pairs method of standard maintaining using comparative judgement

Benton, T. & Gill, T. (2021, November 3rd). Evaluating the simplified pairs method of standard maintaining using comparative judgement. Presentation at AEA Europe Conference 2021, online.

Item response theory, computer adaptive testing and the risk of self-deception

Benton, T. (2021). Item response theory, computer adaptive testing and the risk of self-deception. Research Matters: A Cambridge University Press & Assessment publication, 32, 82-100.

Computer adaptive testing is intended to make assessment more reliable by tailoring the difficulty of the questions a student has to answer to their level of ability. Most commonly, this benefit is used to justify the length of tests being shortened whilst retaining the reliability of a longer, non-adaptive test.

Improvements due to adaptive testing are often estimated using reliability coefficients based on item response theory (IRT). However, these coefficients assume that the underlying IRT model completely fits the data. This article takes a different approach, based on comparing the predictive value of shortened versions of real assessments based on adaptive and non-adaptive approaches. The results show that, when explored in this way, the benefits from adaptive testing may not always be quite a large as hoped.

Research Matters 32: Autumn 2021

Foreword Tim Oates
Editorial Tom Bramley
Learning during lockdown: How socially interactive were secondary school students in England? Joanna Williamson, Irenka Suto, John Little, Chris Jellis, Matthew Carroll
How well do we understand wellbeing? Teachers’ experiences in an extraordinary educational era Chris Jellis, Joanna Williamson, Irenka Suto
What do we mean by question paper error? An analysis of criteria and working definitions Nicky Rushton, Sylvia Vitello, Irenka Suto
Item response theory, computer adaptive testing and the risk of self-deception Tom Benton
Research News Anouk Peigne

On using generosity to combat unreliability

Benton, T. (2021). On using generosity to combat unreliability. Research Matters: A Cambridge Assessment publication, 31, 22-41.

Assessment reliability can be affected by various types of unforeseen events. In any such circumstances where a concern is raised that the reliability of assessment is lower than usual, our natural inclination is to allow extra leniency in grading to reduce the chances of students missing out on a grade they deserve. This article shows how, by focusing on the risk for individual students, we might logically approach this situation in deciding exactly how much additional generosity is required. In particular, it shows how making progress with this problem requires an acceptance that no assessment system is perfect and transparency about the level of reliability that is achievable. Having developed an approach, this article also shows how this may lead to different outcomes than the competing desire to maintain assessment standards so that the group of students in question are not unfairly advantaged relative to previous and future cohorts.

Research Matters 31: Spring 2021

Foreword Tim Oates, CBE
Editorial Tom Bramley
Attitudes to fair assessment in the light of COVID-19 Stuart Shaw, Isabel Nisbet
On using generosity to combat unreliability Tom Benton
A guide to what happened with Vocational and Technical Qualifications in summer 2020 Sarah Mattey
Early policy response to COVID-19 in education—A comparative case study of the UK countries Melissa Mouthaan, Martin Johnson, Jackie Greatorex, Tori Coleman, Sinead Fitzsimons
Generation Covid and the impact of lockdown Gill Elliott
Disruption to school examinations in our past Gillian Cooke, Gill Elliott
Research News Anouk Peigne

2020

Does comparative judgement of scripts provide an effective means of maintaining standards in mathematics?

Benton, T., Hughes, S., and Leech, T. (2020). Does comparative judgement of scripts provide an effective means of maintaining standards in mathematics? Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Comparing the simplified pairs method of standard maintaining to statistical equating

Benton, T., Cunningham, E., Hughes, S., and Leech, T. (2020). Comparing the simplified pairs method of standard maintaining to statistical equating. Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

The usefulness of detailed marks within the levels of levels-based mark schemes

Macinska, S. and Benton, T. (2020) The usefulness of detailed marks within the levels of levels-based mark schemes. Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Research Matters 29: Spring 2020

Foreword Tim Oates, CBE
Editorial Tom Bramley
Accessibility in GCSE Science exams – Students' perspectives Victoria Crisp and Sylwia Macinska
Using corpus linguistic tools to identify instances of low linguistic accessibility in tests David Beauchamp, Filio Constantinou
A framework for describing comparability between alternative assessments Stuart Shaw, Victoria Crisp, Sarah Hughes
Comparing small-sample equating with Angoff judgement for linking cut-scores on two tests Tom Bramley
How useful is comparative judgement of item difficulty for standard maintaining? Tom Benton
Research News Anouk Peigne

How useful is comparative judgement of item difficulty for standard maintaining?

Benton, T. (2020). How useful is comparative judgement of item difficulty for standard maintaining? Research Matters: A Cambridge Assessment publication, 29, 27-35.

This article reviews the evidence on the extent to which experts’ perceptions of item difficulties, captured using comparative judgement, can predict empirical item difficulties. This evidence is drawn from existing published studies on this topic and also from statistical analysis of data held by Cambridge Assessment. Having reviewed the evidence, the article then proposes a simple mechanism by which such judgements can be used to equate different tests, and evaluates the likely accuracy of the method.

2019

Research Matters 28: Autumn 2019

Foreword Tim Oates, CBE
Editorial Tom Bramley
Which is better: one experienced marker or many inexperienced markers? Tom Benton
"Learning progressions": A historical and theoretical discussion Tom Gallacher, Martin Johnson
The impact of A Level subject choice and students' background characteristics on Higher Education participation Carmen Vidal Rodeiro
Studying English and Mathematics at Level 2 post-16: issues and challenges Jo Ireland
Methods used by teachers to predict final A Level grades for their students Tim Gill
Research News David Beauchamp

Which is better: one experienced marker or many inexperienced markers?

Benton, T. (2019). Which is better: one experienced marker or many inexperienced markers? Research Matters: A Cambridge Assessment publication, 28, 2-10.

For many practical purposes, it is often assumed that the quality of a marker is directly related to their seniority. At its extreme, the assumption is that the most senior marker (the Principal Examiner) is always right even in cases where large numbers of junior markers have a collectively different opinion regarding the mark that should be awarded to a given script. To investigate this assumption, this article compares the predictive value of marks provided by the most senior marker to those made by simply taking the mean mark awarded by many junior markers. Predictive value was estimated via the correlation of the scores assigned to scripts with the overall achievement of the same candidates in other exam papers taken within the same month. By looking at the relative predictive value of the two different sources of marks, we can begin to make some inferences about the extent to which senior markers are genuinely more accurate than their junior colleagues.

Looking at the future performance of Cambridge IGCSE students

Blog (21 June, 2019)

The Effect of Using Principal Components to Create Plausible Values

Benton, T. (2019). The Effect of Using Principal Components to Create Plausible Values. In: Wiberg, M., Culpepper, S., Janssen, R., González, J., & Molenaar, D. (Eds.). Quantitative Psychology: The 83rd Annual Meeting of the Psychometric Society, New York, NY, 2018. Springer.

Are IGCSEs easier than GCSEs? – A response from Cambridge Assessment

Blog (29 March, 2019)

2018

Avoid uncritical use of PISA, say researchers

News (03 December, 2018)

The link between subject choices and achievement at GCSE and performance in PISA 2015: Executive summary

Carroll, M. and Benton, T. (2018). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

The link between subject choices and achievement at GCSE and performance in PISA 2015

Carroll, M. and Benton, T. (2018). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Exploring the relationship between optimal methods of item scoring and selection and predictive validity

Benton, T. (2018). Exploring the relationship between optimal methods of item scoring and selection and predictive validity. Presented at the 19th annual AEA-Europe conference, Arnhem/Nijmegen, The Netherlands, 7-10 November 2018.

Is comparative judgement just a quick form of multiple marking?

Benton, T. and Gallacher, T. (2018). Is comparative judgement just a quick form of multiple marking? Research Matters: A Cambridge Assessment publication, 26, 22-28.

This article describes analysis of GCSE English essays that have both been scored using comparative judgement and marked multiple times. The different methods of scoring are compared in terms of the accuracy with which the resulting scores can predict achievement on a separate set of assessments. This results show that the predictive value of marking increases if multiple marking is used and (perhaps more interestingly) if statistical scaling is applied to the marks. More importantly, the evidence in this article suggests that any advantage of comparative judgement over traditional marking can be explained in terms of the number of judgements that are made for each essay and by the use of a complex statistical model to combine these. In other words, it is the quantity of data that is collected about each essay and how this data is analysed that is important. The physical act of placing two essays next to each other and deciding which is better does not appear to produce judgements that are in themselves any more valid than from getting the same individual to simply mark a set of essays.

How many students will get straight grade 9s in reformed GCSEs?

Benton, T. (2018). How many students will get straight grade 9s in reformed GCSEs? Research Matters: A Cambridge Assessment publication, 25, 28-36.

This article describes an attempt to predict the number of students that will achieve straight grade 9s in reformed GCSEs. The prediction is based upon an analysis of unreformed GCSE and international GCSEs from the cohort completing Key Stage 4 in 2016. Several methods are applied and evaluated against the number of candidates known to have achieved straight grade 9s in the three subjects that were reformed before examination in summer 2017. The results suggest that, of those candidates taking at least 8 GCSEs, between 200 and 900 will achieve straight grade 9s. Furthermore, we predict that more than 2,000 students will achieve a perfect score in their Attainment 8 accountability measure.

2017

Pooling the totality of our data resources to maintain standards in the face of changing cohorts

Benton, T. (2017). Presented at the 18th annual AEA Europe conference, Prague, 9-11 November 2017.

Comparing small-sample equating with Angoff judgment for linking cut-scores on two tests

Bramley, T. and Benton, T. (2017). Presented at the 18th annual AEA Europe conference, Prague, 9-11 November 2017.

How much do I need to write to get top marks?

Benton, T. (2017). How much do I need to write to get top marks? Research Matters: A Cambridge Assessment publication, 24, 37-40.

This article looks at the relationship between how much candidates write and the grade they are awarded in an English Literature GCSE examination. Although such analyses are common within computer-based testing, far less had been written about this relationship for traditional exams taken with a pen and paper. This article briefly describes how we estimated word counts based on images of exam scripts, validates the method against a short answer question from a Biology examination, and then uses the method to examine how the length of candidates’ English Literature essays in an exam relate to the grade they were awarded. It shows that candidates awarded a grade A* wrote around 700 words on average in a 45-minute exam - an average rate of 15-words per minute across the period. In contrast, grade E candidates who produced around 450 words - an average rate of 10-words per minute. Whilst it cannot be emphasised strongly enough that performance in GCSEs is judged by what students write and not how much, the results of this research may help students facing examinations have a reasonable idea of the kind of length that is generally expected.

Some thoughts on the ‘Comparative Progression Analysis’ method for investigating inter-subject comparability

Benton, T. and Bramley, T. (2017). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Can AI learn to equate?

Benton, T. (2017). Presented at the International Meeting of the Psychometric Society, Zurich, Switzerland, 17-21 July 2017.

The clue in the dot of the ‘i’: Experiments in quick methods for verifying identity via handwriting

Benton, T. (2017). The clue in the dot of the ‘i’: Experiments in quick methods for verifying identity via handwriting. Research Matters: A Cambridge Assessment publication, 23, 10-16.

This article demonstrates some simple and quick techniques for comparing the style of handwriting between two exams. This could potentially be a useful way of checking that the same person has taken all of the different components leading to a qualification and form one part of the effort to ensure qualifications are only awarded to those candidates that have personally completed the necessary assessments. The advantage of this form of identity checking is that it is based upon data (in the form of images) that is already routinely stored as part of the process of on-screen marking. This article shows that some simple metrics can quickly identify candidates whose handwriting shows a suspicious degree of change between occasions. However, close scrutiny of some of these scripts provides some reasons for caution in assuming that all cases of changing handwriting represent the presence of imposters. Some cases of apparently different handwriting also include aspects that indicate they may come from the same author. In other cases, the style of handwriting may change even within the same examination response.

Volatility happens: Understanding variation in schools’ GCSE results

Crawford, C. and Benton, T. (2017). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

2016

Evidence for the reliability of coursework

Benton, T. (2016). Paper presented at the AEA-Europe annual conference, Limassol, Cyprus, 3-5 November 2016

Revisiting the topics taught as part of an OCR History qualification

Dunn, K., Darlington, E. and Benton, T. (2016). Revisiting the topics taught as part of an OCR History qualification. Research Matters: A Cambridge Assessment publication, 22, 2-8.

Given the introduction of a broader range of options in OCR's new A level history specification, this article follows on from a previous analysis of A level History options based on the previous specification for OCR History (Specification A). That research relied on OCR History centres responding to requests for participation in an online survey. However, OCR’s introduction of an online ‘specification creator’ tool for centres has provided quantitative information about the topics which schools intend to teach their students as part of their A level. As with the previous study, we sought to establish what the common topic choices and combinations are.

On the impact of aligning the difficulty of GCSE subjects on aggregated measures of pupil and school performance

Benton, T. (2016). On the impact of aligning the difficulty of GCSE subjects on aggregated measures of pupil and school performance. Research Matters: A Cambridge Assessment publication, 22, 27-30.

It is empirically demonstrated that adjusting aggregated measures of either student or school performance to account for the relative difficulty of General Certificate of Secondary Education (GCSE) subjects makes essentially no difference. For either students or schools, the correlation between unadjusted and adjusted measures of performance exceeds 0.998. This indicates that suggested variations in the difficulty of different GCSE subjects do not cause any serious problems either for school accountability, or for summarising the achievement of students at GCSE.

A possible formula to determine the percentage of candidates who should receive the new GCSE grade 9 in each subject

Benton, T. (2016). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

2015

The Importance of Teaching Styles and Curriculum in Mathematics: Analysis of TIMSS 2011

Zanini, N. and Benton, T. (2015) Paper presented at the European Conference on Educational Research (ECER), Budapest, Hungary, 8-11 September 2015

How statistics determine examination results in England

Benton, T. (2015) Paper presented at the Royal Statistical Society Annual Conference, Exeter, 7-10 September 2015

The roles of teaching styles and curriculum in Mathematics achievement: Analysis of TIMSS 2011

Zanini, T. & Benton, T. (2015). The roles of teaching styles and curriculum in Mathematics achievement: Analysis of TIMSS 2011. Reserach Matters: A Cambridge Assessment publication, 20, 35-44.

This article provides empirical evidence about the link between Mathematics achievement, curriculum, teaching methods and resources used in the classroom. More specifically, this research explores common teaching styles and topics taught across countries with respect to their Mathematics achievement. In order to do so, we make use of the fifth TIMSS survey, which provides a rich set of information regarding aspects of the curriculum (e.g., the emphasis on problem solving and interpreting data sets), resources used by teachers in the classroom (e.g., calculators and textbooks) and teaching styles (e.g., how often students are asked to take written tests, to work out problems individually rather than with teachers' guidance), along with measures of achievement in Mathematics gathered in 2011. Although TIMSS is administered to students and their teachers in both Grades 4 and 8 (Years 5 and 9 respectively, within England), analysis in this research is restricted to the Grade 8 students (aged 14). When analysing data aggregated at jurisdictional level, this allows us to explore relationships in the Mathematics achievement of 15 year-olds as measured by PISA 2012.

A level reform: implications for subject uptake

Sutch, T., Zanini, N. and Benton, T. (2015). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

The accuracy of forecast grades for OCR GCSEs in June 2014

Gill, T. & Benton, T. (2015) Statistics Report Series No. 91

The accuracy of forecast grades for OCR A levels in June 2014

Gill, T. & Benton, T. (2015) Statistics Report Series No. 90

The roles of teaching styles and curriculum in Mathematics achievement: Analysis of TIMSS 2011

Zanini, N. and Benton, T. (2015). The roles of teaching styles and curriculum in Mathematics achievement: Analysis of TIMSS 2011. Research Matters: A Cambridge Assessment publication, 20, 35-44.

This article provides empirical evidence about the link between Mathematics achievement, curriculum, teaching methods and resources used in the classroom. More specifically, this research explores common teaching styles and topics taught across countries with respect to their Mathematics achievement. In order to do so, we make use of the fifth TIMSS survey, which provides a rich set of information regarding aspects of the curriculum (e.g., the emphasis on problem solving and interpreting data sets), resources used by teachers in the classroom (e.g., calculators and textbooks) and teaching styles (e.g., how often students are asked to take written tests, to work out problems individually rather than with teachers' guidance), along with measures of achievement in Mathematics gathered in 2011. Although TIMSS is administered to students and their teachers in both Grades 4 and 8 (Years 5 and 9 respectively, within England), analysis in this research is restricted to the Grade 8 students (aged 14). When analysing data aggregated at jurisdictional level, this allows us to explore relationships in the Mathematics achievement of 15 year-olds as measured by PISA 2012.

Can we monitor standards over time in one subject using data from another? Why a high correlation isn’t enough

Benton, T. (2015). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Volatility in exam results

Bramley, T. and Benton, T. (2015) Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

The reliability of setting grade boundaries using comparative judgement

Benton, T. and Elliott, G. (2015). The reliability of setting grade boundaries using comparative judgement. Research Papers in Education, 31(3), 352-376.

Examining the impact of moving to on-screen marking on concurrent validity

Benton, T. (2015). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Examining the impact of moving to on-screen marking on the stability of centres’ results

Benton, T. (2015). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

An experimental investigation of the effects of mark scheme features on marking reliability

Child, S., Munro, J. and Benton, T. (2015). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

The use of evidence in setting and maintaining standards in GCSEs and A levels

Benton, T. and Bramley, T. (2015). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

2014

Examining the impact of entry level qualifications on educational aspirations

Benton, T. (2014). Examining the impact of entry level qualifications on educational aspirations. Educational Research, 56(3), 259-276.

The relationship between time in education and achievement in PISA in England

Benton, T. (2014) Paper presented at the British Educational Research Association (BERA) conference, London, 23-25 September 2014

Should we age-standardise GCSEs?

Benton, T. (2014). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Calculating the reliability of complex qualifications

Benton, T. (2014). Calculating the reliability of complex qualifications. Research Matters: A Cambridge Assessment publication, 18, 48-52.

Most traditional methods of calculating reliability cannot be applied to complex qualifications that can be completed through multiple routes. For example, for some Maths A levels, candidates can choose when the take the various exam papers that are required, and, in addition, can choose the optional subjects in which they wish to complete exams. This article demonstrates a method by which reliability can be calculated in these instances by applying an optimal method of split halves to each individual assessment that may contribute to the qualification. All of these split halves can then be combined together to create "half qualifications" for each candidate regardless of the route they have taken. Once this have been achieved, traditional methods to calculate reliability can be applied. A full example from a Maths A level is provided.

Using meta-regression to explore moderating effects in surveys of international achievement

Benton, T. (2014). Using meta-regression to explore moderating effects in surveys of international achievement. Practical Assessment Research and Evaluation, 19(3).

Analysis of the use of Key Stage 2 data in GCSE predictions

Benton, T. and Sutch T. (2014). Analysis of the use of Key Stage 2 data in GCSE predictions. Ofqual, Ofqual/14/5471, Coventry.

Examining the impact of tiered examinations on the aspirations of young people

Benton, T. (2014). Examining the impact of tiered examinations on the aspirations of young people. Research Matters: A Cambridge Assessment publication, 17, 42-46.

Tiered examinations are commonly employed within GCSE examinations in the UK. They are intended to ensure that the difficulties of exam papers are correctly tailored to the ability of the candidates taking them; this should ensure more accurate measurements and also a better experience for candidates as they do not spend time addressing questions that are either too easy or too difficult given their level of skill. However, tiered examinations have also been criticised for potentially damaging the aspirations of young people entered for lower tier examinations by placing a limit on the grades they can achieve. This article explores the extent of the link between GCSE entry tier and aspirations and also investigates the extent to which this link can be explained by differences in achievement and background characteristics of pupils.

The research makes use of data available from the Longitudinal Study of Young People in England (LSYPE) linked to information available from the National Pupil Database (NPD) regarding the qualifications achieved by pupils and also their entry tier at GCSE. Analysis was completed using a combination of multilevel modelling and propensity score matching and showed that differences in aspirations between pupils entering different tiers can almost entirely be explained by differences in background characteristics.

2013

Examining the impact of tiered examinations on the aspirations of young people

Benton, T. (2013) Paper presented at British Educational Research Association (BERA) conference, Brighton, 3-5 September 2013

An empirical assessment of Guttman's Lambda 4 reliability coefficient

Benton, T. (2013) Paper presented at the 78th Annual Meeting of the Psychometric Society, Arnhem, The Netherlands, 22-26 July 2013

Exploring the value of GCSE prediction matrices based upon attainment at Key Stage 2

Benton, T. and Sutch, T. (2013). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Investigating the relationship between aspects of countries’ assessment systems and achievement on the Programme for International Student Assessment (PISA) tests

Gill, T., and Benton, T. (2013). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

Formalising and evaluating the benchmark centres methodology for setting GCSE standards

Benton, T. (2013). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

2012

Calculating the number of marks needed in a subtest to make reporting subscores worthwhile

Benton, T. (2012). Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.

A101: Introducing the Principles of Assessment

CPD accredited online courses

First cohort receives advanced award from the Assessment Network

Become a Member and join the debate

Our publications

Tom Benton

Tom Benton

Publications

2025

2024

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012