Revisiting Automated Essay Scoring via the GPT Artificial Intelligence Chatbot: A Mixed Methods Study

Revisiting Automated Essay Scoring via the GPT Artificial Intelligence Chatbot: A Mixed Methods Study

Eli Fianu, Stephen Boateng, Zelda Arku
Copyright: © 2024 |Pages: 20
DOI: 10.4018/979-8-3693-1310-7.ch008
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The study sought to statistically compare separate sets of scores of graded essays generated from an automated essay scoring (AES) system (ChatGPT) and a human grader, and further engage stakeholders (students, lecturers, and university management) in a discussion of the results of the analysis from the perspective of fairness, bias, consistency with human grading, ethical issues, and adoption. The study adopted a sequential explanatory mixed methods design. The quantitative approach involved the collection and analysis of essay scores while the qualitative approach involved the use of interviews to ascertain stakeholder opinions of the quantitative results. The results of the quantitative study showed that the distribution of ChatGPT scores is the same across categories of age, gender, and ethnicity. Also, there was no statistically significant difference between ChatGPT scores and the scores of the human grader. The analysis of the responses from the interviews are thoroughly discussed.
Chapter Preview
Top

Introduction

An automated grading system is a computer program or software that is designed to automatically evaluate and grade student work; such as assignments, essays, or exams, without human intervention (Ramesh & Sanampudi, 2022). These systems use algorithms and machine learning techniques to analyze and score student responses based on pre-determined criteria and models. Automated grading systems are used in various educational settings, including K-12 schools, colleges, and universities, to provide quick and consistent feedback to students and reduce the workload of instructors (Ramesh & Sanampudi, 2022).

Automated grading systems are increasingly being used in educational settings to grade assignments and exams, from elementary school to university level. While the use of these systems offers several benefits such as time savings and consistency in grading, there is a growing concern about the fairness, bias, consistency with human grading, and ethical implications of these systems. Automated grading systems are often trained on data that may contain hidden biases, which can result in inaccurate and unfair evaluations of student work (Litman et al., 2021). These issues became even more pressing with the shift to remote learning during the COVID-19 pandemic, as more schools and universities had turned to automated grading systems to handle the increased workload. Therefore, it is crucial to examine the fairness and bias of these systems to ensure that they are not perpetuating existing inequalities and hindering student learning.

Litman et al. (2021) state that there are three primary approaches for building AES models: feature-based, neural, and hybrid. While neural network models outperform feature-based models, the latter is more explainable. Hybrid models that combine the strengths of both are becoming increasingly popular. Litman et al. (2021) compared these three types of AES models with respect to algorithmic fairness and found that different models exhibit various biases related to students' gender, race, and socioeconomic status. The study suggests a step towards mitigating AES bias once detected (Litman et al., 2021).

Automated essay scoring systems have been the subject of research for the last few decades. Many researcher have focused on developing artificial intelligence and machine learning techniques to evaluate automatic essay scoring (Ramesh & Sanampudi, 2022). However, existing studies have some limitations, and there are research trends that remain unexplored, particularly related to the relevance of fairness and bias. A recent systematic literature review on automated essay scoring systems conducted by Ramesh & Sanampudi (2022) indicates the need for better evaluation mechanisms that consider all parameters such as the relevance of fairness and bias, development of ideas, cohesion, and coherence.

While group fairness has been explored in previous works on AES, individual fairness has not yet been adequately addressed. Doewes et al. (2022) propose a methodology to measure individual fairness in AES by evaluating the similarity of essays using the distance of the text representation of essays. Their study compares several text representations of essays, from classical text features to deep learning-based features, and evaluates their performance against paraphrased essays to understand if they can maintain the ranking of similarities between the original and the paraphrased essays. The study also demonstrates how to evaluate the performance of automated scoring systems models with regard to individual fairness by counting the number of pairs of essays that satisfy the individual fairness equation, and by observing the correlation of score difference with the distance of essays. The study recommends using Sentence-BERT as the text representation of the essays and Gradient Boosting as the score prediction model to provide better results based on the proposed individual fairness evaluation methodology (Doewes et al., 2022).

Key Terms in this Chapter

Rubric: A template used by teachers to assess students' essay writing by using specific criteria to grade assignments

Large Language Model (LLM): A deep learning algorithm that can perform a variety of natural language processing (NLP) tasks.

Automated Essay Scoring: The use of various types of technology (rather than humans), for instance, artificial intelligence in the grading or scoring of human written essays.

Fairness: Impartial and just treatment or behavior without favoritism or discrimination.

Generative Pre-Trained Transformer (GPT): A type of large language model (LLM) and a prominent framework for generative artificial intelligence.

Ethics: Principles regarding behavior based on laid-down rules and guidelines.

Artificial Intelligence (AI): The theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.

Complete Chapter List

Search this Book:
Reset