An evaluation of ChatGPT in action
When set a task, how does ChatGPT really perform and what does this tell educators about how to craft their questions and assignments to avoid students relying entirely on this AI tool to generate answers?
All of us working in higher education need to carefully consider what artificial intelligence, specifically ChatGPT, can do and what it cannot.
For those like me who are responsible for developing assessments, this will enable us to rephrase our questions to ensure that AI cannot pass.
Here I present a graphic illustration of how we might assess ChatGPT outputs. I have specifically chosen an example that is outside the higher education environment to illustrate the point.
I asked ChatGPT to write a crochet pattern for a chihuahua. Figure 1 below shows what a chihuahua looks like. A little brown dog, with pointy ears, a curly tail and four legs. He’s called Loki.
In response, ChatGPT produced coherent text instructions that made it possible, for someone who knows how to crochet, to produce a finished object. A creature did emerge. Figure 2 below is the result. It is a brown ball with two ears and a small tail. The pattern worked: it advised, in appropriate language, when to add more stitches and when to reduce the number of stitches to create the shape, when to insert the stuffing and where to stitch the ears and tail.
So let us assess the output.
When ChatGPT is asked to produce a crochet pattern for a chihuahua (remember, it is text based; it cannot draw a chihuahua), it will produce one. It is available here for crafty colleagues to have a go!
The pattern advised me to choose a suitable coloured wool, so that is roughly correct. There are two ears and a tail. That is where the likeness ends, though.
Here, I will use sample university-level descriptors and apply them to the pattern that was “submitted”. To view the full sample marking criteria, you can download the PDF above.
We will mark according to “Knowledge and understanding of the subject”, “Cognitive/Intellectual skills”, “Use of research-informed literature” and “Skills for life and employment”.
Knowledge and understanding of subject
0-25%, fail: Largely inaccurate or irrelevant material. Little or no evidence of factual and conceptual understanding of the subject, or of reading/research.
Cognitive/Intellectual skills
40-49%, 3rd, pass: There is some evidence of analysis and evaluation, but work is mainly descriptive with an uncritical acceptance of information, and unsubstantiated opinions may be evident. Lack of logical development of an argument.
Use of research-informed literature
26-39%, fail: Little or no evidence of ability to relate theory to practice. Little or no reference to research-informed literature.
Skills for life and employment
60-69%, 2:1, merit: Structure is coherent and logical, showing progression to the argument. There are few mistakes in presentation or citation. Demonstrates qualities and transferable skills required for employment.
So we have coherent text that conforms to the discipline standard (ie, I could follow it and produce a finished item). It demonstrates an understanding of some elements of the task (two ears and a tail). On this basis, the work could pass.
But it’s not a chihuahua, is it? Here is what a real chihuahua thinks:
So, what does this prove?
The task and associated criteria were not well designed, allowing an extremely poor submission to gain enough marks to pass. It was a long way off a good mark, though.
So we need to carefully consider what we want the student to do and what the intended learning outcomes of the module are.
If an intended learning outcome was to produce a coherent crochet pattern, then the addition of a chihuahua was an unnecessary element. On that basis, ChatGPT did rather well.
If, however, we wanted students to produce a replica of a chihuahua demonstrating a thorough understanding of the physical characteristics of the Canis familiaris species, evaluating both similarities and differences between the chihuahua and other breeds, then ChatGPT failed rather dismally.
It really is all in the question.
Tips to confound AI cheating solutions:
1. The AI tool cannot get behind firewalls, so make reference to specific materials, perhaps from your class notes (cite the sources on slide 8 from week 6), or your text (using the primary source from page 350 of the core text…), or a list of sources that reside in a database such as JSTOR.
2. Use the most up-to-date conversation in your field. This can help to weed out fictional references. If you refer to a text published in 2020 and ask students to write an essay that critically analyses it, then you’ll know any articles that the student suggests are making reference to it but which are dated before 2020 must be false.
3. AI is relentlessly positive, and politically correct. Ask students to write an argument in support of a morally bad action.
4. Ask for your output to be in a format that AI cannot produce such as a slide deck, a poster, an infographic.
5. My favourite – if you can’t beat them, join them. Have students ask the AI to write an essay and then critique it. With this solution, you are helping students to embrace this 21st-century tool, to recognise its benefits and limitations, and to demonstrate their own knowledge as well.
There is no doubt that this dilemma will become more common. As we move away from exam halls with rows of desks, and students responding to a series of questions, in a set time, in silence, we need to become a little more creative in how we ask them to demonstrate their learning. These tips may help in the short term and pave the way for new approaches to assessment.
Karen Kenny is a senior academic developer focused on supporting academic personal tutoring at the University of Exeter.
If you found this interesting and want advice and insight from academics and university staff delivered direct to your inbox each week, sign up for the Campus newsletter.