Abstract
Student course evaluations contain rich qualitative feedback in the form of comments written in response to open-ended questions. However, this qualitative data, which may be more nuanced and detailed than quantitative ratings, is often unexamined in both administrative and research settings due to the labor-intensive nature of manual analysis. We investigate whether large language models (LLMs), including BERT, RoBERTa, and OpenAI model variants, can accurately replicate human judgments of sentiment in these comments. We compare masked and generative language models, using both naïve and fine-tuned approaches, to analyze a curated dataset of 1000 de-identified course evaluation responses. Results show that some artificial intelligence (AI) models can approach inter-rater reliability with humans remarkably well and quickly with limited tuning or training data provided. However, performance varied and not all models were able to produce a reliable sentiment analysis, even after training. This has implications for future avenues of qualitative data analysis within course evaluations as well as the large repositories of course evaluations available at institutions of higher education. Importantly, consideration should be taken when selecting an AI model as this decision has ramifications for the reliability and validity of the generated output.
| Original language | English |
|---|---|
| Article number | 100545 |
| Journal | Computers and Education: Artificial Intelligence |
| Volume | 10 |
| DOIs | |
| State | Published - Jun 2026 |
Keywords
- Course evaluation
- Human-AI alignment
- LLMs
- qualitative data
- sentiment
Fingerprint
Dive into the research topics of 'LLM sentiment quantification reveals selective alignment with human course-evaluation raters'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver