large language models

Using Benchmarking Infrastructure to Evaluate LLM Performance on CS Concept Inventories: Challenges, Opportunities, and Critiques

Used automated benchmarking infrastructure and expert review to understand differences in LLM and student performance on CS assessments with validity evidence.