Introduction

Computer programming is an essential skill in modern society. According to the Asian Development Bank, (2022), programming skills are becoming increasingly essential in all job sectors worldwide. This trend is unsurprising, considering the ongoing digitalization and the growing complexity of technological solutions in various industries. Many non-coding job positions also now demand proficiency in computer programming. The International Labour Organization, (2021) posited that coding skills today are required not only of programmers but also of scientists, engineers, designers, and artists. This demand underscores the necessity of integrating these skills into educational curricula. Governments across the globe have responded by adapting their educational systems. This adaptation involves the systematic incorporation of programming courses across different levels of education (Ou et al., 2023), in addition to offering them within information technology (IT), computer science (CS), and other computing programs. For instance, Macrides et al., (2022) noted a significant trend in early childhood education towards teaching coding. Their systematic review highlighted the adoption of screen-based visual programming and robotics for this purpose. This trend continues into middle school, where Lira et al., (2022) identified programming camps as a significant supplemental educational tool. It also extends to higher education, where Agbo et al., (2019) observed the increasing integration of computational thinking in teaching problem-solving skills and programming education. By incorporating these skills into the curricula, countries are preparing future generations for the challenges and opportunities of an increasingly digital world.

Despite this global shift in education, the world continues to face a shortage of skilled programmers. In a report covering four major countries (i.e., Canada, China, Germany, and Singapore), the International Labour Organization, (2020) has identified substantial deficits in the number of software developers and programmers. Similar widespread shortages have been observed by the European Labour Authority, (2023) in the EU27, Norway, and Switzerland. This global issue is multifaceted, but a critical component centers around the challenges within education systems. Programming education often presents significant learning difficulties for students. Garcia, (2021) asserted that these challenges are influenced by individual differences (e.g., inherent aptitudes and learning styles) as well as cognitive (e.g., problem-solving skills and logical reasoning abilities) and non-cognitive (e.g., motivation and attitude) factors. The effectiveness of programming education is also highly dependent on the quality of teaching and curriculum design. However, Ou et al., (2023) observed that the current quality of programming education is lacking, and there is a need for enhanced curriculum development in schools. This situation underscores the importance of rigorous assessment in programming education. Such assessments are vital in identifying areas where students face challenges. They also guide the refinement of teaching methods and curriculum to ensure they are aligned with the evolving needs of learners. However, most assessments are focused on measuring overall performance rather than diagnosing specific cognitive strengths and weaknesses (Garcia & Revano, 2021). This general approach tends to overlook critical insights into individual learning attributes, which is crucial for addressing the unique challenges each student faces.

This gap in traditional assessment methods highlights the need for an approach that can identify the cognitive abilities of programming students. While traditional assessments measure overall performance, they often fail to diagnose specific cognitive strengths and weaknesses. This limitation makes it difficult to provide targeted support. To address this issue, Cognitive Diagnostic Modeling (CDM) emerges as a fitting solution. CDM is a sophisticated analytical approach that focuses on understanding and diagnosing the specific cognitive skills and knowledge structures individuals possess. By providing detailed and multidimensional diagnostic feedback, CDM can identify examinees' strengths and weaknesses across a spectrum of attributes. This technique can be particularly beneficial in programming education, where understanding the intricacies of a student's cognitive abilities can lead to more effective instructional strategies. Applying a CDM approach can inform curriculum development and optimize learning outcomes by aligning instructional methods with the diverse cognitive needs of students. Therefore, this study seeks to address the following research questions (RQ):

  1. RQ1: Which cognitive diagnosis model most adequately fits the empirical data?
  2. RQ2: What are the attributes mastered by students at the grade and individual levels?
  3. RQ3: How do these mastery profiles vary between CS and IT students?

Background of the Study

Cognitive Diagnosis Modeling

CDM is an advanced analytical approach designed to understand and diagnose the specific cognitive skills and knowledge structures that individuals possess. Its foundational roots date back to the development of the rule space method (Tatsuoka, 1983). Over the years, CDM has evolved to incorporate various models and techniques aimed at providing detailed and multidimensional diagnostic feedback (Liu et al., 2023). Unlike traditional psychometric frameworks that are more descriptive, such as item response theory (IRT) and classical test theory (CTT), CDM offers a diagnostic framework that classifies examinees' strengths and weaknesses across a spectrum of attributes (de la Torre & Minchen, 2014). In this context, an attribute is described as essential knowledge and cognitive abilities crucial for solving specific problems or tasks. For example, a CDM could indicate whether students learning programming have mastered specific attributes essential to coding, such as understanding basic syntax, applying control structures such as loops and conditionals, or efficiently debugging code (Garcia et al., 2022). This level of granularity provides more detailed evidence than other psychometric models, making CDM particularly useful for guiding teaching and learning decisions in the classroom (Effatpanah et al., 2019; Paulsen & Valdivia, 2022). Given its capabilities as a psychometric model, CDM is frequently used as the analytical framework in cognitive diagnostic assessment as it offers a more comprehensive evaluation of students' learning processes (Li et al., 2021).

The application of CDM has shown significant success in various educational contexts, including reading (Jang et al., 2015), listening (Meng et al., 2023), writing (Effatpanah et al., 2019), mathematics (Chandía et al., 2023), and accounting (Helm et al., 2022). These studies have demonstrated the effectiveness of CDM in providing detailed insights into specific skill sets and cognitive abilities of students. However, despite the empirical evidence demonstrating the benefits of CDM in other educational domains, it has not yet been adopted in programming education. A review of the literature reveals a significant gap, as no studies have specifically employed CDM as an assessment approach in programming education. The closest prior work involves the assessment of computational thinking, which has a broader focus on problem-solving and algorithmic reasoning rather than programming-specific skills (Li & Traynor, 2022). Unfortunately, traditional assessments used in programming education (e.g., Qayyum et al., 2018; Schnieder & Williams, 2022), while useful in evaluating the general understanding and competence of learners, often fail to diagnose specific cognitive strengths and weaknesses. In contrast, CDM offers a unique opportunity to identify the nuances of students' cognitive abilities. By diagnosing specific areas where students may struggle or excel, CDM can provide educators with the insights needed to tailor instruction more effectively. Given the limitations of traditional assessment methods in programming, there is a compelling need to explore and integrate CDM into this field to enhance both learning outcomes and instructional practices.

Foundational Models in Cognitive Diagnosis

An important consideration in applying a CDM is selecting an appropriate model (Wu et al., 2024). CDM encompasses various types of models, each with unique features and applications. Saturated models, such as the G-DINA (generalized deterministic inputs, noisy "and" gate) model (de la Torre, 2011), are the most comprehensive, as they allow for the estimation of all possible interactions among attributes. These models are highly flexible and can capture complex relationships, but they require a large amount of data and can be computationally intensive. Conversely, constrained models simplify the structure by assuming that certain interactions are negligible, thus reducing the number of parameters to be estimated. Examples of constrained models include the DINA (deterministic inputs, noisy "and" gate) model (Junker & Sijtsma, 2001), the DINO (deterministic input, noisy "or" gate) model (Templin & Henson, 2006), additive CDM (ACDM; de la Torre, 2011), linear logistic model (LLM; Maris, 1999), reduced reparameterized unified model (RRUM; Hartz, 2002), and more. These models are easier to manage and interpret but may not capture all the nuances of the data. When there are uncertain relationships among attributes, Ma and de la Torre, (2020) noted that the higher-order GDINA model with Rasch, 1-Parameter Logistic (1PL), and 2-parameter Logistic (2PL) joint attribute distributions can be considered to select appropriate models for empirical studies.

In some cases, a mixed model approach is used as a supplement to standard CDM analysis, where different models are applied to individual items within the same assessment. This technique allows for a tailored analysis that can better fit the varying complexities of various questions in an instrument. For instance, simpler items might be analyzed with reduced models, while more complex items might require saturated models to fully capture the cognitive processes involved. In a practical application, Ravand and Robitzsch, (2018) applied this method in a reading comprehension context and found that a mixed model provided a better fit than the G-DINA model. Given the abundance of viable models, de la Torre and Lee, (2013) argued that objectively choosing the most appropriate model is crucial rather than relying on personal preference or a predetermined model. As a guiding approach, the parsimony principle suggests selecting the simplest model when faced with multiple statistically equivalent models. However, model selection should also be based on how well the model assumptions correspond to the theoretical basis used to construct a given test (Li et al., 2015). de la Torre and Lee, (2013) noted that the Wald Test, a statistical test for parameter significance, can be used to compare models under the G-DINA framework. Using this test allows the selection of the model that best fits the specific context of the assessment. The choice of model impacts the accuracy and utility of the diagnostic information obtained, making it essential to consider the characteristics of the items and the attributes being measured (Effatpanah et al., 2019; Helm et al., 2022). This careful selection ensures that the CDM approach is effectively tailored to provide fine-grained diagnostic information and the most meaningful insights into students' cognitive abilities and learning needs.

Methods

Study Setting and Participants

The research was conducted at one of the leading institutes of technology in the Philippines. This university hosts a College of Computer Studies and Multimedia Arts (CCSMA), which offers IT and CS undergraduate programs. A fundamental component shared between these programs is a series of introductory and advanced computer programming courses. One of the programming courses that plays a significant role in the curriculum of both programs is Computer Programming 1, which comprises lecture (CCS0003) and laboratory (CCS0003L) components. The primary objective of this introductory programming course is to teach first-year computing students the foundational skills in computational logic and design. The course covers traditional problem-solving techniques (e.g., flowcharting and pseudo-coding) and basic programming concepts covering input/output operations, conditional and repetitive control structures, and arrays. Garcia, (2021) utilized the same course in conducting experimental research on evaluating cooperative learning pedagogy in computer programming. The selection of this course for the study is strategic, as it represents a shared educational experience for IT and CS students. The course maintains uniformity in its syllabus, teaching materials, and online modules, ensuring instruction consistency across different faculty members and between the two programs. It also guarantees that all computing students are assessed under similar conditions, making the evaluation of their skills and knowledge fair and unbiased. By maintaining uniformity in course content and delivery, the research design effectively controls for extraneous variables that might otherwise influence the outcome of the study (Garcia, 2023).

Research Instrument and Data Collection

This study utilized a comprehensive 100-item multiple-choice final examination from the CCS0003 course. Administered during the first trimester of the academic year 2023-2024, all IT and CS students enrolled in the course took this departmental examination for an hour. The instrument development was spearheaded by the faculty-in-charge, with subsequent validation by a team of co-faculty members who also teach the course. This collaborative approach in the instrument's development and validation ensured its academic rigor and alignment with the course's educational objectives.It is important to note that, although arguably better approaches to assess students exist (e.g., practical coding assessments), the number of items and the multiple-choice format are departmental requirements. Despite the multiple-choice format, several questions presented students with scenarios involving machine problems requiring them to interpret and analyze provided code snippets. Successfully responding to these questions necessitates a comprehensive understanding of the underlying algorithms. Additionally, this instrument was created simply as a final course assessment and not specifically for CDM analysis. Lee et al., (2012) argued that very few assessments are designed based on a cognitive diagnosis framework. More commonly, CDM is applied retrospectively to assessments initially developed with a unidimensional item response theory framework (i.e., retrofitting).

Nonetheless, the primary reason for selecting the final examination is the availability of detailed data (i.e., student responses on an item-by-item basis and the correctness of these responses). The dataset was readily accessible through ZipGrade – a mobile optical scanner application for grading multiple-choice assessments. The administrative office of the CCSMA was formally requested to provide a copy of the examination and the results. In response to the request, and with consideration for ethical research practices, they provided randomly selected data from various IT (n = 308) and CS (n = 269) classes. Upon receipt of the data, the first step was anonymizing it to ensure student confidentiality. This anonymization process involved removing all personally identifiable information, such as names, identification numbers, and any other markers traceable to individual students. This step was crucial for protecting student privacy and upholding the integrity of our research. More importantly, this approach strictly complied with data protection regulations and institutional ethical guidelines.

Attributes Definition References
Theoretical Understanding Deep understanding of the principles and theories that form the foundation of programming. (Garcia & Revano, 2021; Hota et al., 2023; Thuné & Eckerdal, 2019)
Language Proficiency Mastery in using programming languages effectively to solve problems and create applications. (Garcia et al., 2022; Guo, 2018; Xie et al., 2019; Zhang et al., 2023)
Logical Reasoning Ability to apply coherent and rational thinking to solve problems and make decisions in programming. (Barlow-Jones & van der Westhuizen, 2017; Djurdjevic-Pahl et al., 2017)
Algorithmic Thinking Skill in designing, understanding, and implementing instructions to solve specific problems efficiently. (Angeli, 2022; Kiss & Arki, 2017; Lamagna, 2015; Tsukamoto et al., 2017)
Code Tracing Competence in following a program's execution flow and comprehending the behavior of the code. (Kumar, 2015; Russell, 2022; Stankov et al., 2023; Zhang et al., 2023)

Q-Matrix

Upon obtaining a copy of the examination, domain expertise was utilized, enriched by insights from relevant literature (e.g., Xie et al., 2019), to identify the essential attributes that computer programming students must possess. At this point, the primary goal was to compile a comprehensive list of these attributes. Then, the examination was reviewed item-by-item to check for missing attributes or to confirm that all identified attributes were covered. Indeed, specific attributes (e.g., debugging and code documentation) were not included in the study due to the absence of corresponding questions in the examination. The initial list of attributes was then presented to the faculty-in-charge who developed the examination. Following a consultation, a consensus was reached that the attributes measured by this examination include theoretical understanding, language proficiency, logical reasoning, algorithmic thinking, and code tracing (see Table 1). Subsequently, a Q-Matrix was developed to illustrate the relationship between examination items and the identified attributes. Mapping the test items onto an item-by-skill table is a critical first step in CDM (Tatsuoka, 1983). The Q-Matrix underwent validation by the faculty team responsible for the creation and validation of the CCS0003 examination. In cases of disagreement, conflicting viewpoints were discussed and resolved through collaborative decision-making to ensure a unified and accurate representation in the matrix. Several revisions were made based on their feedback, and the revised version served as our initial Q-Matrix.

Data Analysis

Following the development of the initial Q-matrix, data analysis was conducted using the R programming language, employing the GDINA framework (Ma & de la Torre, 2020) as well as the tidyr, ggplot2, and fmsb packages. The data analysis was initiated with an empirical validation of the item-by-skill table. Ma and de la Torre, (2020) have observed that Q-matrices developed by domain experts often tend to be subjective, which is why it is critical to validate them empirically to avoid erroneous attribute estimation. The results provided by the G-DINA model were consulted using the Proportion of Variance Accounted For (PVAF) with a cutoff greater than 0.95 (de la Torre & Chiu, 2016). Additionally, the mesa plots (refer to Figure 1) of items flagged for revision were manually checked for further analysis. Revisions were made only when they were logically consistent with the item and the skills required for its correct response. This validation process led to the finalization of the Q-Matrix, with the results indicating that 86 out of 100 q-vectors were retained (e.g., Item 48; Figure Figure 1a). Regarding the 14 items with suggested q-vector modifications: six items had one suggested change each (e.g., Item 25: from 10110 to 11110; Figure 1b), six items had two suggested changes each (e.g., Item 57: from 01001 to 01111; Figure 1c), and one item had three suggested changes (e.g., Item 54: from 10000 to 10111; Figure 1d. The final and validated Q-matrix can be found in Appendix A.

Afterward, the analysis progressed by fitting the G-DINA model while imposing the monotonicity constraints on the dataset. This saturated model provided a baseline for our analysis. Subsequently, various models were explored, including the DINA model, the DINO model, ACDM, LLM, and RRUM. Given the diversity of cognitive processes involved, it may be more beneficial to avoid forcing a single model onto the entire dataset. Recognizing the complexity of cognitive processes and preventing the imposition of a single model on the entire test, an item-level model fit analysis was conducted. This approach allowed us to consider how each model applied to individual test items rather than the entire test. To select the most appropriate model at the item level, this study followed the process outlined by Ma et al., (2016). First, the Wald statistic for all models for every item was calculated. de la Torre and Lee, (2013) recommended the use of the Wald test as an objective means of determining the most appropriate models. In this approach, the null hypothesis posits that the reduced model fits the item as well as the saturated model. If the null hypothesis is rejected (p < .05), the reduced model is dismissed. If more than one reduced model is retained and DINA or DINO is among them, the one with the most significant p-value is selected. The outcome of this analysis was a mixed model (subsequently referred to as MIXED), which combined different models at an item level. In addition to these models, we incorporated higher-order G-DINA models such as Rasch, 1PL, and 2PL into our analysis. Several studies have demonstrated the potential of using higher-order models in examining the skill profiles of students (e.g., Zhang et al., 2022).

All these models were included in the relative fit analysis, where the performance of saturated, reduced, mixed, and higher-order models was compared using the anova() function in the G-DINA framework. This comparative analysis was pivotal in selecting the most appropriate model. Models that were not rejected during this analysis were further examined, and the one with the lowest Akaike Information Criterion (AIC; Akaike, 1974) and Bayesian Information Criterion (BIC; Schwarz, 1978) was selected as the most suitable model. Both AIC and BIC are relative fit indices used for selecting between non-nested models, with lower values indicating a better fit. These indices provide a balance between model complexity and goodness of fit, helping avoid overfitting while ensuring ac