13 Hmef5053 T9
13 Hmef5053 T9
9
LEARNING OUTCOMES
INTRODUCTION
When you develop a test, it is important to identify the strengths and weaknesses
of each item. To determine how well items in a test perform, some statistical
procedures need to be used.
In this topic, we will discuss item analysis which involves the use of three
procedures: item difficulty, item discrimination and distractor analysis to help the
test developer decide whether the items in a test can be accepted or should be
modified, or rejected. These procedures are quite straightforward and easy to use,
and the educator needs to understand the logic underlying the analyses in order
to use them properly and effectively.
However, there is much more information available about a test that is often
ignored by teachers. This information will only be available if the item analysis is
done. What is item analysis?
Specifically, in classical test theory (CTT) the statistics produced from analysing
the test results based on test scores include measures of difficulty index and
discrimination index. Analysing the effectiveness of distractors also becomes part
of the process (which we will discuss in detail later in the topic).
The quality of a test is determined by the quality of each item or question in the
test. The teacher who constructs a test can only roughly estimate the quality of a
test. This estimate is based on the fact that the teacher has followed all the rules
and conditions of test construction.
However, it is possible that this estimation may not be accurate and certain
important aspects have been ignored. Hence, it is suggested that to obtain a more
comprehensive understanding of the test, item analysis should be conducted on
the responses of students. Item analysis is done to obtain information about
individual items or questions in a test and how the test can be further improved.
It also facilitates the development of an item or question bank which can be used
in the construction of a test.
Step 1
Upon receiving the answer sheet, the first step would be to mark each of the
answer sheets.
Step 2
Arrange the 45 answer sheets from the highest score obtained to the lowest score
obtained. The paper with the highest score is on top and the paper with the lowest
score is at the bottom.
Step 3
Multiply 45 (the number of answer sheets) with 0.27 (or 27 per cent) which is 12.15
and round it up to 12. The use of the value 0.27 or 27 per cent is not inflexible. It is
possible to use any percentage from 27 to 35 per cent as the value. However, the
27 per cent rule can be ignored if the class size is too small. Instead of taking the
27 per cent sample, divide the number of answer sheets by two.
Step 4
Arrange the pile of 45 answer sheets according to the scores obtained (from the
highest score to the lowest score). Take out 12 answer sheets from the top of the
pile and 12 answer sheets from the bottom of the pile. Call these two piles „high
marks‰ students and „low marks‰ students respectively. Set aside the middle
group of papers (21 papers). Although these could be included in the analysis,
using only the high and low groups will simplify the procedure.
Step 5
Refer to Question 1 (refer to Figure 9.1), then:
(a) Count the number of students from the „high marks‰ group who selected
each of the options (A, B, C or D); and
(b) Count the number of students from the „low marks‰ group who selected the
option A, B, C or D.
From the analysis, 11 students from the „high marks‰ group and two students
from the „low marks‰ group selected „B‰ which is the correct answer. This means
that 13 out of the 24 students selected the correct answer. Also, note that all the
distractors (A, C and D) were selected by at least one student. However, the
information provided in Figure 9.1 is insufficient and further analysis has to be
conducted.
SELF-CHECK 9.1
What does a difficulty index (p) of 0.54 mean? The difficulty index is a coefficient
that shows the percentage of students who got the correct answer compared with
the total number of students who attempted the question. In other words,
54 per cent of students selected the right answer. Although our computation is
based on the high and low scoring groups only, it provides a close approximation
of the estimate that would be obtained with the total group. Thus, it is proper to
say that the index of difficulty for this item is 54 per cent (for this particular group).
Note that since „difficulty‰ refers to the percentage getting the item right, the
smaller the percentage figure the more difficult the item. The meaning of the
difficulty index is shown in Figure 9.2.
If a teacher believes that the achievement 0.54 on the item is too low, he or she can
change the way he or she teaches the item to better meet the objective represented
by it. Another interpretation might be that the item was too difficult or confusing
or invalid, in which case the teacher can replace or modify the item, perhaps using
information from the itemÊs discrimination index or distractor analysis.
Under CTT, the item difficulty measure is simply the proportion that is correct for
an item. For an item with a maximum score of two, there is a slight modification
to the computation of proportion of percentage correct.
This item has a possible partial credit scoring 0, 1, 2. If the total number of students
attempting this item is 100, and 23 students scored 0, 60 students scored 1 and
17 students scored 2, then a simple calculation will show that 23 per cent of the
students scored 0, 60 per cent of the students scored 1, and 17 per cent of the
students scored 2 for this particular item. The average score for this item should
be 0 0.23 + 1 0.6 + 2 0.17 = 0.94.
Thus, the observed average score of this item is 0.94 out of a maximum of 2. So the
average proportion correct is 0.94/2 = 0.47 or 47 per cent.
ACTIVITY 9.1
Options A B C D Blank
High marks group (n = 12) 0 2 8 2 0
Low marks group (n = 12) 2 4 3 2 1
In our example in subtopic 9.2, 11 students in the high group and two students in
the low group selected the correct answer. This indicates positive discrimination,
since the item differentiates between students in the same way that the total test
score does. That is, students with high scores on the test (high group) got the item
right more frequently than students with low scores on the test (low group).
Although analysis by inspection maybe all that is necessary for most purposes, an
index of discrimination can be easily computed using the following formula:
Rh RL
Discrimination index
1 T
2
where Rh = Number of students in „high marks‰ group (Rh) with the correct
answer
RL = Number of students in „low marks‰ group (RL) with the correct
answer
T = Total number of students
Example 9.1:
A test was given to a group of 43 students and 10 out of the 13 „high marks‰ group
got the correct answer compared to five out of the 13 „low marks‰ group who got
the correct answer. The discrimination index is computed as follows:
R h R L 10 5 10 5
0.38
1 T 1 26 13
2 2
The formula for the discrimination index is such that if more students in the „high
marks‰ group chose the correct answer than students did in the low scoring group,
the number will be positive. At a minimum, one would hope for a positive value,
as that would indicate that it is knowledge of the question that resulted in the
correct answer. The greater the positive value (the closer it is to 1.0), the stronger
the relationship is between overall test performance and performance on that item.
If the discrimination index is negative, that means that for some reason, students
who scored low on the test were more likely to get the answer correct. This is a
strange situation which suggests poor validity for an item.
The difficulty index (p) of the item can be computed using the following formula:
Average score
p
Possible range of score
Using the information from Table 9.1, the difficulty index of the short-answer essay
question can be easily computed. The average score obtained by the group of
students is 2.55, while the possible range of score for the item is (4 ă 0) = 4. Thus,
2.55
p
4
0.64
The difficulty index (p) of 0.64 means that on average, students have received
64 per cent of the maximum possible score of the item. The difficulty index can
be interpreted the same as that of the multiple-choice question discussed in
subtopic 9.3. The item is of a moderate level of difficulty (refer to Figure 9.2).
Note that in computing the difficulty index in the previous example, the scores of
the whole group are used to obtain the average score. However, for a large group
of students, it is possible to estimate the difficulty index for an item based on only
a sample of students comprising the high marks and low marks groups as in the
case of computing the difficulty index of a multiple-choice question.
Using the information from Table 9.1 but presenting it in the following format as
in Table 9.2, we can compute the discrimination index of the short-answer essay
question.
Average
Score 0 1 2 3 4 Total
Score
High marks group (n = 10) 0 0 1 4 5 34 3.4
Low marks group (n = 10) 1 3 4 2 0 17 1.7
The average score obtained by the upper group of students is 3.4 while that of the
lower group is 1.7. Using the formula as suggested by Nitko (2004), we can
compute the discrimination index of the short-answer essay question as follows:
3.4 1.7
D
4
0.43
The discrimination index (D) of 0.43 indicates that the short-answer question does
discriminate between the upper and lower groups of students and at a high level
(refer to Figure 9.3.) As in the computation of the discrimination index of the
multiple-choice question for a large group of students, a sample of students
comprising the top 27 per cent and the bottom 27 per cent may be used to provide
a good estimate.
The following are two possible reasons for poorly discriminating items:
(a) The item tests something else compared to the majority of items in the test;
or
(a) The wording and format of the item are problematic; and
(b) The item may be testing a different thing than that intended for the test.
ACTIVITY 9.2
Score 0 1 2 3 4
High marks group (n = 10) 2 2 3 1 2
Low marks group (n = 10) 3 2 2 3 0
Figure 9.4: Theoretical relationship between difficulty index and discrimination index
Source: Stanley and Hopkins (1972)
For example, a difficulty index of 0.9 results in a discrimination index of about 0.2,
is described as an item of low to moderate discrimination. What does this mean?
The more difficult a question, the harder it is for that question or item to
discriminate between those students who know and those who do not know the
answer to the question.
Similarly, when the difficulty index is about 0.1, the discrimination index drops to
about 0.2. What does this mean? The easier a question, the harder it is for that
question or item to discriminate between those students who know and those who
do not know the answer to the question.
ACTIVITY 9.3
1. What can you conclude about the relationship between the difficulty
index of an item and its discrimination index?
Generally, a good distractor is able to attract more „low marks‰ students to select
that particular response or distract „high marks‰ students towards selecting that
particular response. What determines the effectiveness of distractors? Figure 9.5
shows you how 24 students selected the options A, B, C and D for a particular
question. Option B is a less effective distractor because many „high marks‰
students (n = 5) selected option B. Option D is a relatively good distractor because
two students from the „high marks‰ group and five students from the „low marks‰
group selected this option. The analysis of response options shows that those who
missed the item were about equally likely to choose answer B and answer D. No
students chose answer C, meaning it does not act as a distractor. Students were not
choosing between four answer options on this item, they were really choosing
between only three options, as they were not even considering answer C. This
makes guessing correctly more likely, which hurts the validity of the item. The
discrimination index can be improved by modifying and improving options B
and C.
ACTIVITY 9.4
The answer is B.
Step 1
Arrange the 30 answer sheets from the highest score obtained to the lowest score
obtained.
Step 2
Select the answer sheet that obtained a middle score. Group all answer sheets
above this score as „high marks‰ (mark an „H‰ on these answer sheets). Group all
answer sheets below this score as „low marks‰ group (mark an „L‰ on these
answer sheets).
Step 3
Divide the class into two groups (high and low) and distribute the „high‰ answer
sheets to the high group and the low answer sheet to the low group. Assign one
student in each group to be the counter.
Step 4
The teacher then asks the class, „The answer for Question #1 is „C‰ and those who
got it correct, raise your hand.
Counter from „H‰ group: „Fourteen for group H‰
Counter from „L‰ group: „Eight from group L‰
Step 5
The teacher records the responses on the whiteboard as follows:
Step 6
Calculate the difficulty index for Question #1 as follows:
R H R L 14 8
Difficulty index 0.73
30 30
Step 7
Compute the discrimination index for Question #1 as follows:
R H R L 14 8 6
Discrimination index 0.40
1 30 15 15
2
Note that earlier, we took 27 per cent of answer sheets in the „high marks‰ group
and 27 per cent of answer sheets in the „low marks‰ group from the total answer
sheets. However, in this approach we divided the total answer sheets into two
groups. There is no middle group. The important thing is to use a large enough
fraction of the group to provide useful information. Selecting the top and bottom
27 per cent of the group is recommended for a more refined analysis. This method
may be less accurate but it is a „quick and dirty‰ method.
ACTIVITY 9.5
(a) From the discussion in the earlier subtopics, it is obvious that the results of
item analysis could provide answers to the following questions:
(iii) Were the items free from irrelevant clues and other defects?
Answers to the previous questions can be used to select or revise test items
for future use. This would improve the quality of test items and the test paper
to be used in future. It also saves teachersÊ time in preparing the test items
for future use because good items can be stored in the item bank.
(b) Item analysis data can provide a basis for efficient class discussion of the test
results. Knowing how effectively each test item functions in measuring the
achievement of the intended learning outcome and how students perform in
each item, teachers can have a more fruitful discussion with the students as
feedback based on the item analysis that is more objective and informative.
(c) Item analysis data can be used for remedial work. The analysis will reveal
the specific areas that the students are weak in. Teachers can use the
information to focus remedial work directly on the particular areas of
weakness. For example, based on the distracter analysis, it is found that a
specific distracter has a low discrimination with a great number of students
from both the high marks and low marks groups choosing the option. This
could suggest that there is some misunderstanding of a particular concept.
Remedial lessons can thus be planned to arrest the problem.
(d) Item analysis data can reveal weaknesses in teaching and provide useful
information to improve teaching. For example, despite the fact that an item
is properly constructed, it has a low difficulty index, suggesting that most
students fail to answer the item satisfactorily. This might suggest that the
students have not mastered a particular syllabus content that is being
assessed. This could be due to the weakness in instruction and thus
necessitates the implementation of more effective teaching strategies by the
teachers. Furthermore, if the item is repeatedly difficult for the items, there
might be a need to revise the curriculum.
(e) Item analysis procedures provide a basis for teachers to improve their skills
in test construction. As teachers analyse studentsÊ responses to items, they
become aware of the defects of the items and what causes them. When
revising the items, they gain experience in rewording the statements so that
they are clear, rewriting the distracters so that they are more plausible and
modifying the items so that they are at a more appropriate level of difficulty.
As a consequence, teachers improve their test construction skills.
(a) Item discriminating power does not indicate item validity. A high
discrimination index merely indicates that students from the high marks
group perform relatively better than the students from the low marks group.
The division of the high and low marks groups is based on the total test score
obtained by each student, which is an internal criterion. By using the internal
criterion of total test score, item analysis offers evidence concerning the
internal consistency of the test rather than its validity. The validity of a test
needs to be judged by an external criterion, that is, to what extent the test
assesses the learning outcomes intended.
(b) The discrimination index is not always an indicator of item quality. For
example, a low index of discriminating power does not necessarily indicate
a defective item. If an item does not discriminate but it has been found to be
free from ambiguity and other technical defects, the item should be retained,
especially in a criterion-referenced test. In such a test, a non-discriminating
item may suggest that all students have achieved the criterion set by the
teacher. As such, the item does not discriminate between the good and poor
students. Another possible reason why low discrimination occurs for an item
is that the item may be very easy or very difficult. Sometimes, this item,
however, is necessary or desirable to be retained in order to measure a
representative sample of learning outcomes and course content. Moreover,
an achievement test is usually designed to measure several different types of
learning outcomes (knowledge, comprehension, application and so on). In
such a case, there will be learning outcomes that are assessed by fewer test
items and these items will have low discrimination because they have less
representation in the total test score. Removing these items from the test is
not advisable as it will affect the validity of the test.
(c) This traditional item analysis data is tentative. It is not fixed but influenced
by the type and number of students being tested and the instructional
procedures employed. The data would thus change with every
administration of the same test items. So, if repeated use of items is possible,
item analysis should be carried for each administration of each item. The
tentative nature of item analysis should therefore be taken seriously and the
results are interpreted cautiously.
An item bank consists of questions that have been analysed and stored because
they are good items. Each stored item will have information on its difficulty index
and discrimination index. Each item is stored according to what it measures,
especially in relation to the topics of the curriculum. These items will be stored in
the form of a table of specifications indicating the content being measured as well
as the cognitive levels measured. For example, you will be able to draw from the
item bank items measuring the application of concepts for the topic on
„electricity‰. You will also be able to draw items from the bank with different
difficulty levels. Perhaps, you want to arrange easier questions at the beginning of
the test so as to build confidence in students and then gradually introduce
questions of increasing difficulty.
With computerised databases, item banks are easy to access. Teachers will have at
their disposal hundreds of items from which they can draw upon when
developing classroom tests. This would certainly help them with the tedious and
time-consuming task of having to construct items or questions from scratch.
Unfortunately, not many educational institutions are equipped with such an item
bank. The more common practice is for teachers to select items or questions from
commercially prepared workbooks, past examination papers and sample items
from textbooks. These sources do not have information about the difficulty index
and discrimination index of items, nor information about the cognitive levels of
questions or what they aim to measure. Teachers will have to figure out for
themselves the characteristics of the items based on their experience in teaching
the content.
However, there are certain issues to consider in setting up a question bank. One of
the major concerns of the bank is how to place different test items collected
overtime on a common scale. The scale should indicate difficulty of the items, one
scale per subject matter. Retrieval of items from the bank is made easy when all
items are placed on the same scale.
The person in charge must also take every effort to add only quality items to the
item pool. To develop and maintain a good item bank requires a great deal of
preparation, planning, expertise and organisation. Though item response theory
(IRT) approach is not a panacea for the item banking problems, it can solve many
of these issues (IRT will be explained further in the next subtopic).
Item response theory (IRT) is a psychometric approach which assumes that the
probability of a certain response is a direct function of an underlying trait or traits.
Under IRT, the concern is whether the student obtained each item correctly or not,
rather than the raw test score. The basic concept of IRT is about the individual item
of test rather than about the test scores. Student trait or ability and item
characteristics are referenced to the same scale. For example, ConQuest is a
computer program for item response and latent regression models and TAM is a
R package for item response models.
ACTIVITY 9.6
The discrimination index is a basic measure which shows the extent to which
a question discriminates or differentiates between students in the "high marks"
group and "low marks" group.
Theoretically, the more difficult a question (or item) or easier the question (or
item) is, the lower will the discrimination index be.
There are many psychometric software programs to help expedite the tedious
calculation process.