Item Analysis of English Final Semester Test

Numerous studies have been conducted on the item test analysis in English test. However, investigation on the characteristics of a good test of English final semester test is still rare in several districts in East Java. This research sought to examine the quality of the English final semester test in the academic year of 2018/2019 in Ponorogo. A total of 151 samples in the form of students’ answers to the test were analysed based on item difficulty, item discrimination, and distractors’ effectiveness using Quest program. This descriptive quantitative research revealed that the test does not have good proportion among easy, medium, and difficult item. In the item discrimination, the test had 39 excellent items (97.5%) which meant that the test could discriminate among high and low achievers. Besides, the distractors could distract students since there were 32 items (80%) that had effective distractors. The findings of this research provided insights that item analysis became important process in constructing test. It related to find the quality of the test that directly affects the accuracy of students’ score.


INTRODUCTION
A large number of studies have highlighted the crucial roles of appropriate assessment in the success of English teaching-learning process. The success of English teachinglearning can effect on students' language proficiency. Sulistyo and Suharyadi (2018) argue that students who have well language proficiency can use that language to communicate. Achieving that success, assessment can provide student progress in mastering the material that has been given (Browder et al., 2006). It assists the teacher to determine the proper approach and method of teaching (Scouller, 1998). As stated in Peraturan Menteri Pendidikan Dan Kebudayaan RI Tentang Standar Penilaian Pendidikan (2016), assessment as a process of collecting and processing information to measure the achievement of students' learning outcomes in learning activity. Instrument is needed in obtaining information of assessment. It is a process where information is produced to oversee the improvement of students' abilities. (Arikunto, 2016) identified two types under the term assessment which can use as an instrument: tests and non-tests. Tests involve diagnostic, formative, and summative test. Whereas non-tests involve rating scale, questionnaire, checklist, interview, observation, and biography.
Generally, Indonesian teachers apply tests, specifically summative tests to assess students in the end of learning process. Brown (2004) states that test is a set of equipment to measure an individual's proficiency within particular criteria. This definition is close to that of (Miller et al., 2009) who define test as an equipment to assess students' abilities through a package of questions within a specified time. It means a test assists teachers to evaluate students' competence that can interpret students' progress. Furthermore, (Brown, 2004) explains the summative tests itself can assist teachers to assess students' comprehension when the learning process ends. It is one of the ways to discover the students' competencies in the end of learning process in the school. Automatically, teachers should construct a good test.
A good test needs to consist of well-constructed items which in turn will teachers to assess students' competencies accurately. It should consist of at least three criteria encompasses practicality, reliability, and validity (Brown, 2001). Practicality can broadly be defined as operating budget, time limitation, implementation, and scoring system of test. Test should be prepared with the low budget (Brown, 2001). Then, test should have vivid time limitation and could be managed easily. The most important is spelling out specific and efficient scoring system. Associating with reliability, the test result should provide stable results in different circumstances (Flucher & Davidson, 2007). Therefore, the test result is trusty. Whereas reliability refers to dependability, validity refers to the tests' ability to measure what should be measured accordance with the learning goals or competencies to be achieved. In ensuring the test has good quality, it must be analyzed to identify the quality by doing item analysis.
Previous studies have been conducted on the quality of English final semester tests, specifically in junior high school in Indonesia (for example Amelia, 2010;Maghfiroh, 2010;Toha, 2010;Ani, 2011;Lestari, 2011;Risydah, 2014;Haryudin, 2015;Manfenrius et al., 2015;Fajriah, 2016;Haryudin & Santosa, 2016;Pradanti et al., 2018;Maghfiroh, 2019). However, none studies conduct item analysis of English final semester tests for junior high school in Ponorogo district. The interview results with the English teachers of junior high school in Ponorogo also showed that they often pass analyzing test items before distribute the tests to students. Therefore, the purpose of this study is to describe the quality of English final semester tests for nine grade students in the academic year of 2018/2019 in Ponorogo district in terms of item difficulty, item discrimination, and effectiveness of distractors. These characteristics have been chosen partly because of the English teachers' forum of junior high school in Ponorogo district already analyzed theoretically. This study is expected to provide a feedback and an example for English teachers, educators, test developers, and others who create an English test. In addition, this study is done to provide a reference for future similar study.

Test Item Analysis
Assessment is a process where information is produced to oversee the improvement of students' abilities. For Miller et al. (2009), assessment means mechanism to find out the students learning results and progress through observation, projects, and tests. Researchers were pointed out in the previous that English teachers conduct summative test to assess the students' competencies in the end of learning process. Teachers or test makers should construct a good test so that the results are valid and reliable. In terms of a good test, Mardapi (2015) states nine steps for creating a highly qualified test involve: (1) composing test specifications, (2) creating a test, (3) analyzing a test, (4) doing a trial, (5) analyzing test items, (6) correcting test, (7) assembling test, (8) administering test, and (9) interpreting test results. Following those steps will assist teachers or test makers generating a well-constructed test.
As the interview results, the English teachers' forum of junior high school in Ponorogo district does not conduct analyzing test items before distribute the test. Test item analysis is claimed as the process to identify the quality of test. Rosana and Setyawarno (2017) say that item analysis is a method to dig up the test quality in order to refine the wellconstructed item. In short, it is organized to identify and analyze the quality of test items. The major purpose of this process is to build on the better tests by revising or dropping poor items (Boopathiraj & Chellamani, 2013;Mukherjee & Lahiri, 2015). This process is important to confirm well-constructed items that are fit with the test principles. Moreover, teachers or test makers' ability in constructing test items will improve. The teachers or test makers role are revising or dropping test items that are not proper.
In analyzing test items, a good test at least should conform to three characteristics, namely item difficulty, item discrimination, and effectiveness of distractors (Brown, 2004). This is done by analyzing the students' responses of each item. Test makers can analyze by two statistical theories, namely classical test theory (CTT) and item response theory (IRT) (Haladyna, 2004). Item response theory is provided as a development of classical test theory. In classical test theory, the item difficulty index depends on the number of samples. Otherwise, item response theory has advantage of providing estimation of difficulty appropriate to estimation students' ability (Flucher & Davidson, 2007). Since the researcher identify item difficulty, item discrimination, and effectiveness of distractors, this study used classical test theory. Classical test theory assumes that the assessment instrument has none errors which result in the participants have a true score.
Relating to this study, the researchers use classical test theory by Quest program. Quest program is one of computer-based statistics programs from The Australian Council for Educational Research Limited (ACER) (Izard, 2005). This program can increase the precision of calculation compared to the manual technique. Ofianto (2018) adds that Quest program can calculate by Classical Test Theory (CTT) and Item Response Theory (IRT). It means this program has advantages compared to other computer-based statistics programs. Suyata (2016) mentions the others benefits of Quest program are more accurate than other statistic programs. In addition, this program can analyze polytomous, dichotomous, and combination of dichotomous and polytomous data.
The TPAtn file output in Quest program displays about item difficulty, item discrimination, and effectiveness of distractors index. The item difficulty index is served as a value percentage that has an asterisk symbol. Further, the discrimination index is served from biserial point that has an asterisk symbol. Meanwhile, the distractor is served from the percentage of participants who choose the option. The options of being the distractor must have a lower biserial point than the correct option.

Item Difficulty
The item difficulty is to identify the percentage of students who answer correctly (Haladyna, 2004). This definition is similar to that found in Brown (2004) who writes: item difficulty relates to the percentage of students who assume an item easy or difficult. This characteristic identify whether the item is difficult or easy so this characteristic can assist the teachers in analyzing easy, medium, and difficult item. Kunandar (2013) claims that a test package must contain 25% easy items, 50% moderate items, and 25% difficult items. It will reduce students to become discouraged and not enthusiastic in answering test items. Arikunto (2016) argues difficult items cause students to be lazy in answering the questions.
The requirement that an item has an ideal item difficulty is that an item must neither too easy nor difficult. The range of item difficulty index is between 0.0 and 1.0. According to Flucher and Davidson (2007), the item difficulty index is between 0.30 and 0.70. Items with index less than 0.30 mean difficult while items with index more than 0.70 mean easy. Factors which affect item difficulty are item analysis theories, the clarity of questions, and similarity between test items with materials in syllabus (Haladyna, 2004).
Numerous studies have attempted to explain the item difficulty in relation to analyze the tests quality (for example Amelia, 2010;Maghfiroh, 2010;Ani, 2011;Risydah, 2014;Haryudin, 2015;Manfenrius et al., 2015;Maghfiroh, 2019;Pradanti et al., 2018). Some analysts, (e.g. Amelia, 2010;Ani, 2011;Maghfiroh, 2010;Risydah, 2014;Pradanti et al., 2018;Maghfiroh, 2019) have attempted to analyze the item difficulty of English final semester test of junior high school. Thus far, these previous studies have revealed that the moderate items are more than others categories item. In summary, those test packages have more items that qualify as a well-constructed item than qualify as a poorconstructed item. Nevertheless, the portion among easy, moderate, and difficult items is not balanced.
In contrast to those six previous studies, Haryudin (2015) found that the difficult items are more than other categories item. In their analysis of item difficulty, these previous researches point out that those test packages have more poor-constructed items than well-constructed items. Different finding exist in the research regarding item difficulty analysis. Manfenrius et al. (2015) analyzed three test packages from three junior high schools. In their research, six items from 150 items were classified as difficult item. Most items were classified as easy item. Moreover, the portion between easy, moderate, and difficult items in this research is far from ideal.
An important theme emerges from the researches discussed so far: the ideal portion between easy, moderate, and difficult items. It is a challenge for teachers or test makers to create items with balanced portion. Thus, the items truly assist teachers to test their students based on students' ability.

Item Discrimination
Second characteristic is item discrimination that have ideal index more than 0.39 (Ebel & Frisbie, 1991) with range between 0.0 and 1.0 (Hingorjo & Jaleel, 2012). This characteristic is about identifying students' knowledge and ability (Haladyna, 2004). It assists teachers to discover high achievers and low achievers in a class. An item test can reach ideal index when high achievers answer correctly more often than low achievers (Hingorjo & Jaleel, 2012). However, this characteristic depends on the number of students' responses, which test makers analyze (Flucher & Davidson, 2007). This number of sample illustrates test takers' abilities. The smaller number of responses causes inaccurate of the item discrimination calculation. Another significant effect of item discrimination is the poor item discrimination index will give bad effect on reliable interpretation of the real students' knowledge (Setiyana, 2016).
In view of all that has been mentioned so far, one may suppose that the existing test items on those previous studies are not be able to discriminate high and low achievers. As Pradanti et al. (2018) argue that teachers or test makers must create items using vivid instructions and language structures. It can prevent students from confusion and difficulty while finishing the test.

Effectiveness of Distractors
Another characteristic is distractors. This characteristic can only be analyzed on tests in the form of multiple-choice tests. A well distractor must be chosen by at least 5% of the respondent, especially those who include in low achievers (Rosana & Setyawarno, 2017). In doing item analysis, test makers must analyze the effectiveness of distractors to measure the functioning incorrect options in attract students (Brown, 2004). Distractors analysis is one of important parts since it has several functions in item analysis. The functions involve reducing items that use ineffective sentences or too many options, providing information to improve the items, assisting to choose a correct distractor, assisting to comprehend students' cognitive behavior, and increasing items' response score (Haladyna, 2004).
Previous researchers have identified the effectiveness of distractors in tests. Several researchers have reported that the effectiveness of distractors in their studies is low (e.g. Risydah, 2014;Haryudin, 2015;Manfenrius et al., 2015;Pradanti et al., 2018;A. Maghfiroh, 2019). Data from these studies identified that more than 40% items qualified as ineffective distractors. Considering all of this evidence, it seems that teachers or test makers should increase their ability in constructing test items. The unclear language structures and unfamiliar vocabularies affect the item difficulty and item discrimination index (Pradanti et al., 2018).

Research Design
This study used descriptive quantitative research since this study aims to find out the quality of test items of English final semester test for grade nine students in the academic year of 2018/2019 in Ponorogo. Anderson and Arsenault (2005) state that descriptive quantitative research aim to portray the data as a whole by grouping and representing the data in tables or figures.

Population and Sample
The population of this study was the grade nine students of 74 junior high schools. The researchers employed proportionate stratified random sampling to acquire the representative sample. The sample involved in this study were 151 samples in the form of students' answer sheets of English final semester test for grade nine students in the academic year 2018/2019 in Ponorogo. The students' answer sheets were from the different junior high schools which are already divided in three ranks: top, middle, and bottom rank.

Instruments
The researchers applied a blank table as an instrument in this study. The blank table refers to the Quest program report of multiple-choice test item analysis. Researchers used this table to record the calculation results of Quest program. This instrument involves 3 characteristics: item discrimination, item difficulty, and distractors in which these characteristics cover the quality of the test. An item was accepted when it conforms to the whole ideal index of item difficulty, item discrimination, and effectiveness of distractor. Conversely, an item was eliminated when it does not conform to one of the item difficulty, item discrimination, and effectiveness of distractor.

Data Analysis Procedures
To analyze the data, the researchers computed through the Quest program to obtain the calculation of item difficulty, item discrimination, and distractors index. The answer key and students' responses of the test package were typed in the form of notepad f ile. Afterward, researchers created file control in the form of notepad as a command to analyze the data. The file control must place in the same location with the Quest program. Then, the researchers ran the program and typed 'submit' word followed by the file control's name. Automatically, this program created output file which provided the calculation of item difficulty, item discrimination, and distractors index. The item difficulty index was presented in the form of a value percentage that has an aster isk symbol. The range index was from 0.00 to 1.00. For the discrimination index, the index was presented from biserial point value that has an asterisk symbol. The well discrimination index offered a positive index. Whereas for the distractors could be seen from the percentage of students who select the option. A distractor was effective when the biserial point value was lower than the biserial point of the correct option.

FINDINGS
There were 40 items in the form of multiple-choice test with 4 options in the English final semester test for grade nine students in the academic year of 2018/2019 in Ponorogo. Quantitative analysis was conducted to identify the quality of test items based on item difficulty, item discrimination, and distractors using Quest program. In general, the findings revealed that the index of item difficulty, item discrimination, and distractors is very high. The findings are determined with judgments: items were accepted when the items conformed to all of the three characteristics and items were eliminated when the items did not conform to one of the three characteristics.

Item Difficulty
The researchers calculate the item difficulty based on students' response. Table 1 displays the item difficulty index from Quest program related to the level difficulty of items.

Item Discrimination
The next characteristics are showed in Table 2 which presents the calculation of Quest program related to item discrimination. The Quest program calculation reveals that the 97.5% (39 items) with very good discrimination index and 2.5% (1 item) needs little revision. Hence, most of the items include in accepted item and can be used as item bank. There is only an item should improve by little revision.

Effectiveness of Distractors
In the final characteristics, researchers analyze the effectiveness of distractors. The researchers focus on the biserial point of the options. The table below shows the distribution of distractors. As clearly presents in Table 3, most all of the distractors of English final semester test in the academic year of 2018/2019 in Ponorogo are effective to distract the students.
What is interesting about the data from Quest program that there are 32 items (80%) as effective distractor and the other 8 items (20%) are ineffective distractor. The results indicate that most of items can distract students effectively.

DISCUSSION
This study set out with the aim of identifying the quality of the English final semester test in the academic year of 2018/2019 in Ponorogo based on item difficulty, item discrimination, and the effectiveness of distractor. In the current study, out of 40 test items, most of the items are acceptable in the item difficulty. The 37 items are moderate items while 3 items are easy items. The ideal test involves 25% easy items, 50% moderate items, and 25% difficult items (Kunandar, 2013). The results of this study do not show that the test package has proportional item difficulty. Alderson et al. (1995) said that this condition cannot reveal the exact students' ability. These results involve more moderate category than easy and difficult category. This argumentation confirmed Brown (2004) who argued a well-constructed item cannot be too easy or difficult. A test package should cover each difficulty level so that teachers can recognize the abilities of each student. By contrast, Haider et al. (2012) argued that the dominant category that is medium category could indicate that the students have well comprehension to answer the test since more than half of the students answer the items correctly. This can be related to none difficult items in the test package.
These results are comparable to those of other studies (e.g. Amelia, 2010;Maghfiroh, 2010;Ani, 2011;Risydah, 2014;Haryudin, 2015;Manfenrius et al., 2015;Pradanti et al., 2018;Maghfiroh, 2019), although test conditions do not similar. These previous studies reported the disproportionate portion among easy, moderate, and difficult items. There is possible explanation for these results. Item difficulty can be influenced by cognitive factors (Sung et al., 2015). Cognitive factors involve comprehension, coding, transition, scrutinizing, and working memory (Danili & Reid, 2006). They added that cognitive factors affect students' performance and achievement so these factors affect calculation of item difficulty.
These results are also likely to be related to factors which can affect item difficulty namely, theory of item analysis involves classical test theory (CTT) and item response theory (IRT), the clarity of items instruction, and the suitability between material and items (Haladyna, 2004). The use of statistical theory in analyzing the quality of items can affect the accuracy of the index results. Furthermore, the instruction of items also affects the students' comprehension which affects their answers automatically. Students might answer with incorrect answer when the questions contain unclear instructions. Last, the suitability of the materials with the questions also affects the item difficulty. Students would be difficult to answer the questions when the questions are not in accordance with the material that has been studied in class. In short, teachers or test makers should concern with these factors to achieve a balanced item difficulty.
On the question about item discrimination, the results are very great. The Quest program calculation revealed that the 97.5% (39 items) with excellent discrimination index and 2.5% (1 item) with poor discrimination index. These results indicate that 1 poor item needs revision. It is interesting to note that most of test items of this study can be kept as item bank and used for further test. These accords with Ebel and Frisbie (1991) that the great item discrimination index is influenced by the moderate item difficulty index.
As stated in the Quest program result, 92.5% of the test items are moderate items. The great item discrimination index leads the test items to discriminate high and low achievers.
Having defined item discrimination, the researcher will now move on to discuss the effectiveness of distractors. The effectiveness of distractors is recognized by analyzing distractors. The results of this study showed that there are 32 items (80%) which have effective distractors and 8 items (20%) which have ineffective distractors. The levels analyzed in this study are high above than the previous studies (see Risydah, 2014;Haryudin, 2015;Manfenrius et al., 2015;Pradanti et al., 2018;A. Maghfiroh, 2019). These previous studies found that more than 40% were ineffective distractors.
These results may be explained by the fact that the item discrimination index may have been an important factor in the effectiveness of distractors. As stated in the previous, most of the test items are able to discriminate between high and low achievers. It can therefore be assumed that the great item discrimination can lead to the effectiveness of distractors (Kheyami et al., 2018). Despite this, the ideal number of distractors affects the functionality of options. An item had at least three distractors to make the item work well (Haladyna, 2004;Rodriguez, 2005;Kheyami et al., 2018). Since this test package has three distractors in each test item, most of the items have effective distractors. Creating reasonable distractors and decreasing ineffective distractors were important to increase the test items' quality (Rodriguez, 2005) .
According to these results, we could infer that the test package was a good test. The 31 items (77.5%) were well-constructed item since they conformed to the characteristics of item analysis. While, the other 9 items (22.5%) were poor-constructed item since they do not conformed to the characteristics of item analysis. A good test was able to reveal the students' performance accurately (Quaigrain & Arhin, 2017). It indicated that there was suitability between the items and the material being studied (Gareis & Grant, 2015). A good test can build the effective and comfortable atmosphere classroom -learning because the teachers realize the students' needs and abilities. It automatically reveals the specific topics or materials which need more emphasis or clarity. Moreover, the students' higher level cognitive was able to assess (Quaigrain & Arhin, 2017). Mukherjee and Lahiri (2015) proposed that a well-constructed item is capable to assess higher level cognitive such as knowledge, application, analysis and synthesis. Furthermore, the effect of a good test made teachers easier to assess students' performance level and provided the consistent scores (Hotiu, 2006).
To achieve the consistent scores, improving the assessment literacy needs to be carried out by teachers and test makers because assessment is a complex, dynamic and continuous process. (Xu & Liu, 2009). The teachers and test makers who have a good assessment literacy are able to construct and apply tests with a high level of validity and reliability continuously (Gareis & Grant, 2015).

CONCLUSION
The Quest program results provide evidence that generally, the test is a good test.
Although several items must be revised or replaced with the new item, most of the items conform to be well-constructed items. These poor items may be influenced by other causes such as, students' understanding level, ambiguity of instructions, difficult materials or topics, and ambiguity in the options or even key answer. In spite of this study has several advantages, it contains several limitation such as the few variable and data of the study. Access to offer seminars on constructing test item must to be found. These results may support teachers or test makers as an effective feedback to change in the way they construct test items. Moreover, the way teachers teach and the atmosphere of teaching-learning activity can be improved. In the future study, other researchers should add other techniques of analyzing test item to compare the results. Other researchers should also complete the study by qualitative analysis to obtain the deeper findings. The students' argumentation may be included for discovering more accurate about the level of difficultness items and enhancing the solution of the problems.