Let’s clarify the terms: evaluation, assessment and testing.

 

Assessment is measuring our students’ performance and the progress they make, diagnosing the problems they have and providing learners useful feedback.

 

Evaluation is considering all the factors that influence the learning process such as syllabus objectives, course design, materials used, methodology, teacher performance and assessment. Assessment is one of the most valuable sources of information about what is happening in a learning environment. So evaluation is an umbrella term.

 

Is testing synonymous with evaluation? No, it is a way of formalassessment alongside with oral exams of traditional format. Assessment may also beinformal, carried out by the teacher not under special test/exam conditions.

 

Is it just teachers’ responsibility to test their students? Primarily yes, but school or university administration also have a say, and local and national authorities are responsible in case of some regional tests (“srezy”) or national exams or the so called external independent testing (ZNO). Besides assessment of students’ progress may be carried out by the students themselves, it may be peer-assessment or self-assessment.

 

They often say that testing should take place only after learning. It’s true about tests at the end of the topic, term, or year. It’s summative assessment measuring students’ performance at the end of a certain period of study. But we should not forget about formative assessment which feeds back into learning, gives learners information of their progress throughout a course, helps them to become more efficient learners. There are diagnostic tests as well aimed at finding out students problematic areas with the language and given at the beginning of a course or when a new teacher starts conducting a course and wants to know where students and s/he stands.

Testing as a check on learning is essential. Teaching implies giving input and guidance, testing implies absence of teacher’s support and some kind of evaluation. One absolutely necessary features of testing is accountability.As professionals, teachers should be able to provide learners, parents, institutions and society in general, with clear indications of what progress has been made and if it has not, why that is so. We should be able to explain the rationale behind the way assessment takes place and how conclusions are drawn, rather than hiding behind a smoke screen of professional secrecy.

 

Assessment should be aimed at giving students a chance to show what they have learnt rather than to reveal what they have not learnt. Unfortunately, it is not always like that. Tests try to catch students out and students feel alienated by the assessment because they have no other role in it but that of passive participants. For many learners in this situation, especially when task formats and criteria for grading the works are not informed beforehand, assessment may seem arbitrary and at times even unfair. Sometimes they get on with their teacher, sometimes they do not. Sometimes they are lucky and revise the right material for a test, sometimes they are unlucky.

 

To change this gloomy picture, we are to analyse tests that are used at schools and at universities, to speak about different test formats, to find out how to test effectively and integrate testing into teaching process, and last, but not least, - how to enlist the help and participation of both teachers and learners and use a co-operative approach. Rather than being motivated by the threat of examinations, students should start feel more responsible for their own progress.

 

 

  1. Tests used at secondary and tertiary levels

 

  1. Progress tests – continuous classroom assessment (formative assessment, formal assessment). They are used to find out how well the students have grasped the material covered and learning objectives. Quite naturally they must be short (not take much time from teaching and learning), they must be based on samples from material covered (to be fair), They must check both knowledge and skills. If they were concentrated on knowledge only, some learners may think that this is the main aim in learning a foreign language (to know certain words and master grammar rules), practice and production stages are not very important.

 

  1. Diagnostic tests– used at the beginning of a new course or from time to time to find out problem areas and work out remedial activities. The content includes knowledge and skills but some sub-skills can be in focus as well, e.g. writing letters of complaint or skimming of leaflets. These tests are rather difficult to design since they are based on eliciting errors rather than correct answers or correct language.

 

 

  1. Achievement tests– formal assessment at the end of the term or year. The aim of them is to see whether the students have achieved the objectives set out in the syllabus and if they could be moved to a higher level (next form, next term, year etc). They are backward oriented. The tests are centered about skills and knowledge, the material covered within the period of study. Usually they are difficult to design and they put a lot of stress on both teachers and students. Besides, achievement tests must necessarily contain both easy and difficult tasks selected form the material learned.

 

  1. Proficiency/qualification tests– summative and final assessment. They are designed to find out what students are capable of doing in a foreign language. Some international tests (Cambridge tests, TOEFL, GRE etc.) are of this kind. They are mainly forward-oriented. They are irrespective of any syllabus (or at least of any course-book like ZNO) and include the material needed for professional work (University) or survival in all spheres of life (personal, educational and social). They are centered about knowledge and skills, sub-skills, functions, notions, behaviour. The tests must be communicative and reflect real life situations. It is usually very difficult to select material, to put in and to leave out. Proficiency or qualification tests put a lot of stress on students.

 

 

  1. Entry\placement tests– conducted at some universities or language schools where streaming is practiced. Used to be conducted at all institutions of tertiary level. The purpose of the test is to select, to filter out the applicants and to place students into groups according to the language ability. It should be based on secondary school syllabus, test both knowledge and skills. Since it is to discriminate entry or placement test must be difficult.

 

  1. Aptitude tests– administered at the end of school or used as a kind of entry requirement. Their purpose is to find out if students have aptitude for something, e.g. SAT (Scholastic Aptitude Test) at American schools. It is forward oriented to some future studies and professional activity.

 

 

  1. Approaches to testing and main kinds of testing

Teacher’s espoused theory of language learning and teaching and the ELT approaches and methods s/he adheres to will influence the way s/he tests. Approaches to testing reflect changing views on language, language learning and language teaching.

 

In the period before 1930s Intuitive approach to testing was mainly spread. Language was viewed as a system of rules, Grammar –Translation approach to teaching was the king so, naturally, tests were centered about knowledge of grammar and translation skills. The preferred test formats were translation, writing essays and grammar analysis with a heavy cultural and literary bias. There was no special skill or expertise in testing – only the subjective judgement of the teacher in both the setting and marking of tests.

 

1950s and 60s saw a different view of language and learning. Language was seen as a system which could be broken down in a set of linguistic items (structuralism). There was the advent of Audio-Lingual method: learning, according to behaviorist psychological theory, was seen as the systematic acquisition of set of habits. Tests of the Scientific approach reflected these views and were designed to measure learners’ mastery of separate elements of the language (grammar, vocabulary, phonology) at sound, word and sentence level, i.e. no context was provided. Skills were also tested separately. So tests were discrete item tests. Tests had to be objective and reliable so that there could be statistical analysis of the results. Multiple-choice was the most common test format. Rise of testing experts specially trained took place for the first time in history.

 

Since 1980s -1990s language has been seen as a complex system of skills and problems of linguistic and non-linguistic behaviour as well as a means of communication. Cognitive (language is a means of acquiring knowledge) and Communicative approaches to learning have been in the focus. So the tests became concerned with meaning in context and communication. Integrative and communicative approach to testing came into being. Tests are to be integrative (several skills, sub-skills and language use are tested at a time). Typical integrative tests are – Cloze, essay-writing, oral interviews etc. Tests are to be communicative (primarily concerned with how language is used for communication). Therefore, tests tend to consist of real-life tasks and success is judged on the basis of the effectiveness of communication, i.e. they assess language use more than language usage.

 

Finishing with approaches and types of tests we should make some more terms and concepts clear.

 

First of all, terms objective and subjective tests. Both are neutral terms though they have acquired evaluative connotations: objective – good, subjective – bad. In fact it refers only to scoring of tests: all tests require subjective answer by candidates, test items are selected subjectively by the tester, etc. Objective tests presuppose only one correct answer and are checked against the provided key. In subjective tests to limit all options to the only right one is not possible, the mark is mainly impressionistic though some criteria are taken into account.

 

Discrete item tests are opposed to integrativetests. In discrete item tests one language point (e.g. test on prepositions) or one skill is tested at a time.

 

Indirect tests are opposed to direct ones.

Direct tests are those during which a candidate is asked to perform precisely the skill which we want to measure, e.g. if we want to know how well a learner can write letters, we get him to write a letter. Test tasks and texts used are as authentic as possible. Direct testing is easier when we want to measure the productive skills. For the receptive skills it’s necessary to get learners not only to listen and read but also to demonstrate that they have done this successfully.

Indirect tests measure abilities whish underlie the skills we are interested in, e.g. discrete point grammar tests, tests of minimal pairs (to test pronunciation) etc. One must be cautious about claims one makes: if a learner does well on a grammar test, this does not mean s/he can communicate well.

 

 

  1. The main characteristics of tests

How to distinguish a good test from a less good one? And what makes a test good? It is necessary to work out some guidelines by revising test characteristics – criteria for evaluating tests.

 

A very important feature is practicality.Any test must be not time-consuming in terms of class-hours and our own time outside the class. It should also be practical in terms of physical resources such as tape recorders and photocopiers. And at last its preparation and administration should not demand too much of money.

 

A test must have some degree of reliability, i.e. it must be consistent and under the same conditions and with the same students it should give similar results. There are several ways for making tests more reliable:

  • Administration: the same test is administered for the same time under the same conditions; provide uniform and non-distracting conditions of administration.

 

  • Size: the larger the sample is (the more tasks learners have to perform), the greater is the probability the test as a whole is more reliable: accurate information does not come cheaply, the more important the decisions based on a test are, the longer the test should be.
  • Layout and legibility: ensure that tests are well laid out and perfectly legible, tests shouldn’t be badly typed or photocopied or have too much text in too small a space.
  • Instructions: should be clear, concise and explicit. If it is possible on some occasion to misinterpret instructions, some candidates will certainly do that. It’s not always the weakest candidates who are misled by ambiguous instructions, it’s often the better candidate who is able to provide the alternative interpretation. Test writers should not rely on the students’ powers of telepathy to elicit the desired behaviour.
  • Familiarity of tasks: candidates should be familiar with both formats and testing techniques.
  • Scoring: appropriate criteria should be chosen beforehand and students should be informed of them, scoring should be objective (provide a detailed scoring key) and, in case of subjective tests reliability is achieved through standardization (agree acceptable responses and appropriate scores) and training of raters (examiners).

There are various methods for measuring the reliability of the test – most statistical, but the simplest way is test re-test (provided learners have equal treatment in the interval).

 

It is also very important that our assessment has validity, that we are clear about what we want to assess and that we are assessing that and not something else. For example, if we want to assess listening we must only consider understanding and not assess our students’ ability to read or write or their ability to produce accurate language. Or the following test item has low validity if we wish to measure only writing ability in a General English class “Is photography an art or a science?” – since it demands some knowledge of photography.

 

There are several types of validity:

· Face validity: how acceptable a test is to the public (teachers, students, authorities, etc). The test should look right, be convincing.

· Content validity: how representative of the learners needs and syllabus content the items of the test are, how adequately the expected content has been sampled.

· Concurrent validity: whether the candidates’ performance on this test is comparable with their performance on other tests, with students’ self-assessment, with teachers’ ratings etc.

· Predictive validity: whether the test predicts how well the test-taker will perform in future, e.g. at final exams.

 

Reliability and validity are constantly in conflict: the greater reliability of a test is, the less validity it ahs. E.g. writing down the translation equivalents of 500 words is a reliable but not a valid type of writing. Or real-life tasks like letter-writing have higher validity at the expense of reliability. The best way often is to devise a valid test then establish ways of increasing its reliability. The tester has to balance gains in one against losses in the other.

 

The last characteristic is backwash (washback) effect – influence of testing on teaching. What and how we test often predetermines what and how we teach or what and how learners study. The influence can be either positive or negative. There is a tendency to test what is easiest to test rather than what is most important to test. Weighting of different abilities which are tested does not correspond to the course objectives. If we claim to teach communicatively, we cannot use tests containing mainly multiple-choice grammar items. Or if we include items based on one unit only instead of the three ones covered, the students would feel cheated. Or cramming for half a year before the exam makes our teaching exam-oriented. Sometimes it is positive, e.g. teachers cannot any longer ignore writing letters in class as it is included into compulsory external independent evaluation.

Ways of achieving beneficial washback effect:

  • Test the abilities whose development you want to encourage;
  • Sample widely and unpredictably;
  • Use direct testing (authentic tasks, the skills we are interested in fostering);
  • Make testing criterion-referenced (scoring of tests results may be norm-referenced or criterion referenced. Norm referencing consists of putting the students in a list or scale depending on the mark they achieved in the test A pass in the test might be decided as the top 60% of students with 40% failing. This is often used in public examinations but is not suitable for classroom testing. Criterion referencing consists of making decisions about what is a pass and what is a fail before the results are obtained. Each candidate’s performance is decided irrespective of the rest of the candidates).
  • Base achievement tests on the objectives of the course.
  • Always go over tests and their results with students (students should realize where they went wrong, what their strong and weak points are, and can think about what they need to do to get better results the next time).

In these ways results from formal tests can feed into learning and give students, as well as the teacher, vital information about both performance and progress.

 

LECTURE 2

 

WRITING AND CHOOSING TESTS

1. Stages of test construction or selection.

2. What a test consists of. Checklists.

3. The best-known test-techniques and their analysis.

 

1. Stages of test construction or selection.

 

Whenever we start teaching a class we are to plan an assessment task programme, that is, to plan when (what weeks), what (what skills, use of English) and for how long assessment tasks we shall have. If possible, inform students of it beforehand.

An important thing is to decide on weighting between different elements in the course. Your assessment should reflect your teaching, the syllabus you follow. This may seem obvious but it is surprising how often “communicative’ classes have tests which are grammar-based. This has a very negative washback effect on students. They quite naturally come to feel that while speaking and listening are good fun but what really matters is grammar.

Having decided on weighting, we need to establish priorities. We cannot test everything that students have done throughout the course. We must therefore look at our syllabus and choose a sample of areas to assess formally. For example, a class of post-elementary students -4th Form: lexical areas: classroom/animals/homes/food/travel; grammar: revision and introduction of Present Simple/continuous, past simple, present perfect, future: going to, countables; listening: listening for gist and specific information, stories, dialogues, radio programmes; writing: writing letters and postcards about own lives, etc.

Then we are to write specifications, if we are designing a test) or to get acquainted with those specifications which are written for the test we want to choose. It should include:

· the test purpose (what kind of test is it: progress, achievement, proficiency);

· description of the test taker (young adult, small children, students of the mathematics department, applicants, etc);

· test level (difficulty);

· Construct (theoretical framework for test);

· Description of suitable language course or textbook;

· Number of sections (papers);

· Time for each section;

· Weighting for each section;

· Target language situation (what students need to perform the tasks for);

· Text-types;

· Text length;

· Language skills to be tested;

· Language elements to be tested;

· Test tasks;

· Test methods;

· Rubrics (instructions);

· Criteria for marking;

· Descriptions of typical performance at each level.

 

The next stage in case of test writing is item writing and moderation (checking them with colleagues, senior colleagues, etc). In case of test selection it is item analysis. They recommend pre-testing if possible and analyzing pretest results. Rejecting bad items and creating an item bank.

Then come scoring schemes and their analysis.

Interpreting test results, standardisation (agreement between raters on the meaning and interpretation of criteria used for assessment), setting pass marks.

At last – improving tests, monitoring and revising.

 

    1. What a test consists of. Checklists.

Each test may include the following parts.

· Test handbook (especially the so-called internationally recognized tests) – a publication for stakeholders of a test (candidates, teachers)that contains information about the formatand the content of a test. Formatmeans test structure, including time allocated to components, weighing assigned to each component, the number of passages presented, and itemtypes (elicitation procedures for oral tests) with examples.

· Test task –a separate task performed by candidates. It may be of different formats (true/false, multiple choice, short answer response, etc).

· Item – an individual question in a test that requires the candidate to produce an answer.

· Rubric– instructions given to candidates to guide their responses to a particular test task.

· Stem– the stimulus in a multiple-choice task.

· Options– the answers from which one is to be chosen as the right one.

· Answer sheet– a special form (blank) for students to fill with the selected numbers of answers, selected options etc.

When analyzing and selecting a test, separate test tasks and items, a teacher can use a checklist to help him or her.

· Is there more than one possible answer if it is a close-ended test?

· Is there no correct answer?

· Is there enough context provided to choose the correct answer?

· Could a test-wise student guess the answer without reading or listening to the text?

· Does it test what it says it is going to test? (or does it test something else?)

· Does it test the ability to do puzzles, or IQ in general, rather than language?

· Does it test student imagination rather than their linguistic ability?

· Does it test students’ skills or content knowledge of other academic areas?

· Does it test general knowledge of the world?

· Does it test cultural knowledge rather than language?

· Are the rubrics clear and concise? Is the language in the instructions more difficult than that in the test task, task text or in each item?

· Will it be time-consuming to mark and difficult to work out scores?

· Are there any typing errors that make it difficult to do?

During your practical classes you will have a chance to analyse real test taken from published leaflets or books and find out what is wrong with them using the checklist.

    1. The best-known test-techniques and their analysis.

No technique is good or bad by itself: there are concrete cases in which a certain technique is more appropriate and effective than others. Or there are items or tasks where some technique is not used correctly. Each test format is useful in its own context and is less useful in other contexts. Much will depend on what you want the test to do for you in your teaching situation.

Let us have a brief look at some test-techniques and find out their advantages and disadvantages and formulate some recommendations on their writing, selecting and using.

True/false tests

Advantages and disadvantages of true/false tests

 

Advantages Disadvantages
· May be successfully used for testing reading and listening comprehension · Scoring is reliable, economical and rapid · Rather easy to prepare · Cost-effective at the lessons · Easy to administer · Does not need a professional tester · Even a short text can provide a basis for numerous items · Can be used as a valuable teaching device to attract Ss’ attention to the most important details in the text · Can be used for graded learning · All the items can be pre-tested · May be rather subjective on the part of the test writer · The wording of the item may be ambiguous · It may not be clear if a candidate failed due to lack of comprehension of the text or lack of comprehension of the item (question) · Can encourage guessing as chances are 50:50 · They are not reliable unless there are enough items · They test only receptive skills · Not a communicative type of test · They do not test creative writing and speaking · Cheating may be easy

 

Recommendations for writing True/False statements (questions)

Read the recommendations for writing True/False statements and do the T/F activity which follows.

 

  1. In a scanning or skimming test, present items in the order in which the answers can be found in the text.
  2. Use simple language and write your sentences at a lower level of difficulty than the reading or listening text.
  3. It is advisable to include into the instructions the number of true and false statements.
  4. To reduce the element of guessing at the intermediate levels and above, add one more option “Not Stated” or “Doesn’t Say”.
  5. Do not write items for which the correct response can be found without understanding the text, simply by looking at the exact words.
  6. Correct T/F responses should be adequately randomized so as not to set a response pattern.

 

Multiple choice tests

 

 

Advantages and disadvantages of multiple-choice tests

Advantages Disadvantages
  • good for testing grammar and vocabulary
  • can successfully be used for testing reading and listening
  • helps Ss and T identify areas of difficulty
  • cost-effective at the lesson
  • scoring is objective, economical, reliable and objective
  • does not need a professional tester
  • can be used for graded learning
  • the chance factor may be reduced by offering 5-6 options
  • may include enough items for making it reliable
  • all the items can be pretested
  • they test only receptive skills
  • there may be a gap between knowledge and use
  • item construction is difficult and time-consuming
  • may be difficult to find enough distractors
  • the wording of the item may be ambiguous
  • it may not be clear if a candidate failed due to lack of comprehension of the text or lack of comprehension of the item
  • they do not test creative writing and speaking
  • may be rather subjective on the part of the test writer
  • they may include more guesswork than knowledge and may have harmful effect on learning because of guesswork
  • may be psychologically far from real-life situation
  • cheating may be easy