Student Performance Assessments

The Point of Assessment.   When you evaluate something, you assess its value, you make a judgment about its goodness relative to some standard.    

For an assessment to be useful, the standard for judgment must be both valid and reliable.   In other words, the standard of goodness must be worthy.  We discuss validity and reliability as it relates to our assessments below.    

As you put together your assessment plan for your course, please include OME’s Student Performance and Evaluation office in your deliberations.   David Golay has deep expertise in this area and can help you plan evaluations that are valid, reliable, and relevant for short and long term evaluation purposes.     In addition, the Student Performance and Evaluation office can bring sophisticated analysis tools to bear to help you think about your assessments meaningfully.  

Types of Student Performance Assessments

Student performance assessment is measuring actual student performance against a specified set of expertise criteria.    For these assessments we select performance standards from both STEP Exam criteria and Physician performance norms.    The twin standards of student success in our program are:

  • Excellent performance on the “gateway” exams (STEP 1 and STEP2 CK and STEP2 CS) and
  • Excellent performance as an “undifferentiated intern” physician.   

In other words, learning success for our students is both academic as measured by the gateway exams, and practical as demonstrated during their clerkships and residencies. 


Formative versus Summative Assessment    

Formative Assessment.   Formative assessments gather information about student learning as the learning is taking place (Anderson & Krathwohl, 2001).   We conduct formative assessments to help us plan our immediate way ahead in instruction, to improve the quality of instruction as it progresses.   Using a 5 question clicker quiz during a lecture is a good example of a formative student performance assessment.  

Summative Assessment.   Summative assessments focus on measuring learning after learning should have occured.  We generally use summative assessments to assign grades for course work.  


Specific Assessments Used at EVMS    

Classroom Quizzes.    These are formative in nature and serve the purpose of helping instructors monitor student learning and make "course corrections" during a course.   These can take the form of clicker quizzes or more formal computer-based quizzes.  Lab quizzes using clickers may also be used in a formative fashion to gauge student progress or instructional effectiveness.       

Formal Course Exams.    Exams are, by definition, summative in nature.  Periodically, during the progression of a course, exams are administered for the purpose of monitoring and reporting student learning.    These exams are often vignette-based multiple choice tests and aim at approximating the types of questions students will face on STEP exams as we test knowledge and practical application of our course materials.    We often use Shelf Exams or other sources that use STEP exam style questions.   

Practical Exams.   Practical exams are summative assessments.  In addition to formal course exams, many courses employ practical assessments, as well.   For example, Introduction to the Paitent incorporates evaluations with simulated patients or other simulations to measure student learning.  Other courses, like Anatomy for example, incorporate other types of laboratory exams.    In all of these cases, the goal is to evaluate student learning in a realistic setting that has a relationship to the end performance of a physician in clinical practice.     

Alternative Assessment Techniques.     A good assessment addresses the required performance directly and is not contingent upon the mode of administration.   For example, if the students must demonstrate a knowledge of the interrelationships of brain structures, an appropriate assessment may be to have them draw a cross-section of the brain and label the parts.   Or, if the objective is that students diagram the relationships of components of a local health care network, actually drawing a diagram is an appropriate assessment.   In fact, these types of assessments in which learners must construct or create something that demonstrates the required knowledge are often fruitful learning experiences since this act of production serves to strengthen the fragile schema of new knowledge.  


Why Using STEP Exam Style Questions is a Good Idea...

You may have noticed that we often encourage instructors to employ quiz and test questions that mimic the style of questions on the STEP exams.   These questions often use a clinical vignette that tests the application of foundational science in a clinical setting.   We encourage this idea of imitating STEP style questions for two reasons: 1) placing an application of foundational science in a clinical setting approximates the real world of a physician so it enhances transfer of what we teach to the practice of medicine, and 2) frankly, it prepares our students for the STEP exams.   

There is no getting around the reality that the STEP exams are, in fact, gateway exams that our students must pass to move on in their careerss in medicine.   They must learn the content and the pattern of STEP exam questions.   But further, by most standards of judgement, the STEP exams are worthy exams.   That is, they are designed to place the knowledge gained in medical school into the clinical setting.   The use of clinical vignettes not only makes the exams seem more relevant, it makes them more valid.   


Validity and Reliability of Evaluations

We have said that an excellent evaluation is both valid and reliable.   What does that mean?

Validity.    Validity means that a test is measuring what it says it measuress.  In our world, this translates to the test measuring performance against the actual course objectives. 

For example, suppose a course objective read: “Given a regulation college football, the student will throw the ball through a 24” diameter target from 15 yards away with 95% accuracy.”   You could sit the students down, run them through a 40 hour lecture on throwing the football, and then give them a 20 question multiple choice test to measure whether or not they got it. 

While the answers to such a test might hold some mild interest, it would not be a valid test of the objective performance.   A valid test would be to suspend a 24” round hoop 4 feet above the ground and have students—after they completed the training—stand at a point 15 yards away and throw regulation college footballs through the target.    You could have them throw the ball 20 times.   If 19 of these throws make it through the hoop (95% of the 20 throws) then they would successfully pass the test.   

A valid test measures what you intended to measure.  

Validity is not always easy to measure.  Morrison, Ross, Kalman and Kemp (2011) propose that the two most important validity types for our type of educational context are face validity and content validity. 

Face validity is established by the judgment of experts that the test items are a good measure of the performance required by the learning objective.  

Content validity is a more systematic examination of the items on the test, counting the objectives tested against all the objectives taught.   In this way, a judgment can be made regarding whether or not the test measured a representative sample of the objectives.   

The office of Student Performance and Evaluation can help you make these judgments in a principled fashion.

Reliability.   Reliability simply refers to whether or not a test yields a consistent output under similar conditions.     And while it may be easier to define reliability than it is validity, establishing reliability involves more complex analysis methods.   

Measures of reliability are derived from statistical analysis of item-by-item student performance data.   Here are some of the general rules of thumb regarding reliability:

  • Generally, the more items you have addressing each objective, the more reliable the test will be.
  • If tests are administered in a standardized way, they tend to be more reliable.   This is a main reason LCME (and OME) strongly advise using proctors from the office of Student Performance and Evaluation to administer exams.
  • Everyone should be tested under the same conditions.
  • The scoring method should be clear and repeatable.
  • And there are methods such as test-retest method, split-half method. and others which use statistical methods to establish item reliability. 



Anderson, L. W., & Krathwohol, D. R. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's Taxonomy of educational objectives. New York, NY: Longman.

Morrison, G. R., Ross, S. M., Kalman, H. K., & Kemp, J. E. (2011). Designing effective instruction (6th ed.). Hoboken, NJ: John Wiley & Sons, Inc.