In this article analysis, you read the article provided and answer the questions asked. The rubric is attached also for grading purposes 

Article Analysis #2

Carefully read “The Relationship Between Teacher Performance Evaluation Scores and Student Achievement: Evidence from Cincinnati” from Peabody Journal of Education by Milanowski. Then, provide succinct answers to the following questions. To help you, Empirical Bayes’ Estimation is one method to estimate teacher “valued-added” in terms of student achievement. Residuals are the difference between the observed value of a variable and the predicted value of a variable. In the Milanowski study, EB Intercept residuals is used by the author as a measure of student learning.

1. What was/were the research questions(s)?

2. Identify the variables involved in this study.

3. What variables were used to calculate correlation coefficients?

4. What was/were the major finding(s)?

5. Were any limitations described in the study?


PEABODY JOURNAL OF EDUCATION, 79(4), 33-53 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

The Relationship Between Teacher Performance Evaluation Scores and Student Achievement: Evidence From Cincinnati

Anthony Milanowski Consortium for Policy Research in Education University of Wisconsin-Madison

In this article, I present the results of an analysis of the relationship between teacher evaluation scores and student achievement on district and state tests in reading, mathematics, and science in a large Midwestern U.S. school dis-

Some portions of this article were presented at the 2003 arvnual meeting of the American Educational Research Association, Chicago on April 21. Since that time, some additional data have been collected and some results have changed. The research reported herein was sup- ported in part by a grant from the U.S. Department of Education, Office of Educational Re- search and Improvement, National Institute on Educational Governance, Finance, Policy- making, and Management to the Consortium for Policy Research in Education (CPRE) and the Wisconsin Center for Education Research, School of Education, University of Wiscon- sin-Madison (Grant OERI-R3086A60003). The opinions expressed are those of the author and do not necessarily reflect the view of the National Institute on Educational Governance, Finance, Policymaking, and Management, Office of Educational Research and Improvement, U.S. Department of Education; the institutional partners of CPRE; or the Wisconsin Center for Education Research.

I thank Dr. Elizabeth Holtzapple, Director of Research, Evaluation, and Test Administra- tion for the Cincinnati Public Schools, for her help in obtaining and interpreting the data used in this article.

Requests for reprints should be sent to Anthony Milanowski, University of Wiscon- sin-Madison, Consortium for Policy Research in Education, 1025 West Johnson Street, Madi- son, WI 53706. E-mail:


A. Milanowski

trict. Within a value-added framework, I correlated the difference between predicted and actual student achievement in science, mathematics, and read- ing for students in Grades 3 through 8 with teacher evaluation ratings. Small to moderate positive correlationships were found for most grades in each subject tested. When these correlationships were combined across grades within subjects, the average correlationships were .27 for science, .32 for reading, and .43 for mathematics. These results show that scores from a rig- orous teacher evaluation system can be substantially related to student achievement and provide criterion-related validity evidence for the use of the performance evaluation scores as the basis for a performance-based pay system or other decisions with consequences for teachers.

The literature reviewed by Odden, Borman, and Fermanich (2004/this issue) has suggested that teachers have substantial impacts on student learning and that teacher classroom practices are likely to be an important pathway for these effects. Odden et al. (2004/this issue) proposed that the scores from well-designed, performance-based teacher evaluation systems may provide a measure of important teacher behaviors that can be used in a comprehensive model of teacher, classroom, and school effects on stu- dent achievement. In this article, I provide evidence for the potential use- fulness of such scores by examining the relationship between teacher eval- uation scores in a school district with a rigorous, standards-based teacher evaluation system and a value-added measure of student achievement. The existence of a positive, substantial, and statistically significant rela- tionship would be evidence that certain teacher behaviors have important impacts on student learning. This relationship would also provide evi- dence of the validity of these scores as the basis for administrative deci- sions with consequences for teachers.

Standards-Based Teacher Evaluation

It may seem unusual to think of teacher evaluation systems as a source of information on teacher instructional behavior that affects student learn- ing. As a measurement process, the reputation of teacher evaluation is not particularly good. For example, Peterson (2000) concluded from his review of the literature that typical teacher evaluation practices neither improve teachers nor accurately represent what happens in the classroom. Dar- ling-Hammond, Wise, and Pease (1983) characterized teacher evaluation methods as generally of low reliability and validity. Others have criticized teacher evaluation as superficial (Stiggens & Duke, 1988) or as based on simplistic criteria with minimal relevance to what teachers need to do to


Teacher Performance and Student Achievement

enhance student teaming (Danietson & McGreal, 2000). Medtey and Coker (1987) reviewed studies from tlie 1950s to 197Gs and conctuded that the re- tationstiip between principal ratings of teacher performance and student achievement was generatty weak. Their own study found correlationships between principal performance ratings and teaming gains of .10 to .23.

In the 1990s, an interest in making teacher assessment more perfor- mance based and reflective of a more complex conception of teaching guided the development of more sophisticated teaching assessment sys- tems including those used by the National Board for Professional Teaching Standards and the PRAXIS III licensure assessment (Porter, Youngs, & Odden, 2001). At the school district level, a related strategy based on ex- plicit and detailed standards for teacher performance that try to capture the content of quality instruction has attracted interest. Consistent with the movement for standards for students, this approach has been called stan- dards-based teacher evaluation. Danielson and McGreal (2000) described a comprehensive approach to standards-based evaluation. It starts with a comprehensive model or description of teacher performance reflecting the current consensus on good teaching and translates this into explicit stan- dards and multiple levels of performance defined by detailed behavioral rating scales. It also requires more intensive collection of evidence includ- ing frequent observations of classroom practice and use of artifacts such as lesson plaris and samples of student work to provide a richer picture of teacher performance. One set of standards on which several district-lev- el evaluation systems have been based is the Framework for Teaching (Danielson, 1996). According to Danielson, the Framework was intended to reflect current views of “best practice” in instruction, a pedagogical model including insights from both effective teaching research and from con- structivist/authentic approaches. The higher levels of performance in this model describe teaching practice that is active, consistent with curriculum standards, differentiated, inclusive, engages students, aims at developing a community of learners, and incorporates teacher reflection. Teaching in this way is assumed to lead to higher levels of student achievement.

Standards-based teacher evaluation systems based on the Framework for Teaching would appear to have the potential to provide measurements of teacher practice that would be more strongly related to student learning. One jurisdiction that has implemented standards-based evaluation is the Cincinnati Public School District. As described next, the district developed a standards-based teacher assessment system as the foundation for both teacher evaluation and knowledge and skill-based pay. In this article, I pro- vide a brief description of the Cincinnati evaluation system, set out a con- ceptual framework for the assessment of the validity of inferences based on evaluation scores for use in making decisions about teachers or as mea-


A. Milanowski

sures of teacher behaviors, and then present the results of an analysis of the relationship between teacher evaluation scores and measures of student achievement.

Performance Evaluation System in Cincinnati

Cincirmati Public Schools (CPS) is a large urban district with 48,000 stu- dents and 3,000 teachers in more than 70 schools and programs. Its average level of student achievement is low compared to the surrounding subur- ban districts, and a high proportion of the students are eligible for free or reduced-price lunch. CPS has also had a history of school reform activity, including the introduction of new whole-school designs (e.g.. Success for All), school-based budgeting, and teams to run schools and deliver in- struction. The union-management relationship has generally been posi- tive. Teachers have generally been paid more than in surrounding districts, giving the potential to attract better teachers. Like many other urban dis- tricts, state accountability programs and public expectations have put pressure on the district to raise average student test scores.

In response to state-level changes in teacher licensing requirements, the obsolescence of the existing teacher performance evaluation system, and ambitious goals for improving student achievement, the District designed a knowledge and skill-based pay system and new teacher evaluation system during the 1998-99 school year. (See Kellor & Odden, 2000, for a description of the design process.) Both were based on a teacher performance assess- ment process that I describe next. The assessment system was piloted in the 1999-2000 school year and was used for teacher evaluation district wide in the 2000-01,2001-02, and 2002-03 school years. An assessment of teacher re- actions to the pilot was done by Milanowski and Heneman (2001).

The assessment system was based on a set of teaching standards de- rived from the Framework for Teaching (Danielson, 1996). Sixteen (later 17) performance standards were grouped into four domains: planning and preparation (Domain 1), creating an environment for learning (Domain 2), teaching for learning (Domain 3), and professionalism (Domain 4). For each standard, a set of behaviorally anchored rating scales called rubrics described four levels of performance (unsatisfactory, basic, proficient, and distinguished). Teachers were evaluated using the rubrics based on two major sources of evidence: six classroom observations and a portfolio pre- pared by the teacher. Four classroom observations were made by a teacher evaluator hired from the ranks of the teaching force and released from classroom teaching for 3 years. Building administrators (principals and as- sistant principals) did the other two observations. The portfolio included


Teacher Performance and Student Achievement

artifacts such as lesson and unit plans, attendance records, student work, family contact logs, and documentation of professional development ac- tivities. Based on summaries of the six observations, teacher evaluators made a final summative rating on each of the standards in Domains 2 and 3, whereas building administrators rated on the standards in Domains 1 and 4, primarily based on the teacher portfolio. Standard-level ratings were then aggregated to a domain-level score for each of the four domains. The full assessment system was used for a comprehensive evaluation of teachers in their 1st and 3rd years and every 5 years thereafter. A less inten- sive annual assessment was done in all other years, conducted only by building administrators and based on more limited evidence. The annual assessment was intended to be both an opportunity for teacher profes- sional development and an evaluation for accountability purposes.

Both teachers and evaluators received considerable training on the new system. Evaluators were trained using a calibration process that involved rating taped lessons using the rubrics and then comparing ratings with ex- pert judges and discussing differences. To ensure consistency among eval- uators, the district eventually required that all evaluators, including prin- cipals, meet a standard of agreement with a set of reference or expert evaluators in rating videotaped lessons. Only those who met the standard were to be allowed to evaluate after the 2001-02 school year.

The performance evaluation system was designed in part to provide the foundation for the knowledge and skill-based pay system (Odden & Kel- ley, 2001). This system defined career levels for teachers with pay differen- tiated by level. The new pay system was originally scheduled to come into effect in the 2002-03 school year, resulting in relatively high stakes for many of the district’s teachers. However, the link between the evaluation system and the pay system was voted down by teachers in a special elec- tion held in May 2002. The evaluation system has continued to be used for new teachers and some veterans. For beginning teachers (those evaluated in their 1st or 3rd years), the consequences of a poor comprehensive evalu- ation could be termination. For tenured teachers, the consequences of a posi- tive evaluation could include eligibility for the step increases at some levels and eligibility to become a lead teacher. A poor evaluation could lead to placement in the peer assistance program and eventual termination.

Conceptual Framework for Inferences About Teacher Evaluation Scores

Figure 1 helps explicate the use of empirical evidence of a relationship between teacher evaluation scores and measurements of student achieve-


A. Milanowski

Construct Level Teacher Behavior

2 Student Achievement

Operational level Teacher Evaluation Rating

Vaiue-Added iVIeasurement

Figure 1. Inferential relationships involving use of evaluation scores in research and practice. Note. Adapted from Research in Organizational Behavior (Vol. 2), by B. M. Staw and L. L. Cummings (Eds.), 1980, Greenwich, CT: JAI. Copyright 1980 by Elsevier.

ment to support the use of the scores for administrative purposes and for research on teacher effects on student learning. The figure (see Schwab, 1980; see also Binning & Barrett, 1989) distinguishes between the construct level at which the relevant attributes or characteristics of teachers and stu- dents are represented and the operational level at which the measurements of these constructs are represented.

The figure shows five linkages:

1. The relationship between the evaluation scores and the teacher be- haviors or performance they represent.

2. The theorized causal relationship between teacher behaviors and student achievement.

3. The relationship between student achievement and value-added measurements based on test scores.

4. The empirical relationship between evaluation scores and the val- ue-added measurements.

5. The inference that differences in teachers’ evaluation scores are re- lated to differences in student learning.

Linkages 1 and 3 can be thought of as the construct validity of the measure- ments that represent the constructs.

District administrators intending to use teacher evaluation scores for making decisions with consequences for teachers are primarily interested


Teacher Performance and Student Achievement

in “validating” the evaluation scores.^ They want to be justified in in- ferring that teachers with high scores are better performers, defined as producing more student learning (inference in Linkage 5). Evidence for the validity of this inference would be provided by a substantial empiri- cal relationship between performance scores and the value-added mea- surements of student achievement (Linkage 4) assuming that the value- added measurements adequately represent student learning (Linkage 3). The latter link, the construct validity of the value added measurements as representations of student learning, is more or less taken as given or as trivial because test scores are typically defined by accountability systems or external constituencies as the important indicators of student learning and because many districts face considerable pressure to raise test scores from state or Federal accountability systems. In this context. Linkage 4 provides what is often called criterion-related validity evidence, so called because one performance measure (student achievement) is thought of as of closer to the ultimate goal of the organization and so considered the criterion by which the measurements to be validated are judged (see Messick, 1989). To the extent that teacher evaluation scores are em- pirically related to measures of student achievement, and thus, the scores distinguish between teachers who help produce more or less student achievement, district administrators have evidence to justify their in- ference that some teachers are better performers than others and for the use of the scores to make decisions affecting teachers. Note that in this use of teacher evaluation scores. Linkage 1 is not of primary concern once the decision has been made to adopt a particular teacher evaluation system.

Researchers interested in measuring school, classroom, and teacher ef- fects on student learning are also interested in Linkage 4, the empirical re- lationship between evaluation scores and value-added measurements of student achievement. However, for them the importance of this relation- ship is that it provides evidence for the existence and magnitude of a causal effect of teacher behavior on student learning. In this case, the ade- quacy of the scores as representations of teacher behavior and of value-

În Cincinnati, given the substantial investment in the performance assessment system and the potential further investment in higher pay for teachers rated more competent, district leaders wanted to know whether teachers who were evaluated at higher levels contributed more to the district’s strategic goal of improved student achievement. An initial analysis done by the district (Holtzapple, 2002) provided evidence that the evaluation scores were related to student achievement. In that analysis, residuals from bivariate ordinary least squares (OLS) regressions of current on prior year test scores were calculated, averaged by teacher, and cor- related with the sum of the four domain scores from the teacher evaluation system.


A. Milanowski

added measurements of student achievement on tests as representations of student teaming, their construct validity, is also of importance. Like dis- trict administrators, researchers may judge that the value-added measure- ments are adequate indicators of student teaming, especiatty in the ab- sence of other practical candidates for this rote and the policy importance of these indicators The retationship between teacher behavior and evatua- tion scores deserves additionat consideration, however. The first issue is how welt the dimensions of performance and definitions of performance tevets correspond to the researcher’s concept of the teacher behavior that is expected to contribute to student teaming. As mentioned previously, Danietson (1996) argued that the Framewor]< was based on both analyses of the teacher job and research on teaching effectiveness. However, a re- searcher planning to use teacher evaluation scores as measures of behavior needs to review the content of the model of teaching on which the evalua- tion system is based to judge the degree to which the system is likely to measure her or his concept of the teacher behaviors that are expected to af- fect student learning.

The second issue is the degree to which the judgments of the evatuators adequately represent the teacher behaviors described in the model of teaching itself. Evaluators need to be accurate in observing behaviors and transtating those behaviors into scores. Although the training and calibra- tion of evaluators, the use of multipte observations, and the use of speciat- ized evaluators provide reasons to believe the scores of performance in Cincinnati have a degree of construct validity as representations of teacher behavior, it is important to recognize that more formal construct validity evidence would be desirable because it is possible that the evatuators may be basing their scores on some other behaviors or teacher characteristics, which may in turn also be related to student achievement. Studies aimed at providing this evidence are in progress.

In this study, I concentrated on establishing the empirical relationship between evaluation scores and value-added measures of student achieve- ment. The analyses described following were originalty intended to pro- vide criterion-related validity evidence that the teacher evaluation scores could be used as the basis for a system of differential pay for teachers. As Figure 1 shows, this evidence is also relevant to the question of the effect of teachers’ instructional practice on student learning because it represents the expected retationship at the empirical or operational level once the con- struct validity Linkages 1 and 3 are established. If no empirical relationship were shown, there would he tittle theoretical reason to pursue further the particular operationalization of teacher practice represented by this teach- er evaluation system.


Teacher Performance and Student Achievement



Teacher performance scores. The comprehensive teacher evaluation scores from the system just described were obtained from the district for the 2000-01 and 2001-02 school years. Because only a subset of teachers expe- rienced the comprehensive evaluation each year, complete evaluation scores were available for the 270 teachers who were comprehensively eval- uated during the 2000-01 school year and the 335 evaluated in 2001-02. It was decided to include teachers evaluated in 2000-01 in the analyses, even though the performance criterion was the test scores of their 2001-02 stu- dents, because teacher perform.ance was expected to be a relatively stable characteristic. Unfortunately, not all of the scores could be used in the anal- ysis because most evaluated teachers taught subjects or grades for which no state or district standardized tests were given. Also, some teachers were excluded because they had too f eŵ students (less than three) tested in both years. Due to these exclusions, evaluation scores for only 212 teachers could be included in the analysis. (Because analyses were done separately by subject and grade, some teachers appear in two or more subject/grade analyses.)

As described previously, teachers undergoing the comprehensive eval- uation received scores on four domains: planning and preparation, creat- ing an environment for learning (classroom management), teaching for learning, and professionalism. For this analysis, the scores on the four do- mains were added to yield a composite evaluation score, which was taken as an overall indicator of teacher performance. (Note that the performance pay system, as designed, would have used the scores on all of the domains to determine the teacher’s pay range.) The average intercorrelation be- tween domain scores for all the teachers evaluated in 2000-01 was .60, and coefficient alpha was .86. The average intercorrelation for 2001-02 was .61, and coefficient alpha was also .86.

Teacher experience. Ideally, one would also like information on teach- ers’ total years of experience in the field to control for experience when examining the relationship between evaluation scores and student achieve- ment. This information was not available from the district’s human re- source information system, so information on teacher’s years of experience with the district was provided. These data were used as a proxy for total experience, recognizing that it underestimates the experience of teachers who taught in other districts prior to employment with CPS.


A. Milanowski

Table 1

Student Achievement Measures by Grade

Grade 2001-02 Test 2000-01 Test

3 District Test District Test 4 State Proficiency Test District Test 5 Terra Nova State Proficiency Test 6 State Proficiency Test District Test 7 District Test State Proficiency Test 8 State Proficiency Test District Test

Student academic achievement. Student test scores for the 2001-02 school year were obtained from the district for students in Grades 3 through 8 in reading, mathematics, and science. Test scores in the same subjects from the 2000-01 school year were also obtained from the district for these sub- jects. Table 1 shows the tests used for each grade.

The tests are given in March of each year. They are largely multiple choice in format but most also contain some extended response items. State proficiency tests were based on state student content standards. (A set of four score ranges has been established to group students into the profi- ciency categories of below basic, basic, proficient, and advanced, but these were not used in this analysis.) District tests were developed by the same testing company that helped develop the state tests and are intended to cover similar content so that schools and teachers can use the results to identify students likely to have difficulties on the state test. The scores used in the analysis were the scale scores provided by the state or the test pub- lisher rather than raw scores, percentiles, or normal curve equivalents.

It should be noted that a substantial proportion of students enrolled in each grade in 2001-02 could not be included in the analyses because either or both the 2001-02 and prior year test scores were not available or because of student mobility between schools. Only students enrolled in the school in which they were tested in March 2002 for at least 71 days prior to the test were included. In addition, students were lost from the analyses because some could not be matched with teachers, even when test scores for both years were available. This appears to be due mostly to clerical errors in the student data system. In addition, a few students were excluded from the analyses because their current or prior year scores were extreme outliers.^ Across grades and subjects, an average of 66% of the students enrolled in

^These were defined as beyond the “outer fences ” of the univariate distributions, or fur- ther than about two interquartile ranges from the median, or about 2.7 SDs from the mean of a normal distribution. See Hoaglin (1983).


Teacher Performance and Student Achievement

March 2002 were included in the analyses. Data were available for a higher percentage of students in the lower grades in comparison with Grades 7 and 8. The Appendix shows the total numbers of students by grade, the numbers for which test scores are available, and the number of students used in the first step of the analyses described next. Comparison of the av- erage 2001-02 test scores between the population of students tested and the group included in the analysis showed that the latter had somewhat higher test scores (an average of .12 SDs higher across grades and subjects) and lower variance (an average of 15% less across grades and subjects). These differences were partially due to the exclusion of students not en- rolled at a school for at least 71 days and the elimination of extreme outli- ers, most of which represented students who obtained very low scores.

Student demographic variables. The district also provided data on sev- eral student characteristics including ethnicity, gender, receipt of free/re- duced-price lunch, special education status, and days enrolled in the school. For the analyses described next, this information was used to con- struct a set of dummy variables for gender (female = 1), non-White (not White/Caucasian – 1), free/reduced-price lunch status (free or reduced = 1), and special education status (participation in any special education pro- gram or exemption from testing = 1). These were included as controls at Level 1 of the hierarchical linear model used to derive the average level of value added for teachers’ students.

Comparing the group of students included in Step 1 of the analysis with the population of students on the district roster in March 2002 showed that the included group was somewhat more female (an average of 2.3 percent- age points across grades), was more White (an average of 1.4 percentage points), was less poor (an average of 2.5 percentage points), and contained a lower proportion of special education students (an average of 2.1 percentage points). These differences did not seem large enough to indicate that the in- cluded students were not representative of the student population.


The analyses proceeded along the lines of the value-added paradigm defining student achievement as the residual from a regression of the 2001-02 test score in a subject on the prior year’s score for that subject plus other variables thought to stand for student characteristics that potentially influence student test performance. In this case, dummy variables for gen- der, non-White ethnicity, special education status, and receipt of free or re- duced-price lunch were included as well as the number of days enrolled at the school in which testing took place.


A. Milanowski

Because of the small number of teachers in each grade for which both test score and evaluation data were available, a full multilevel analysis (i.e., with teacher evaluation scores used as a Level 2 predictor) was not done. Rather, a two-step analysis procedure was followed as outlined next. The purpose of the two-step procedure was to produce correlation coefficients that repre- sented the relationship between teacher evaluation scores and student achievement residuals in each grade for each of the three subjects and that could be combined across grades using standard meta-analysis formulas.

The first step was intended to produce a measure or representation of the criterion—the average achievement level on the 2001-02 test for each teacher’s students—controlling for prior achievement in that subject and several of the student characteristics thought to influence test scores. To do this, a two-level hierarchical linear model was estimated. The Level 1 model was

Posttest = Bo + Bi pretest + B2 female + B3 free/reduced-price lunch + B4 non-White -1- B5 special ed + B6 days enrolled in school + R.

Bo.. .B6 were within-classroom regression coefficients, and R was the Level 1 error on individual student residual. All Level 1 predictors were grand- mean centered. The Level 2 model was

Boj = Yoo + Uoj.

At Level 2, Boj represented the intercept in cl

Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we’ll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.

Why Hire writers to do your paper?

Quality- We are experienced and have access to ample research materials.

We write plagiarism Free Content

Confidential- We never share or sell your personal information to third parties.

Support-Chat with us today! We are always waiting to answer all your questions.