Publication Date


Type of Culminating Activity


Degree Title

Doctor of Education in Curriculum and Instruction


Curriculum, Instruction, and Foundational Studies

Major Advisor

Evelyn Johnson, Ed.D.


This study used generalizability theory to identify sources of variance on a pilot observation tool designed to evaluate special education teacher effectiveness, and was guided by the question: How many occasions and raters are needed for acceptable levels of reliability when using the pilot RESET observation tool to evaluate special education teachers? At the time of this study, the pilot Recognizing Effective Special Education Teachers (RESET) observation tool included three evidence-based instructional practices (direct, explicit instruction, whole-group instruction, and discrete trial teaching) as the basis for special education teacher evaluation. Eight teachers (raters) were invited to attend two sessions (October 2012 and April 2013) to evaluate special education classroom instruction collected from the 2011-2012 and 2012-2013 school years, via the Teachscape 360-degree video system. The raters were trained on the pilot RESET observation tool, and participated in whole-group coding sessions to establish interrater agreement (minimum of 80%) before evaluating assigned videos.

Data collected from raters were analyzed in a two-facet “partially” nested design where occasions (o) (observations/lessons) were nested within teachers (t), o:t, and crossed with raters (r), {o:t} x r. Using the results from the generalizability study analyses, decision studies were then completed to determine optimal facet conditions for the highest levels of reliability (the relative G coefficient and standard error of measurement scores were used to inform the decision study analyses). Results from this study are in alignment with similar studies that found multiple observations and multiple raters are critical for ensuring acceptable levels of reliability. Recommendations for future studies include investigating the use of different raters (e.g., principals, university faculty, etc.), and using larger facet sample sizes to increase the overall measurement precision of the RESET tool. Considerations for the feasibility of practice must also be observed in future reliability and validity studies on the RESET tool.