Technology development and design decisions in wind energy are often based on results from simulations performed for individual wind turbines or entire wind plants. It is therefore critical to ensure that the models being used for research and industry applications in wind energy be thoroughly validated against measurements. A full-system validation of wind plant simulations must consider the atmospheric inflow, the response of the wind turbines, and their wakes. This task is complicated by the lack of freely available, quality-controlled, high-quality measurements. Here, such measurements are used to offer a validation exercise that can be used to assess the accuracy of models of any fidelity level. When it comes to real-world measurements, the dataset considered herein is simple in terms of terrain but exhibits pronounced diurnal cycles. Instead of a full-scale wind plant, we consider an individual research-scale, utility wind turbine instrumented for power and loads measurements. Three benchmarks are defined, with increasing levels of complexity: near neutral, slightly unstable, and very stable atmospheric stratification. Through comparisons between observations and simulations, the benchmarks provide complementary information about the model performance and its ability to reproduce mean and dynamic wake characteristics. This article describes the measurements and methodology used to define these benchmarks and provides the information required to perform simulations and conduct the model-measurement comparison. The objective is to provide a robust wake model validation exercise open to anyone, which will serve to minimize uncertainty in model validation practices related to varying methodologies across simulation tools and users.