Previous research has revealed the need for a validation study that considers several wake quantities and code types so that decisions on the trade-off between accuracy and computational cost can be well informed and appropriate to the intended application. In addition to guiding code choice and setup, rigorous model validation exercises are needed to identify weaknesses and strengths of specific models and guide future improvements. Here, we consider 13 approaches to simulating wakes observed with a nacelle-mounted lidar at the Scaled Wind Technology Facility (SWiFT) under varying atmospheric conditions. We find that some of the main challenges in wind turbine wake modeling are related to simulating the inflow. In the neutral benchmark, model performance tracked as expected with model fidelity, with large-eddy simulations performing the best. In the more challenging stable case, steady-state Reynolds-averaged Navier?Stokes simulations were found to outperform other model alternatives because they provide the ability to more easily prescribe noncanonical inflows and their low cost allows for simulations to be repeated as needed. Dynamic measurements were only available for the unstable benchmark at a single downstream distance. These dynamic analyses revealed that differences in the performance of time-stepping models come largely from differences in wake meandering. This highlights the need for more validation exercises that take into account wake dynamics and are able to identify where these differences come from: mesh setup, inflow, turbulence models, or wake-meandering parameterizations. In addition to model validation findings, we summarize lessons learned and provide recommendations for future benchmark exercises.