BACKGROUND: AI models like ChatGPT have the potential to support musculoskeletal rehabilitation by providing clinical insights. However, their alignment with evidence-based guidelines needs evaluation before integration into physiotherapy practice. OBJECTIVE: To evaluate the performance of ChatGPT (GPT-4 model) in generating responses to musculoskeletal rehabilitation queries by comparing its recommendations with evidence-based clinical practice guidelines (CPGs). DESIGN: This study was designed as a cross-sectional observational study. METHODS: Twenty questions covering disease information, assessment, and rehabilitation were developed by two experienced physiotherapists specializing in musculoskeletal disorders. The questions were distributed across three anatomical regions: upper extremity (7 questions), lower extremity (9 questions), and spine (4 questions). ChatGPT's responses were obtained and evaluated independently by two raters using a 5-point Likert scale assessing relevance, accuracy, clarity, completeness, and consistency. Weighted kappa values were calculated to assess inter-rater agreement and consistency within each category. RESULTS: ChatGPT's responses received the highest average score for clarity (4.85), followed by accuracy (4.62), relevance (4.50), and completeness (4.20). Consistency received the lowest score (3.85). The highest agreement (weighted kappa = 0.90) was observed in the disease information category, whereas rehabilitation displayed relatively lower agreement (weighted kappa = 0.56). Variability in consistency and moderate weighted kappa values in relevance and clarity highlighted areas requiring improvement. CONCLUSIONS: This study demonstrates ChatGPT's potential in providing guideline-aligned information in musculoskeletal rehabilitation. However, due to observed limitations in consistency, completeness, and the ability to replicate nuanced clinical reasoning, its use should remain supplementary rather than as a primary decision-making tool. While it performed better in disease information, as evidenced by higher inter-rater agreement and scores, its performance in the rehabilitation category was comparatively lower, highlighting challenges in addressing complex, nuanced therapeutic interventions. This variability in consistency and domain-specific reasoning underscores the need for further refinement to ensure reliability in complex clinical scenarios. CLINICAL TRIAL NUMBER: Not applicable.