OBJECTIVE: A large proportion of electronic health record (EHR) data consists of unstructured medical language text. The formatting of this text is often flexible and inconsistent, making it challenging to use for predictive modeling, clinical decision support, and data mining. Large language models' (LLMs) ability to understand context and semantic variations makes them promising tools for standardizing medical text. In this study, we develop and assess clinical text normalization pipelines built using large-language models. METHODS: We implemented four LLM-based normalization strategies (Zero-Shot Recall, Prompt Recall, Semantic Search, and Retrieval-Augmented Generation based normalization [RAGnorm]) and one baseline approach using TF-IDF based String Matching. We evaluated performance across three datasets of SNOMED-mapped condition terms: [1] an oncology-specific dataset, [2] a representative sample of institutional medical conditions, and [3] a dataset of commonly occurring condition codes (>
1000 uses) from our institution. We measured performance by recording the mean shortest path length between predicted and true SNOMED CT terms. Additionally, we benchmarked our models against the TAC 2017 drug label annotations, which normalizes terms to the Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms. RESULTS: We found that RAGnorm was the most effective throughout each dataset, achieving a mean shortest path length of 0.21 for the domain-specific dataset, 0.58 for the sampled dataset, and 0.90 for the top terms dataset. It achieved a micro F1 score of 88.01 on task 4 of the TAC2017 conference, surpassing all other models without viewing the provided training data. CONCLUSION: We find that retrieval-focused approaches overcome traditional LLM limitations for this task. RAGnorm and related retrieval techniques should be explored further for the normalization of biomedical free text.