Genome-wide association studies (GWAS) help to identify disease-linked genetic variants, but pinpointing the most likely causal genes in GWAS loci remains challenging. Existing GWAS gene prioritization tools are powerful but often use complex black box models trained on datasets containing unaddressed biases. Here, we use a data-driven approach to construct a truth set of causal genes in 406 GWAS loci. We train a gene prioritization tool, CALDERA, that uses a simple logistic regression model with L1 regularization and corrects for potential confounders. Using three independent benchmarking datasets of resolved GWAS loci, we compare the performance of CALDERA with three other methods (FLAMES, L2G, and cS2G). CALDERA outperforms all these methods in two out of three datasets and ranks second in the remaining dataset. We demonstrate that CALDERA prioritizes genes with expected properties, such as mutation intolerance (OR = 1.751 for pLI >
90%, P = 8.45x10