A Laboratory Decision-Support System for Reflective Urine Culture Testing: Development of an Interpretable Artificial Intelligence Model
PDF
Cite
Share
Request
RESEARCH ARTICLE
VOLUME: 15 ISSUE: 1
P: 17 - 33
January 2026

A Laboratory Decision-Support System for Reflective Urine Culture Testing: Development of an Interpretable Artificial Intelligence Model

Mediterr J Infect Microb Antimicrob 2026;15(1):17-33
1. Dokuz Eylül University, Institute of Health Sciences, Department of Neurosciences, İzmir, Türkiye
2. University of Health Sciences Türkiye, İzmir Tepecik Training and Research Hospital, Department of Medical Biochemistry, İzmir, Türkiye
3. University of Health Sciences Türkiye, İzmir Tepecik Training and Research Hospital, Department of Infectious Diseases and Clinical Microbiology, İzmir, Türkiye
4. University of Health Sciences Türkiye, İzmir Tepecik Training and Research Hospital, Department of Family Medicine, İzmir, Türkiye
No information available.
No information available
Received Date: 18.07.2025
Accepted Date: 08.12.2025
Online Date: 03.02.2026
Publish Date: 03.02.2026
E-Pub Date: 07.01.2026
PDF
Cite
Share
Request

Abstract

Introduction

Urinary tract infections are a common diagnostic challenge. Although urine culture remains the gold standard, it is time-consuming and often ordered reflexively. This study aimed to develop and validate an interpretable machine-learning–based Laboratory Decision-Support System (LDSS) to guide reflective urine culture prioritization using only structured laboratory data.

Materials and Methods

We analyzed a retrospective cohort of 51,923 adult patients. Seven machine learning algorithms were trained, with the Random Forest (RF) model demonstrating the highest accuracy. SHapley Additive exPlanations was employed to ensure model interpretability. A reduced RF model, using the top 10 predictive features, was used to construct three scoring systems: one emphasizing model fidelity, one optimizing diagnostic balance, and one maximizing sensitivity.

Results

The RF model demonstrated excellent performance (external receiver operating characteristic – area under the curve [ROC-AUC]: 0.956). The simplified 10-variable model maintained high accuracy (ROC-AUC: 0.947). Key predictors included bacterial count, leukocyte count, nitrite presence, and patient age. The scoring systems offered flexible options tailored to different diagnostic priorities, with the SAFE-Score achieving 95.3% sensitivity.

Conclusion

The developed LDSS supports rational antibiotic use by reducing unnecessary culture testing. Its explainable structure facilitates collaboration between laboratory professionals and clinicians, contributing to standardized reflective testing workflows and interdisciplinary decision-making and strengthens antimicrobial stewardship, while preserving the central role of urine culture in infection management.

Keywords:
Urinary tract infections, machine learning, urine culture

Introduction

Urinary tract infections (UTIs) are among the most common infections in clinical practice, with an estimated global incidence exceeding 150 million cases annually[1]. They are associated with substantial healthcare costs, frequent antibiotic prescriptions, and increased diagnostic burden, particularly in outpatient and emergency settings[2, 3]. Accurate diagnosis remains challenging due to nonspecific symptoms and reliance on time-consuming laboratory tests[4].

Urine culture is considered the gold standard for UTI diagnosis. However, its 24–48-hour turnaround often necessitates empiric antibiotic treatment before microbiological confirmation[5]. This practice contributes to antimicrobial resistance, now recognized by the World Health Organization as a global health threat[6]. Moreover, up to 60%–70% of urine cultures yield negative or clinically insignificant results, highlighting potential overuse of testing and therapy[7].

Rapid dipstick tests, detecting leukocyte esterase and nitrite, provide immediate screening but show variable performance across populations, with sensitivity and specificity ranging from 68% to 88% and 17% to 98%, respectively[8].

This diagnostic uncertainty has prompted efforts to improve laboratory decision-making, including the use of reflective testing. Reflective testing, increasingly recognized in modern laboratory medicine, involves laboratory physicians adding further analyses or interpretative comments after reviewing initial test results to enhance diagnostic reasoning[9]. In UTIs, this expert-led approach aids accurate interpretation and encourages more judicious use of microbiological testing. Laboratory physicians thus face the dual challenge of minimizing unnecessary culture requests while ensuring patients with a high likelihood of positive cultures are correctly identified.

In most laboratory information systems (LIS), detailed symptom information is not captured; only test orders and preliminary diagnoses, such as International Classification of Diseases (ICD) codes, are typically available. Consequently, the predictive modeling approach in this study relied solely on structured laboratory data. To address this, we developed a standardized, interpretable, and data-driven Laboratory Decision-Support System (LDSS) to optimize urine culture utilization using routine laboratory parameters. The LDSS is not intended to replace clinical diagnoses but to assist laboratory physicians in prioritizing reflex urine culture testing within laboratory workflows. Diagnostic responsibility remains entirely with the treating clinician, while the LDSS provides reproducible, standardized insights derived from LIS data.

Artificial intelligence (AI) and machine learning (ML) have gained increasing attention for developing predictive models in UTI diagnosis. Various algorithms—including Logistic Regression (LR), Random Forests (RFs), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and TabNet—have demonstrated robust performance using structured data such as urinalysis results, demographics, and clinical history[10–12]. Reported area under the receiver operating characteristic curve (AUROC) values commonly exceed 0.85, with some studies achieving 0.95 or higher in external validation cohorts[11, 13].

Recent studies have highlighted the importance of model interpretability. By employing SHapley Additive exPlanations (SHAP), our LDSS not only ensures transparency but also facilitates clinical integration by illustrating the real-time contribution of each variable. Real-world implementations of ML-based LDSSs have shown reductions in unnecessary culture orders, accelerated treatment decisions, and improved antibiotic stewardship outcomes[12, 14].

Despite these advances, challenges remain. Many predictive models are trained on single-center datasets and lack external validation, raising concerns about generalizability across institutions and diverse patient populations[13, 15]. Additionally, variability in urinalysis platforms and clinical practice patterns may limit reproducibility and scalability.

Unlike existing tools, the proposed LDSS provides three distinct scoring systems tailored to different clinical priorities, ranging from high-sensitivity triage to specificity-focused decision-making. This flexibility promotes collaboration among biochemists, microbiologists, and clinicians while reducing diagnostic waste by minimizing unnecessary urine culture requests.

The aim of this study was to develop and externally validate a robust, interpretable ML-based LDSS to predict urine culture outcomes in patients with suspected UTIs. By standardizing reflective testing practices, the LDSS supports interdisciplinary decision-making, optimizes resource utilization, and ultimately contributes to rational antibiotic prescribing across healthcare settings.

Materials and Methods

Study Population/Subjects

This study was conducted at İzmir Tepecik Training and Research Hospital. Ethical approval was obtained from the University of Health Sciences Türkiye, İzmir Tepecik Training and Research Hospital Non-Interventional Research Ethics Committee prior to study initiation (approval number: 2025/02-05, dated: 10.03.2025).

Eligible participants were adults aged ≥18 years who presented as inpatients or outpatients to the main hospital between January 1, 2014, and December 31, 2024, or to its affiliated hospital between January 1 and February 28, 2025. Inclusion criteria required patients to undergo their first urinalysis, complete blood count (CBC), and urine culture, ordered by a specialist physician based on clinical indication.

The study cohort included both culture-positive and culture-negative cases, capturing the full spectrum of patients for whom urine cultures were clinically indicated. Consequently, the dataset reflects real-world test-ordering practices rather than a biased subset of confirmed infections.

Patients were excluded if they had incomplete test results, missing sub-parameters, non-bacterial pathogens in their urine culture, delays exceeding one hour between urine sample collection and laboratory registration, delays exceeding 30 minutes for hemogram samples between phlebotomy and laboratory receipt, or a history of antibiotic treatment prior to testing.

CBC analyses were performed using UniCell DxH 800 analyzers (Beckman Coulter, Miami, FL, USA) from 2014 to 2020 and XN-2000 systems (Sysmex Corporation, Kobe, Japan) from 2020 onward. Urinalysis tests were conducted using fully automated analyzers across three periods: H-800 and FUS-200 systems (Dirui Industrial Co., Changchun, China) from 2014 to 2018; BT Uricell 1280–1600 (Bilimsel Products, İzmir, Türkiye) from 2018 to 2021; and U2610–U1600 (Zybio Corporation, Chongqing, China) from 2021 onward.

Midstream urine samples were collected in sterile containers simultaneously with urinalysis and processed according to standard microbiological procedures. Samples without detectable bacterial growth after 24 hours were incubated for an additional 24 hours; if no growth was observed, the result was reported as “no growth”.

Reagents and calibrators for urinalysis were obtained from authorized manufacturers and were certified and registered products. Quality control materials were sourced from Bio-Rad (California, USA). All results were reviewed and validated for accuracy and reliability by both a clinical biochemistry specialist and a clinical microbiology specialist.

Study Design

Patient identifiers were anonymized, and a dataset comprising age, sex, hemogram, urinalysis, and urine culture results from 55,385 patients (main hospital: 52,854; affiliated hospital: 2,531) was imported into Microsoft Excel 2021 (USA).

Symptom data were not included, as such information is not routinely recorded in LIS. In standard laboratory workflows, test orders are typically accompanied by preliminary diagnoses or ICD codes from the requesting physician, but detailed patient symptoms are not captured. Accordingly, the predictive model in this study was developed exclusively on structured laboratory data, aiming to forecast urine culture outcomes rather than to establish a clinical diagnosis of UTI.

After applying exclusion criteria, the final dataset included 49,720 patients, with an external validation cohort of 2,203 patients. The dataset was subsequently transferred to Python (version 3.13.1, USA) for ML analysis.

Following data cleaning, the main dataset was divided into training, internal test, and external test subsets using a 60:20:20 stratified sampling strategy based on the binary target variable, ensuring preservation of class distribution.

Patient flow throughout the study is depicted in Figure 1, in accordance with the Standards for Reporting Diagnostic Accuracy guidelines.

Data Preprocessing and Training of ML Algorithms

Patient data were initially exported from the LIS into Microsoft Excel. Hemogram values and flow cytometry parameters from urinalysis were used directly due to device standardization. Semi-quantitative dipstick results—reported by urinalysis analyzers as categorical values (e.g., “+,” “++,” “+/-,” “trace”)—were converted into numerical equivalents (e.g., “++” mapped to 2; “trace” standardized to 0.5) to ensure quantitative consistency. Variables describing urine color and appearance were also recategorized by grouping similar classifications (e.g., light yellow to dark red; clear to very cloudy) to standardize the dataset.

Urine culture results were binarized as follows: samples with ≥10,000 colony-forming unit (CFU)/mL bacterial growth were defined as positive (label = 1), while samples with <10,000 CFU/mL, mixed flora, colonization, yeast, or no growth were classified as negative (label = 0).

The 10,000 CFU/mL threshold was selected based on recent evidence and the 2024 European Association of Urology guidelines, which acknowledge that lower colony counts (≥10³–104 CFU/mL) may be clinically significant in symptomatic or catheterized patients[16].  Nelson et al.[17] demonstrated that these lower thresholds preserve diagnostic accuracy for symptomatic UTIs, supporting their use in reflective testing workflows. Additionally, Werneburg et al.[18] showed that urinalysis parameters reliably predict the absence of infection at this threshold, reinforcing its clinical validity. This definition also aligns with our institutional microbiology reporting standard for significant bacteriuria.

Yeast and colonization findings were labeled as negative (label = 0) based on established microbiological evidence and laboratory reporting standards. In urinary cultures, the presence of Candida species typically reflects colonization or contamination rather than true infection, even at colony counts exceeding 104–105 CFU/mL, unless accompanied by compatible clinical symptoms[19]. Classifying yeast as negative prevented false-positive propagation in the LDSS and improved the model’s clinical specificity.

Similarly, cases labeled as “colonization”—including cultures with mixed flora or non-uropathogenic organisms—were considered negative. This approach aligns with standard microbiology practice, where such findings are reported as clinically non-significant. Although CLSI M100 (2025) does not define colony-count thresholds for colonization or candiduria, its terminology guided our categorization strategy. This interpretation reflects real-world laboratory workflows, ensuring that the LDSS mirrors standardized reporting logic and remains generalizable across institutions[20].

The cleaned dataset was transferred to Python for ML analysis. To enhance model robustness and address class imbalance, a stratified data partitioning scheme was applied, allocating 60% of samples to training and 20% each to internal and external testing. The dataset exhibited natural imbalance, with 22.4% culture-positive and 77.6% culture-negative samples. To mitigate majority-class bias, feature standardization and rebalancing strategies (class_weight=’balanced’) were applied uniformly across all classifiers.

As a preliminary check, a baseline LR model was trained and evaluated across all data splits. Receiver operating characteristic – area under the curve (ROC-AUC) scores (≈0.74, 0.73, 0.73 for training, internal, and external sets, respectively) and F1 scores (0.55, 0.54, 0.54) demonstrated consistent generalization without evidence of overfitting or imbalance-driven inflation. The close alignment of these baseline metrics confirmed that stratified sampling preserved class proportions across all subsets (≈22.4% positive vs. 77.6% negative), ensuring reliable model development.

ML Model Selection and Development

The results confirmed that the methodological setup—including stratified sampling and proportional weighting—effectively mitigated class imbalance and provided a reliable foundation for model development. LR was used not as a primary model, but as a diagnostic tool to verify dataset integrity and the fairness of the training process[21].

Model development was performed in Python 3.13.1 using widely adopted libraries and workflows. Seven ML algorithms were evaluated for their suitability with the dataset and their potential effectiveness in predicting urine culture outcomes: RF, XGBoost, LightGBM, CatBoost, LR, Artificial Neural Network (ANN), and K-Nearest Neighbors (KNN).

Variables included in the analysis:

• Demographic: Age, sex

• Hemogram: White blood cell, neutrophil, lymphocyte, monocyte, eosinophil, basophil, hemoglobin (HGB)

• Urine Dipstick: Leukocyte esterase, nitrite, glucose, protein, pH, erythrocyte, bilirubin, urobilinogen, ketone

• Other Urinalysis: Urine color, urine density, appearance

• Flow Cytometry: Bacteria count, cylinder, yeast, urine leukocyte count

Data preprocessing, model training, evaluation, and visualization were conducted using open-source Python libraries:

• Data Processing and Analysis: pandas (v2.2.2), numpy (v2.0.2), optuna (v4.3.0)

• ML Model Development: scikit-learn (v1.6.1), XGBoost (v2.1.4), lightgbm (v4.5.0), catboost (v1.2.8), tensorflow (v2.10), keras (v2.10), torch (v2.6.0 + cu124)

• Model Evaluation and Visualization: matplotlib (v3.10), seaborn (v0.13.2), scipy.stats (v1.9), sklearn.metrics (v1.2), SHAP (v0.47)

Detailed hyperparameter optimization procedures, including search strategies and parameter configurations for each model, are provided in the Supplementary Table 1. Each model was retrained using the optimal hyperparameters identified during tuning. Final model evaluation was based on F1 and ROC-AUC scores derived from the internal test set.

Performance Evaluation

Performance evaluation was conducted using standard Python-based data science libraries. The modeling process was assessed comprehensively through internal cross-validation, hyperparameter tuning, and multiple performance metrics.

Classification Performance Metrics: Model discrimination and predictive capability were evaluated using:

• AUC-ROC

• Area under the precision-recall curve (AUC-PR)

• Sensitivity and Specificity

• Positive predictive value (PPV) and negative predictive value (NPV)

• Positive likelihood ratio (PLR) and negative likelihood ratio (NLR)

• F1 score

Model Interpretability Metrics: To enhance clinical transparency and foster trust in algorithmic decisions, interpretability was assessed using:

• Feature-Importance metrics

• SHAP graphs

This multidimensional evaluation approach balances predictive performance with explainability, providing a robust framework for forecasting urine culture outcomes based solely on laboratory and demographic data.

Development of the LDSS

The LDSS was built using the best-performing ML model identified during model selection. SHAP analysis was employed to select the ten most informative features, and a simplified model was retrained using only these variables. The reduced model maintained performance comparable to the full model, supporting its suitability for practical implementation.

Instead of the default probability threshold of 0.5, an optimized threshold based on Youden’s J statistic was applied to improve sensitivity and minimize missed infections. Each selected feature was then converted into a binary indicator using individual cut-points derived from ROC analysis, enabling construction of a straightforward cumulative score.

Feature-importance values were normalized to derive clinically interpretable weights. Highly influential predictors received slightly higher weights, while moderately informative features were scaled conservatively to balance performance with interpretability. The final scoring system was recalibrated using internal data and externally evaluated, demonstrating preserved sensitivity and specificity. This streamlined, transparent design ensures that the LDSS is suitable for routine use within laboratory workflows.

Validation of the LDSS

An independent validation dataset, obtained from an affiliated hospital within the same healthcare network, was used to assess the generalizability and robustness of the LDSS through temporal validation. This temporally separated retrospective dataset was entirely independent of all model development phases, including training, feature selection, and score construction.

Performance of the reduced 10-variable RF model and the three derived scoring systems was evaluated within this separate clinical environment. Standard classification metrics were computed and compared with those from the original external test set, providing insight into the system’s real-world applicability.

The validation strategy adheres to recommendations from the International Federation of Clinical Chemistry and Laboratory Medicine for evaluating diagnostic tools using independent datasets. This approach strengthens the clinical credibility of the LDSS by demonstrating reproducibility across diverse healthcare settings.

Statistical Analysis

Descriptive statistics are presented as means ± standard deviations (SDs) for continuous variables and as frequencies with percentages for categorical variables. Comparative analyses between the development and validation datasets were conducted using:

• Student’s t-test for normally distributed continuous variables

• Welch’s t-test for continuous variables with unequal variances or sample sizes

• Pearson’s chi-square test for categorical variables

• Z-tests for proportions and McNemar’s test for paired categorical outcomes, particularly for comparing model performance metrics across datasets

These statistical comparisons were used to evaluate diagnostic consistency and identify significant differences in classification outcomes, providing insight into the reproducibility and robustness of the LDSS across diverse clinical settings.

All p-values were two-sided, with statistical significance defined as p < 0.05. Analyses were conducted using Python 3.13 and its associated statistical packages.

Results

Dataset Description and Data Preprocessing

The analytical cohort comprised 51,923 patient encounters, including 49,720 records from the main institutional database and 2,203 from an affiliated tertiary center. The validation cohort was enriched with inpatients from high-acuity units, such as Palliative Care and Gynecologic Oncology, and was specifically used to assess the external validity of the LDSS.

The validation cohort demonstrated significantly higher age across all demographic strata (total: 43.92 vs. 38.28 years; males: 48.04 vs. 39.69; females: 41.23 vs. 37.41; all p < 0.05). Hematologic comparisons revealed statistically significant reductions in lymphocyte count (LYM) and eosinophil count, accompanied by a modest but significant increase in HGB levels (p < 0.05).

Among urinalysis variables, the validation group exhibited higher bacterial counts, increased mucus presence, and elevated pH levels, whereas urine specific gravity and cylinder counts were lower (p < 0.05 for all). No significant differences were observed in white blood cell (WBC), neutrophil, monocyte, or basophil counts, nor in leukocyte counts, yeast presence, or gender distribution (all p > 0.05). Although the proportion of urine culture-positive cases was numerically similar (22.4% vs. 18.3%), this difference reached statistical significance (p < 0.05), potentially reflecting distinct microbiologic or clinical characteristics in the validation population.

Overall, these findings indicate that while the two datasets are broadly comparable, the validation cohort exhibits distinct demographic and laboratory profiles, likely due to its inpatient composition. These differences should be considered when interpreting LDSS performance in more complex clinical settings. Detailed summary statistics and p-values for each variable are provided in Table 1.

Hyperparameter Tuning

Each ML model was trained and optimized to achieve optimal performance on our dataset. Final hyperparameter configurations, tailored to the structure of each algorithm, are summarized in the Supplementary Table 2.

Performance Metrics of ML Models

The performance of seven ML models was evaluated using both internal and external test datasets. Ensemble-based methods—RF, CatBoost, and XGBoost—consistently demonstrated high accuracy (≥0.929) and F1 scores (>0.83) across both datasets, highlighting their robustness for clinical prediction tasks.

On the external test set, RF outperformed all other models, achieving the highest ROC-AUC (0.956) and PR-AUC (0.907), indicating superior discrimination and precision-recall trade-off. CatBoost achieved the highest sensitivity (0.771) while maintaining balanced performance across other metrics.

KNN demonstrated exceptional specificity (0.988) and PPV (0.945) in the external set, making it particularly effective for ruling in cases. Conversely, LR, while computationally efficient, showed the lowest sensitivity and F1 scores, limiting its diagnostic utility.

Performance metrics from the external dataset closely mirrored those of the internal test set for all models, reinforcing their generalizability and stability. Comprehensive statistics for both datasets are provided in Table 2 and Figure 2.

Among all evaluated algorithms, RF exhibited the most consistent and highest overall performance, with an internal ROC-AUC of 0.952 (95% confidence interval [CI]: 0.948–0.956) and an external ROC-AUC of 0.956 [95% CI: 0.952–0.960], along with strong PR characteristics.

Given its superior accuracy, consistent generalizability, and interpretability, RF was selected as the core algorithm for integration into the LDSS. SHAP analysis was then performed on the final model to provide insight into the individual contribution of each feature to the predicted outcomes.

SHAP Analysis of the Optimal RF Model

Model interpretability was improved using SHAP, which quantifies the contribution of each feature to the predictions generated by the final RF model. As shown in Figure 3, the most influential features were

• Bacteria_Count (SHAP value: 0.061)

• Urine_Leu_Count (0.055)

• Nitrite (0.052)

• Age and Leukocyte Esterase (both 0.041)

These features correspond with well-established clinical markers of UTI, supporting the biological plausibility of the model.

Features with moderate importance included HGB, Gender, and LYM, with SHAP values ranging from 0.017 to 0.030. Features such as Bilirubin, Urobilinogen, and Ketone contributed minimally, each with SHAP values below 0.003.

Overall, the feature ranking confirms that the model primarily relies on clinically relevant variables, enhancing transparency and supporting its integration into laboratory decision-making.

Performance Metrics of the LDSS

A simplified RF model, built using the top 10 SHAP-derived features, maintained performance comparable to the full-feature model (ROC-AUC: 0.952 vs. 0.947; PR-AUC: 0.897 vs. 0.890), supporting its suitability for clinical implementation (Table 2). Based on these variables, three complementary scoring systems were developed to address distinct operational needs within laboratory workflows (Table 3):

• Model-Prioritized Score: Retains the behavior of the original ML model by assigning weights directly from normalized SHAP values. This version is ideal for institutions seeking high overall discrimination while remaining faithful to the underlying algorithm.

• Dual-Optimization Score: Adjusts feature weights to balance sensitivity and specificity, as reflected in stable metrics across both test datasets (Table 4, Figure 4). This score is intended for laboratories aiming to minimize both missed infections and unnecessary cultures.

• SAFE-Score: Optimized for high sensitivity and NPV, this score is suitable for safety-critical settings where missing true infections is unacceptable—such as high-acuity units, elderly populations, or immunocompromised patients. Its higher sensitivity comes at the expense of specificity, highlighting the trade-off between diagnostic conservatism and resource utilization.

Across all scoring systems, sensitivity remained consistent in external and independent validation cohorts, while specificity varied according to prioritization strategy (Table 4). Together, these tools provide laboratories with flexible options that can be tailored to local clinical priorities, test-ordering practices, and antimicrobial stewardship goals (Figure 4).

Discussion

ML-based approaches offer substantial potential for the early diagnosis of UTIs. With the rising prevalence of antibiotic resistance, reducing unnecessary antibiotic use has become increasingly critical. Recent studies demonstrate that ML models improve diagnostic accuracy by integrating clinical symptoms, medical history, and urinary biomarkers, rather than relying solely on culture results[22].

Moreover, AI–driven decision-support systems can reduce diagnostic workload in hospitals, although their clinical validation remains limited[15]. Urinary biomarkers, such as nitrite and leukocyte esterase, exhibit high sensitivity for UTI diagnosis, yet their integration into ML models is essential to mitigate false-positive results[23]. AI-assisted methodologies are expected to be particularly beneficial for early detection of recurrent UTIs and multidrug-resistant pathogens, potentially improving patient outcomes and guiding more precise therapeutic interventions[23, 24].

In this study, we evaluated the performance of multiple ML models in predicting urine culture outcomes and assessed their clinical applicability using explainable AI (XAI) techniques. Validation on a demographically and clinically distinct inpatient cohort further demonstrated the robustness and real-world adaptability of the LDSS. The incorporation of XAI enhanced interpretability, providing insight into the decision-making process and supporting potential integration in complex healthcare settings.

The LDSS was developed using all physician-ordered urine culture requests, including both culture-positive and culture-negative cases. Consequently, the dataset reflects the complete real-world distribution of suspected UTIs encountered in laboratory practice, enabling the model to learn discriminative patterns for both infection and non-infection samples. Importantly, the LDSS functions solely as a laboratory-level decision-support tool rather than a diagnostic system. Its predictions are limited to variables available in the LIS and are intended to complement, not replace, physicians’ diagnostic judgment.

Gender and Age-Related UTI Incidence

In our study, UTIs were significantly more common in female patients than in males. This finding aligns with existing literature and reinforces the well-established notion that women are more susceptible to UTIs due to urogenital anatomy, hormonal fluctuations, and lifestyle factors. Schmiemann et al.[1] reported that UTI incidence in women is four to five times higher than in men. Similarly, Hooton et al.[25] identified a higher risk in women attributable to a shorter urethra and variability in periurethral microbial flora. Additional risk factors include age, postmenopausal hormonal changes, and a history of recurrent infections.

Age also emerged as a critical determinant, with UTI incidence progressively increasing—particularly among women aged 65 years and older. While Foxman et al.[26] reported peak incidence in women aged 15–29, with a secondary rise in postmenopausal groups, and Møller et al.[11] linked estrogen depletion after age 50 to heightened susceptibility, our study identified older age (≥65 years) as an independent risk factor for positive urine culture in the LDSS model. This finding underscores the importance of incorporating age as a predictive variable and reflects the growing burden of UTIs in elderly populations.

Performance of ML Models

The predictive performance of the models developed in this study is consistent with, and in several cases surpasses, previously reported ML approaches for UTI prediction. Among the algorithms tested, ensemble-based models—particularly RF and CatBoost—demonstrated consistently high accuracy, balanced sensitivity and specificity, and favorable F1 scores. Compared to prior models reported by de Vries et al.[27] and Flores et al.[2], our RF model showed superior performance across multiple evaluation metrics. Likewise, our CatBoost implementation outperformed the model described by Mancini et al.[13], which exhibited lower AUC and F1 values in a comparable clinical context.

Tree-based gradient boosting methods, such as XGBoost and LightGBM, also performed robustly and yielded results similar to high-performing models developed by Choi et al.[5] and Lin et al.[28], indicating strong generalizability across diverse patient populations. In studies by Dhanda et al.[29] and Taylor et al.[30], RF and XGBoost models similarly demonstrated superior discriminatory capacity, achieving AUC-ROC values of 0.85 and 0.90, respectively.

The KNN model achieved precision metrics comparable to prior studies; however, its limited interpretability may constrain clinical adoption[7]. Conversely, LR, while highly interpretable, exhibited lower sensitivity and F1 scores—consistent with Ramgopal et al.[10], where the model tended to overpredict positive cases, reducing precision. ANN (MLP) models, though commonly employed in UTI prediction studies, demonstrated moderate performance in our dataset, slightly below previously reported benchmarks[2].

Overall, these results reinforce the value of ensemble ML methods in the context of a LDSS for UTI prediction. They offer high predictive accuracy and consistent performance across internal and external validation cohorts, supporting their applicability in real-world clinical settings.

Several studies have investigated ML–based urine culture prediction, varying in complexity and generalizability. Seheult et al.[31] developed a decision-tree algorithm across multiple institutions to identify urinalysis predictors of culture positivity, reporting ROC-AUC values of approximately 0.78–0.79; however, their study lacked external validation and interpretability assessment. By comparison, our model achieved higher discrimination during development (ROC-AUC = 0.94–0.96) under cross-validation. Following conversion into a simplified score-based LDSS, real-world performance remained consistent (ROC-AUC ≈ 0.70–0.72; F1 ≈ 0.50–0.55). This decline reflects the expected trade-off between model complexity and clinical interpretability, as the LDSS was designed for practical integration into LIS rather than maximizing algorithmic precision[31].

Sergounioti et al.[32] applied ensemble classifiers, including RF and XGBoost, to real-world laboratory data, achieving AUROC values of 0.79–0.82. However, their models combined clinical and laboratory parameters and lacked transparent feature-importance analysis. In contrast, our LDSS relied solely on structured laboratory data, achieved comparable discrimination (0.70–0.72), and preserved interpretability and reproducibility through rule-based score calibration via the Model-Prioritized and Dual-Optimization systems.

Sheele et al.[33] investigated bacteriuria prediction in an emergency-department cohort using mixed clinical–laboratory features, yielding AUC-ROC values of 0.86–0.93 depending on the CFU/mL threshold. While their results were strong in a high-acuity population, our laboratory-only LDSS achieved comparable sensitivity (up to 95%) in routine diagnostic settings, highlighting its potential as a front-end decision-support tool for reflex culture testing.

Collectively, previous studies demonstrated the feasibility of ML-assisted urine culture prediction but often emphasized algorithmic performance over interpretability and clinical applicability. The present study addresses this gap by establishing a transparent, externally validated, and operational LDSS framework that maintains clinically acceptable performance while remaining fully interpretable and implementable within routine laboratory workflows.

Explainability and Feature Importance

SHAP-based feature-importance analysis in our study revealed a variable ranking that aligns with and extends existing literature. The most influential predictors were bacterial count, urine leukocyte count, nitrite, age, and leukocyte esterase. These findings are consistent with the meta-analysis by Devillé et al.[8], which reported that combining nitrite and leukocyte esterase yielded a sensitivity of 88% and specificity of 98% for UTI diagnosis. Similarly, Lachs et al.[34] demonstrated that integrating these parameters with clinical symptoms significantly improves diagnostic accuracy.

Notably, our model also identified HGB levels, sex, and LYMs as important features with relatively high SHAP values, suggesting sensitivity to broader systemic or demographic factors that may influence infection risk. This aligns with Zhao et al.[35], who reported age and sex among the top predictors in a SHAP-based post-urostomy UTI risk model, and Wang et al.[36], who found that systemic inflammatory markers and age were highly important in predicting post-surgical UTIs.

The predominance of microscopic urinalysis variables—particularly bacterial and leukocyte counts—over clinical or demographic features underscores the model’s responsiveness to diagnostic biomarkers. This differentiates our approach from models such as Lee et al.[37], which focused on predicting antimicrobial resistance patterns but also leveraged SHAP analysis for interpretability.

Recent literature highlights the limitations of reflexive urine culture testing in the absence of clinical context. Munigala et al.[38] and others have shown that reflex algorithms triggered by markers like leukocyte esterase or nitrite may reduce test volume but compromise diagnostic precision when symptom data are unavailable. Fakih et al.[39] similarly argue that urinalysis alone is insufficient for accurate UTI diagnosis in asymptomatic patients, risking overdiagnosis and overtreatment.

Our study addresses the diagnostic gap through a reflective developed solely using structured laboratory data. Because symptom data are typically absent from LIS, the LDSS optimizes culture utilization within real-world laboratory constraints. Rather than functioning as an autonomous decision-maker or reflex trigger, the system serves as a reflective tool, providing SHAP-based analytical insights to support laboratory physicians’ expert interpretation.

This reflective framework promotes standardized testing and interdisciplinary consultation. In equivocal cases, LDSS outputs can facilitate dialogue between laboratory and clinical teams, helping reconcile test reduction with diagnostic safety. Such an approach advances rational microbiological testing and provides a scalable model for clinician-laboratory collaboration[40].

The LDSS demonstrated robust predictive performance across internal and external datasets, supporting its seamless integration into routine laboratory workflows and reflective testing processes. The system is designed not to replace culture testing but to prioritize it based on evidence-driven probability, maintaining diagnostic stewardship.

To enhance accessibility for readers from diverse clinical and laboratory backgrounds, this study emphasizes the translational relevance of the LDSS over computational complexity. Its explainable design—supported by SHAP analysis and simplified scoring systems—enables non-technical users to interpret outputs transparently. While technical details were included to ensure methodological transparency and reproducibility, the interpretability of the system fosters trust, usability, and interdisciplinary communication between laboratory specialists and treating physicians. By promoting shared understanding of data-driven reasoning, the LDSS supports faster decision-making, improved test stewardship, and enhanced integration of laboratory insights into clinical workflows.

LDSS

Although symptom data were unavailable in the laboratory dataset, the LDSS was intentionally designed to function within the routine workflow of laboratory medicine, where test requests are frequently submitted without accompanying clinical narratives. By aligning the model with real-world laboratory constraints, the LDSS remains applicable and scalable across diverse clinical settings.

To improve interpretability and minimize unnecessary complexity, feature selection was applied to reduce the number of input variables. Prior studies have consistently demonstrated that parsimonious models are better suited for clinical implementation, as they are easier to interpret and maintain, while preserving acceptable predictive performance[41, 42]. Accordingly, subsequent model development was restricted to ten key parameters that did not result in a statistically or clinically meaningful decline in performance. This strategy ensured an optimal balance between model simplicity and predictive accuracy.

Several published studies have similarly developed LDSS frameworks based on urine culture data, including those reported by de Vries et al.[27], Dhanda et al.[29], Del Ben et al.[43], and Flores et al.[2] Among these, Del Ben et al.[43] employed a decision-tree-based approach, whereas the remaining studies selected RF as the primary algorithm. The LDSS developed by de Vries and colleagues demonstrated performance metrics comparable to those observed in the present study, with AUC-ROC values ranging from 0.70 to 0.80. Although their model achieved a higher PPV, its NPV was lower than that of our model, highlighting differences in clinical trade-offs between false-positive and false-negative predictions.

Notably, Dhanda et al.[29] and Flores et al.[2] implemented scoring systems that stratified patients into high- and low-risk groups, an approach that is conceptually aligned with the strategy adopted in the present study. Across key performance metrics, the predictive accuracy of their models was broadly comparable to that of our system.

What distinguishes our LDSS is the integration of three distinct predictive models within a unified decision-making framework. To our knowledge, this is the first study to report the implementation of such a multi-model structure for UTI prediction. This design enables clinicians and laboratory physicians to select among alternative strategies according to specific clinical priorities, such as maximizing case detection or minimizing unnecessary diagnostic testing.

Although the SAFE-Score achieved excellent sensitivity, its specificity was limited (approximately 20%), a trade-off that may raise concerns regarding potential overtesting. Importantly, the LDSS was intentionally designed to accommodate this limitation by offering three complementary scoring strategies, each reflecting a distinct clinical philosophy. These include prioritization of patient safety (SAFE-Score), balanced diagnostic performance (Dual Optimization), and strict adherence to model-derived predictions (Model-Prioritized). Rather than enforcing a one-size-fits-all solution, the LDSS functions as a flexible framework that facilitates consensus-based decision-making, allowing institutions to align model selection with local clinical expectations and operational priorities.

Crucially, the proposed system is not static. By continuously incorporating real-world data—particularly cases in which algorithmic recommendations are compared with expert laboratory physician judgments—the LDSS can be iteratively retrained and refined. As additional large-scale datasets are accumulated over time, improvements in specificity and overall diagnostic balance are anticipated, reflecting the inherent capacity of ML models to evolve with expanding data inputs. In this respect, the LDSS serves not only as an immediate decision-support tool but also as a scalable platform for continuous learning and performance optimization.

Within the Turkish healthcare context, reflective testing has not yet been systematically implemented. Nevertheless, the LDSS offers a structured and standardized framework that may facilitate its adoption, reduce inappropriate urine culture requests, and support antimicrobial stewardship initiatives. Moreover, the Ministry of Health of Türkiye has recently introduced a “Rational Laboratory Utilization” directive that explicitly promotes reflex and reflective testing practices [44]. This regulatory emphasis is expected to accelerate the integration of reflective testing into routine laboratory workflows, highlighting the timeliness and practical relevance of the proposed system.

Finally, the LDSS was designed for seamless integration into routine clinical practice through Microsoft Excel, a widely available and familiar platform in most healthcare settings. All three predictive models are embedded within a single interface and generate concurrent outputs, enabling direct comparison and transparent interpretation at the point of use.

Due to time constraints, the validation cohort was relatively small. Nevertheless, implementation of the LDSS within our hospital’s central laboratory is planned, where it will be deployed to support real-time microbiological decision-making. This implementation will allow prospective validation of the system within routine laboratory workflows, evaluation of its diagnostic impact, and quantification of downstream outcomes, including reductions in unnecessary urine cultures, shorter turnaround times, and improved antibiotic stewardship. In addition, future multicenter studies across diverse healthcare systems are planned, incorporating structured clinical variables such as symptomatology, comorbidities, and medication history to further enhance the model’s generalizability and clinical relevance.

Study Limitations

Although this study leveraged a large dataset and included external validation, several limitations should be acknowledged. First, all data were derived from a single healthcare network, which may limit generalizability to institutions with different patient populations, laboratory infrastructures, or clinical workflows. Second, the retrospective study design precluded assessment of the LDSS in real-time clinical decision-making; prospective implementation studies are therefore required to determine its effects on clinical practice and patient outcomes.

Third, the model relied exclusively on structured laboratory data and did not incorporate patient symptoms, comorbidities, medication history, or clinical notes—factors known to influence UTI risk assessment and antibiotic prescribing. In routine clinical care, integration of such information is primarily the responsibility of the treating physician, who orders diagnostic tests based on patient history, clinical presentation, and prevailing guidelines. In contrast, laboratory physicians are tasked with processing submitted specimens according to standardized pre-analytical and analytical protocols. Although pre-preanalytical factors, such as appropriate test selection, are important, these data are rarely available to LIS in a structured, analyzable format. Consequently, most LIS environments contain only coded test orders and limited demographic information, without access to patient symptomatology or detailed clinical context.

Within these real-world constraints, the LDSS was designed not as a replacement for clinical judgment but as a complementary, interpretable decision-support tool that standardizes reflective testing and promotes communication between laboratory and clinical teams. Accordingly, the system functions as a laboratory-based reflex testing prioritization tool rather than as a diagnostic or therapeutic decision-making platform.

Fourth, despite robust performance in both internal and external test sets, the relatively small independent validation cohort—enriched for high-acuity inpatients—may introduce spectrum bias and lead to overestimation of sensitivity in complex clinical populations. Fifth, although the conventional definition of significant bacteriuria is ≥105 CFU/mL, this study adopted a ≥104 CFU/mL threshold based on emerging clinical evidence and institutional practice. Future investigations should evaluate the effects of alternative thresholds on model calibration and performance across different clinical settings.

Sixth, scoring weights and feature thresholds were calibrated using a fixed probability cutoff and Youden’s index derived from the present dataset. Optimal thresholds may vary across institutions and will require local adjustment to maintain the desired balance between sensitivity and specificity. Finally, while SHAP values were employed to enhance model interpretability, clinician acceptance, usability, and integration into routine workflows were not formally assessed. Future implementation studies are therefore essential to evaluate user engagement, potential alert fatigue, and cost-effectiveness prior to widespread clinical deployment.

Conclusion

We developed and preliminarily validated an interpretable, multi-model LDSS designed to improve the efficiency of urine culture utilization. By integrating ensemble ML approaches with SHAP-based interpretability, the system demonstrated strong discriminatory performance while offering flexible scoring strategies that prioritize sensitivity, specificity, or an optimized balance between the two. The LDSS has the potential to reduce unnecessary urine cultures, support antimicrobial stewardship efforts, and promote standardized, evidence-based laboratory decision-making.

Future work will focus on prospective, real-world implementation across diverse clinical settings. Planned enhancements include integration with electronic health record–derived clinical data, local calibration of decision thresholds, and systematic evaluation of clinical impact, user adoption, and cost-effectiveness. These steps are critical for translating this early-stage model into a scalable and clinically actionable decision-support tool.

Ethics Committee Approval: Ethical approval was obtained from the University of Health Sciences Türkiye, İzmir Tepecik Training and Research Hospital Non-Interventional Research Ethics Committee prior to study initiation (approval number: 2025/02-05, dated: 10.03.2025).
Informed Consent: Retrospective study.

Authorship Contributions

Surgical and Medical Practices: F.D., İ.A., Concept: F.D., İ.A., Design: F.D., M.A., İ.A., Data Collection or Processing: F.D., M.A., İ.A., Analysis or Interpretation: F.D., A.D., Literature Search: F.D., M.A., A.D., Writing: F.D., A.D.
Conflict of Interest: No conflict of interest was declared by the authors.
Financial Disclosure: The authors declared that this study received no financial support.

References

1
Schmiemann G, Kniehl E, Gebhardt K, Matejczyk MM, Hummers-Pradier E. The diagnosis of urinary tract infection: a systematic review. Dtsch Arztebl Int. 2010;107(21):361–7.
2
Flores E, Martínez-Racaj L, Blasco Á, Diaz E, Esteban P, López-Garrigós M, Salinas M. A step forward in the diagnosis of urinary tract infections: from machine learning to clinical practice. Comput Struct Biotechnol J. 2024;24:533–41.
3
O’Brien M, Marijam A, Mitrani-Gold FS, Terry L, Taylor-Stokes G, Joshi AV. Unmet needs in uncomplicated urinary tract infection in the United States and Germany: a physician survey. BMC Infect Dis. 2023;23:281.
4
Ozkan IA, Koklu M, Sert IU. Diagnosis of urinary tract infection based on artificial intelligence methods. Comput Methods Programs Biomed. 2018;166:51–9.
5
Choi MH, Kim D, Park Y, Jeong SH. Development and validation of artificial intelligence models to predict urinary tract infections and secondary bloodstream infections in adult patients. J Infect Public Health. 2024;17:10–7.
6
World Health Organization. Global Antimicrobial Resistance and Use Surveillance System (GLASS) report 2022. Geneva: World Health Organization; 2022.
7
Dedeene L, Van Elslande J, Dewitte J, Martens G, De Laere E, De Jaeger P, De Smet D. An artificial intelligence-driven support tool for prediction of urine culture test results. Clin Chim Acta. 2024;562:119854.
8
Devillé WL, Yzermans JC, van Duijn NP, Bezemer PD, van der Windt DA, Bouter LM. The urine dipstick test useful to rule out infections: a meta-analysis of the accuracy. BMC Urol. 2004;4:4.
9
Verboeket-van de Venne WPHG, Aakre KM, Watine J, Oosterhuis WP. Reflective testing: adding value to laboratory testing. Clin Chem Lab Med. 2012;50:1249–52.
10
Ramgopal S, Horvat CM, Yanamala N, Alpern ER. Machine learning to predict serious bacterial infections in young febrile infants. Pediatrics. 2020;146(3):e20194096.
11
Møller JK, Sørensen M, Hardahl C. Prediction of risk of acquiring urinary tract infection during hospital stay based on machine-learning: a retrospective cohort study. PLoS One. 2021;16:e0248636.
12
Herter WE, Khuc J, Cinà G, Knottnerus BJ, Numans ME, Wiewel MA, Bonten TN, de Bruin DP, van Esch T, Chavannes NH, Verheij RA. Impact of a machine learning–based decision support system for urinary tract infections: prospective observational study in 36 primary care practices. JMIR Med Inform. 2022;10:e27795.
13
Mancini A, Vito L, Marcelli E, Piangerelli M, De Leone R, Pucciarelli S, Merelli E. Machine learning models predicting multidrug resistant urinary tract infections using “DsaaS.” BMC Bioinformatics. 2020;21:347.
14
Shapiro Ben David S, Romano R, Rahamim-Cohen D, Azuri J, Greenfeld S, Gedassi B, Lerner U. AI driven decision support reduces antibiotic mismatches and inappropriate use in outpatient urinary tract infections. NPJ Digit Med. 2025;8:61.
15
Burton RJ, Albur M, Eberl M, Cuff SM. Using artificial intelligence to reduce diagnostic workload without compromising detection of urinary tract infections. BMC Med Inform Decis Mak. 2019;19:171.
16
Kranz J, Bartoletti R, Bruyère F, Cai T, Geerlings S, Köves B, Schubert S, Pilatz A, Veeratterapillay R, Wagenlehner FME, Bausch K, Devlies W, Horváth J, Leitner L, Mantica G, Mezei T, Smith EJ, Bonkat G. European Association of Urology Guidelines on urological infections: summary of the 2024 guidelines. Eur Urol. 2024;86(1):27–41.
17
Nelson Z, Aslan AT, Beahm NP, Blyth M, Cappiello M, Casaus D, Dominguez F, Egbert S, Hanretty A, Khadem T, Olney K, Abdul-Azim A, Aggrey G, Anderson DT, Barosa M, Bosco M, Chahine EB, Chowdhury S, Christensen A, de Lima Corvino D, Fitzpatrick M, Fleece M, Footer B, Fox E, Ghanem B, Hamilton F, Hayes J, Jegorovic B, Jent P, Jimenez-Juarez RN, Joseph A, Kang M, Kludjian G, Kurz S, Lee RA, Lee TC, Li T, Maraolo AE, Maximos M, McDonald EG, Mehta D, Moore WJ, Nguyen CT, Papan C, Ravindra A, Spellberg B, Taylor R, Thumann A, Tong SYC, Veve M, Wilson J, Yassin A, Zafonte V, Mena Lora AJ. Guidelines for the prevention, diagnosis, and management of urinary tract infections in pediatrics and adults: a WikiGuidelines Group consensus statement. JAMA Netw Open. 2024;7(11):e2444495. Erratum in: JAMA Netw Open. 2024;7(12):e2453497.
18
Werneburg GT, Lewis KC, Vasavada SP, Wood HM, Goldman HB, Shoskes DA, Li I, Rhoads DD. Urinalysis exhibits excellent predictive capacity for the absence of urinary tract infection. Urology. 2023;175:101–6.
19
Gharaghani M, Taghipour S, Halvaeezadeh M, Mahmoudabadi AZ. Candiduria: a review article with specific data from Iran. Turk J Urol. 2018;44(6):445–52.
20
Clinical and Laboratory Standards Institute (CLSI). Performance standards for antimicrobial susceptibility testing. 35th ed. CLSI document M100. Wayne (PA): CLSI; 2025.
21
Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regression. 3rd ed. Hoboken (NJ): Wiley; 2013.
22
Yelin I, Snitser O, Novich G, Katz R, Tal O, Parizade M, Chodick G, Koren G, Shalev V, Kishony R. Personal clinical history predicts antibiotic resistance of urinary tract infections. Nat Med. 2019;25:1143–52.
23
Gadalla AAH, Friberg IM, Kift-Morgan A, Zhang J, Eberl M, Topley N, Weeks I, Cuff S, Wootton M, Gal M, Parekh G, Davis P, Gregory C, Hood K, Hughes K, Butler C, Francis NA. Identification of clinical and urine biomarkers for uncomplicated urinary tract infection using machine learning algorithms. Sci Rep. 2019;9:19694.
24
Werneburg GT, Rhoads DD, Milinovich A, McSweeney S, Knorr J, Mourany L, Zajichek A, Goldman HB, Haber GP, Vasavada SP. External validation of predictive models for antibiotic susceptibility of urine culture. BJU Int. 2025.
25
Hooton TM. Uncomplicated urinary tract infection. N Engl J Med. 2012;366:1028–37.
26
Foxman B. Epidemiology of urinary tract infections: incidence, morbidity, and economic costs. Am J Med. 2002;113:5–13.
27
de Vries S, ten Doesschate T, Totté JEE, Heutz JW, Loeffen YGT, Oosterheert JJ, Thierens D, Boel E. A semi-supervised decision support system to facilitate antibiotic stewardship for urinary tract infections. Comput Biol Med. 2022;146:105621.
28
Lin T-H, Chung H-Y, Jian M-J, Chang C-K, Lin H-H, Yu C-M, Perng C-L, Chang F-Y, Chen C-W, Chiu C-H, Shang H-S. Artificial intelligence-clinical decision support system for enhanced infectious disease management: accelerating ceftazidime-avibactam resistance detection in Klebsiella pneumoniae. J Infect Public Health. 2024;17:102541.
29
Dhanda G, Asham M, Shanks D, O’Malley N, Hake J, Satyan MT, Yedlinsky NT, Parente DJ. Adaptation and external validation of pathogenic urine culture prediction in primary care using machine learning. Ann Fam Med. 2023;21:11–8.
30
Taylor RA, Moore CL, Cheung K-H, Brandt C. Predicting urinary tract infections in the emergency department with machine learning. PLoS One. 2018;13:e0194085.
31
Seheult JN, Stram MN, Contis L, Pontzer RE, Hardy S, Wertz W, Baxter CM, Ondras M, Kip PL, Snyder GM, Pasculle AW. Development, evaluation, and multisite deployment of a machine learning decision tree algorithm to optimize urinalysis parameters for predicting urine culture positivity. J Clin Microbiol. 2023;61(6):e0029123.
32
Sergounioti A, Rigas D, Zoitopoulos V, Kalles D. From preliminary urinalysis to decision support: machine learning for UTI prediction in real-world laboratory data. J Pers Med. 2025;15(5):200.
33
Sheele JM, Campbell RL, Jones DD. Machine learning to predict bacteriuria in the emergency department. Sci Rep. 2025;15(1):31087.
34
Lachs MS, Nachamkin I, Edelstein PH, Goldman J, Feinstein AR, Schwartz JS. Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med. 1992;117:135–40.
35
Zhao Q, Liu M, Gao K, Zhang B, Qi F, Xing T, Liu C, Gao J. Predicting 90-day risk of urinary tract infections following urostomy in bladder cancer patients using machine learning and explainability. Sci Rep. 2025;15:6807.
36
Wang H, Ding J, Wang S, Li L, Song J, Bai D. Enhancing predictive accuracy for urinary tract infections post-pediatric pyeloplasty with explainable AI: an ensemble TabNet approach. Sci Rep. 2025;15:2455.
37
Lee H-G, Seo Y, Kim JH, Han SB, Im JH, Jung CY, Durey A. Machine learning model for predicting ciprofloxacin resistance and presence of ESBL in patients with UTI in the ED. Sci Rep. 2023;13:3282.
38
Munigala S, Rojek R, Wood H, Yarbrough ML, Jackups RR, Burnham C-AD, Warren DK. Effect of changing urine testing orderables and clinician order sets on inpatient urine culture testing: analysis from a large academic medical center. Infect Control Hosp Epidemiol. 2019;40:281–86.
39
Fakih MG, Advani SD, Vaughn VM. Diagnosis of urinary tract infections: need for a reflective rather than reflexive approach. Infect Control Hosp Epidemiol. 2019;40:834–35.
40
Chambliss AB, Van TT. Revisiting approaches to and considerations for urinalysis and urine culture reflexive testing. Crit Rev Clin Lab Sci. 2022;59:112–24.
41
Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40:16-28.
42
Tonekaboni S, Joshi S, McCradden MD, Goldenberg A. What clinicians want: contextualizing explainable machine learning for clinical end use. Mach Learn Healthc. 2019:1–21.
43
Del Ben F, Da Col G, Cobârzan D, Turetta M, Rubin D, Buttazzi P, Antico A. A fully interpretable machine learning model for increasing the effectiveness of urine screening. Am J Clin Pathol. 2023;160:620–32.
44
General Directorate of Health Services (Türkiye Ministry of Health). Rational Laboratory Utilization Project (Akılcı Laboratuvar Kullanımı Projesi). Official Letter No. E.319. March 5, 2018. Ankara, Türkiye.

Suplementary Materials