Abstract
Background and Objectives: Artificial intelligence (AI) has shown promising performance in skin-lesion classification; however, its fairness, external validity, and real-world reliability remain uncertain. This systematic review and meta-analysis evaluated the diagnostic accuracy, equity, and generalizability of AI-based dermatology systems across diverse imaging modalities and clinical settings. Materials and Methods: A comprehensive search of PubMed, Embase, Web of Science, and ClinicalTrials.gov (inception–31 October 2025) identified diagnostic accuracy studies using clinical, dermoscopic, or smartphone images. Eighteen studies (11 melanoma-focused; 7 mixed benign–malignant) met inclusion criteria. Six studies provided complete 2 × 2 contingency data for bivariate Reitsma HSROC modeling, while seven reported AUROC values with extractable variance. Risk of bias was assessed using QUADAS-2, and evidence certainty was graded using GRADE. Results: Across more than 70,000 test images, pooled sensitivity and specificity were 0.91 (95% CI 0.74–0.97) and 0.64 (95% CI 0.47–0.78), respectively, corresponding to an HSROC AUROC of 0.88 (95% CI 0.84–0.92). The AUROC-only meta-analysis yielded a similar pooled AUROC of 0.88 (95% CI 0.87–0.90). Diagnostic performance was highest in specialist settings (AUROC 0.90), followed by community care (0.85) and smartphone environments (0.81). Notably, performance was lower in darker skin tones (Fitzpatrick IV–VI: AUROC 0.82) compared with lighter skin tones (I–III: 0.89), indicating persistent fairness gaps. Conclusions: AI-based dermatology systems achieve high diagnostic accuracy but demonstrate reduced performance in darker skin tones and non-specialist environments. These findings emphasize the need for diverse training datasets, skin-tone–stratified reporting, and rigorous external validation before broad clinical deployment.
Affiliated Institutions
Related Publications
A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis
Deep learning offers considerable promise for medical diagnostics. We aimed to evaluate the diagnostic accuracy of deep learning algorithms versus health-care professionals in c...
Prediction of obstructive coronary artery disease and prognosis in patients with suspected stable angina
Abstract Aims We hypothesized that the modified Diamond–Forrester (D-F) prediction model overestimates probability of coronary artery disease (CAD). The aim of this study was to...
Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: individual participant data meta-analysis
Abstract Objective To determine the accuracy of the Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression. Design Individual participant data meta-anal...
Efficacy and safety of cholesterol-lowering treatment: prospective meta-analysis of data from 90 056 participants in 14 randomised trials of statins
Results of previous randomised trials have shown that interventions that lower LDL cholesterol concentrations can significantly reduce the incidence of coronary heart disease (C...
Rapid, point-of-care antigen and molecular-based tests for diagnosis of SARS-CoV-2 infection
We included 22 publications reporting on a total of 18 study cohorts with 3198 unique samples, of which 1775 had confirmed SARS-CoV-2 infection. Ten studies took place in North ...
Publication Info
- Year
- 2025
- Type
- article
- Volume
- 61
- Issue
- 12
- Pages
- 2186-2186
- Citations
- 0
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.3390/medicina61122186