ChatGPT Exhibits Reduced Sensitivity in Identifying Concern for Malignancy in Darker Skin Tones

July 2026 | Volume 25 | Issue 7 | 10083 | Copyright © July 2026


Published online June 29, 2026

Nicholas Schell MCITa, Simona A. Alomary BAb,c, Nicole Baker BSc,d, Jina Chung MDc, Temitayo Ogunleye MDc, Susan C. Taylor MDc

aPerelman School of Medicine, University of Pennsylvania, Philadelphia, PA
bRutgers New Jersey Medical School, Newark, NJ
cDepartment of Dermatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
dSidney Kimmel Medical College, Philadelphia, PA

Abstract
To the Editor,

As artificial intelligence (AI) increases in popularity, understanding the nature of health information it provides to patients at home is paramount. ChatGPT's image upload feature lends itself in particular to dermatologic questions, given the specialty's reliance on visual assessment of skin lesions. Understanding the performance of AI tools like ChatGPT in answering common dermatologic questions is important for evaluating how patients conceptualize and make decisions about their skin health at home. This study investigates the ability of ChatGPT, a common AI tool, to identify concern for dermatologic malignancy in lesions across various skin tones.

656 images of biopsy-confirmed lesions were collected from the Stanford Diverse Dermatology Images (DDI) dataset, a pathologically confirmed and expertly curated clinical image dataset described by Daneshjou et al.1 Use of the dataset for this study was conducted with permission from the DDI study team. The images were characterized by skin tone: "light" (Fitzpatrick I/II), "medium" (Fitzpatrick III/IV), or "dark" (Fitzpatrick V/VI), and malignancy: "benign" or "malignant." For each image, ChatGPT was interrogated with the question "Is this image concerning for malignancy?" Response accuracy was evaluated, and statistics were calculated using chi-squared.
ChatGPT's overall response accuracy was similar across all skin tones (58.7% correct, 37.0% incorrect, 4.3% unknown). For patients with confirmed malignancy, sensitivity was stratified by skin tone. ChatGPT correctly identified malignancy in 47.9% light, 65.3% medium, and 28.3% dark skin lesions, however sensitivity differed across skin tones (P=0.0004). On the other hand, specificity was similar across skin tones (63.3% in light, 79.8% in medium, 69.7% in dark skin; P=0.38).

These findings demonstrate that ChatGPT performs poorly at identifying concern for malignancy in images across all skin tones, but exhibits significantly reduced sensitivity on darker skin. Several factors may play a role in these results, including potential difficulty recognizing common dermatologic findings associated with malignancy, such as erythema, vascular changes, or variations in pigmentation. Prior work has