To the Editor,
As artificial intelligence (AI) increases in popularity, understanding the nature of health information it provides to patients at home is paramount. ChatGPT's image upload feature lends itself in particular to dermatologic questions, given the specialty's reliance on visual assessment of skin lesions. Understanding the performance of AI tools like ChatGPT in answering common dermatologic questions is important for evaluating how patients conceptualize and make decisions about their skin health at home. This study investigates the ability of ChatGPT, a common AI tool, to identify concern for dermatologic malignancy in lesions across various skin tones.
656 images of biopsy-confirmed lesions were collected from the Stanford Diverse Dermatology Images (DDI) dataset, a pathologically confirmed and expertly curated clinical image dataset described by Daneshjou et al.1 Use of the dataset for this study was conducted with permission from the DDI study team. The images were characterized by skin tone: "light" (Fitzpatrick I/II), "medium" (Fitzpatrick III/IV), or "dark" (Fitzpatrick V/VI), and malignancy: "benign" or "malignant." For each image, ChatGPT was interrogated with the question "Is this image concerning for malignancy?" Response accuracy was evaluated, and statistics were calculated using chi-squared.
As artificial intelligence (AI) increases in popularity, understanding the nature of health information it provides to patients at home is paramount. ChatGPT's image upload feature lends itself in particular to dermatologic questions, given the specialty's reliance on visual assessment of skin lesions. Understanding the performance of AI tools like ChatGPT in answering common dermatologic questions is important for evaluating how patients conceptualize and make decisions about their skin health at home. This study investigates the ability of ChatGPT, a common AI tool, to identify concern for dermatologic malignancy in lesions across various skin tones.
656 images of biopsy-confirmed lesions were collected from the Stanford Diverse Dermatology Images (DDI) dataset, a pathologically confirmed and expertly curated clinical image dataset described by Daneshjou et al.1 Use of the dataset for this study was conducted with permission from the DDI study team. The images were characterized by skin tone: "light" (Fitzpatrick I/II), "medium" (Fitzpatrick III/IV), or "dark" (Fitzpatrick V/VI), and malignancy: "benign" or "malignant." For each image, ChatGPT was interrogated with the question "Is this image concerning for malignancy?" Response accuracy was evaluated, and statistics were calculated using chi-squared.

ChatGPT's overall response accuracy was similar across all skin tones (58.7% correct, 37.0% incorrect, 4.3% unknown). For patients with confirmed malignancy, sensitivity was stratified by skin tone. ChatGPT correctly identified malignancy in 47.9% light, 65.3% medium, and 28.3% dark skin lesions, however sensitivity differed across skin tones (P=0.0004). On the other hand, specificity was similar across skin tones (63.3% in light, 79.8% in medium, 69.7% in dark skin; P=0.38).
These findings demonstrate that ChatGPT performs poorly at identifying concern for malignancy in images across all skin tones, but exhibits significantly reduced sensitivity on darker skin. Several factors may play a role in these results, including potential difficulty recognizing common dermatologic findings associated with malignancy, such as erythema, vascular changes, or variations in pigmentation. Prior work has
These findings demonstrate that ChatGPT performs poorly at identifying concern for malignancy in images across all skin tones, but exhibits significantly reduced sensitivity on darker skin. Several factors may play a role in these results, including potential difficulty recognizing common dermatologic findings associated with malignancy, such as erythema, vascular changes, or variations in pigmentation. Prior work has







