INTRODUCTION
In a clinical research environment, acne severity is assessed by either manually
counting individual types of lesions or by comparing the subject to a grading scale and providing an overall global assessment. Both methods have reliably
demonstrated the clinical efficacy of anti-acne products for decades and are recommended by the US Food and Drug Administration.1 However, maintaining
assessment consistency within and across acne trials is an important consideration since counting and grading approaches can be subject to the judgment of
individual evaluators and have not been rigorously standardized. For example, the first reported acne grading system dates back to 1957,2 and a published
review in 2002 found that more than 25 types of grading systems and more than 19 lesion counting techniques have been in use since then.3 Although
inter-evaluator and intra-evaluator variability is the main reason for non-standardization, there has been limited research done in measuring this variability.
Lucky et al found that when performing lesion counts, agreement among evaluators decreased with an increasing number of acne lesions, and overall the
agreement between evaluators was low.4 For the sake of simplicity, many clinicians use a grading system to loosely categorize acne as mild, moderate, or
severe. However, the reproducibility of this is also shown to be low.5 H. Barrett et al in 2009 surveyed published articles proposing the use of a novel
means of assessing severity in clinical trials where the outcome measures were investigator-assessed. Of the nine clinical trials compared, only two offered
a statistical measure of inter-evaluator reliability, while none provided evidence of intra-evaluator reliability, responsiveness, or validity.6
Tan et al assessed the reliability of acne lesion counts and global assessments
(grading system) with a group of 11 dermatologists, five of whom had no formal training in acne grading or lesion counting. The dermatologists demonstrated
that there was generally good reliability in lesion counting with a correlation coefficient >0.75, but the reliability was much lower for global assessment.
They also claimed that the intra-evaluator reliability was much higher than the inter-evaluator reliability; however, no measures have been reported
quantifying intra-evaluator reliability. Additionally, they also noted that formal training of the evaluator improved reliability in both lesion counting and
global assessments.7 According to Allen et al, two 12-week long, double blind, placebo-controlled studies of acne treatments were performed using three judges
and a total of 331 male college students as subjects. Global severity grades were assigned and papules, pustules, and comedones were counted every two weeks
and the data were evaluated using Pearson’s