Blog 2 of 4
In the first blog of our Understanding Seizure Detection series, we summarized different approaches on how to define “seizure detection”. In this blog we introduce different ways to help understand and visualize the performance of a seizure detection algorithm. The Receiver Operation Characteristics (ROC), and F1-Score are two popular methods used for that purpose.
The ROC Curve
The Receiver Operating Characteristic (ROC) curve visualizes the trade-off between detecting true seizures and generating false alarms across different detection thresholds. As the separability between signal and noise increases, the area under the ROC curve (AUC) approaches 1. When signal and noise completely overlap, the AUC approaches 0.5. The AUC value, also called choice probability, provides a sensitive measure of how well an EEG feature can differentiate seizure from non-seizure activity (Figure 1).
In other words, a high area under the curve signifies that a seizure algorithm is accurately detecting a seizure. An AUC of 1 equals a perfect 100% accurate categorization. An AUC of 0.5, however, would signify that the algorithm would only be detecting a seizure by chance. AUC values under 0.5 would be even worse, signifying that an algorithm identifies the target signal even below chance.
To find out how we derive useful performance metrics from contingency tables and confusion matrices, see our next chapter on Understanding Performance Metrics.
Understanding Performance Metrics
Sensitivity and Specificity
Diagnostic accuracy is typically expressed through sensitivity, specificity, False Positive Rate, and Precision:

- Sensitivity (Recall or Positive Percent Agreement): The proportion of actual seizures correctly identified (TP / [TP + FN])
- Specificity (Negative Percent Agreement): The proportion of non-seizure activity correctly identified (TN / [TN + FP])
- False Positive Rate: (1-Specificity)
- Precision: The accuracy of positive predictions (TP / [TP + FP])

The problem of Ground Truth: Inter-Rater Agreement
Performance metrics assume we have an objective “ground truth” to compare against. In seizure detection, this ground truth comes from expert EEG readers with years of experience. However, this foundation becomes shaky when we examine inter-rater agreement.
Research shows that even expert physicians can disagree substantially. In our own study, three independent experts reviewing the same EEG data agreed on only 54% of seizures (Figure 3). This high variability in inter-rater assessments raises serious questions about the reliability of sensitivity, specificity, and F1 scores.
It is crucial to maintain the statistical representativeness of the physicians reviewing the test data. First of all, the rating physicians must be blinded to each others’ seizure annotations, otherwise one physician’s seizure annotation may influence others’. It is also recommended to elect independent readers of geographically and professionally distant areas, because physicians trained in the same school, working as residents under shared supervision, or have been co-workers may bias the inter-rater agreement, and decrease the representativeness of expert opinion. A study aiming for objectivity should mitigate these types of biases.

Take Away
Therefore, investigators must report a Positive Percent Agreement (PPA) and a Negative Percent Agreement (NPA) for the raters to quantify the concordance of the annotations of independent experts.
The formulas are:
PPA = TP / (TP + FN),
NPA = (TN) / (TN + FP).
Addressing the Uncertainty
Several approaches can help manage this uncertainty:
- Conservative approach: Use only seizures identified by all experts (intersection). This yields high confidence ratings but excludes many valid seizure episodes if they were missed by just one expert. It provides the smallest number of validated seizures.
- Moderate approach: Require agreement from more than one but not all experts. This balances confidence with data availability.
- Liberal approach: Include all seizures identified by any expert. This maximizes data but reduces consensus. It provides the largest number of validated seizures.
Ideally, consensus scoring across many raters provides a more robust ground truth. However, this variability means that no published F1 score has absolute validity. These metrics change when different expert readers are involved, making claims of 100% recall or F1 scores questionable and sample-dependent.
Finally, inter-rater variability is highly sensitive to the method we determine the overlap between expert readers’ definitions of seizures. If we apply the “percent overlap” (discussed in Blog 1: “How to Define Seizure Detection”) between seizure events defined by their onset and offset times, the disagreement can be significant even if the experts agree on the onset but disagree on the offset time. Therefore, reported inter-rater agreements are usually high (>90%) because most investigators consider an agreement perfect when the seizure onset asynchrony does not exceed a predefined ΔT interval (usually a few seconds) and ignore the offset asynchrony. This topic will be further elaborated next.
Temporal Considerations: Seizure Onset and Offset
Seizures have relatively well-defined onset times but much less clear offset times. Seizures often taper gradually, making the endpoint a matter of clinical judgment rather than objective measurement. Some clinicians consider the seizure to be ended when the EEG returns to near-normal, while others use the transition from periodic to aperiodic patterns as the endpoint.
This ambiguity particularly affects the diagnosis of electrographic status epilepticus (ESE), which is defined by total seizure duration thresholds (10% continuous or 12% fragmented seizures over an hour). Since offset time uncertainty directly impacts duration calculations, it can influence critical clinical decisions.
Take Away
Expert raters generally show higher agreement on seizure onset than offset times, adding another layer of complexity to performance validation.
The Electrode Coverage Dilemma
An often-underestimated factor in seizure detection is electrode montage and coverage. Clinical EEG systems range from partial limited montage (headbands, single electrodes) to full montage coverage (headcaps or individually attached electrodes with standardized electrode positions). Most comply with the 10-20 system, which ensures reproducible electrode placement.
Reduced montage electrode coverage introduces an inherent bias by failing to capture all possible seizures, particularly focal seizures confined to brain regions outside the electrode array. As a result, reduced montages cannot establish a complete ground truth for seizure occurrence. Nevertheless, some studies define ground truth using reduced montages rather than full electrode coverage. As a result, it becomes easier to report perfect sensitivity and high specificity when the ground truth is limited to partial montages. This is particularly likely when algorithm performance is evaluated against expert consensus derived from the same reduced electrode set, which may miss seizures occurring outside the monitored regions. In other words, if the ground truth includes only seizures visible within the partial montage—and excludes seizures occurring beyond it—then failing to detect those excluded seizures will not negatively affect performance scores. Readers should therefore be cautious about overvaluing high sensitivity or specificity reported for algorithms based on limited montages.
In contrast, several studies recommend to define the ground truth using expert annotations based on full-montage EEG recordings [1, 3, 5, 7, 9]. When evaluated against this more comprehensive reference, partial montage (reduced-electrode-set) approaches yielded more moderate performance, with reported sensitivities around 75% and specificities near 97%. Notably, none of these studies claimed perfect sensitivity. But that is also a more realistic and statistically plausible outcome, because this more prudent assessment strategy is not subject to the inherent bias of partial montage approaches.
Given the inherent bias of using reduced montage EEG as ground truth, it is not surprising that some algorithms based on partial montages report 100 % sensitivity. Such results beg the question how they can be even better than the concordance between blinded human readers evaluating either reduced montage or full coverage EEG records – it can’t and this simply illustrates the inherent bias of such partial montage approaches. To bring some clarity to the source of contradicting results that approaches like that may produce is a recent multi-center study (AccuRASE) evaluating seizure detection performance of partial montage based seizure detection methods and found only 29% sensitivity (low-to-moderate), but relatively high, >90% specificity [8]. This study used full montage ground truth as opposed to a partial montage.
The above example illustrates how important it is to maximize the objectivity of ground truth dataset. The misrepresentation of ground truth may also create the false impression that reduced montages perform as well or even better than full montages. Please note, that specificity (false positive ratio) is relatively immune to the reduced montage bias, because it is only affected by the true negatives and false positives; neither of which is related to missed seizures.
In other words, by reducing the available electrodes to assess the ground truth, sensitivity assessments based on partial montages risk over-estismate true sensitivity values. Approaches based on full montage ground-truth data have a higher likelihood to score lower sensitivity values simply because of the inclusion of additional ground truth channels in the central head regions. Full montage assessments may well be much more sensitive as a result but that won’t be reflected in their sensitivity values compared to values achieved by partial montage approaches. The simple take-away is that reducing ground-truth coverage introduces a bias towards higher sensitivity scores.
Take Away
Best practice: When evaluating non-standard or partial-coverage montages, ground truth should always be determined using full-coverage, expert-reviewed EEG data.
Achieving unbiased performance metrics is paramount in seizure detection. However, performance evaluation is only half the story—the training dataset plays an equally critical role. To understand how training data fundamentally shapes seizure detection performance, see Blog 3: “The Critical Role of Training Data”.
References
- Asif, R., Saleem, S., Hassan, S. A., Alharbi, S. A., & Kamboh, A. M. (2020). Epileptic seizure detection with a reduced montage: A way forward for ambulatory EEG devices. IEEE Access, 8, 65880-65890. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9050500
- Backman, S., Cronberg, T., Rosén, I., & Westhall, E. (2020). Reduced EEG montage has a high accuracy in the post cardiac arrest setting. Clinical Neurophysiology, 131(9), 2216–2223. https://doi.org/10.1016/j.clinph.2020.06.021
- Frankel, M. A., Lehmkuhle, M. J., Spitz, M. C., Newman, B. J., Richards, S. V., & Arain, A. M. (2021). Wearable Reduced-Channel EEG System for Remote Seizure Monitoring. Frontiers in neurology, 12, 728484. https://doi.org/10.3389/fneur.2021.728484
- Grant, A. C., Abdel-Baki, S. G., Weedon, J., Arnedo, V., Chari, G., Koziorynska, E., Lushbough, C., Maus, D., McSween, T., Mortati, K. A., Reznikov, A., & Omurtag, A. (2014). EEG interpretation reliability and interpreter confidence: a large single-center study. Epilepsy & behavior : E&B, 32, 102–107. https://doi.org/10.1016/j.yebeh.2014.01.011
- Lin, Y. C., Lin, H. A., Chang, M. L., & Lin, S. F. (2025). Diagnostic accuracy of reduced electroencephalography montages for seizure detection: A frequentist and Bayesian meta-analysis. Neurophysiologie clinique = Clinical neurophysiology, 55(2), 103044. https://doi.org/10.1016/j.neucli.2025.103044
- Little, S. C., & Raffel, S. C. (1962). Intra-rater reliability of EEG interpretations. The Journal of nervous and mental disease, 135, 77–81. https://doi.org/10.1097/00005053-196207000-00010
- Ma, B. B., Johnson, E. L., & Ritzl, E. K. (2018). Sensitivity of a Reduced EEG Montage for Seizure Detection in the Neurocritical Care Setting. Journal of clinical neurophysiology : official publication of the American Electroencephalographic Society, 35(3), 256–262. https://doi.org/10.1097/WNP.0000000000000463
- Sheikh, Z. B., Dhakar, M. B., Fong, M. W. K., Fang, W., Ayub, N., Molino, J., Haider, H. A., Foreman, B., Gilmore, E., Mizrahi, M., Karakis, I., Schmitt, S. E., Osman, G., Yoo, J. Y., & Hirsch, L. J. (2025). Accuracy of a Rapid-Response EEG’s Automated Seizure-Burden Estimator: AccuRASE Study. Neurology, 104(2), e210234. https://doi.org/10.1212/WNL.0000000000210234
- Westover, M. B., Gururangan, K., Markert, M. S., Blond, B. N., Lai, S., Benard, S., Bickel, S., Hirsch, L. J., & Parvizi, J. (2020). Diagnostic value of electroencephalography with ten electrodes in critically ill patients. Neurocritical Care, 33(2), 479–490. https://doi.org/10.1007/s12028-019-00911-4

