Blog

Basic Concepts of Seizure Detection

Zeto-blog-banner-02-1

Blog 2 of 4

In the first blog of our Understanding Seizure Detection series, we summarized different approaches on how to define “seizure detection”. In this blog we introduce different ways to help understand and visualize the performance of a seizure detection algorithm. The Receiver Operation Characteristics (ROC), and F1-Score are two popular methods used for that purpose. 

The ROC Curve

The Receiver Operating Characteristic (ROC) curve visualizes the trade-off between detecting true seizures and generating false alarms across different detection thresholds. As the separability between signal and noise increases, the area under the ROC curve (AUC) approaches 1. When signal and noise completely overlap, the AUC approaches 0.5. The AUC value, also called choice probability, provides a sensitive measure of how well an EEG feature can differentiate seizure from non-seizure activity (Figure 1).

In other words, a high area under the curve signifies that a seizure algorithm is accurately detecting a seizure. An AUC of 1 equals a perfect 100% accurate categorization. An AUC of 0.5, however, would signify that the algorithm would only be detecting a seizure by chance. AUC values under 0.5 would be even worse, signifying that an algorithm identifies the target signal even below chance. 

To find out how we derive useful performance metrics from contingency tables and confusion matrices, see our next chapter on Understanding Performance Metrics.

Understanding Performance Metrics

Sensitivity and Specificity

Diagnostic accuracy is typically expressed through sensitivity, specificity, False Positive Rate, and Precision:

Figure1
  • Sensitivity (Recall or Positive Percent Agreement): The proportion of actual seizures correctly identified (TP / [TP + FN])
  • Specificity (Negative Percent Agreement): The proportion of non-seizure activity correctly identified (TN / [TN + FP])
  • False Positive Rate: (1-Specificity)
  • Precision: The accuracy of positive predictions (TP / [TP + FP])
The-contingency-table

Deep-Dive: F1 Score

close_icon

The F1 score balances precision and sensitivity into a single metric using their harmonic mean: 2 × (Precision × Sensitivity) / (Precision + Sensitivity). This score ranges from 0 to 1, where 1 represents perfect detection.The F1 score is particularly valuable for imbalanced datasets—a common situation in EEG analysis where non-seizure data vastly outnumbers seizure data. A detection algorithm with a very low threshold might catch every seizure (high recall) but also generate numerous false alarms (low precision). The F1 score provides a balanced assessment that reflects real-world utility where both sensitivity and false-alarm rates matter equally (Figure 2).

F1-contours
Figure 2: F1-contours and testing new models. On the left are the F1 contours that represent the expected Precision (P) as a function of Recall (R) for any given F1 score. On the right, the F1-countours were plotted for specific seizure detection models, with a test model’s F1 curve overlaid. This representation enables us to compare the test-model’s performance with other models. The blue curve represents the test model performance with a specific point highlights the part of the curve where the expected Precision at a given Recall level (Recall ~ 0.6) is 0.75, higher than in other models (courtesy of Bálint Csanády and the Zeto AI team).

The problem of Ground Truth: Inter-Rater Agreement

Performance metrics assume we have an objective “ground truth” to compare against. In seizure detection, this ground truth comes from expert EEG readers with years of experience. However, this foundation becomes shaky when we examine inter-rater agreement.

Research shows that even expert physicians can disagree substantially. In our own study, three independent experts reviewing the same EEG data agreed on only 54% of seizures (Figure 3). This high variability in  inter-rater assessments  raises serious questions about the reliability of sensitivity, specificity, and F1 scores. 

It is crucial to maintain the statistical representativeness of the physicians reviewing the test data. First of all, the rating physicians must be blinded to each others’ seizure annotations, otherwise one physician’s seizure annotation may influence others’. It is also recommended to elect independent readers of geographically and professionally distant areas, because physicians trained in the same school, working as residents under shared supervision, or have been co-workers may bias the inter-rater agreement, and decrease the representativeness of expert opinion. A study aiming for objectivity should mitigate these types of biases. 

Inter-rater agreement
Figure 3: Inter-rater agreement: This Venn diagram illustrates the concordance of seizure ratings among three independent EEG reading experts (Experts 1-3). Experts are color-coded. In this sample of EEGs, 18.38% of samples were classified as seizure suspicion and 81.62% as non-seizures. From the 18.38% seizure suspicions, only 10.1% overlapped, indicating a 55% inter-rater agreement (based on internal Zeto data).
bulb

Take Away

Therefore, investigators must report a Positive Percent Agreement (PPA) and a Negative Percent Agreement (NPA) for the raters to quantify the concordance of the annotations of independent experts.

The formulas are:
PPA = TP / (TP + FN),
NPA = (TN) / (TN + FP).

Addressing the Uncertainty

Several approaches can help manage this uncertainty:

  • Conservative approach: Use only seizures identified by all experts (intersection). This yields high confidence ratings but excludes many valid seizure episodes if they were missed by just one expert. It provides the smallest number of validated seizures.
  • Moderate approach: Require agreement from more than one but not all experts. This balances confidence with data availability. 
  • Liberal approach: Include all seizures identified by any expert. This maximizes data but reduces consensus. It provides the largest number of validated seizures.

Ideally, consensus scoring across many raters provides a more robust ground truth. However, this variability means that no published F1 score has absolute validity. These metrics change when different expert readers are involved, making claims of 100% recall or F1 scores questionable and sample-dependent. 

Finally, inter-rater variability is highly sensitive to the method we determine the overlap between expert readers’ definitions of seizures. If we apply the “percent overlap” (discussed in Blog 1: “How to Define Seizure Detection”) between seizure events defined by their onset and offset times, the disagreement can be significant even if the experts agree on the onset but disagree on the offset time. Therefore, reported inter-rater agreements are usually high (>90%) because most investigators consider an agreement perfect when the seizure onset asynchrony does not exceed a predefined ΔT interval (usually a few seconds) and ignore the offset asynchrony. This topic will be further elaborated next.

Temporal Considerations: Seizure Onset and Offset

Seizures have relatively well-defined onset times but much less clear offset times. Seizures often taper gradually, making the endpoint a matter of clinical judgment rather than objective measurement. Some clinicians consider the seizure to be ended when the EEG returns to near-normal, while others use the transition from periodic to aperiodic patterns as the endpoint.

This ambiguity particularly affects the diagnosis of electrographic status epilepticus (ESE), which is defined by total seizure duration thresholds (10% continuous or 12% fragmented seizures over an hour). Since offset time uncertainty directly impacts duration calculations, it can influence critical clinical decisions.

bulb

Take Away

Expert raters generally show higher agreement on seizure onset than offset times, adding another layer of complexity to performance validation.

The Electrode Coverage Dilemma

An often-underestimated factor in seizure detection is electrode montage and coverage. Clinical EEG systems range from partial limited montage (headbands, single electrodes) to full montage coverage (headcaps or individually attached electrodes with standardized electrode positions). Most comply with the 10-20 system, which ensures reproducible electrode placement.

Reduced montage electrode coverage introduces an inherent bias by failing to capture all possible seizures, particularly focal seizures confined to brain regions outside the electrode array. As a result, reduced montages cannot establish a complete ground truth for seizure occurrence. Nevertheless, some studies define ground truth using reduced montages rather than full electrode coverage. As a result, it becomes easier to report perfect sensitivity and high specificity when the ground truth is limited to partial montages. This is particularly likely when algorithm performance is evaluated against expert consensus derived from the same reduced electrode set, which may miss seizures occurring outside the monitored regions. In other words, if the ground truth includes only seizures visible within the partial montage—and excludes seizures occurring beyond it—then failing to detect those excluded seizures will not negatively affect performance scores. Readers should therefore be cautious about overvaluing high sensitivity or specificity reported for algorithms based on limited montages.

In contrast, several studies recommend to define the ground truth using expert annotations based on full-montage EEG recordings [1, 3, 5, 7, 9]. When evaluated against this more comprehensive reference, partial montage (reduced-electrode-set) approaches yielded more moderate performance, with reported sensitivities around 75% and specificities near 97%. Notably, none of these studies claimed perfect sensitivity. But that is also a more realistic and statistically plausible outcome, because this more prudent assessment strategy is not subject to the inherent bias of partial montage approaches.

Given the inherent bias of using reduced montage EEG as ground truth, it is not surprising that some algorithms based on partial montages report 100 % sensitivity. Such results beg the question how they can be even better than the concordance between blinded human readers evaluating either reduced montage or full coverage EEG records – it can’t and this simply illustrates the inherent bias of such partial montage approaches. To bring some clarity to the source of contradicting results that approaches like that may produce is a recent multi-center study (AccuRASE) evaluating seizure detection performance of partial montage based seizure detection methods and found only 29% sensitivity (low-to-moderate), but relatively high, >90% specificity [8]. This study used full montage ground truth as opposed to a partial montage.  

The above example illustrates how important it is to maximize the objectivity of ground truth dataset. The misrepresentation of ground truth may also create the false impression that reduced montages perform as well or even better than full montages. Please note, that  specificity (false positive ratio) is relatively immune to the reduced montage bias, because it is only affected by the true negatives and false positives; neither of which is related to missed seizures.  

In other words, by reducing the available electrodes to assess the ground truth, sensitivity assessments based on partial montages risk over-estismate true sensitivity values. Approaches based on full montage ground-truth data have a higher likelihood to score lower sensitivity values simply because of the inclusion of additional ground truth channels in the central head regions. Full montage assessments may well be much more sensitive as a result but that won’t be reflected in their sensitivity values compared to values achieved by partial montage approaches. The simple take-away is that reducing ground-truth coverage introduces a bias towards higher sensitivity scores.

bulb

Take Away

Best practice: When evaluating non-standard or partial-coverage montages, ground truth should always be determined using full-coverage, expert-reviewed EEG data.

Achieving unbiased performance metrics is paramount in seizure detection. However, performance evaluation is only half the story—the training dataset plays an equally critical role. To understand how training data fundamentally shapes seizure detection performance, see Blog 3: “The Critical Role of Training Data”.

Deep-Dive: Factors That Complicate Study Comparisons

close_icon

The literature on the effect of reduced electrode montages on seizure detection is mixed and often shaped by study design and sponsorship [1-7]. Industry-sponsored studies tend to report high concordance between reduced and full montages in seizure detection sensitivity, with little effect on specificity — a finding that is unsurprising given that vendors design these studies to support the equivalence of their systems and emphasize similarity in outcomes [1, 2]. Independent studies, by contrast, more often highlight meaningful differences and report losses in sensitivity with reduced coverage.

Beyond sponsorship bias, several clinical and technical factors complicate direct comparisons across studies.

  1. The first is time to EEG. In many acute settings, rapid electrode application is critical. When setup speed is clinically decisive, the practical benefit of a reduced montage system may outweigh the information lost through limited spatial coverage, justifying its use even if sensitivity is modestly lower.
  2. The second is patient age. EEG monitoring is strongly recommended in neonates when hypoxic-ischemic injury is suspected or following complicated deliveries. Neonatal-specific reduced montages focus on the frontal and central regions, which carry the highest diagnostic relevance in this population. Given the ease of application and the reduced discomfort for newborns, a targeted reduced montage is often considered sufficient in this context [6].
  3. The third is electrode configuration. Not all reduced montages are equivalent. For seizure detection, temporal lobe coverage is particularly important, as temporal-onset seizures are among the most common. Montages or headband systems that include temporal electrodes are therefore meaningfully better positioned to capture focal seizures than those that do not.
  4. The fourth is seizure type. Reduced montage systems can reliably detect focal seizures with an onset within or near the covered electrodes, but will likely miss focal seizures arising outside the zone of coverage. Generalized seizures, given their broad spatial distribution, are well captured even by reduced montage sets [1,3].

When comparing published studies quantitatively, a clear performance gap emerges: studies reporting sensitivity around 95% for reduced montages stand in stark contrast to those reporting sensitivity below 85%, underscoring how strongly methodology, montage design, and patient population drive reported outcomes.

The fact that reduced montages affect the sensitivity more than specificity is reasonable, since the reduced montage leads to more misses (FN) rather than increasing false detections (FP), where FN affects sensitivity and FP affects specificity.

References:

  1. Tacke, M., Janson, K., Vill, K. et al. Effects of a reduction of the number of electrodes in the EEG montage on the number of identified seizure patterns. Sci Rep 12, 4621 (2022). https://doi.org/10.1038/s41598-022-08628-9
  2. Westover, M. B., Gururangan, K., Markert, M. S., Blond, B. N., Lai, S., Benard, S., Bickel, S., Hirsch, L. J., & Parvizi, J. (2020). Diagnostic Value of Electroencephalography with Ten Electrodes in Critically Ill Patients. Neurocritical care, 33(2), 479–490. https://doi.org/10.1007/s12028-019-00911-4
  3. Frankel, M. A., Lehmkuhle, M. J., Spitz, M. C., Newman, B. J., Richards, S. V., & Arain, A. M. (2021). Wearable Reduced-Channel EEG System for Remote Seizure Monitoring. Frontiers in neurology, 12, 728484. 
  4. Ma, B. B., Johnson, E. L., & Ritzl, E. K. (2018). Sensitivity of a Reduced EEG Montage for Seizure Detection in the Neurocritical Care Setting. Journal of clinical neurophysiology : official publication of the American Electroencephalographic Society, 35(3), 256–262. https://doi.org/10.1097/WNP.0000000000000463
  5. R. Asif, S. Saleem, S. A. Hassan, S. A. Alharbi and A. M. Kamboh, “Epileptic Seizure Detection With a Reduced Montage: A Way Forward for Ambulatory EEG Devices,” in IEEE Access, vol. 8, pp. 65880-65890, 2020, https://doi: 10.1109/ACCESS.2020.2983917 
  6. Lin, Y. C., Lin, H. A., Chang, M. L., & Lin, S. F. (2025). Diagnostic accuracy of reduced electroencephalography montages for seizure detection: A frequentist and Bayesian meta-analysis. Neurophysiologie clinique = Clinical neurophysiology, 55(2), 103044. https://doi.org/10.1016/j.neucli.2025.103044
  7. Stevenson, N. J., Lauronen, L., & Vanhatalo, S. (2018). The effect of reducing EEG electrode number on the visual interpretation of the human expert for neonatal seizure detection. Clinical neurophysiology : official journal of the International Federation of Clinical Neurophysiology, 129(1), 265–270. https://doi.org/10.1016/j.clinph.2017.10.031
StudyTypes of seizureSensitivity Full montage [%]Sensitivit y Partial Montage [%]Specificity
Full montage [%]
Specificity Partial montage [%]
1Westover, et al  (2020)ICU97.5
81.8


100
94.4
2Frankel, et al (2021).90 90

3
Ma,  Johnson, & Ritzl (2018)Seizure
ESE
81
69
92
97
4Asif,et al. (2020). 9592 9999
5Lin et al, (2025). 
<8 electrodes
>8
7566
77 
97

97
6Stevenson, Lauronen & Vanhatalo. (2018)18% of seizures detected in the 19-electrode montage were not detected in the 8- or 4-electrode montage10070 
7Tacke et al. (2022). 0.760.65* 0.960.97
Table 1. Representative studies comparing seizure detection performance between full and partial montage electrodes.

References

  1. Asif, R., Saleem, S., Hassan, S. A., Alharbi, S. A., & Kamboh, A. M. (2020). Epileptic seizure detection with a reduced montage: A way forward for ambulatory EEG devices. IEEE Access, 8, 65880-65890. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9050500
  2. Backman, S., Cronberg, T., Rosén, I., & Westhall, E. (2020). Reduced EEG montage has a high accuracy in the post cardiac arrest setting. Clinical Neurophysiology, 131(9), 2216–2223. https://doi.org/10.1016/j.clinph.2020.06.021
  3. Frankel, M. A., Lehmkuhle, M. J., Spitz, M. C., Newman, B. J., Richards, S. V., & Arain, A. M. (2021). Wearable Reduced-Channel EEG System for Remote Seizure Monitoring. Frontiers in neurology, 12, 728484. https://doi.org/10.3389/fneur.2021.728484
  4. Grant, A. C., Abdel-Baki, S. G., Weedon, J., Arnedo, V., Chari, G., Koziorynska, E., Lushbough, C., Maus, D., McSween, T., Mortati, K. A., Reznikov, A., & Omurtag, A. (2014). EEG interpretation reliability and interpreter confidence: a large single-center study. Epilepsy & behavior : E&B, 32, 102–107. https://doi.org/10.1016/j.yebeh.2014.01.011
  5. Lin, Y. C., Lin, H. A., Chang, M. L., & Lin, S. F. (2025). Diagnostic accuracy of reduced electroencephalography montages for seizure detection: A frequentist and Bayesian meta-analysis. Neurophysiologie clinique = Clinical neurophysiology, 55(2), 103044. https://doi.org/10.1016/j.neucli.2025.103044
  6. Little, S. C., & Raffel, S. C. (1962). Intra-rater reliability of EEG interpretations. The Journal of nervous and mental disease, 135, 77–81. https://doi.org/10.1097/00005053-196207000-00010
  7. Ma, B. B., Johnson, E. L., & Ritzl, E. K. (2018). Sensitivity of a Reduced EEG Montage for Seizure Detection in the Neurocritical Care Setting. Journal of clinical neurophysiology : official publication of the American Electroencephalographic Society, 35(3), 256–262. https://doi.org/10.1097/WNP.0000000000000463
  8. Sheikh, Z. B., Dhakar, M. B., Fong, M. W. K., Fang, W., Ayub, N., Molino, J., Haider, H. A., Foreman, B., Gilmore, E., Mizrahi, M., Karakis, I., Schmitt, S. E., Osman, G., Yoo, J. Y., & Hirsch, L. J. (2025). Accuracy of a Rapid-Response EEG’s Automated Seizure-Burden Estimator: AccuRASE Study. Neurology, 104(2), e210234. https://doi.org/10.1212/WNL.0000000000210234
  9. Westover, M. B., Gururangan, K., Markert, M. S., Blond, B. N., Lai, S., Benard, S., Bickel, S., Hirsch, L. J., & Parvizi, J. (2020). Diagnostic value of electroencephalography with ten electrodes in critically ill patients. Neurocritical Care, 33(2), 479–490. https://doi.org/10.1007/s12028-019-00911-4