Blog 4 of 4
Seizure detection algorithms
The first three blogs in this series introduced the quantitative terminology used to evaluate seizure detection and emphasized the critical role of training and testing datasets. Here, we address the central challenge of quantifying seizure burden (SB) and explain why reliable seizure detection is the limiting factor for both real-time alerting and longitudinal clinical assessment .
Read more in our related blog posts:
- How to Define Seizure Detection
- Basic Concepts of Seizure Detection
- The Critical Role of Training Data in the Development of Seizure Detection Algorithms
Current State of Commercial Automated Seizure Detection
Not all clinical EEG platforms incorporate automated seizure detection capabilities, but the most widely deployed commercial systems feature modular software architectures that support optional seizure recognition modules. These solutions are implemented either as third-party plugins or proprietary extensions developed by the platform manufacturer. Notable examples of third-party integrations include Persyst’s seizure detection module for Natus NeuroWorks™ and Nihon Kohden EEG systems [1], and encevis’ cloud-based detection for Zeto’s NeuroPulse™ AI-powered cloud platform [2]. Ceribell developed Clarity™, a proprietary detection algorithm integrated with their rapid-response EEG system [3,4]. A recent review features the hardware capabilities of these point-of-care EEG systems [5].
Given that both SB tracking accuracy and electrographic status epilepticus (ESE) alerting reliability are fundamentally determined by seizure detection performance, we focus our analysis on comparing detection algorithms across available solutions. Our evaluation is limited to published performance metrics and, where source code has been made publicly available, direct comparative testing on standardized datasets.
Table 1 (below) presents published performance characteristics of widely used commercial seizure detection systems. It is critical to note that these performance ranges derive from studies conducted on heterogeneous datasets with varying patient populations, recording conditions, and seizure types. (We provided recommendations in Blog 3 under “Best Practices for Transparent Reporting” to improve the consistency of reported characteristics.) Consequently, direct performance comparisons based solely on published metrics are subject to significant dataset bias, as we discussed in Blog 3, “The Critical Role of Training Data“.
Grand Truth Paradox
When evaluating seizure detection algorithms, the most critical methodological decision is how “ground truth” is defined. In medical device diagnostics, ground truth is assumed to represent the objective reality against which algorithm performance is measured. However, in seizure detection, this assumption is often fragile (see Blog 2: “The problem of Ground Truth: inter-rater agreement”). An incorrectly defined or biased ground truth can artificially inflate or suppress sensitivity and specificity. Studies that use different definitions of ground truth therefore become difficult to compare, even if their reported metrics appear similar. Because the contingency table to calculate the key performance metrics is directly affected by the ground truth (see Blog 1), the numbers enter in the table determines the sensitivity and specificity. Hence, by manipulating the ground truth one can easily enhance the performance.
“Ground truth” is not “Gold truth”
In some other diagnostic fields, an external biological standard exists (MRI, CT, etc). For example, a radiological suspicion of a tumor can be confirmed or refuted by biopsy where histopathology serves as an independent validator. In these situations, time and patient outcome of the disease state will help inform about the reality of a preceding assessment.
Seizure detection lacks such an independent confirmatory modality. The EEG pattern itself is the phenomenon of interest. Consequently, ground truth in seizure detection is anchored only to expert interpretation. This creates an epistemological limitation: the reference standard for what is seizure and what is not is inherently subjective, even when systematically constructed.
Human Expert Consensus as Ground Truth
In electroencephalography (EEG), ground truth is typically established through expert physician annotation. As discussed previously (Blog 2: “The Problem of Ground Truth: Inter-Rater Agreement“), careful expert selection, representativeness, and blinded review are essential to mitigate bias. Multi-rater consensus is commonly used to strengthen reliability. Yet consensus does not guarantee correctness.
Inter-rater agreement varies across types of EEGs (duration, clinical settings, partial or full montages and patients conditions) and across signal conditions. Agreement is typically higher when the signal-to-noise ratio is strong and the seizure is stereotypical. It decreases if the EEG is full of artifacts or contaminated by noise making it ambiguous and hard to read for human experts —precisely the scenarios where automated systems may be superior.
The goal to obtain independent and blinded scores by expert readers in EEG cannot be understated. Readers who belong to the same school of thought through identical training, will necessarily provide more similar ratings, than readers with more diverse backgrounds. Scores that are not blinded may also create an inherent interpersonal conflict between readers if they are financially or professionally dependent on one another. Neurology fellows reading under the supervision of their attending physician in an un-blinded situation may be prone to agree with their mentors scores artificially increasing the inter-rater reliability – human psychology dictates at minimum an unconscious bias towards wanting to please one’s superior.
Finally, at the moment a method requires subjective human ratings, principles of social science apply in full, suggesting that independent inter-rater scores follow a normal distribution with an average and set standard variation [6]. Even if rating scores tend to converge to a commonly high agreement in clear text-book examples of a seizure signal, perfect inter-rater agreement at 100% would be a statistical abnormality when including less clear, less ideal and more edge-case seizure morphologies into the assessment. Perfect inter-rater reliabilities, therefore become a warning sign on their own. If they are perfectly convergent, then the question begs what detection criteria were deployed (See Blog 1: proximity, overlap, percentage overlap), or what dataset characteristics might have contributed to achieve this level of alignment. Perfect inter-rater reliabilities in EEG, on diverse, multi-facetted datasets, in a truly independent, larger and diverse group of readers, are in-probable.
The Paradox of Algorithmic Superiority
We mentioned above that “… consensus does not guarantee correctness.” This is another source of confusion. A paradox emerges when a machine-learning algorithm demonstrates consistent disagreement with expert annotations—yet appears clinically plausible, internally consistent, and reproducible across datasets.
Only by recognizing and managing the “ground truth paradox” can we generate trustworthy, comparable, and clinically meaningful performance metrics.
The Relationship Between Sensitivity and Ground Truth
One of the clearest warning signs of bias in seizure detection studies is a mismatch between reported sensitivity and inter-rater agreement. The problem arises when vendor-reported sensitivity exceeds the level of agreement among expert reviewers. For example, if two independent EEG experts agree only ~80% of the time on whether a segment contains a seizure (see Blog 2, Figure 3), how can an algorithm claim 99% sensitivity against that same reference?
If Dr. X and Dr. Y agree on seizure presence 80% of the time, then an algorithm achieving 99% sensitivity must be nearly perfectly aligned with one expert while necessarily disagreeing with the other in a substantial fraction of cases. In other words, the 99% sensitivity cannot simultaneously apply to both experts unless one expert’s ratings are excluded from the reference standard.
Constructed Ground Truth and Inflated Metrics
A common methodological solution is to define “ground truth” as only those seizures agreed upon by two or three experts. These consensus events are then treated as 100% true positives (see Blog 2: “conservative,” “moderate,” and “liberal approaches” to consensus).
While this practice increases internal consistency, it also reshapes the denominator of sensitivity. By excluding disputed events, the evaluation may preferentially retain clear, high signal-to-noise seizures and remove those ambiguous cases that lower inter-rater agreement. The resulting contingency table no longer reflects the full clinical reality—it reflects a filtered subset.
Take Away
Consensus based ground truth in sensitivity and specificity assessments introduce a bias towards inflated scores. Sensitivity and specificity are therefore not independent of the ground truth construction. They are downstream consequences of it.
Developers who report very high sensitivity should therefore demonstrate that these values are supported by correspondingly high, independent, blinded inter-rater agreement. If not, responsible reporting requires acknowledging that the robustness of the metric is limited by the robustness of the reference.
Why This Matters
The absence of practical reporting standards in this area has encouraged competitive sensitivity claims, often producing discrepancies between vendor-sponsored and independent studies. These discrepancies are frequently rooted not in statistical error, but in differences in dataset composition and ground truth construction.
Ultimately, real-world clinical performance will arbitrate these claims. Sensitivity that depends heavily on how disagreement cases were handled may not translate into consistent bedside performance.
Sensitivity cannot meaningfully exceed the reliability of its ground truth. When it appears to do so, the issue lies not in the algorithm—but in how the reference standard was constructed and interpreted.
| Model | Primary usecase | Montage | Inter-rater agreement | Temporal seizure matching | Method | Sensitivity% | Specificity% | Negative prediction value | Positive prediction value |
| Ceribell Clarity™ [3-4,7] | ICU Rapid bedside triage / continuous seizure burden / rule-out status epilepticus | Partial Montage: 8 channels | undisclosed | undisclosed | Time–frequency features and machine-learning classification on partial montage EEG | ~29–100 | ~79-93 | ≈95–99 | Moderate; decreases with low seizure burden |
Persyst,Version 11 [8,9] | Expert-assisted review, seizure burden | Full 10–20 cEEG | undisclosed | undisclosed | Time–frequency analysis, morphology, rhythmicity, and spatial coherence, with deterministic + ML-assisted components. | ~80–95 | ~70–90 | High | Moderate to high after expert review |
Natus NeuroWorks™ [8,10] | Continuous screening / flagging | Full 10–20 cEEG | undisclosed | undisclosed | Uses Persyst or encevis plugins to detect seizures | As in Persyst | As in Persyst | As in Persyst | As in Persyst |
NeuroPulse™(encevis Version 2.1)[11,12] | ICU Rapid bedside triage / long-term monitoring for status epilepticus / continuous seizure burden | Full 10–20 cEEG | undisclosed | undisclosed | Time–frequency feature extraction Temporal continuity and seizure duration modeling Evaluation based on seizure–prediction interval overlap, not just pointwise detection. | ~80–95 | ~75–90 | High for sustained seizures; lower for very brief events | Moderate |
Encouraging Examples of Testing Seizure Detection Methods
A limited number of studies provide examples of rigorously conducted, real-world validation of seizure detection systems using large cohorts. One such example is the 2025 study by Sheikh et al. [7]. This retrospective observational study evaluated the performance of an automated seizure burden estimator (ASBE) integrated into a rapid-response EEG (rr-EEG) system for detecting clinically significant electrographic seizure activity, particularly electrographic status epilepticus (ESE), without requiring immediate expert interpretation. Key methodological features included a multi-center design and independent (blinded) expert review.
Methodological highlights:
- The study reviewed consecutive clinical rr-EEG recordings performed at multiple hospitals between 2019 and 2021.
- Each EEG was independently and blindly reviewed by three human experts.
- A reference standard was defined as at least 2 of 3 reviewers agreeing on whether seizure activity was present.
- The main performance metrics were the negative predictive value (NPV) and positive predictive value (PPV) of the automatic seizure burden estimator for detecting or excluding ESE at various seizure-burden thresholds (e.g., >1%, >10%, >20%, >50%, >90%).
- The only caveat was that the grand truth was quantified based on a partial montage instead of a full montage EEG, which introduced a positive sensitivity bias.
Key Findings: The study concluded that a very high negative predictive value (≈99%) at the cost of sensitivity drop. It means, it rarely misidentified seizure activity when it was absent. Interestingly, the study revealed that many seizures were missed when they were present, occasionally with sensitivity as low as 29%.
Related Considerations:
These results point to two considerations: High negative predictive values are to be expected in rare clinical events such as seizures. Negative predictive values also increase with the length of EEG studies, if the seizure events are limited in duration and frequency, i.e. such as might be found after successful early administration of AEDs. To primarily focus on negative predictive values to determine the clinical viability of an algorithm may introduce a bias driven by dataset length. The value of negative predictive scores also becomes increasingly inconclusive if AED were administered at any point during the recording. Finally, the balance of sensitivity and specificity in this device depends heavily on the chosen seizure-burden threshold, underlining the interpretational nuance when deploying such tools in practice.
Clinical Implications: Rapid-response systems with automated estimators may enhance early seizure detection in acute or resource-limited settings, but should not be viewed as replacing expert EEG review. Where expert consensus (ground truth) is low or subtle, algorithm performance must be contextualized against the variability in human interpretation — as explained above. In other words, the argument seems valid that automated estimators are a helpful triaging aid but are not a full replacement for human interpretation at this time.
Final Thoughts
AI driven seizure burden assessments opened up an exciting and necessary step in the evolution of further automating and improving the diagnostic process in the epilepsy space. Much good has already and will continue to come from that, in particular for improving access to care in rural and underserved communities. The ability to quickly triage patients on the spot, to provide more appropriate downstream diagnostics and resulting treatments will continue to have a profound impact on patients’ lives. The continued task for the epilepsy community as a whole will be to set appropriate and commonly accepted standards to compare the increasing amounts of algorithms in a clinically meaningful way that benefit patients.
Assessing the current state of publicly reported AI driven seizure assessment approaches, it appears safe to say that there remains to be a gap in much anticipated clarity of results and real-life applicability. We advocate for an increased number of independent multi-center, double blinded studies, using full montage ground-truth approaches to fully understand how far AI can truly assist in the complex process of diagnosing and treating patients to achieve best outcomes. Until then, it appears safest to engage with AI driven approaches as an advanced triaging tool rather than a replacement for a time-tested physician-expert driven diagnostic process.
References:
- Scheuer, M. L., Wilson, S. B., Antony, A., Ghearing, G., Urban, A., & Bagić, A. I. (2021). Seizure detection: Interreader agreement and detection algorithm assessments using a large dataset. Journal of Clinical Neurophysiology, 38(5), 439–447. https://doi.org/10.1097/WNP.0000000000000709
- Zeto, Inc. (n.d.). Internal testing of Encevis AI-Powered seizure detection [Unpublished internal report].
- Vespa, P. M., Olson, D. M., John, S., Hobbs, K. S., Gururangan, K., Nie, K., Desai, M. J., Markert, M., Parvizi, J., Bleck, T. P., Hirsch, L. J., & Westover, M. B. (2020). Evaluating the clinical impact of rapid response electroencephalography: The DECIDE multicenter prospective observational clinical study. Critical Care Medicine, 48(9), 1249–1257. https://doi.org/10.1097/CCM.0000000000004428
- Kamousi, B., Karunakaran, S., Gururangan, K., Markert, M., Decker, B., Khankhanian, P., Mainardi, L., Quinn, J., Woo, R., & Parvizi, J. (2021). Monitoring the burden of seizures and highly epileptiform patterns in critical care with a novel machine learning method. Neurocritical Care, 34(3), 908–917. https://doi.org/10.1007/s12028-020-01120-0
- Herman S. T. (2027). Hardware Technology for Point-of-Care EEG: A Comprehensive Review. Journal of clinical neurophysiology: official publication of the American Electroencephalographic Society, 43(3), 191–203. https://doi.org/10.1097/WNP.0000000000001240
- Grant, A. C., Abdel-Baki, S. G., Weedon, J., Arnedo, V., Chari, G., Koziorynska, E., Lushbough, C., Maus, D., McSween, T., Mortati, K. A., Reznikov, A., & Omurtag, A. (2014). EEG interpretation reliability and interpreter confidence: a large single-center study. Epilepsy & behavior : E&B, 32, 102–107. https://doi.org/10.1016/j.yebeh.2014.01.011
- Sheikh, Z. B., Dhakar, M. B., Fong, M. W. K., Fang, W., Ayub, N., Molino, J., Haider, H. A., Foreman, B., Gilmore, E., Mizrahi, M., Karakis, I., Schmitt, S. E., Osman, G., Yoo, J. Y., & Hirsch, L. J. (2025). Accuracy of a rapid-response EEG’s automated seizure-burden estimator: AccuRASE study. Neurology, 104(2), e210234. https://doi.org/10.1212/WNL.0000000000210234
- Haider, H. A., Esteller, R., Hahn, C. D., Westover, M. B., Halford, J. J., Lee, J. W., Shafi, M. M., Gaspard, N., Herman, S. T., Gerard, E. E., Hirsch, L. J., Ehrenberg, J. A., LaRoche, S. M., & Critical Care EEG Monitoring Research Consortium. (2016). Sensitivity of quantitative EEG for seizure identification in the intensive care unit. Neurology, 87(9), 935–944. https://doi.org/10.1212/WNL.0000000000003034
- Wilson, S. B., Scheuer, M. L., Emerson, R. G., & Gabor, A. J. (2004). Seizure detection: Evaluation of the Reveal algorithm. Clinical Neurophysiology, 115(10), 2280–2291. https://doi.org/10.1016/j.clinph.2004.05.018
- Ganguly, T. M., Ellis, C. A., Tu, D., Shinohara, R. T., Davis, K. A., Litt, B., & Pathmanathan, J. (2022). Seizure detection in continuous inpatient EEG: A comparison of human vs automated review. Neurology, 98(22), e2224–e2232. https://doi.org/10.1212/WNL.0000000000200267
- Trinka, E., Cock, H., Hesdorffer, D., Rossetti, A. O., Scheffer, I. E., Shinnar, S., Shorvon, S., & Lowenstein, D. H. (2015). A definition and classification of status epilepticus—Report of the ILAE Task Force on Classification of Status Epilepticus. Epilepsia, 56(10), 1515–1523. https://doi.org/10.1111/epi.13121
- Fürbass, F., Ossenblok, P., Hartmann, M., Perko, H., Skupch, A. M., Lindinger, G., Elezi, L., Pataraia, E., Colon, A. J., Baumgartner, C., & Kluge, T. (2015). Prospective multi-center study of an automatic online seizure detection system for epilepsy monitoring units. Clinical Neurophysiology, 126(6), 1124–1131. https://doi.org/10.1016/j.clinph.2014.09.023

