AI for Seizure Detection: Promise, Limitations, and Validation

Blog 4 of 4

Seizure detection algorithms

The first three blogs in this series introduced the quantitative terminology used to evaluate seizure detection and emphasized the critical role of training and testing datasets. Here, we address the central challenge of quantifying seizure burden (SB) and explain why reliable seizure detection is the limiting factor for both real-time alerting and longitudinal clinical assessment .

Current State of Commercial Automated Seizure Detection

Not all clinical EEG platforms incorporate automated seizure detection capabilities, but the most widely deployed commercial systems feature modular software architectures that support optional seizure recognition modules. These solutions are implemented either as third-party plugins or proprietary extensions developed by the platform manufacturer. Notable examples of third-party integrations include Persyst’s seizure detection module for Natus NeuroWorks™ and Nihon Kohden EEG systems [1], and encevis’ cloud-based detection for Zeto’s NeuroPulse™ AI-powered cloud platform [2]. Ceribell developed Clarity™, a proprietary detection algorithm integrated with their rapid-response EEG system [3,4]. A recent review features the hardware capabilities of these point-of-care EEG systems [5].

Given that both SB tracking accuracy and electrographic status epilepticus (ESE) alerting reliability are fundamentally determined by seizure detection performance, we focus our analysis on comparing detection algorithms across available solutions. Our evaluation is limited to published performance metrics and, where source code has been made publicly available, direct comparative testing on standardized datasets.

Table 1 (below) presents published performance characteristics of widely used commercial seizure detection systems. It is critical to note that these performance ranges derive from studies conducted on heterogeneous datasets with varying patient populations, recording conditions, and seizure types. (We provided recommendations in Blog 3 under “Best Practices for Transparent Reporting” to improve the consistency of reported characteristics.) Consequently, direct performance comparisons based solely on published metrics are subject to significant dataset bias, as we discussed in Blog 3, “The Critical Role of Training Data“.

Grand Truth Paradox

When evaluating seizure detection algorithms, the most critical methodological decision is how “ground truth” is defined. In medical device diagnostics, ground truth is assumed to represent the objective reality against which algorithm performance is measured. However, in seizure detection, this assumption is often fragile (see Blog 2: “The problem of Ground Truth: inter-rater agreement”). An incorrectly defined or biased ground truth can artificially inflate or suppress sensitivity and specificity. Studies that use different definitions of ground truth therefore become difficult to compare, even if their reported metrics appear similar. Because the contingency table to calculate the key performance metrics is directly affected by the ground truth (see Blog 1), the numbers enter in the table determines the sensitivity and specificity. Hence, by manipulating the ground truth one can easily enhance the performance.

“Ground truth” is not “Gold truth”

In some other diagnostic fields, an external biological standard exists (MRI, CT, etc). For example, a radiological suspicion of a tumor can be confirmed or refuted by biopsy where histopathology serves as an independent validator. In these situations, time and patient outcome of the disease state will help inform about the reality of a preceding assessment.

Seizure detection lacks such an independent confirmatory modality. The EEG pattern itself is the phenomenon of interest. Consequently, ground truth in seizure detection is anchored only to expert interpretation. This creates an epistemological limitation: the reference standard for what is seizure and what is not is inherently subjective, even when systematically constructed.

Human Expert Consensus as Ground Truth

In electroencephalography (EEG), ground truth is typically established through expert physician annotation. As discussed previously (Blog 2: “The Problem of Ground Truth: Inter-Rater Agreement“), careful expert selection, representativeness, and blinded review are essential to mitigate bias. Multi-rater consensus is commonly used to strengthen reliability. Yet consensus does not guarantee correctness.

Inter-rater agreement varies across types of EEGs (duration, clinical settings, partial or full montages and patients conditions) and across signal conditions. Agreement is typically higher when the signal-to-noise ratio is strong and the seizure is stereotypical. It decreases if the EEG is full of artifacts or contaminated by noise making it ambiguous and hard to read for human experts —precisely the scenarios where automated systems may be superior.

The goal to obtain independent and blinded scores by expert readers in EEG cannot be understated. Readers who belong to the same school of thought through identical training, will necessarily provide more similar ratings, than readers with more diverse backgrounds. Scores that are not blinded may also create an inherent interpersonal conflict between readers if they are financially or professionally dependent on one another. Neurology fellows reading under the supervision of their attending physician in an un-blinded situation may be prone to agree with their mentors scores artificially increasing the inter-rater reliability – human psychology dictates at minimum an unconscious bias towards wanting to please one’s superior.

Finally, at the moment a method requires subjective human ratings, principles of social science apply in full, suggesting that independent inter-rater scores follow a normal distribution with an average and set standard variation [6]. Even if rating scores tend to converge to a commonly high agreement in clear text-book examples of a seizure signal, perfect inter-rater agreement at 100% would be a statistical abnormality when including less clear, less ideal and more edge-case seizure morphologies into the assessment. Perfect inter-rater reliabilities, therefore become a warning sign on their own. If they are perfectly convergent, then the question begs what detection criteria were deployed (See Blog 1: proximity, overlap, percentage overlap), or what dataset characteristics might have contributed to achieve this level of alignment. Perfect inter-rater reliabilities in EEG, on diverse, multi-facetted datasets, in a truly independent, larger and diverse group of readers, are in-probable.

The Paradox of Algorithmic Superiority

We mentioned above that “… consensus does not guarantee correctness.” This is another source of confusion. A paradox emerges when a machine-learning algorithm demonstrates consistent disagreement with expert annotations—yet appears clinically plausible, internally consistent, and reproducible across datasets.

Only by recognizing and managing the “ground truth paradox” can we generate trustworthy, comparable, and clinically meaningful performance metrics.

The Relationship Between Sensitivity and Ground Truth

One of the clearest warning signs of bias in seizure detection studies is a mismatch between reported sensitivity and inter-rater agreement. The problem arises when vendor-reported sensitivity exceeds the level of agreement among expert reviewers. For example, if two independent EEG experts agree only ~80% of the time on whether a segment contains a seizure (see Blog 2, Figure 3), how can an algorithm claim 99% sensitivity against that same reference?

If Dr. X and Dr. Y agree on seizure presence 80% of the time, then an algorithm achieving 99% sensitivity must be nearly perfectly aligned with one expert while necessarily disagreeing with the other in a substantial fraction of cases. In other words, the 99% sensitivity cannot simultaneously apply to both experts unless one expert’s ratings are excluded from the reference standard.

Constructed Ground Truth and Inflated Metrics

A common methodological solution is to define “ground truth” as only those seizures agreed upon by two or three experts. These consensus events are then treated as 100% true positives (see Blog 2: “conservative,” “moderate,” and “liberal approaches” to consensus).

While this practice increases internal consistency, it also reshapes the denominator of sensitivity. By excluding disputed events, the evaluation may preferentially retain clear, high signal-to-noise seizures and remove those ambiguous cases that lower inter-rater agreement. The resulting contingency table no longer reflects the full clinical reality—it reflects a filtered subset.

Take Away

Consensus based ground truth in sensitivity and specificity assessments introduce a bias towards inflated scores. Sensitivity and specificity are therefore not independent of the ground truth construction. They are downstream consequences of it.

Developers who report very high sensitivity should therefore demonstrate that these values are supported by correspondingly high, independent, blinded inter-rater agreement. If not, responsible reporting requires acknowledging that the robustness of the metric is limited by the robustness of the reference.

Why This Matters

The absence of practical reporting standards in this area has encouraged competitive sensitivity claims, often producing discrepancies between vendor-sponsored and independent studies. These discrepancies are frequently rooted not in statistical error, but in differences in dataset composition and ground truth construction.

Ultimately, real-world clinical performance will arbitrate these claims. Sensitivity that depends heavily on how disagreement cases were handled may not translate into consistent bedside performance.

Sensitivity cannot meaningfully exceed the reliability of its ground truth. When it appears to do so, the issue lies not in the algorithm—but in how the reference standard was constructed and interpreted.

Deep-Dive: The limit of statistics

We remark that the link between the sensitivity of a diagnostic method and the ground truth characteristics is not statistical. It is interpretational. Therefore, studies may apply rigorous statistical methods and honest reporting of these high >95 % sensitivity metrics.

The contradiction is the interpretation when these high sensitivity metrics meet the low inter-rated consensus. The developers of such algorithms have to confirm that the high sensitivity scores are supported by high interrater agreement. Otherwise, the honest report should say that the relatively high performance metric is weakly supported by a relatively low inter-rater agreement, which should undermine the robustness of such results. Again, the real-life use will be the ultimate testbed for those algorithms.

Model	Primary usecase	Montage	Inter-rater agreement	Temporal seizure matching	Method	Sensitivity%	Specificity%	Negative prediction value	Positive prediction value
Ceribell Clarity™ [3-4,7]	ICU Rapid bedside triage / continuous seizure burden / rule-out status epilepticus	Partial Montage: 8 channels	undisclosed	undisclosed	Time–frequency features and machine-learning classification on partial montage EEG	~29–100	~79-93	≈95–99	Moderate; decreases with low seizure burden
Persyst,Version 11 [8,9]	Expert-assisted review, seizure burden	Full 10–20 cEEG	undisclosed	undisclosed	Time–frequency analysis, morphology, rhythmicity, and spatial coherence, with deterministic + ML-assisted components.	~80–95	~70–90	High	Moderate to high after expert review
Natus NeuroWorks™ [8,10]	Continuous screening / flagging	Full 10–20 cEEG	undisclosed	undisclosed	Uses Persyst or encevis plugins to detect seizures	As in Persyst	As in Persyst	As in Persyst	As in Persyst
NeuroPulse™(encevis Version 2.1)[11,12]	ICU Rapid bedside triage / long-term monitoring for status epilepticus / continuous seizure burden	Full 10–20 cEEG	undisclosed	undisclosed	Time–frequency feature extraction Temporal continuity and seizure duration modeling Evaluation based on seizure–prediction interval overlap, not just pointwise detection.	~80–95	~75–90	High for sustained seizures; lower for very brief events	Moderate

Table 1. The list of most popular seizure detection software by vendors and basic characteristics. Performance metrics are based on the referenced publications. Please note that developers typically do not disclose the inter-rater agreement rate of their testing data.

Deep-Dive: How to compare detection performance

Let’s assume all published studies disclose their performance metrics or make the codebase available. Is there a simple graphical representation that encompasses the most relevant performance indices and puts them in the context such that the ranking of methods becomes evident? We recommend plotting the empirical F1 curve for each model, which shows the trade-off between Recall and Precision for the given seizure detection method and enables comparison with similar algorithms. In Figure 1, we illustrate examples of the performance indices of a fictitious poor-performing and a well-performing seizure detection model. Imagine each model is tested on the same dataset by varying model sensitivity, resulting in different F1 curves (colored dashed lines). The maximum of P x R provides us with the highest F1 score for a given test. Connecting those F1 scores (colored filled circles) yields an F1 contour that lies on the landscape of theoretical F1 contours (thin green lines).

For a poor-performing model, the F1 contour lies far below the diagonal, while the F1 contour of a well-performing model lies far above the same diagonal. The graphical representation of the F2 scores was introduced in Blog #2, Figure 2. A nice feature of F1 contours is that they enable comparison of multiple seizure detection models within the same parameter space. If most seizure detection models publish their F1 scores and their testing parameters (testing dataset, quantification of ground-truth performance, and sensitivity thresholds), and those are comparable, the F1 contours of different models’ performances can be shown on the same P x R plot, making the comparison self-evident. Figure 1 below illustrates a side-by-side comparison of good and poor model performance.

model-b4 — **Figure 1. Comparison of poor and good model performances.** The X- and Y-axes represent the recall (R) and precision (P) indices, respectively, for two fictitious seizure-detection models. The thin green curves represent the theoretical F1 contours in the Recall-Precision landscape, and the dotted line is our subjective division between good and poor performances. Different models can be directly compared by overlaying their F1 curves. What distinguishes a model of good performance from a model of poor performance is the positions of their F1 curves (red for the poor-performing model and green for the good-performing model). The colored data points on the curves represent the best recall and precision results achieved by separate testing runs, with their F1 curves (dashed lines), while systematically varying parameters such as sample size, model sensitivity, and neural network architectures.

Encouraging Examples of Testing Seizure Detection Methods

A limited number of studies provide examples of rigorously conducted, real-world validation of seizure detection systems using large cohorts. One such example is the 2025 study by Sheikh et al. [7]. This retrospective observational study evaluated the performance of an automated seizure burden estimator (ASBE) integrated into a rapid-response EEG (rr-EEG) system for detecting clinically significant electrographic seizure activity, particularly electrographic status epilepticus (ESE), without requiring immediate expert interpretation. Key methodological features included a multi-center design and independent (blinded) expert review.

Methodological highlights:

The study reviewed consecutive clinical rr-EEG recordings performed at multiple hospitals between 2019 and 2021.
Each EEG was independently and blindly reviewed by three human experts.
A reference standard was defined as at least 2 of 3 reviewers agreeing on whether seizure activity was present.
The main performance metrics were the negative predictive value (NPV) and positive predictive value (PPV) of the automatic seizure burden estimator for detecting or excluding ESE at various seizure-burden thresholds (e.g., >1%, >10%, >20%, >50%, >90%).
The only caveat was that the grand truth was quantified based on a partial montage instead of a full montage EEG, which introduced a positive sensitivity bias.

Key Findings: The study concluded that a very high negative predictive value (≈99%) at the cost of sensitivity drop. It means, it rarely misidentified seizure activity when it was absent. Interestingly, the study revealed that many seizures were missed when they were present, occasionally with sensitivity as low as 29%.

Related Considerations:
These results point to two considerations: High negative predictive values are to be expected in rare clinical events such as seizures. Negative predictive values also increase with the length of EEG studies, if the seizure events are limited in duration and frequency, i.e. such as might be found after successful early administration of AEDs. To primarily focus on negative predictive values to determine the clinical viability of an algorithm may introduce a bias driven by dataset length. The value of negative predictive scores also becomes increasingly inconclusive if AED were administered at any point during the recording. Finally, the balance of sensitivity and specificity in this device depends heavily on the chosen seizure-burden threshold, underlining the interpretational nuance when deploying such tools in practice.

Clinical Implications: Rapid-response systems with automated estimators may enhance early seizure detection in acute or resource-limited settings, but should not be viewed as replacing expert EEG review. Where expert consensus (ground truth) is low or subtle, algorithm performance must be contextualized against the variability in human interpretation — as explained above. In other words, the argument seems valid that automated estimators are a helpful triaging aid but are not a full replacement for human interpretation at this time.

Final Thoughts

AI driven seizure burden assessments opened up an exciting and necessary step in the evolution of further automating and improving the diagnostic process in the epilepsy space. Much good has already and will continue to come from that, in particular for improving access to care in rural and underserved communities. The ability to quickly triage patients on the spot, to provide more appropriate downstream diagnostics and resulting treatments will continue to have a profound impact on patients’ lives. The continued task for the epilepsy community as a whole will be to set appropriate and commonly accepted standards to compare the increasing amounts of algorithms in a clinically meaningful way that benefit patients.

Assessing the current state of publicly reported AI driven seizure assessment approaches, it appears safe to say that there remains to be a gap in much anticipated clarity of results and real-life applicability. We advocate for an increased number of independent multi-center, double blinded studies, using full montage ground-truth approaches to fully understand how far AI can truly assist in the complex process of diagnosing and treating patients to achieve best outcomes. Until then, it appears safest to engage with AI driven approaches as an advanced triaging tool rather than a replacement for a time-tested physician-expert driven diagnostic process.

References:

Scheuer, M. L., Wilson, S. B., Antony, A., Ghearing, G., Urban, A., & Bagić, A. I. (2021). Seizure detection: Interreader agreement and detection algorithm assessments using a large dataset. Journal of Clinical Neurophysiology, 38(5), 439–447. https://doi.org/10.1097/WNP.0000000000000709
Zeto, Inc. (n.d.). Internal testing of Encevis AI-Powered seizure detection [Unpublished internal report].
Vespa, P. M., Olson, D. M., John, S., Hobbs, K. S., Gururangan, K., Nie, K., Desai, M. J., Markert, M., Parvizi, J., Bleck, T. P., Hirsch, L. J., & Westover, M. B. (2020). Evaluating the clinical impact of rapid response electroencephalography: The DECIDE multicenter prospective observational clinical study. Critical Care Medicine, 48(9), 1249–1257. https://doi.org/10.1097/CCM.0000000000004428
Kamousi, B., Karunakaran, S., Gururangan, K., Markert, M., Decker, B., Khankhanian, P., Mainardi, L., Quinn, J., Woo, R., & Parvizi, J. (2021). Monitoring the burden of seizures and highly epileptiform patterns in critical care with a novel machine learning method. Neurocritical Care, 34(3), 908–917. https://doi.org/10.1007/s12028-020-01120-0
Herman S. T. (2027). Hardware Technology for Point-of-Care EEG: A Comprehensive Review. Journal of clinical neurophysiology: official publication of the American Electroencephalographic Society, 43(3), 191–203. https://doi.org/10.1097/WNP.0000000000001240
Grant, A. C., Abdel-Baki, S. G., Weedon, J., Arnedo, V., Chari, G., Koziorynska, E., Lushbough, C., Maus, D., McSween, T., Mortati, K. A., Reznikov, A., & Omurtag, A. (2014). EEG interpretation reliability and interpreter confidence: a large single-center study. Epilepsy & behavior : E&B, 32, 102–107. https://doi.org/10.1016/j.yebeh.2014.01.011
Sheikh, Z. B., Dhakar, M. B., Fong, M. W. K., Fang, W., Ayub, N., Molino, J., Haider, H. A., Foreman, B., Gilmore, E., Mizrahi, M., Karakis, I., Schmitt, S. E., Osman, G., Yoo, J. Y., & Hirsch, L. J. (2025). Accuracy of a rapid-response EEG’s automated seizure-burden estimator: AccuRASE study. Neurology, 104(2), e210234. https://doi.org/10.1212/WNL.0000000000210234
Haider, H. A., Esteller, R., Hahn, C. D., Westover, M. B., Halford, J. J., Lee, J. W., Shafi, M. M., Gaspard, N., Herman, S. T., Gerard, E. E., Hirsch, L. J., Ehrenberg, J. A., LaRoche, S. M., & Critical Care EEG Monitoring Research Consortium. (2016). Sensitivity of quantitative EEG for seizure identification in the intensive care unit. Neurology, 87(9), 935–944. https://doi.org/10.1212/WNL.0000000000003034
Wilson, S. B., Scheuer, M. L., Emerson, R. G., & Gabor, A. J. (2004). Seizure detection: Evaluation of the Reveal algorithm. Clinical Neurophysiology, 115(10), 2280–2291. https://doi.org/10.1016/j.clinph.2004.05.018
Ganguly, T. M., Ellis, C. A., Tu, D., Shinohara, R. T., Davis, K. A., Litt, B., & Pathmanathan, J. (2022). Seizure detection in continuous inpatient EEG: A comparison of human vs automated review. Neurology, 98(22), e2224–e2232. https://doi.org/10.1212/WNL.0000000000200267
Trinka, E., Cock, H., Hesdorffer, D., Rossetti, A. O., Scheffer, I. E., Shinnar, S., Shorvon, S., & Lowenstein, D. H. (2015). A definition and classification of status epilepticus—Report of the ILAE Task Force on Classification of Status Epilepticus. Epilepsia, 56(10), 1515–1523. https://doi.org/10.1111/epi.13121
Fürbass, F., Ossenblok, P., Hartmann, M., Perko, H., Skupch, A. M., Lindinger, G., Elezi, L., Pataraia, E., Colon, A. J., Baumgartner, C., & Kluge, T. (2015). Prospective multi-center study of an automatic online seizure detection system for epilepsy monitoring units. Clinical Neurophysiology, 126(6), 1124–1131. https://doi.org/10.1016/j.clinph.2014.09.023

Quantifying Seizure Detection

Blog 4 of 4

Seizure detection algorithms