How EEG Datasets Affect Seizure Detection Performance

Blog 3 of 4

Data Source and Representativeness

In the first two blogs, we overviewed the basic terminology and quantitative evaluation of seizure detection (1 – How to Define Seizure Detection, 2 – Basic Concepts of Seizure Detection). Here we turn our attention to the importance of datasets.

The training dataset fundamentally determines where an algorithm will work effectively. An algorithm trained exclusively on ICU data – where patients are often sedated with minimal motion artifacts – will generate excessive false alarms in routine EEG settings where patients move freely. Conversely, algorithms trained only on awake patients will generate a lot of false pathological slow oscillations in critical care settings, where those slow oscillations are natural delta waves of the sedated brain. Similarly, none of these seizure detection algorithms will work for patients in transport (airlifted or ambulance cars) unless the EEG is denoised for the mechanical and electromagnetic artifacts before processing.

Public Datasets

Several public databases support algorithm development, but each of them has advantages and disadvantages depending on the seizure detection objectives.

Bonn EEG Database

Description: Contains 600 expert-annotated single-channel seizure segments [1].

Limitations: It is a very limited dataset (1 channel each) with minimal artifacts, hence limiting generalizability. In addition, no seizure offset annotations.

Temple University Hospital (TUH) Database

Description: Includes the TUH EEG Epilepsy Corpus (TUEP) and TUH EEG Seizure Corpus (TUSZ) with thousands of annotated recordings from various settings (routine, EMU, ICU), including both seizure and non-seizure data. The different TUH datasets overlap.

Limitations: It is an overall good resource, but it contains a broad quality range of EEGs recorded in diverse settings, from routine to ICU data. It is good for developing robust, quality-insensitive seizure detection algorithms, but has a limited use for parameter fine-tuning.

Harvard Electroencephalography Database

Description: Contains 284,343 EEG studies from 109,178 patients across four sites, including annotated ICU, EMU, and routine EEG datasets [2]. It is a huge database (210 TB), almost ideal for foundational model development.

Limitations: However, the seizure annotations made by human experts are very limited and lack seizure offset times. Moreover, it contains long ICU recordings (> 8 hours) with very few seizures, which takes us to the next problem.

Seizure Prevalence Effects

Seizure prevalence dramatically affects performance metrics. Consider a test database consisting of 100 samples of 20-second EEG segments from patients undergoing 33-minute routine EEG recordings with rare seizure occurrences: average 1 true seizure in 100 EEG samples. If we run the data on a new seizure detection algorithm that detects 10 seizures (including the one true seizure), it would achieve:

Recall = 100% (found the only true seizure)

Precision = 10% (9 false positives)

F1 = 0.18 (severely impacted by false positives)

This demonstrates why 100% sensitivity alone is very misleading—it overlooks the critical importance of false positive rates.

Considering that seizure prevalence varies significantly across clinical settings, we have a good reason to balance the composition of our test and training datasets:

Routine EEG: Low prevalence (30-minute recordings in outpatients)
Epilepsy Monitoring Unit (EMU): High prevalence (medications withdrawn to capture seizures)
ICU: High prevalence (continuous monitoring for status epilepticus)

When evaluating published results, always consider the patient cohort and seizure prevalence, as they fundamentally affect precision and F1 scores.

Data Heterogeneity and Specialization

Age and State Considerations

EEG characteristics vary dramatically across clinical settings, demographics and populations:

Pediatric EEG differs from adult EEG
Neonatal EEG is fundamentally different from both
Sleep EEG requires different approaches than awake EEG

An F1 score reported for adults may be meaningless when applied to neonates. Similarly, algorithms trained on awake patients cannot be directly applied to sleep EEG without adjustment.

The Strategic Choice

Generalist approach: For broad applicability across diverse settings, train on similarly diverse datasets with large numbers of validated examples.

Specialist approach: For specific populations (e.g., neonates), focus training on population-specific datasets for optimal performance.

Sample Size Matters

EEG data availability is limited by privacy protections and data security requirements. Despite carefully de-identified public databases (see above), most clinical EEG data remains protected and not shared. Therefore, the available sample size for developing seizure detection methods is far from optimal.

The required dataset size depends on the approach:

Small datasets (20-100 patients up to 1,000 recordings): Sufficient for algorithms that focus on specific features like wavelet coefficients. Statistically adequate for training and testing feature-based classification.

Big datasets (100-1,000 patients up to 10,000 recordings): Necessary for convolutional neural networks that extract multidimensional parameters from larger parameter spaces.

Large datasets (>1,000 patients, more than 10,000 recordings): Required for foundational models that represent millions of features and can extract disease-specific patterns from any EEG—essentially the “large language model” equivalent for EEG analysis.

Best Practices for Transparent Reporting

Performance evaluation of seizure detection methods is not straightforward. Many parameters can significantly impact published results, and incomplete reporting leads to misinterpretation and irreproducibility. Key issues include:

Inter-rater variability
Seizure onset-offset uncertainty
Electrode coverage variations
Training data representativeness
Sample size and prevalence effects

Take Away

Recommendations

Report inter-rater reliability metrics when comparing to ground truth. Transparency about expert agreement levels provides essential context. In addition, the raters must be blinded to each other’s ratings (we described it in Blog 2, “Basic Concepts of Seizure Detection”).
Quantify temporal matching between predicted and validated seizures based on overlaps (Interval Matching OVLP, or percent overlap). This requires precise seizure duration definitions, which we discussed in Blog 1, “What is detection”.
Use full-montage EEG for ground truth estimation. This prevents inherent bias of partial incomplete electrode coverage/montage. We discuss this further in Blog 2, “The Electrode Coverage Dilemma”.
Specify training data characteristics: patient population, seizure prevalence, recording settings, and dataset (see above).
Report multiple metrics: Sensitivity, specificity, precision, and F1 score together provide a complete picture – see Blog 2, “Sensitivity and Specificity” for more details.

Until these standards become widely adopted, direct comparison between different seizure detection algorithms remains challenging. Readers should approach performance claims with healthy skepticism, asking: What was the inter-rater agreement? What electrode coverage was used? What was the seizure prevalence? Was the algorithm tested on data representative of its intended use case?

Conclusion

No single number can capture the complexity of seizure detection performance. The published F1 scores, sensitivity values, and other metrics that appear definitive are actually contingent on numerous methodological choices—from how ground truth is established to which patient populations are studied.

As the field moves toward larger annotated databases and more diverse training sets, we can develop increasingly robust foundational models. However, transparency in reporting remains essential. By understanding the factors that influence performance metrics, clinicians and researchers can better evaluate new technologies and make informed decisions about their implementation.

The goal isn’t skepticism about all published results, but rather informed interpretation that recognizes both the promise and limitations of current methods. Only through rigorous, transparent reporting can we enable fair comparison between models and drive genuine progress in automated seizure detection.

In the next blog, we address the central challenge of quantifying seizure burden (SB) and explain why reliable seizure detection is the limiting factor for both real-time alerting and longitudinal clinical assessment. Read the next blog: “Quantifying Seizure Detection“.

References

Andrzejak, R. G., Lehnertz, K., Mormann, F., Rieke, C., David, P., & Elger, C. E. (2001). Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64(6), 061907
Zafar, S., Loddenkemper, T., Lee, J. W., Cole, A., Goldenholz, D., Peters, J., Lam, A., Amorim, E., Chu, C., Cash, S., Moura Junior, V., Gupta, A., Ghanta, M., Fernandes, M., Sun, H., Jing, J., & Westover, M. B. (2025). Harvard Electroencephalography Database (version 4.1). Brain Data Science Platform. https://doi.org/10.60508/k85b-fc87

The Critical Role of Training Data