Benchmark
All results are from a 100-seed benchmark (seeds 0–99) on standard datasets. Each seed produces a different random train/validation/test split, and results are averaged across all 20 seeds to reduce variance from any single split.
Methodology
Split: Each dataset is split into train, validation, and test sets. Models are trained on the train set, the DES router is fitted on the validation set, and it is evaluated on the test set.
Best Single is the best individual model from the pool, selected by validation set performance. It represents the baseline without any ensembling.
Simple Average is a uniform blend of all five models with no fitting or tuning. It represents the simplest possible ensemble baseline.
All deskit algorithms use preset="balanced" (FAISS IVF) and k=20.
Pool
The same five-model pool is used across all datasets:
| Model |
|---|
| K-Nearest Neighbors |
| Decision Tree |
| SVR / SVM-RBF |
| Ridge / Gaussian NB |
| Bayesian Ridge / Logistic Regression |
These five were chosen for having different inductive biases and architectures, which is the kind of scenario that DES would be used in.
Datasets
California Housing
Source: sklearn built-in. Size: 20,640 samples, 8 features.
Predict median house value from census block features.
Bike Sharing
Source: OpenML 42712. Size: 17,379 samples, 12 features.
Predict hourly bike rental counts from weather and time features.
Abalone
Source: OpenML 183. Size: 4,177 samples, 8 features.
Predict abalone age from physical measurements.
Diabetes
Source: sklearn built-in. Size: 442 samples, 10 features.
Predict disease progression one year after baseline from physiological measurements.
Concrete Strength
Source: OpenML 4353. Size: 1,030 samples, 8 features.
Predict concrete compressive strength from ingredient and curing age features.
HAR
Source: OpenML 1478. Size: 10,299 samples, 561 features.
Six-class classification of human activities from smartphone accelerometer and gyroscope data.
Yeast
Source: OpenML 181. Size: 1,484 samples, 8 features.
Ten-class protein localisation classification with class imbalance.
Image Segment
Source: OpenML 36. Size: 2,310 samples, 19 features.
Seven-class classification of outdoor image segments from colour and texture statistics.
Vowel
Source: OpenML 307. Size: 990 samples, 10 features.
Eleven-class vowel recognition from LPC-derived formant frequencies.
Waveform
Source: OpenML 60. Size: 5,000 samples, 40 features.
Three-class classification of artificially constructed waveforms with deliberate class overlap.
Regression results
MAE, lower is better. % shown as delta vs Best Single. 100-seed mean ± std.
| Dataset | Best Single | Simple Avg | KNN-DWS | KNN-DWS-I | OLA | KNORA-U | KNORA-E | KNORA-IU |
|---|---|---|---|---|---|---|---|---|
| California Housing | 0.3955 ± 0.008 | +7.93% | −2.41% | −2.68% | −0.31% | −0.81% | +7.22% | −1.03% |
| Bike Sharing | 51.604 ± 1.291 | +48.39% | −4.90% | −6.25% | −2.55% | +6.67% | +15.16% | +5.50% |
| Abalone | 1.4923 ± 0.054 | +1.29% | +3.00% | +3.12% | +4.02% | +1.63% | +7.60% | +1.61% |
| Diabetes | 44.986 ± 3.370 | +2.98% | +0.96% | +0.88% | +3.16% | +5.13% | +14.36% | +5.02% |
| Concrete Strength | 5.3934 ± 0.400 | +21.30% | +1.40% | −1.55% | +3.94% | +0.46% | +11.29% | −2.85% |
KNORA variants are designed for classification, which explains the poor performance on regression datasets; However, some exception can occur in certain datasets, either where feature space is has hard clusters (like in Concrete Strength) or when the target is discrete and classification-like (like in Abalone).
Classification results
Accuracy, higher is better. % shown as delta vs Best Single. 100-seed mean ± std. Classification datasets include a comparison against DESlib, a mature sklearn-compatible DES library.
HAR
| Method | Mean | Std | vs Best Single |
|---|---|---|---|
| Best Single | 98.24% | 0.25% | — |
| Simple Average | 97.93% | 0.33% | −0.32% |
| deskit KNN-DWS | 98.37% | 0.27% | +0.13% |
| deskit KNN-DWS-I | 98.37% | 0.27% | +0.14% |
| deskit OLA | 97.99% | 0.31% | −0.25% |
| deskit KNORA-U | 98.18% | 0.29% | −0.05% |
| deskit KNORA-E | 97.99% | 0.31% | −0.25% |
| deskit KNORA-IU | 98.19% | 0.29% | −0.04% |
| DESlib KNORA-U | 98.04% | 0.32% | −0.20% |
| DESlib KNORA-E | 97.83% | 0.34% | −0.41% |
| DESlib OLA | 97.05% | 0.44% | −1.20% |
| DESlib META-DES | 98.37% | 0.31% | +0.14% |
| DESlib KNOP | 98.32% | 0.30% | +0.08% |
| DESlib DESP | 97.98% | 0.33% | −0.26% |
| DESlib DESKNN | 97.81% | 0.33% | −0.43% |
deskit achieves a best mean score of 98.37%; DESlib achieves a best mean score of 98.37%.
Yeast
| Method | Mean | Std | vs Best Single |
|---|---|---|---|
| Best Single | 59.19% | 2.70% | — |
| Simple Average | 59.46% | 2.44% | +0.46% |
| deskit KNN-DWS | 59.89% | 2.42% | +1.18% |
| deskit KNN-DWS-I | 59.91% | 2.51% | +1.23% |
| deskit OLA | 58.93% | 2.37% | −0.44% |
| deskit KNORA-U | 59.89% | 2.53% | +1.18% |
| deskit KNORA-E | 57.05% | 2.63% | −3.61% |
| deskit KNORA-IU | 60.06% | 2.53% | +1.48% |
| DESlib KNORA-U | 59.91% | 2.32% | +1.22% |
| DESlib KNORA-E | 57.64% | 2.49% | −2.61% |
| DESlib OLA | 57.46% | 2.44% | −2.91% |
| DESlib META-DES | 58.28% | 2.61% | −1.52% |
| DESlib KNOP | 59.88% | 2.40% | +1.17% |
| DESlib DESP | 59.48% | 2.23% | +0.50% |
| DESlib DESKNN | 58.19% | 2.11% | −1.69% |
deskit achieves a best mean score of 60.06%; DESlib achieves a best mean score of 59.91%.
Image Segment
| Method | Mean | Std | vs Best Single |
|---|---|---|---|
| Best Single | 93.65% | 1.11% | — |
| Simple Average | 95.24% | 1.04% | +1.70% |
| deskit KNN-DWS | 95.56% | 0.94% | +2.04% |
| deskit KNN-DWS-I | 95.71% | 0.96% | +2.20% |
| deskit OLA | 94.96% | 0.89% | +1.39% |
| deskit KNORA-U | 95.60% | 1.02% | +2.08% |
| deskit KNORA-E | 95.66% | 0.95% | +2.14% |
| deskit KNORA-IU | 95.84% | 0.98% | +2.33% |
| DESlib KNORA-U | 95.10% | 1.05% | +1.54% |
| DESlib KNORA-E | 95.45% | 0.89% | +1.91% |
| DESlib OLA | 94.96% | 0.95% | +1.40% |
| DESlib META-DES | 95.61% | 0.91% | +2.09% |
| DESlib KNOP | 95.34% | 0.96% | +1.80% |
| DESlib DESP | 94.89% | 1.05% | +1.32% |
| DESlib DESKNN | 94.82% | 0.98% | +1.25% |
deskit achieves a best mean score of 95.84%; DESlib achieves a best mean score of 95.61%.
Vowel
| Method | Mean | Std | vs Best Single |
|---|---|---|---|
| Best Single | 90.54% | 2.17% | — |
| Simple Average | 88.90% | 2.40% | −1.81% |
| deskit KNN-DWS | 90.13% | 2.27% | −0.46% |
| deskit KNN-DWS-I | 90.48% | 2.26% | −0.07% |
| deskit OLA | 90.36% | 2.32% | −0.20% |
| deskit KNORA-U | 90.76% | 2.16% | +0.25% |
| deskit KNORA-E | 90.92% | 2.12% | +0.42% |
| deskit KNORA-IU | 91.38% | 2.05% | +0.93% |
| DESlib KNORA-U | 88.76% | 2.31% | −1.96% |
| DESlib KNORA-E | 89.69% | 2.11% | −0.94% |
| DESlib OLA | 88.30% | 2.71% | −2.48% |
| DESlib META-DES | 90.09% | 2.16% | −0.50% |
| DESlib KNOP | 89.27% | 2.30% | −1.40% |
| DESlib DESP | 86.13% | 2.52% | −4.88% |
| DESlib DESKNN | 85.37% | 2.94% | −5.71% |
deskit achieves a best mean score of 91.38%; DESlib achieves a best mean score of 90.09%.
Waveform
| Method | Mean | Std | vs Best Single |
|---|---|---|---|
| Best Single | 86.28% | 1.10% | — |
| Simple Average | 85.38% | 1.02% | −1.04% |
| deskit KNN-DWS | 85.80% | 1.04% | −0.56% |
| deskit KNN-DWS-I | 85.80% | 1.02% | −0.55% |
| deskit OLA | 84.03% | 1.10% | −2.61% |
| deskit KNORA-U | 85.60% | 1.02% | −0.78% |
| deskit KNORA-E | 82.84% | 1.13% | −3.99% |
| deskit KNORA-IU | 85.62% | 1.02% | −0.77% |
| DESlib KNORA-U | 85.83% | 1.03% | −0.53% |
| DESlib KNORA-E | 83.19% | 1.14% | −3.57% |
| DESlib OLA | 81.19% | 1.29% | −5.90% |
| DESlib META-DES | 85.29% | 1.11% | −1.15% |
| DESlib KNOP | 86.10% | 1.08% | −0.21% |
| DESlib DESP | 85.78% | 1.03% | −0.57% |
| DESlib DESKNN | 84.61% | 1.18% | −1.93% |
deskit achieves a best mean score of 85.80%; DESlib achieves a best mean score of 86.10%.
Timing
Mean fit + predict time in milliseconds, averaged across 100 seeds. Fit is measured once per dataset per seed; predict is measured over the full test set.
deskit caches all model predictions on the validation set at fit time and reads from that matrix at inference, so no model is called at predict time. This is the primary reason for the speed advantage over DESlib, which calls each model live per neighbour at inference.
deskit used preset='balanced', which uses FAISS IVF instead of KNN, but the difference
in performance isn't very pronounced in datasets of the size used.
deskit
| Dataset | KNN-DWS | KNN-DWS-I | OLA | KNORA-U | KNORA-E | KNORA-IU |
|---|---|---|---|---|---|---|
| California Housing | 24.4 ms | 23.9 ms | 22.9 ms | 26.2 ms | 34.5 ms | 27.9 ms |
| Bike Sharing | 19.8 ms | 19.3 ms | 18.5 ms | 21.4 ms | 28.4 ms | 22.9 ms |
| Abalone | 5.1 ms | 5.0 ms | 4.7 ms | 5.3 ms | 7.1 ms | 5.7 ms |
| Diabetes | 1.4 ms | 1.4 ms | 1.2 ms | 1.3 ms | 1.6 ms | 1.3 ms |
| Concrete Strength | 1.7 ms | 1.7 ms | 1.6 ms | 1.8 ms | 2.2 ms | 1.8 ms |
| HAR | 66.9 ms | 55.5 ms | 55.0 ms | 55.9 ms | 60.8 ms | 57.9 ms |
| Yeast | 3.5 ms | 3.2 ms | 3.0 ms | 2.8 ms | 3.1 ms | 3.0 ms |
| Image Segment | 5.8 ms | 5.5 ms | 5.3 ms | 5.1 ms | 5.4 ms | 5.3 ms |
| Vowel | 3.5 ms | 3.4 ms | 3.2 ms | 3.1 ms | 3.3 ms | 3.1 ms |
| Waveform | 10.5 ms | 9.8 ms | 9.6 ms | 9.1 ms | 9.9 ms | 9.8 ms |
DESlib (classification datasets only)
| Dataset | KNORA-U | KNORA-E | OLA | META-DES | KNOP | DESP | DESKNN |
|---|---|---|---|---|---|---|---|
| HAR | 1853.2 ms | 1862.5 ms | 1871.0 ms | 2857.3 ms | 2823.9 ms | 1886.4 ms | 1912.5 ms |
| Yeast | 60.7 ms | 62.0 ms | 63.3 ms | 108.2 ms | 84.3 ms | 61.0 ms | 70.1 ms |
| Image Segment | 20.9 ms | 21.2 ms | 20.8 ms | 36.9 ms | 32.2 ms | 21.3 ms | 25.0 ms |
| Vowel | 53.5 ms | 53.7 ms | 54.7 ms | 96.3 ms | 72.8 ms | 53.6 ms | 58.0 ms |
| Waveform | 186.2 ms | 191.4 ms | 193.1 ms | 333.2 ms | 312.2 ms | 197.7 ms | 211.4 ms |