Benchmark

All results are from a 100-seed benchmark (seeds 0–99) on standard datasets. Each seed produces a different random train/validation/test split, and results are averaged across all 20 seeds to reduce variance from any single split.

Methodology

Split: Each dataset is split into train, validation, and test sets. Models are trained on the train set, the DES router is fitted on the validation set, and it is evaluated on the test set.

Best Single is the best individual model from the pool, selected by validation set performance. It represents the baseline without any ensembling.

Simple Average is a uniform blend of all five models with no fitting or tuning. It represents the simplest possible ensemble baseline.

All deskit algorithms use preset="balanced" (FAISS IVF) and k=20.

Pool

The same five-model pool is used across all datasets:

Model
K-Nearest Neighbors
Decision Tree
SVR / SVM-RBF
Ridge / Gaussian NB
Bayesian Ridge / Logistic Regression

These five were chosen for having different inductive biases and architectures, which is the kind of scenario that DES would be used in.

Datasets

California Housing

Source: sklearn built-in. Size: 20,640 samples, 8 features.

Predict median house value from census block features.

Source: OpenML 42712. Size: 17,379 samples, 12 features.

Predict hourly bike rental counts from weather and time features.

Abalone

Source: OpenML 183. Size: 4,177 samples, 8 features.

Predict abalone age from physical measurements.

Diabetes

Source: sklearn built-in. Size: 442 samples, 10 features.

Predict disease progression one year after baseline from physiological measurements.

Concrete Strength

Source: OpenML 4353. Size: 1,030 samples, 8 features.

Predict concrete compressive strength from ingredient and curing age features.

HAR

Source: OpenML 1478. Size: 10,299 samples, 561 features.

Six-class classification of human activities from smartphone accelerometer and gyroscope data.

Yeast

Source: OpenML 181. Size: 1,484 samples, 8 features.

Ten-class protein localisation classification with class imbalance.

Image Segment

Source: OpenML 36. Size: 2,310 samples, 19 features.

Seven-class classification of outdoor image segments from colour and texture statistics.

Vowel

Source: OpenML 307. Size: 990 samples, 10 features.

Eleven-class vowel recognition from LPC-derived formant frequencies.

Waveform

Source: OpenML 60. Size: 5,000 samples, 40 features.

Three-class classification of artificially constructed waveforms with deliberate class overlap.

Regression results

MAE, lower is better. % shown as delta vs Best Single. 100-seed mean ± std.

Dataset	Best Single	Simple Avg	KNN-DWS	KNN-DWS-I	OLA	KNORA-U	KNORA-E	KNORA-IU
California Housing	0.3955 ± 0.008	+7.93%	−2.41%	−2.68%	−0.31%	−0.81%	+7.22%	−1.03%
Bike Sharing	51.604 ± 1.291	+48.39%	−4.90%	−6.25%	−2.55%	+6.67%	+15.16%	+5.50%
Abalone	1.4923 ± 0.054	+1.29%	+3.00%	+3.12%	+4.02%	+1.63%	+7.60%	+1.61%
Diabetes	44.986 ± 3.370	+2.98%	+0.96%	+0.88%	+3.16%	+5.13%	+14.36%	+5.02%
Concrete Strength	5.3934 ± 0.400	+21.30%	+1.40%	−1.55%	+3.94%	+0.46%	+11.29%	−2.85%

KNORA variants are designed for classification, which explains the poor performance on regression datasets; However, some exception can occur in certain datasets, either where feature space is has hard clusters (like in Concrete Strength) or when the target is discrete and classification-like (like in Abalone).

Classification results

Accuracy, higher is better. % shown as delta vs Best Single. 100-seed mean ± std. Classification datasets include a comparison against DESlib, a mature sklearn-compatible DES library.

HAR

Method	Mean	Std	vs Best Single
Best Single	98.24%	0.25%	—
Simple Average	97.93%	0.33%	−0.32%
deskit KNN-DWS	98.37%	0.27%	+0.13%
deskit KNN-DWS-I	98.37%	0.27%	+0.14%
deskit OLA	97.99%	0.31%	−0.25%
deskit KNORA-U	98.18%	0.29%	−0.05%
deskit KNORA-E	97.99%	0.31%	−0.25%
deskit KNORA-IU	98.19%	0.29%	−0.04%
DESlib KNORA-U	98.04%	0.32%	−0.20%
DESlib KNORA-E	97.83%	0.34%	−0.41%
DESlib OLA	97.05%	0.44%	−1.20%
DESlib META-DES	98.37%	0.31%	+0.14%
DESlib KNOP	98.32%	0.30%	+0.08%
DESlib DESP	97.98%	0.33%	−0.26%
DESlib DESKNN	97.81%	0.33%	−0.43%

deskit achieves a best mean score of 98.37%; DESlib achieves a best mean score of 98.37%.

Yeast

Method	Mean	Std	vs Best Single
Best Single	59.19%	2.70%	—
Simple Average	59.46%	2.44%	+0.46%
deskit KNN-DWS	59.89%	2.42%	+1.18%
deskit KNN-DWS-I	59.91%	2.51%	+1.23%
deskit OLA	58.93%	2.37%	−0.44%
deskit KNORA-U	59.89%	2.53%	+1.18%
deskit KNORA-E	57.05%	2.63%	−3.61%
deskit KNORA-IU	60.06%	2.53%	+1.48%
DESlib KNORA-U	59.91%	2.32%	+1.22%
DESlib KNORA-E	57.64%	2.49%	−2.61%
DESlib OLA	57.46%	2.44%	−2.91%
DESlib META-DES	58.28%	2.61%	−1.52%
DESlib KNOP	59.88%	2.40%	+1.17%
DESlib DESP	59.48%	2.23%	+0.50%
DESlib DESKNN	58.19%	2.11%	−1.69%

deskit achieves a best mean score of 60.06%; DESlib achieves a best mean score of 59.91%.

Image Segment

Method	Mean	Std	vs Best Single
Best Single	93.65%	1.11%	—
Simple Average	95.24%	1.04%	+1.70%
deskit KNN-DWS	95.56%	0.94%	+2.04%
deskit KNN-DWS-I	95.71%	0.96%	+2.20%
deskit OLA	94.96%	0.89%	+1.39%
deskit KNORA-U	95.60%	1.02%	+2.08%
deskit KNORA-E	95.66%	0.95%	+2.14%
deskit KNORA-IU	95.84%	0.98%	+2.33%
DESlib KNORA-U	95.10%	1.05%	+1.54%
DESlib KNORA-E	95.45%	0.89%	+1.91%
DESlib OLA	94.96%	0.95%	+1.40%
DESlib META-DES	95.61%	0.91%	+2.09%
DESlib KNOP	95.34%	0.96%	+1.80%
DESlib DESP	94.89%	1.05%	+1.32%
DESlib DESKNN	94.82%	0.98%	+1.25%

deskit achieves a best mean score of 95.84%; DESlib achieves a best mean score of 95.61%.

Vowel

Method	Mean	Std	vs Best Single
Best Single	90.54%	2.17%	—
Simple Average	88.90%	2.40%	−1.81%
deskit KNN-DWS	90.13%	2.27%	−0.46%
deskit KNN-DWS-I	90.48%	2.26%	−0.07%
deskit OLA	90.36%	2.32%	−0.20%
deskit KNORA-U	90.76%	2.16%	+0.25%
deskit KNORA-E	90.92%	2.12%	+0.42%
deskit KNORA-IU	91.38%	2.05%	+0.93%
DESlib KNORA-U	88.76%	2.31%	−1.96%
DESlib KNORA-E	89.69%	2.11%	−0.94%
DESlib OLA	88.30%	2.71%	−2.48%
DESlib META-DES	90.09%	2.16%	−0.50%
DESlib KNOP	89.27%	2.30%	−1.40%
DESlib DESP	86.13%	2.52%	−4.88%
DESlib DESKNN	85.37%	2.94%	−5.71%

deskit achieves a best mean score of 91.38%; DESlib achieves a best mean score of 90.09%.

Waveform

Method	Mean	Std	vs Best Single
Best Single	86.28%	1.10%	—
Simple Average	85.38%	1.02%	−1.04%
deskit KNN-DWS	85.80%	1.04%	−0.56%
deskit KNN-DWS-I	85.80%	1.02%	−0.55%
deskit OLA	84.03%	1.10%	−2.61%
deskit KNORA-U	85.60%	1.02%	−0.78%
deskit KNORA-E	82.84%	1.13%	−3.99%
deskit KNORA-IU	85.62%	1.02%	−0.77%
DESlib KNORA-U	85.83%	1.03%	−0.53%
DESlib KNORA-E	83.19%	1.14%	−3.57%
DESlib OLA	81.19%	1.29%	−5.90%
DESlib META-DES	85.29%	1.11%	−1.15%
DESlib KNOP	86.10%	1.08%	−0.21%
DESlib DESP	85.78%	1.03%	−0.57%
DESlib DESKNN	84.61%	1.18%	−1.93%

deskit achieves a best mean score of 85.80%; DESlib achieves a best mean score of 86.10%.

Timing

Mean fit + predict time in milliseconds, averaged across 100 seeds. Fit is measured once per dataset per seed; predict is measured over the full test set.

deskit caches all model predictions on the validation set at fit time and reads from that matrix at inference, so no model is called at predict time. This is the primary reason for the speed advantage over DESlib, which calls each model live per neighbour at inference.

deskit used preset='balanced', which uses FAISS IVF instead of KNN, but the difference in performance isn't very pronounced in datasets of the size used.

deskit

Dataset	KNN-DWS	KNN-DWS-I	OLA	KNORA-U	KNORA-E	KNORA-IU
California Housing	24.4 ms	23.9 ms	22.9 ms	26.2 ms	34.5 ms	27.9 ms
Bike Sharing	19.8 ms	19.3 ms	18.5 ms	21.4 ms	28.4 ms	22.9 ms
Abalone	5.1 ms	5.0 ms	4.7 ms	5.3 ms	7.1 ms	5.7 ms
Diabetes	1.4 ms	1.4 ms	1.2 ms	1.3 ms	1.6 ms	1.3 ms
Concrete Strength	1.7 ms	1.7 ms	1.6 ms	1.8 ms	2.2 ms	1.8 ms
HAR	66.9 ms	55.5 ms	55.0 ms	55.9 ms	60.8 ms	57.9 ms
Yeast	3.5 ms	3.2 ms	3.0 ms	2.8 ms	3.1 ms	3.0 ms
Image Segment	5.8 ms	5.5 ms	5.3 ms	5.1 ms	5.4 ms	5.3 ms
Vowel	3.5 ms	3.4 ms	3.2 ms	3.1 ms	3.3 ms	3.1 ms
Waveform	10.5 ms	9.8 ms	9.6 ms	9.1 ms	9.9 ms	9.8 ms

DESlib (classification datasets only)

Dataset	KNORA-U	KNORA-E	OLA	META-DES	KNOP	DESP	DESKNN
HAR	1853.2 ms	1862.5 ms	1871.0 ms	2857.3 ms	2823.9 ms	1886.4 ms	1912.5 ms
Yeast	60.7 ms	62.0 ms	63.3 ms	108.2 ms	84.3 ms	61.0 ms	70.1 ms
Image Segment	20.9 ms	21.2 ms	20.8 ms	36.9 ms	32.2 ms	21.3 ms	25.0 ms
Vowel	53.5 ms	53.7 ms	54.7 ms	96.3 ms	72.8 ms	53.6 ms	58.0 ms
Waveform	186.2 ms	191.4 ms	193.1 ms	333.2 ms	312.2 ms	197.7 ms	211.4 ms

Benchmark

Methodology

Pool

Datasets

California Housing

Bike Sharing

Abalone

Diabetes

Concrete Strength

HAR

Yeast

Image Segment

Vowel

Waveform

Regression results

Classification results

HAR

Yeast

Image Segment

Vowel

Waveform

Timing

deskit

DESlib (classification datasets only)