Benchmark

All results are from a 20-seed benchmark (seeds 0–19) on standard datasets. Each seed produces a different random train/validation/test split, and results are averaged across all 20 seeds to reduce variance from any single split.

Methodology

Split: Each dataset is split into train, validation, and test sets. Models are trained on the train set, the DES router is fitted on the validation set, and it is evaluated on the test set.

Best Single is the best individual model from the pool, selected by validation set performance. It represents the baseline without any ensembling.

Simple Average is a uniform blend of all five models with no fitting or tuning. It represents the simplest possible ensemble baseline.

All deskit algorithms use preset="balanced" (FAISS IVF) and k=20.

Pool

The same five-model pool is used across all datasets:

Model
K-Nearest Neighbors
Decision Tree
SVR / SVM-RBF
Ridge / Gaussian NB
Bayesian Ridge / Logistic Regression

These five were chosen for having different inductive biases and architectures, which is the kind of scenario that DES would be used in.

Datasets

California Housing

Source: sklearn built-in. Size: 20,640 samples, 8 features.

Predict median house value from census block features.

Source: OpenML 42712. Size: 17,379 samples, 12 features.

Predict hourly bike rental counts from weather and time features.

Abalone

Source: OpenML 183. Size: 4,177 samples, 8 features.

Predict abalone age from physical measurements.

Diabetes

Source: sklearn built-in. Size: 442 samples, 10 features.

Predict disease progression one year after baseline from physiological measurements.

Concrete Strength

Source: OpenML 4353. Size: 1,030 samples, 8 features.

Predict concrete compressive strength from ingredient and curing age features.

HAR

Source: OpenML 1478. Size: 10,299 samples, 561 features.

Six-class classification of human activities from smartphone accelerometer and gyroscope data.

Yeast

Source: OpenML 181. Size: 1,484 samples, 8 features.

Ten-class protein localisation classification with class imbalance.

Image Segment

Source: OpenML 36. Size: 2,310 samples, 19 features.

Seven-class classification of outdoor image segments from colour and texture statistics.

Vowel

Source: OpenML 307. Size: 990 samples, 10 features.

Eleven-class vowel recognition from LPC-derived formant frequencies.

Waveform

Source: OpenML 60. Size: 5,000 samples, 40 features.

Three-class classification of artificially constructed waveforms with deliberate class overlap.

Regression results

MAE, lower is better. % shown as delta vs Best Single. 20-seed mean ± std.

Dataset	Best Single	Simple Avg	DEWS-U	DEWS-I	DEWS-T	DEWS-V	DEWS-IV	LWSE-U	LWSE-I	OLA	KNORA-U	KNORA-E	KNORA-IU
California Housing	0.3956 ± 0.008	+7.99%	−2.24%	−2.54%	−2.52%	−1.17%	−1.44%	−1.67%	−1.85%	−0.03%	−0.79%	+7.46%	−0.99%
Bike Sharing	51.678 ± 0.860	+47.77%	−5.34%	−6.86%	−6.85%	−3.25%	−4.63%	−3.85%	−4.59%	−2.97%	+6.57%	+14.79%	+5.39%
Abalone	1.4981 ± 0.044	+1.14%	+2.68%	+2.82%	+2.80%	+3.22%	+3.20%	+3.33%	+3.38%	+3.67%	+1.47%	+7.18%	+1.47%
Diabetes	44.504 ± 2.645	+3.18%	+1.17%	+1.09%	+1.09%	+1.09%	+0.86%	+3.36%	+3.25%	+3.56%	+5.86%	+15.34%	+5.74%
Concrete Strength	5.2686 ± 0.336	+23.66%	+1.68%	−1.20%	−1.01%	+3.36%	+0.46%	−3.46%	−5.41%	+3.54%	+2.49%	+11.84%	−1.05%

KNORA variants are designed for classification, which explains the poor performance on regression datasets; However, some exception can occur in certain datasets, either where feature space has hard clusters (like in Concrete Strength) or when the target is discrete and classification-like (like in Abalone).

LWSE-I is the clear winner on Concrete Strength (−5.41%), where strong local competence heterogeneity allows the per-sample NNLS solver to find genuine local blends. DEWS-IV edges all other algorithms on Diabetes, the only dataset where every ensembling method loses to the best single model.

Classification results

Accuracy, higher is better. % shown as delta vs Best Single. 20-seed mean ± std. Classification datasets include a comparison against DESlib, a mature sklearn-compatible DES library.

HAR

Method	Mean	Std	vs Best Single
Best Single	98.24%	0.23%	—
Simple Average	97.92%	0.26%	−0.33%
deskit DEWS-U	98.38%	0.27%	+0.14%
deskit DEWS-I	98.39%	0.27%	+0.15%
deskit DEWS-T	98.40%	0.28%	+0.16%
deskit DEWS-V	98.38%	0.27%	+0.14%
deskit DEWS-IV	98.38%	0.27%	+0.14%
deskit LWSE-U	98.17%	0.34%	−0.07%
deskit LWSE-I	98.20%	0.34%	−0.04%
deskit OLA	98.00%	0.41%	−0.25%
deskit KNORA-U	98.18%	0.29%	−0.06%
deskit KNORA-E	98.02%	0.27%	−0.22%
deskit KNORA-IU	98.19%	0.29%	−0.05%
DESlib KNORA-U	98.00%	0.26%	−0.25%
DESlib KNORA-E	97.82%	0.34%	−0.43%
DESlib OLA	97.09%	0.48%	−1.17%
DESlib LCA	91.29%	1.27%	−7.08%
DESlib MCB	97.03%	0.36%	−1.24%
DESlib META-DES	98.35%	0.30%	+0.11%
DESlib KNOP	98.33%	0.29%	+0.09%
DESlib DESP	97.97%	0.27%	−0.28%
DESlib DESKNN	97.81%	0.33%	−0.44%
DESlib DES-MI	95.56%	1.31%	−2.73%

deskit achieves a best mean score of 98.40%; DESlib achieves a best mean score of 98.35%.

Yeast

Method	Mean	Std	vs Best Single
Best Single	58.87%	2.20%	—
Simple Average	59.33%	1.96%	+0.77%
deskit DEWS-U	59.46%	1.94%	+1.00%
deskit DEWS-I	59.58%	1.99%	+1.20%
deskit DEWS-T	59.71%	2.07%	+1.43%
deskit DEWS-V	59.51%	2.05%	+1.09%
deskit DEWS-IV	59.56%	2.19%	+1.17%
deskit LWSE-U	58.40%	1.97%	−0.80%
deskit LWSE-I	58.42%	2.10%	−0.77%
deskit OLA	58.16%	1.94%	−1.20%
deskit KNORA-U	59.68%	1.91%	+1.37%
deskit KNORA-E	56.82%	2.37%	−3.49%
deskit KNORA-IU	59.85%	1.90%	+1.66%
DESlib KNORA-U	59.48%	1.81%	+1.03%
DESlib KNORA-E	56.97%	1.91%	−3.23%
DESlib OLA	56.84%	2.13%	−3.46%
DESlib LCA	55.71%	1.62%	−5.38%
DESlib MCB	57.22%	2.34%	−2.80%
DESlib META-DES	57.46%	2.35%	−2.40%
DESlib KNOP	59.12%	1.81%	+0.43%
DESlib DESP	58.77%	1.72%	−0.17%
DESlib DESKNN	57.93%	1.74%	−1.60%
DESlib DES-MI	56.90%	2.25%	−3.35%

deskit achieves a best mean score of 59.85%; DESlib achieves a best mean score of 59.48%.

Image Segment

Method	Mean	Std	vs Best Single
Best Single	93.70%	0.98%	—
Simple Average	95.01%	1.03%	+1.40%
deskit DEWS-U	95.58%	0.86%	+2.01%
deskit DEWS-I	95.79%	0.95%	+2.23%
deskit DEWS-T	95.81%	0.98%	+2.25%
deskit DEWS-V	95.55%	0.93%	+1.98%
deskit DEWS-IV	95.81%	0.91%	+2.25%
deskit LWSE-U	95.58%	0.87%	+2.01%
deskit LWSE-I	95.79%	0.82%	+2.23%
deskit OLA	94.98%	0.92%	+1.36%
deskit KNORA-U	95.37%	0.92%	+1.78%
deskit KNORA-E	95.41%	1.01%	+1.82%
deskit KNORA-IU	95.66%	0.89%	+2.09%
DESlib KNORA-U	94.95%	0.97%	+1.33%
DESlib KNORA-E	95.25%	0.89%	+1.65%
DESlib OLA	94.65%	0.89%	+1.02%
DESlib LCA	92.35%	1.00%	−1.44%
DESlib MCB	94.62%	1.08%	+0.98%
DESlib META-DES	95.48%	0.81%	+1.89%
DESlib KNOP	95.19%	1.00%	+1.59%
DESlib DESP	94.68%	0.91%	+1.04%
DESlib DESKNN	94.76%	1.00%	+1.13%
DESlib DES-MI	94.76%	1.00%	+1.13%

deskit achieves a best mean score of 95.81%; DESlib achieves a best mean score of 95.48%.

Vowel

Method	Mean	Std	vs Best Single
Best Single	89.95%	2.94%	—
Simple Average	88.11%	2.98%	−2.05%
deskit DEWS-U	89.87%	2.51%	−0.08%
deskit DEWS-I	90.25%	2.52%	+0.34%
deskit DEWS-T	90.38%	2.56%	+0.48%
deskit DEWS-V	89.80%	2.57%	−0.17%
deskit DEWS-IV	90.10%	2.52%	+0.17%
deskit LWSE-U	91.84%	1.97%	+2.11%
deskit LWSE-I	92.60%	1.81%	+2.95%
deskit OLA	90.23%	2.45%	+0.31%
deskit KNORA-U	90.15%	2.49%	+0.22%
deskit KNORA-E	90.61%	2.29%	+0.73%
deskit KNORA-IU	90.78%	2.14%	+0.93%
DESlib KNORA-U	88.18%	2.58%	−1.97%
DESlib KNORA-E	89.47%	2.45%	−0.53%
DESlib OLA	88.38%	2.96%	−1.74%
DESlib LCA	78.61%	3.96%	−12.61%
DESlib MCB	86.36%	2.95%	−3.99%
DESlib META-DES	89.70%	2.27%	−0.28%
DESlib KNOP	88.61%	2.58%	−1.49%
DESlib DESP	85.56%	2.99%	−4.88%
DESlib DESKNN	85.23%	3.60%	−5.25%
DESlib DES-MI	85.23%	3.60%	−5.25%

deskit achieves a best mean score of 92.60%; DESlib achieves a best mean score of 89.70%.

Waveform

Method	Mean	Std	vs Best Single
Best Single	85.91%	0.76%	—
Simple Average	85.07%	0.76%	−0.98%
deskit DEWS-U	85.57%	0.79%	−0.40%
deskit DEWS-I	85.54%	0.79%	−0.43%
deskit DEWS-T	85.57%	0.75%	−0.39%
deskit DEWS-V	85.43%	0.80%	−0.55%
deskit DEWS-IV	85.43%	0.81%	−0.56%
deskit LWSE-U	83.87%	0.96%	−2.37%
deskit LWSE-I	83.92%	0.94%	−2.32%
deskit OLA	84.15%	0.89%	−2.04%
deskit KNORA-U	85.41%	0.80%	−0.59%
deskit KNORA-E	82.91%	1.12%	−3.50%
deskit KNORA-IU	85.42%	0.78%	−0.58%
DESlib KNORA-U	85.61%	0.82%	−0.35%
DESlib KNORA-E	83.19%	1.02%	−3.17%
DESlib OLA	81.14%	1.15%	−5.55%
DESlib LCA	77.10%	1.64%	−10.25%
DESlib MCB	82.16%	1.18%	−4.36%
DESlib META-DES	85.19%	0.91%	−0.84%
DESlib KNOP	85.97%	0.97%	+0.07%
DESlib DESP	85.50%	0.82%	−0.47%
DESlib DESKNN	84.39%	0.95%	−1.78%
DESlib DES-MI	84.07%	1.02%	−2.14%

deskit achieves a best mean score of 85.57%; DESlib achieves a best mean score of 85.97%.

Timing

Mean fit + predict time in milliseconds, averaged across 20 seeds. Fit is measured once per dataset per seed; predict is measured over the full test set.

deskit caches all model predictions on the validation set at fit time and reads from that matrix at inference, so no model is called at predict time. This is the primary reason for the speed advantage over DESlib, which calls each model live per neighbour at inference.

deskit used preset='balanced', which uses FAISS IVF instead of KNN, but the difference in performance isn't very pronounced in datasets of the size used.

deskit

Dataset	DEWS-U	DEWS-I	DEWS-T	DEWS-V	DEWS-IV	LWSE-U	LWSE-I	OLA	KNORA-U	KNORA-E	KNORA-IU
California Housing	25.2 ms	23.5 ms	31.0 ms	27.8 ms	29.1 ms	44.1 ms	59.3 ms	22.4 ms	26.3 ms	34.7 ms	27.6 ms
Bike Sharing	19.5 ms	19.1 ms	25.2 ms	22.7 ms	23.5 ms	35.2 ms	47.6 ms	18.8 ms	21.0 ms	28.3 ms	22.6 ms
Abalone	5.2 ms	5.2 ms	6.8 ms	6.0 ms	6.5 ms	8.8 ms	11.9 ms	4.5 ms	5.4 ms	7.3 ms	5.3 ms
Diabetes	1.4 ms	1.3 ms	1.4 ms	1.4 ms	1.3 ms	1.5 ms	1.8 ms	1.0 ms	1.3 ms	1.4 ms	1.2 ms
Concrete Strength	1.8 ms	1.8 ms	2.4 ms	2.0 ms	2.0 ms	2.4 ms	3.2 ms	1.4 ms	1.6 ms	2.2 ms	1.7 ms
HAR	65.8 ms	56.1 ms	60.1 ms	57.6 ms	58.3 ms	74.1 ms	83.4 ms	56.6 ms	57.8 ms	63.2 ms	60.1 ms
Yeast	4.5 ms	3.9 ms	4.5 ms	3.7 ms	3.6 ms	5.8 ms	7.0 ms	2.8 ms	2.9 ms	3.1 ms	2.9 ms
Image Segment	6.0 ms	5.4 ms	6.3 ms	5.4 ms	5.6 ms	8.5 ms	11.2 ms	5.3 ms	5.3 ms	5.4 ms	5.5 ms
Vowel	3.4 ms	3.3 ms	3.5 ms	3.2 ms	3.1 ms	4.5 ms	5.7 ms	3.1 ms	3.0 ms	3.1 ms	3.1 ms
Waveform	10.0 ms	10.0 ms	10.5 ms	10.2 ms	10.4 ms	15.2 ms	19.8 ms	9.4 ms	9.3 ms	9.9 ms	9.8 ms

DESlib (classification datasets only)

Dataset	KNORA-U	KNORA-E	OLA	LCA	MCB	META-DES	KNOP	DESP	DESKNN	DES-MI
HAR	1886.2 ms	1884.9 ms	1884.5 ms	1901.0 ms	1900.5 ms	2905.5 ms	2841.3 ms	1917.8 ms	1946.7 ms	1919.4 ms
Yeast	56.7 ms	57.3 ms	58.4 ms	59.4 ms	63.3 ms	99.8 ms	79.5 ms	56.8 ms	66.1 ms	57.1 ms
Image Segment	22.6 ms	22.1 ms	21.8 ms	22.6 ms	22.8 ms	37.7 ms	33.2 ms	21.3 ms	25.6 ms	21.5 ms
Vowel	50.3 ms	49.1 ms	50.2 ms	52.1 ms	55.2 ms	89.6 ms	68.2 ms	49.4 ms	53.4 ms	49.9 ms
Waveform	185.3 ms	192.1 ms	194.2 ms	195.0 ms	201.6 ms	332.9 ms	312.7 ms	204.3 ms	220.9 ms	202.1 ms

Benchmark

Methodology

Pool

Datasets

California Housing

Bike Sharing

Abalone

Diabetes

Concrete Strength

HAR

Yeast

Image Segment

Vowel

Waveform

Regression results

Classification results

HAR

Yeast

Image Segment

Vowel

Waveform

Timing

deskit

DESlib (classification datasets only)