Benchmarks#
Benchmarking USearch#
Hyper-parameters#
All major HNSW implementation share an identical list of hyper-parameters:
connectivity (often called
M),expansion on additions (often called
efConstruction),expansion on search (often called
ef).
The default values vary drastically.
Library |
Connectivity |
EF @ A |
EF @ S |
|---|---|---|---|
|
16 |
200 |
10 |
|
32 |
40 |
16 |
|
16 |
128 |
64 |
Below are the performance numbers for a benchmark running on the 64 cores of AWS c7g.metal “Graviton 3”-based instances.
The main columns are:
Add: Number of insertion Queries Per Second.
Search: Number search Queries Per Second.
Recall @1: How often does approximate search yield the exact best match?
Different “connectivity”#
Vectors |
Connectivity |
EF @ A |
EF @ S |
Add, QPS |
Search, QPS |
Recall @1 |
|---|---|---|---|---|---|---|
|
16 |
128 |
64 |
75’640 |
131’654 |
99.3% |
|
12 |
128 |
64 |
81’747 |
149’728 |
99.0% |
|
32 |
128 |
64 |
64’368 |
104’050 |
99.4% |
Different “expansion factors”#
Vectors |
Connectivity |
EF @ A |
EF @ S |
Add, QPS |
Search, QPS |
Recall @1 |
|---|---|---|---|---|---|---|
|
16 |
128 |
64 |
75’640 |
131’654 |
99.3% |
|
16 |
64 |
32 |
128’644 |
228’422 |
97.2% |
|
16 |
256 |
128 |
39’981 |
69’065 |
99.2% |
Different vectors “quantization”#
Vectors |
Connectivity |
EF @ A |
EF @ S |
Add, QPS |
Search, QPS |
Recall @1 |
|---|---|---|---|---|---|---|
|
16 |
128 |
64 |
87’995 |
171’856 |
99.1% |
|
16 |
128 |
64 |
87’270 |
153’788 |
98.4% |
|
16 |
128 |
64 |
71’454 |
132’673 |
98.4% |
|
16 |
128 |
64 |
115’923 |
274’653 |
98.9% |
As seen on the chart, for f16 quantization, performance may differ depending on native hardware support for that numeric type.
Also worth noting, 8-bit quantization results in almost no quantization loss and may perform better than f16.
Utilities#
Within this repository you will find two commonly used utilities:
cpp/bench.cppthe produces thebench_cppbinary for broad USearch benchmarks.python/bench.pyandpython/bench.ipynbfor interactive charts against FAISS.
C++ Benchmarking Utilities#
To achieve best highest results we suggest compiling locally for the target architecture.
git submodule update --init --recursive
cmake -DUSEARCH_BUILD_BENCH_CPP=1 -DUSEARCH_BUILD_TEST_C=1 -DUSEARCH_USE_NUMKONG=1 -DUSEARCH_USE_OPENMP=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build_profile
cmake --build build_profile --config RelWithDebInfo --parallel
build_profile/bench_cpp --help
Which would print the following instructions.
SYNOPSIS
build_profile/bench_cpp [--vectors <path>] [--queries <path>] [--neighbors <path>] [-o
<path>] [-b] [-j <integer>] [-c <integer>] [--expansion-add
<integer>] [--expansion-search <integer>] [--rows-skip <integer>]
[--rows-take <integer>] [--dtype <type>] [--metric <name>] [-h]
OPTIONS
--vectors <path>
.[fhbd]bin, .i8bin, .u8bin, .f32bin file path to construct the index
--queries <path>
.[fhbd]bin, .i8bin, .u8bin, .f32bin file path to query the index
--neighbors <path>
.ibin file path with ground truth
-o, --output <path>
.usearch output file path
-b, --big Will switch to uint40_t for neighbors lists with over 4B entries
-j, --threads <integer>
Uses all available cores by default
-c, --connectivity <integer>
Index granularity
--expansion-add <integer>
Affects indexing depth
--expansion-search <integer>
Affects search depth
--rows-skip <integer>
Number of vectors to skip
--rows-take <integer>
Number of vectors to take
--dtype <type>
Quantization type: f64, f32, bf16, f16, e5m2, e4m3, e3m2, e2m3, i8, u8, b1
--metric <name>
Distance metric: ip, l2sq, cos, hamming, tanimoto, sorensen, haversine
-h, --help Print this help information on this tool and exit
Here is an example of running the C++ benchmark:
build_profile/bench_cpp \
--vectors datasets/wiki_1M/base.1M.fbin \
--queries datasets/wiki_1M/query.public.100K.fbin \
--neighbors datasets/wiki_1M/groundtruth.public.100K.ibin \
--dtype bf16 --metric ip
build_profile/bench_cpp \
--vectors datasets/t2i_1B/base.1B.fbin \
--queries datasets/t2i_1B/query.public.100K.fbin \
--neighbors datasets/t2i_1B/groundtruth.public.100K.ibin \
--output datasets/t2i_1B/index.usearch \
--dtype bf16 --metric cos
Optional parameters include
connectivity,expansion_add,expansion_search.
For Python, jut open the Jupyter Notebook and start playing around.
Python Benchmarking Utilities#
Several benchmarking suites are available for Python: approximate search, exact search, and clustering.
python/scripts/bench.py --help
python/scripts/bench_exact.py --help
python/scripts/bench_cluster.py --help
Datasets#
BigANN benchmark is a good starting point, if you are searching for large collections of high-dimensional vectors. Those often come with precomputed ground-truth neighbors, which is handy for recall evaluation. Datasets below are grouped by scale; only configurations with matching ground truth support recall evaluation.
~1M Scale — Development & Testing#
Dataset |
Scalar Type |
Dimensions |
Metric |
Base Size |
Ground Truth |
|---|---|---|---|---|---|
|
256 |
IP |
1 GB |
100K queries, yes |
|
|
256 |
IP |
3 GB |
cross-modal pairing |
|
|
768 |
IP |
6 GB |
cross-modal pairing |
~10M Scale#
Dataset |
Scalar Type |
Dimensions |
Metric |
Base Size |
Ground Truth |
|---|---|---|---|---|---|
|
128 |
L2 |
1.2 GB |
10K queries, yes |
|
|
100 |
L2 |
3.7 GB |
100K queries, yes |
|
|
96 |
L2 |
3.6 GB |
¹ no subset GT |
¹ Yandex only publishes ground truth computed against the full 1B dataset. A
base.10M.fbinexists for download but using 1B ground truth with a subset will produce misleadingly low recall. Use it only for throughput/latency testing, not recall evaluation.
~100M Scale#
Dataset |
Scalar Type |
Dimensions |
Metric |
Base Size |
Ground Truth |
|---|---|---|---|---|---|
|
128 |
L2 |
12 GB |
10K queries, yes |
|
|
100 |
L2 |
37 GB |
100K queries, yes |
|
|
100 |
L2 |
9.3 GB |
30K queries, yes |
~1B Scale#
Dataset |
Scalar Type |
Dimensions |
Metric |
Base Size |
Ground Truth |
|---|---|---|---|---|---|
|
128 |
L2 |
119 GB |
10K queries, yes |
|
|
100 |
L2 |
373 GB |
100K queries, yes |
|
|
100 |
L2 |
93 GB |
30K queries, yes |
|
|
200 |
Cos |
750 GB |
100K queries, yes |
|
|
96 |
L2 |
358 GB |
10K queries, yes |
|
|
2048 |
Cos |
2 - 10 TB |
— |
Unum UForm Creative Captions#
A cross-modal dataset of ~2.75M image-text pairs embedded with UForm VL English (256 dimensions).
No separate query/ground-truth files — the natural ground truth is the image-text pairing: image[i] matches text[i].
mkdir -p datasets/cc_3M/ && \
wget -nc https://huggingface.co/datasets/unum-cloud/ann-cc-3m/resolve/main/images.uform-vl-english.fbin \
-O datasets/cc_3M/images.fbin && \
wget -nc https://huggingface.co/datasets/unum-cloud/ann-cc-3m/resolve/main/texts.uform-vl-english.fbin \
-O datasets/cc_3M/texts.fbin
To benchmark cross-modal join:
python python/scripts/join.py \
--vectors-a datasets/cc_3M/texts.fbin \
--vectors-b datasets/cc_3M/images.fbin \
--metric cos --diagnostics
Unum UForm Wiki#
mkdir -p datasets/wiki_1M/ && \
wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/base.1M.fbin -P datasets/wiki_1M/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/query.public.100K.fbin -P datasets/wiki_1M/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/groundtruth.public.100K.ibin -P datasets/wiki_1M/
Arxiv with E5#
mkdir -p datasets/arxiv_2M/ && \
wget -nc https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m/resolve/main/abstract.e5-base-v2.fbin -P datasets/arxiv_2M/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m/resolve/main/title.e5-base-v2.fbin -P datasets/arxiv_2M/
Yandex Text-to-Image#
Warning: Yandex only publishes ground truth computed against the full 1B dataset. A
base.1M.fbinsubset exists for download but has no matching ground truth — using 1B ground truth with the 1M subset will produce misleadingly low recall. Usebase.1M.fbinonly for throughput/latency testing, not recall evaluation.
mkdir -p datasets/t2i_1B/ && \
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1B.fbin -P datasets/t2i_1B/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.public.100K.fbin -P datasets/t2i_1B/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/groundtruth.public.100K.ibin -P datasets/t2i_1B/
To run the benchmark (requires the full 1B base for valid recall):
build_profile/bench_cpp \
--vectors datasets/t2i_1B/base.1B.fbin \
--queries datasets/t2i_1B/query.public.100K.fbin \
--neighbors datasets/t2i_1B/groundtruth.public.100K.ibin \
--output datasets/t2i_1B/index.usearch \
--dtype bf16 --metric cos
Yandex Deep#
Warning: Yandex only publishes ground truth computed against the full 1B dataset. Smaller base files (
base.10M.fbin,base.1M.fbin) are available for download but have no matching ground truth — using 1B ground truth with a subset will produce misleadingly low recall. Use those files only for throughput/latency testing, not recall evaluation.
mkdir -p datasets/deep_1B/ && \
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.1B.fbin -P datasets/deep_1B/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/query.public.10K.fbin -P datasets/deep_1B/ &&
wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/groundtruth.public.10K.ibin -P datasets/deep_1B/
To run the benchmark (requires the full 1B base for valid recall):
build_profile/bench_cpp \
--vectors datasets/deep_1B/base.1B.fbin \
--queries datasets/deep_1B/query.public.10K.fbin \
--neighbors datasets/deep_1B/groundtruth.public.10K.ibin \
--output datasets/deep_1B/index.usearch \
--dtype bf16 --metric l2sq
Meta BIGANN — SIFT#
The full 1B dataset is available from Meta. No pre-sliced subset base files exist, so range requests are used to download only the first N vectors, followed by a header patch to update the vector count. Pre-computed ground truth is available for 10M and 100M subsets.
10M subset, ~1.2 GB#
mkdir -p datasets/sift_10M/ && \
wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.public.10K.u8bin -P datasets/sift_10M/ && \
wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/GT_10M/bigann-10M -O datasets/sift_10M/groundtruth.public.10K.ibin && \
wget --header="Range: bytes=0-1280000007" \
https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/base.1B.u8bin \
-O datasets/sift_10M/base.10M.u8bin && \
python3 -c "
import struct
with open('datasets/sift_10M/base.10M.u8bin', 'r+b') as f:
f.write(struct.pack('I', 10_000_000))
"
build_profile/bench_cpp \
--vectors datasets/sift_10M/base.10M.u8bin \
--queries datasets/sift_10M/query.public.10K.u8bin \
--neighbors datasets/sift_10M/groundtruth.public.10K.ibin \
--output datasets/sift_10M/index.usearch \
--dtype u8 --metric l2sq
100M subset, ~12 GB#
mkdir -p datasets/sift_100M/ && \
wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.public.10K.u8bin -P datasets/sift_100M/ && \
wget -nc https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/GT_100M/bigann-100M -O datasets/sift_100M/groundtruth.public.10K.ibin && \
wget --header="Range: bytes=0-12800000007" \
https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/base.1B.u8bin \
-O datasets/sift_100M/base.100M.u8bin && \
python3 -c "
import struct
with open('datasets/sift_100M/base.100M.u8bin', 'r+b') as f:
f.write(struct.pack('I', 100_000_000))
"
build_profile/bench_cpp \
--vectors datasets/sift_100M/base.100M.u8bin \
--queries datasets/sift_100M/query.public.10K.u8bin \
--neighbors datasets/sift_100M/groundtruth.public.10K.ibin \
--output datasets/sift_100M/index.usearch \
--dtype u8 --metric l2sq
Microsoft Turing-ANNS#
The full 1B dataset is ~373 GB of f32 vectors with 100 dimensions.
Subsets can be obtained via range requests, followed by a header patch to update the vector count.
Pre-computed ground truth is available for 1M, 10M, and 100M subsets.
1M subset, ~400 MB#
mkdir -p datasets/turing_1M/ && \
wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/query100K.fbin \
-O datasets/turing_1M/query.public.100K.fbin && \
wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/msturing-gt-1M \
-O datasets/turing_1M/groundtruth.public.100K.ibin && \
wget --header="Range: bytes=0-400000007" \
https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/base1b.fbin \
-O datasets/turing_1M/base.1M.fbin && \
python3 -c "
import struct
with open('datasets/turing_1M/base.1M.fbin', 'r+b') as f:
f.write(struct.pack('I', 1_000_000))
"
build_profile/bench_cpp \
--vectors datasets/turing_1M/base.1M.fbin \
--queries datasets/turing_1M/query.public.100K.fbin \
--neighbors datasets/turing_1M/groundtruth.public.100K.ibin \
--output datasets/turing_1M/index.usearch \
--dtype bf16 --metric l2sq
10M subset, ~3.7 GB#
mkdir -p datasets/turing_10M/ && \
wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/query100K.fbin \
-O datasets/turing_10M/query.public.100K.fbin && \
wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/msturing-gt-10M \
-O datasets/turing_10M/groundtruth.public.100K.ibin && \
wget --header="Range: bytes=0-4000000007" \
https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/base1b.fbin \
-O datasets/turing_10M/base.10M.fbin && \
python3 -c "
import struct
with open('datasets/turing_10M/base.10M.fbin', 'r+b') as f:
f.write(struct.pack('I', 10_000_000))
"
build_profile/bench_cpp \
--vectors datasets/turing_10M/base.10M.fbin \
--queries datasets/turing_10M/query.public.100K.fbin \
--neighbors datasets/turing_10M/groundtruth.public.100K.ibin \
--output datasets/turing_10M/index.usearch \
--dtype bf16 --metric l2sq
100M subset, ~37 GB#
mkdir -p datasets/turing_100M/ && \
wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/query100K.fbin \
-O datasets/turing_100M/query.public.100K.fbin && \
wget -nc https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/msturing-gt-100M \
-O datasets/turing_100M/groundtruth.public.100K.ibin && \
wget --header="Range: bytes=0-40000000007" \
https://comp21storage.z5.web.core.windows.net/comp21/MSFT-TURING-ANNS/base1b.fbin \
-O datasets/turing_100M/base.100M.fbin && \
python3 -c "
import struct
with open('datasets/turing_100M/base.100M.fbin', 'r+b') as f:
f.write(struct.pack('I', 100_000_000))
"
build_profile/bench_cpp \
--vectors datasets/turing_100M/base.100M.fbin \
--queries datasets/turing_100M/query.public.100K.fbin \
--neighbors datasets/turing_100M/groundtruth.public.100K.ibin \
--output datasets/turing_100M/index.usearch \
--dtype bf16 --metric l2sq
Microsoft SpaceV#
The original dataset can be pulled in a USearch-compatible form from AWS S3:
mkdir -p datasets/spacev_1B/ && \
aws s3 cp s3://your-bucket/path/to/spacev/ datasets/spacev_1B/ --recursive
A smaller 100M dataset can be pulled from Hugging Face.
mkdir -p datasets/spacev_100M/ && \
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/ids.100M.i32bin -P datasets/spacev_100M/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/base.100M.i8bin -P datasets/spacev_100M/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/query.30K.i8bin -P datasets/spacev_100M/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/groundtruth.30K.i32bin -P datasets/spacev_100M/ &&
wget -nc https://huggingface.co/datasets/unum-cloud/ann-spacev-100m/resolve/main/groundtruth.30K.f32bin -P datasets/spacev_100M/
build_profile/bench_cpp \
--vectors datasets/spacev_100M/base.100M.i8bin \
--queries datasets/spacev_100M/query.30K.i8bin \
--neighbors datasets/spacev_100M/groundtruth.30K.i32bin \
--output datasets/spacev_100M/index.usearch \
--dtype i8 \
--metric l2sq
Profiling#
With perf:
# Pass environment variables with `-E`, and `-d` for details
sudo -E perf stat -d build_profile/bench_cpp ...
sudo -E perf mem -d build_profile/bench_cpp ...
# Sample on-CPU functions for the specified command, at 1 Kilo Hertz:
sudo -E perf record -F 1000 build_profile/bench_cpp ...
perf record -d -e arm_spe// -- build_profile/bench_cpp ..
Caches#
sudo perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' build_profile/bench_cpp ...
Typical output on a 1M vectors dataset is:
255426 faults
305988813388 dTLB-loads
8845723783 dTLB-load-misses # 2.89% of all dTLB cache accesses
20094264206 cache-misses # 6.567 % of all cache refs
305988812745 cache-references
8.285148010 seconds time elapsed
500.705967000 seconds user
1.371118000 seconds sys
If you notice problems and the stalls are closer to 90%, it might be a good reason to consider enabling Huge Pages and tuning allocations alignment. To enable Huge Pages:
sudo cat /proc/sys/vm/nr_hugepages
sudo sysctl -w vm.nr_hugepages=2048
sudo reboot
sudo cat /proc/sys/vm/nr_hugepages