Benchmarks#

Benchmarking USearch#

Hyper-parameters#

All major HNSW implementation share an identical list of hyper-parameters:

  • connectivity (often called M),

  • expansion on additions (often called efConstruction),

  • expansion on search (often called ef).

The default values vary drastically.

Library

Connectivity

EF @ A

EF @ S

hnswlib

16

200

10

FAISS

32

40

16

USearch

16

128

64

Below are the performance numbers for a benchmark running on the 64 cores of AWS c7g.metal “Graviton 3”-based instances. The main columns are:

  • Add: Number of insertion Queries Per Second.

  • Search: Number search Queries Per Second.

  • Recall @1: How often does approximate search yield the exact best match?

Different “connectivity”#

Vectors

Connectivity

EF @ A

EF @ S

Add, QPS

Search, QPS

Recall @1

f32 x256

16

128

64

75’640

131’654

99.3%

f32 x256

12

128

64

81’747

149’728

99.0%

f32 x256

32

128

64

64’368

104’050

99.4%

Different “expansion factors”#

Vectors

Connectivity

EF @ A

EF @ S

Add, QPS

Search, QPS

Recall @1

f32 x256

16

128

64

75’640

131’654

99.3%

f32 x256

16

64

32

128’644

228’422

97.2%

f32 x256

16

256

128

39’981

69’065

99.2%

Different vectors “quantization”#

Vectors

Connectivity

EF @ A

EF @ S

Add, QPS

Search, QPS

Recall @1

f32 x256

16

128

64

87’995

171’856

99.1%

f16 x256

16

128

64

87’270

153’788

98.4%

f16 x256 ✳️

16

128

64

71’454

132’673

98.4%

i8 x256

16

128

64

115’923

274’653

98.9%

As seen on the chart, for f16 quantization, performance may differ depending on native hardware support for that numeric type. Also worth noting, 8-bit quantization results in almost no quantization loss and may perform better than f16.

Utilities#

Within this repository you will find two commonly used utilities:

  • cpp/bench.cpp the produces the bench_cpp binary for broad USearch benchmarks.

  • python/bench.py and python/bench.ipynb for interactive charts against FAISS.

C++ Benchmarking Utilities#

To achieve best highest results we suggest compiling locally for the target architecture.

git submodule update --init --recursive
cmake -USEARCH_BUILD_BENCH_CPP=1 -DUSEARCH_BUILD_TEST_C=1 -DUSEARCH_USE_OPENMP=1 -DUSEARCH_USE_SIMSIMD=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build_profile
cmake --build build_profile --config RelWithDebInfo -j
build_profile/bench_cpp --help

Which would print the following instructions.

SYNOPSIS
        build_profile/bench_cpp [--vectors <path>] [--queries <path>] [--neighbors <path>] [-o
                                <path>] [-b] [-j <integer>] [-c <integer>] [--expansion-add
                                <integer>] [--expansion-search <integer>] [--rows-skip <integer>]
                                [--rows-take <integer>] [-bf16|-f16|-i8|-b1]
                                [--ip|--l2sq|--cos|--hamming|--tanimoto|--sorensen|--haversine] [-h]

OPTIONS
        --vectors <path>
                    .[fhbd]bin file path to construct the index

        --queries <path>
                    .[fhbd]bin file path to query the index

        --neighbors <path>
                    .ibin file path with ground truth

        -o, --output <path>
                    .usearch output file path

        -b, --big   Will switch to uint40_t for neighbors lists with over 4B entries
        -j, --threads <integer>
                    Uses all available cores by default

        -c, --connectivity <integer>
                    Index granularity

        --expansion-add <integer>
                    Affects indexing depth

        --expansion-search <integer>
                    Affects search depth

        --rows-skip <integer>
                    Number of vectors to skip

        --rows-take <integer>
                    Number of vectors to take

        -bf16, --bf16quant
                    Enable `bf16_t` quantization

        -f16, --f16quant
                    Enable `f16_t` quantization

        -i8, --i8quant
                    Enable `i8_t` quantization

        -b1, --b1quant
                    Enable `b1x8_t` quantization

        --ip        Choose Inner Product metric
        --l2sq      Choose L2 Euclidean metric
        --cos       Choose Angular metric
        --hamming   Choose Hamming metric
        --tanimoto  Choose Tanimoto metric
        --sorensen  Choose Sorensen metric
        --haversine Choose Haversine metric
        -h, --help  Print this help information on this tool and exit

Here is an example of running the C++ benchmark:

build_profile/bench_cpp \
    --vectors datasets/wiki_1M/base.1M.fbin \
    --queries datasets/wiki_1M/query.public.100K.fbin \
    --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin

build_profile/bench_cpp \
    --vectors datasets/t2i_1B/base.1B.fbin \
    --queries datasets/t2i_1B/query.public.100K.fbin \
    --neighbors datasets/t2i_1B/groundtruth.public.100K.ibin \
    --output datasets/t2i_1B/index.usearch \
    --cos

Optional parameters include connectivity, expansion_add, expansion_search.

For Python, jut open the Jupyter Notebook and start playing around.

Python Benchmarking Utilities#

Several benchmarking suites are available for Python: approximate search, exact search, and clustering.

python/scripts/bench.py --help
python/scripts/bench_exact.py --help
python/scripts/bench_cluster.py --help

Datasets#

BigANN benchmark is a good starting point, if you are searching for large collections of high-dimensional vectors. Those often come with precomputed ground-truth neighbors, which is handy for recall evaluation.

Dataset

Scalar Type

Dimensions

Metric

Size

Unum UForm Creative Captions

float32

256

IP

3 GB

Unum UForm Wiki

float32

256

IP

1 GB

Yandex Text-to-Image Sample

float32

200

Cos

1 GB

Microsoft SPACEV

int8

100

L2

93 GB

Microsoft Turing-ANNS

float32

100

L2

373 GB

Yandex Deep1B

float32

96

L2

358 GB

Yandex Text-to-Image

float32

200

Cos

750 GB

ViT-L/12 LAION

float32

2048

Cos

2 - 10 TB

Luckily, smaller samples of those datasets are available.

Unum UForm Wiki#

mkdir -p datasets/wiki_1M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/base.1M.fbin -P datasets/wiki_1M/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/query.public.100K.fbin -P datasets/wiki_1M/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-wiki-1m/resolve/main/groundtruth.public.100K.ibin -P datasets/wiki_1M/

Yandex Text-to-Image#

mkdir -p datasets/t2i_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1B.fbin -P datasets/t2i_1B/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.1M.fbin -P datasets/t2i_1B/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.public.100K.fbin -P datasets/t2i_1B/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/groundtruth.public.100K.ibin -P datasets/t2i_1B/

Yandex Deep1B#

mkdir -p datasets/deep_1B/ && \
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.1B.fbin -P datasets/deep_1B/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/base.10M.fbin -P datasets/deep_1B/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/query.public.10K.fbin -P datasets/deep_1B/ &&
    wget -nc https://storage.yandexcloud.net/yandex-research/ann-datasets/DEEP/groundtruth.public.10K.ibin -P datasets/deep_1B/

Arxiv with E5#

mkdir -p datasets/arxiv_2M/ && \
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m/resolve/main/abstract.e5-base-v2.fbin -P datasets/arxiv_2M/ &&
    wget -nc https://huggingface.co/datasets/unum-cloud/ann-arxiv-2m/resolve/main/title.e5-base-v2.fbin -P datasets/arxiv_2M/

Profiling#

With perf:

# Pass environment variables with `-E`, and `-d` for details
sudo -E perf stat -d build_profile/bench_cpp ...
sudo -E perf mem -d build_profile/bench_cpp ...
# Sample on-CPU functions for the specified command, at 1 Kilo Hertz:
sudo -E perf record -F 1000 build_profile/bench_cpp ...
perf record -d -e arm_spe// -- build_profile/bench_cpp ..

Caches#

sudo perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' build_profile/bench_cpp ...

Typical output on a 1M vectors dataset is:

       255426      faults
 305988813388      dTLB-loads
   8845723783      dTLB-load-misses          #    2.89% of all dTLB cache accesses
  20094264206      cache-misses              #    6.567 % of all cache refs
 305988812745      cache-references

  8.285148010 seconds time elapsed

500.705967000 seconds user
  1.371118000 seconds sys

If you notice problems and the stalls are closer to 90%, it might be a good reason to consider enabling Huge Pages and tuning allocations alignment. To enable Huge Pages:

sudo cat /proc/sys/vm/nr_hugepages
sudo sysctl -w vm.nr_hugepages=2048
sudo reboot
sudo cat /proc/sys/vm/nr_hugepages