Benchmarks#

UForm Model Benchmarks#

Accuracy#

Embedding Models#

Few retrieval benchmarks exist for multimodal embeddings. The most famous ones for English are “MS-COCO” and “Flickr30k”. Evaluating uform-vl-english model, one can expect the following numbers for search quality.

Dataset

Recall @ 1

Recall @ 5

Recall @ 10

Flickr

0.727

0.915

0.949

MS-COCO ¹

0.510

0.761

0.838

For multilingual benchmarks, we’ve created the ``unum-cloud/coco-sm` <https://github.com/unum-cloud/coco-sm>`_ repository². Evaluating the unum-cloud/uform-vl-multilingual-v2 model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32 OpenCLIP model.

Language

OpenCLIP @ 1

UForm @ 1

OpenCLIP @ 5

UForm @ 5

OpenCLIP @ 10

UForm @ 10

Speakers

English 🇺🇸

37.8

37.7

63.5

65.0

73.5

75.9

1’452 M

Chinese 🇨🇳

27.3

32.2

51.3

59.0

62.1

70.5

1’118 M

Hindi 🇮🇳

20.7

31.3

42.5

57.9

53.7

69.6

602 M

Spanish 🇪🇸

32.6

35.6

58.0

62.8

68.8

73.7

548 M

Arabic 🇸🇦

22.7

31.7

44.9

57.8

55.8

69.2

274 M

French 🇫🇷

31.3

35.4

56.5

62.6

67.4

73.3

274 M

All languages:

Language

OpenCLIP @ 1

UForm @ 1

OpenCLIP @ 5

UForm @ 5

OpenCLIP @ 10

UForm @ 10

Speakers

Arabic 🇸🇦

22.7

31.7

44.9

57.8

55.8

69.2

274 M

Armenian 🇦🇲

5.6

22.0

14.3

44.7

20.2

56.0

4 M

Chinese 🇨🇳

27.3

32.2

51.3

59.0

62.1

70.5

1’118 M

English 🇺🇸

37.8

37.7

63.5

65.0

73.5

75.9

1’452 M

French 🇫🇷

31.3

35.4

56.5

62.6

67.4

73.3

274 M

German 🇩🇪

31.7

35.1

56.9

62.2

67.4

73.3

134 M

Hebrew 🇮🇱

23.7

26.7

46.3

51.8

57.0

63.5

9 M

Hindi 🇮🇳

20.7

31.3

42.5

57.9

53.7

69.6

602 M

Indonesian 🇮🇩

26.9

30.7

51.4

57.0

62.7

68.6

199 M

Italian 🇮🇹

31.3

34.9

56.7

62.1

67.1

73.1

67 M

Japanese 🇯🇵

27.4

32.6

51.5

59.2

62.6

70.6

125 M

Korean 🇰🇷

24.4

31.5

48.1

57.8

59.2

69.2

81 M

Persian 🇮🇷

24.0

28.8

47.0

54.6

57.8

66.2

77 M

Polish 🇵🇱

29.2

33.6

53.9

60.1

64.7

71.3

41 M

Portuguese 🇵🇹

31.6

32.7

57.1

59.6

67.9

71.0

257 M

Russian 🇷🇺

29.9

33.9

54.8

60.9

65.8

72.0

258 M

Spanish 🇪🇸

32.6

35.6

58.0

62.8

68.8

73.7

548 M

Thai 🇹🇭

21.5

28.7

43.0

54.6

53.7

66.0

61 M

Turkish 🇹🇷

25.5

33.0

49.1

59.6

60.3

70.8

88 M

Ukranian 🇺🇦

26.0

30.6

49.9

56.7

60.9

68.1

41 M

Vietnamese 🇻🇳

25.4

28.3

49.2

53.9

60.3

65.5

85 M

Mean

26.5±6.4

31.8±3.5

49.8±9.8

58.1±4.5

60.4±10.6

69.4±4.3

Google Translate

27.4±6.3

31.5±3.5

51.1±9.5

57.8±4.4

61.7±10.3

69.1±4.3

Microsoft Translator

27.2±6.4

31.4±3.6

50.8±9.8

57.7±4.7

61.4±10.6

68.9±4.6

Meta NLLB

24.9±6.7

32.4±3.5

47.5±10.3

58.9±4.5

58.2±11.2

70.2±4.3

Generative Models#

Model

LLM Size

SQA

MME

MMBench

Average¹

UForm-Gen2-Qwen-500m

0.5B

45.5

880.1

42.0

29.31

MobileVLM v2

1.4B

52.1

1302.8

57.7

36.81

LLaVA-Phi

2.7B

68.4

1335.1

59.8

42.95

For captioning evaluation we measure CLIPScore and RefCLIPScore³.

Model

Size

Caption Length

CLIPScore

RefCLIPScore

llava-hf/llava-1.5-7b-hf

7B

Long

0.878

0.529

llava-hf/llava-1.5-7b-hf

7B

Short

0.886

0.531

Salesforce/instructblip-vicuna-7b

7B

Long

0.902

0.534

Salesforce/instructblip-vicuna-7b

7B

Short

0.848

0.523

unum-cloud/uform-gen

1.5B

Long

0.847

0.523

unum-cloud/uform-gen

1.5B

Short

0.842

0.522

unum-cloud/uform-gen-chat

1.5B

Long

0.860

0.525

unum-cloud/uform-gen-chat

1.5B

Short

0.858

0.525

Results for VQAv2 evaluation.

Model

Size

Accuracy

llava-hf/llava-1.5-7b-hf

7B

78.5

unum-cloud/uform-gen

1.5B

66.5


¹ Train split was in training data.
² Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section.
³ We used apple/DFN5B-CLIP-ViT-H-14-378 CLIP model.

Speed#

Embedding Models#

UForm comes pre-packaged with speed benchmarks for the models.

$ python python/scripts/bench_encoders.py --help
usage: bench_encoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --filter-out FILTER_OUT
                        Filter out models, backends, or devices with a Regular Expression.
  --batch-size BATCH_SIZE
                        Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.

Running that script for a fairly small batch size of 50 on an Nvidia H100 GPU and

Model Name

Device

Backend

Images Preprocessed/s

Images Encoded/s

Texts Preprocessed/s

Texts Encoded/s

unum-cloud/uform3-image-text-english-base

cpu

torch

23.03

76.57

15,978.03

562.28

unum-cloud/uform3-image-text-english-base

cpu

onnx

23.11

77.75

13,880.27

1,067.40

unum-cloud/uform3-image-text-english-base

cuda

torch

22.87

1,060.40

12,348.94

13,242.83

unum-cloud/uform3-image-text-english-large

cpu

torch

22.41

10.84

13,350.45

145.12

unum-cloud/uform3-image-text-english-large

cpu

onnx

23.13

19.60

18,031.85

960.09

unum-cloud/uform3-image-text-english-large

cuda

torch

22.78

244.86

13,226.40

10,204.04

unum-cloud/uform3-image-text-english-small

cpu

torch

20.08

71.68

12,147.05

249.63

unum-cloud/uform3-image-text-english-small

cpu

onnx

22.84

195.27

13,636.99

1,385.25

unum-cloud/uform3-image-text-english-small

cuda

torch

22.63

2,662.16

14,731.18

14,694.87

unum-cloud/uform3-image-text-multilingual-base

cpu

torch

22.98

64.28

10,129.27

209.76

unum-cloud/uform3-image-text-multilingual-base

cpu

onnx

23.06

66.81

8,963.13

1,104.32

unum-cloud/uform3-image-text-multilingual-base

cuda

torch

22.88

1,051.95

15,639.72

12,416.12

If you are interested in performance numbers on consumer grade hardware, compared to third-party models, here are some rough estimates. On Nvidia RTX 3090:

Model

Multilingual

Speed

Speedup

bert-base-uncased

No

1’612 sequences/second

distilbert-base-uncased

No

3’174 sequences/second

x 1.96

sentence-transformers/all-MiniLM-L12-v2

Yes

3’604 sequences/second

x 2.24

unum-cloud/uform3-image-text-multilingual-base

Yes

6’809 sequences/second

x 4.22

Given the small size of the model it also work well on mobile devices. On Apple M2 Arm chips the energy efficiency of inference can exceed that of the RTX 3090 GPU and other Ampere-generation cards.

Device

Speed

Device TDP

Efficiency

Nvidia RTX 3090

~ 140 tokens/second

< 350W

0.40 tokens/joule

Apple M2 Pro unplugged

~ 19 tokens/second

< 20W

0.95 tokens/joule

Apple M2 Max unplugged

~ 38 tokens/second

< 36W

1.06 tokens/joule

Apple M2 Max plugged

~ 56 tokens/second

< 89W

0.63 tokens/joule

Generative Models#

$ python python/scripts/bench_decoders.py --help
usage: bench_decoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.
  --max-length MAX_LENGTH
                        Maximum length of the generated text in tokens.

On Nvidia H100 GPU, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model

Size

Decoding Speed

Decoding Parallel Streams

llava-hf/llava-1.5-7b-hf

7 B

~ 141 tokens/s

~ 4 K tokens/s (32 streams)

Salesforce/instructblip-vicuna-7b

7 B

~ 211 tokens/s

~ 2 K tokens/s (32 streams)

unum-cloud/uform-gen

1.5 B

~ 252 tokens/s

~ 3 K tokens/s (128 streams)

unum-cloud/uform-gen2-dpo

1.2 B

~ 372 tokens/s

~ 10 K tokens/s (64 streams)

On Nvidia RTX 3090, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model

Size

Decoding Speed

Speedup

llava-hf/llava-1.5-7b-hf

7 B

~ 40 tokens/s

Salesforce/instructblip-vicuna-7b

7 B

~ 40 tokens/s

unum-cloud/uform-gen

1.5 B

~ 140 tokens/s

x 3.5