Safetensors
English
ebind
Frederik Hvilshøj commited on
Commit
35cfd0f
·
1 Parent(s): be28595

Update readme

Browse files
Files changed (1) hide show
  1. README.md +224 -1
README.md CHANGED
@@ -1,6 +1,229 @@
1
  ---
2
  {}
3
  ---
4
- # Model Card for EBind
 
5
 
6
  ![ebind](https://cdn-uploads.huggingface.co/production/uploads/62a0da842e30aaf94ebaaa12/EohI585GsKe5cFlvWyWAX.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  {}
3
  ---
4
+
5
+ # Model Card for `ebind-full`
6
 
7
  ![ebind](https://cdn-uploads.huggingface.co/production/uploads/62a0da842e30aaf94ebaaa12/EohI585GsKe5cFlvWyWAX.png)
8
+
9
+ <div style="display: flex; justify-content: space-between;">
10
+ <div style="flex: 1; padding: 10px;">
11
+ <!-- <a href="todohttps://arxiv.org/abs/YYMM.NNNNN" target="_blank" rel="noreferrer" style="text-decoration:none; ">
12
+ <img src="https://img.shields.io/badge/arXiv-YYMM.NNNNN-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;">
13
+ </a> -->
14
+ <a href="https://colab.research.google.com/github/encord-team/ebind/blob/main/misc/demo.ipynb" target="_blank" rel="noreferrer" style="text-decoration:none; ">
15
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="vertical-align:middle;">
16
+ </a>
17
+ <a href="https://huggingface.co/encord-team/ebind-full" target="_blank" rel="noreferrer" style="text-decoration:none; ">
18
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face Models" style="vertical-align:middle;">
19
+ </a>
20
+ <a href="https://huggingface.co/datasets/encord-team/E-MM1-100M" target="_blank" rel="noreferrer" style="text-decoration:none; ">
21
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue" alt="Hugging Face Datasets" style="vertical-align:middle;">
22
+ </a>
23
+ <a href="https://e-mm1.github.io" target="_blank" rel="noreferrer" style="text-decoration:none; ">
24
+ <img src="https://img.shields.io/badge/Project%20Page-blue?logo=github" alt="Blog" style="vertical-align:middle;">
25
+ </a>
26
+ <div style="flex:1"></div>
27
+ <a href="https://encord.com/blog/how-we-built-multimodal-dataset-emm1/" target="_blank" rel="noreferrer" style="text-decoration:none; ">
28
+ <img src="https://img.shields.io/badge/%F0%9F%93%96-Blog-blue" alt="Blog" style="vertical-align:middle;">
29
+ </a>
30
+ <a href="https://twitter.com/encord_team" target="_blank" rel="noreferrer" style="text-decoration:none; ">
31
+ <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&amp;style=social" style="vertical-align: middle">
32
+ </a>
33
+ <img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue" style="vertical-align: middle;">
34
+ <img alt="Licence" src="https://img.shields.io/github/license/encord-team/ebind" style="vertical-align: middle;">
35
+ </div>
36
+ </div>
37
+
38
+ # EBind: Multi-Modal Embeddings
39
+
40
+ ## Model Details
41
+
42
+ ### Model Description
43
+
44
+ EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computations.
45
+ The model builds on top of three other models; [Perception Encoder](https://huggingface.co/facebook/PE-Core-L14-336), [ImageBind](https://huggingface.co/nielsr/imagebind-huge), and [Uni3D](https://github.com/baaivision/Uni3D).
46
+ As indicated by the figure in the top, data is first embedded individually by the three said models.
47
+ Audio and 3D point cloud embeddings are successively projected with an MLP into the embedding space of the Perception Encoder.
48
+ The model produces unit-norm embeddings directly usable for similarity comparisons via dot-products ([cosine similarity]).
49
+
50
+ This version loads all encoders.
51
+ If you do not need all modalities, please refer to the [audio-vision](https://huggingface.co/encord-team/ebind-audio-vision) and [3D-points-vision](https://huggingface.co/encord-team/ebind-points-vision) only models.
52
+
53
+ - **Developed by:** The Encord ML Team.
54
+ - **Model type:** Multimodal embedding model.
55
+ - **License:** The model is published under the [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt) license.
56
+
57
+ ### Model Sources
58
+
59
+ - **Repository:** [Github](https://github.com/encord-team/ebind)
60
+ - **Project Page:** [e-mm1.github.io](https://e-mm1.github.io)
61
+ - **Paper [optional]:** Coming soon.
62
+ - **Demo [optional]:** [Explore the embedding space](https://data.encord.com)
63
+
64
+ ## Uses
65
+
66
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
67
+
68
+ ### Direct Use
69
+
70
+ The model is intended to be used with direct file-inputs of the said modalities; image, video, audio, 3D, and text. It will produce a 1024 dimension embedding per input, suited for similarity computations.
71
+
72
+ **Downstream Use**
73
+
74
+ The model could be used to build multimodal LLMs, generative models, and systems that perceive their surroundings via both visual, audio, and point cloud embeddings.
75
+
76
+ ## Bias, Risks, and Limitations
77
+
78
+ The model was built on data specified in the paper.
79
+ As such, it will be biased towards data that "lives on the internet."
80
+ For specific use-cases, a subsequent fine-tuning stage may be necessary.
81
+
82
+ ## How to Get Started with the Model
83
+
84
+ **Option 1**
85
+ If you want to work within the repository, use [`uv`](https://docs.astral.sh/uv/) to install the necessary dependencies.
86
+
87
+ ```bash
88
+ git clone https://github.com/encord-team/ebind
89
+ cd ebind
90
+ uv sync
91
+ ```
92
+
93
+ **Option 2**
94
+ You can also install it as an external dependency for another project:
95
+
96
+ ```bash
97
+ # Option 2.a
98
+ python -m pip install git@https://github.com/encord-team/ebind
99
+ # Option 2.b; or install a local, editable version
100
+ git clone https://github.com/encord-team/ebind
101
+ cd /path/to/your/project
102
+ python -m pip install -e /path/to/ebind
103
+ ```
104
+
105
+ > [!WARNING]
106
+ > If you are running a project with pytorch~=2.8.0, you should install torchcodec~=0.7.0 (as opposed to the ~=0.8.0)
107
+ > which is automatically installed with uv. `torchcodec~=0.8.*` matches `pytorch~=2.9.0`.
108
+
109
+ > [!NOTE]
110
+ > The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile](#compile-pointnet2-cuda-ops-optional).
111
+ > To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels.
112
+
113
+ ### Loading the Model
114
+
115
+ ```python
116
+ import torch
117
+ from ebind import EBindModel, EBindProcessor
118
+
119
+ model = EBindModel.from_pretrained("encord-team/ebind-full")
120
+ processor = EBindProcessor.from_pretrained("encord-team/ebind-full")
121
+
122
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
123
+ model = model.to(device).eval()
124
+ processor = processor.to(device)
125
+ ```
126
+
127
+ ### Processing Multi-Modal Inputs
128
+
129
+ ```python
130
+ inputs = {
131
+ "image": ["examples/dog.png", "examples/cat.png"],
132
+ "video": ["examples/dog.mp4", "examples/cat.mp4"],
133
+ "audio": ["examples/dog.mp4", "examples/cat.mp4"],
134
+ "text": ["A dog is howling in the street", "A cat is sleeping on the couch"],
135
+ "points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"],
136
+ }
137
+
138
+ with torch.inference_mode():
139
+ batch = processor(inputs, return_tensors="pt") # set text_file_paths=True if passing text file paths instead of strings
140
+ outputs = model.forward(**batch)
141
+ ```
142
+
143
+ ### Computing Cross-Modal Similarities
144
+
145
+ ```python
146
+ keys = list(outputs.keys())
147
+ for i, modality in enumerate(keys):
148
+ for j, modality2 in enumerate(keys[i + 1:]):
149
+ result = outputs[modality] @ outputs[modality2].T
150
+ print(f"{modality} x {modality2}:")
151
+ print(result.cpu().detach().numpy())
152
+ print('='*26)
153
+ ```
154
+
155
+ Expected Output:
156
+
157
+ ```
158
+ image x video similarity:
159
+ [[0.48 0.42]
160
+ [0.41 0.6 ]]
161
+ ==========================
162
+ image x audio similarity:
163
+ [[0.07 0.05]
164
+ [0.02 0.12]]
165
+ ==========================
166
+ image x text similarity:
167
+ [[0.16 0.07]
168
+ [0.08 0.14]]
169
+ ==========================
170
+ image x points similarity:
171
+ [[0.2 0.19]
172
+ [0.18 0.19]]
173
+ ==========================
174
+ video x audio similarity:
175
+ [[0.19 0.08]
176
+ [0.03 0.16]]
177
+ ==========================
178
+ video x text similarity:
179
+ [[0.26 0.05]
180
+ [0.11 0.14]]
181
+ ==========================
182
+ video x points similarity:
183
+ [[0.24 0.15]
184
+ [0.17 0.26]]
185
+ ==========================
186
+ audio x text similarity:
187
+ [[ 0.12 -0. ]
188
+ [ 0.07 0.09]]
189
+ ==========================
190
+ audio x points similarity:
191
+ [[0.13 0.06]
192
+ [0.1 0.12]]
193
+ ==========================
194
+ text x points similarity:
195
+ [[0.19 0.14]
196
+ [0.05 0.18]]
197
+ ==========================
198
+ ```
199
+
200
+ **Note:** The image/video similarity is significantly higher because they share the same vision encoder.
201
+
202
+ ### Compile PointNet2 CUDA ops (optional)
203
+
204
+ If you have CUDA available, consider building the [PointNet2](https://github.com/erikwijmans/Pointnet2_PyTorch/tree/master/pointnet2_ops_lib/pointnet2_ops/_ext-src) custom ops used for embedding point clouds to get faster inference:
205
+
206
+ ```bash
207
+ cd src/ebind/models/uni3d/pointnet2_ops && \
208
+ uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \
209
+ MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplace
210
+ ```
211
+
212
+ > We have modified the code slightly in `src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.py` to
213
+ > have a fallback torch implementation in order for the model to be executable on no-GPU
214
+ > hardware.
215
+
216
+ ## Evaluation
217
+
218
+ We have evaluated the model on multiple benchmarks.
219
+ We highlight that EBind is performing close to as well as models 4 and 17 times larger.
220
+
221
+ ![Summary plot](./summary.jpg)
222
+ **Figure 1:** An average of the 13 benchmarks presented in the two tables below, plotted against model size.
223
+
224
+ ![Table 1: Retrieval benchmarks](./table-1.png)
225
+ ![Table 1: Zero-shot benchmarks](./table-2.png)
226
+
227
+ ## Citation [optional]
228
+
229
+ **BibTeX:** Coming soon..