encord-team
/

ebind-full

Safetensors

English

ebind

Model card Files Files and versions

xet

Community

Frederik Hvilshøj commited on Nov 13, 2025

Commit

35cfd0f

1 Parent(s): be28595

Update readme

Browse files

Files changed (1) hide show

README.md +224 -1

README.md CHANGED Viewed

@@ -1,6 +1,229 @@
 ---
 {}
 ---
-# Model Card for EBind
 ![ebind](https://cdn-uploads.huggingface.co/production/uploads/62a0da842e30aaf94ebaaa12/EohI585GsKe5cFlvWyWAX.png)

 ---
 {}
 ---
+# Model Card for `ebind-full`
 ![ebind](https://cdn-uploads.huggingface.co/production/uploads/62a0da842e30aaf94ebaaa12/EohI585GsKe5cFlvWyWAX.png)
+<div style="display: flex; justify-content: space-between;">
+  <div style="flex: 1; padding: 10px;">
+    <!-- <a href="todohttps://arxiv.org/abs/YYMM.NNNNN" target="_blank" rel="noreferrer" style="text-decoration:none; ">
+      <img src="https://img.shields.io/badge/arXiv-YYMM.NNNNN-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;">
+    </a> -->
+    <a href="https://colab.research.google.com/github/encord-team/ebind/blob/main/misc/demo.ipynb" target="_blank" rel="noreferrer" style="text-decoration:none; ">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="vertical-align:middle;">
+    </a>
+    <a href="https://huggingface.co/encord-team/ebind-full" target="_blank" rel="noreferrer" style="text-decoration:none; ">
+    <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face Models" style="vertical-align:middle;">
+    </a>
+    <a href="https://huggingface.co/datasets/encord-team/E-MM1-100M" target="_blank" rel="noreferrer" style="text-decoration:none; ">
+    <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue" alt="Hugging Face Datasets" style="vertical-align:middle;">
+    </a>
+    <a href="https://e-mm1.github.io" target="_blank" rel="noreferrer" style="text-decoration:none; ">
+    <img src="https://img.shields.io/badge/Project%20Page-blue?logo=github" alt="Blog" style="vertical-align:middle;">
+    </a>
+    <div style="flex:1"></div>
+    <a href="https://encord.com/blog/how-we-built-multimodal-dataset-emm1/" target="_blank" rel="noreferrer" style="text-decoration:none; ">
+    <img src="https://img.shields.io/badge/%F0%9F%93%96-Blog-blue" alt="Blog" style="vertical-align:middle;">
+    </a>
+    <a href="https://twitter.com/encord_team" target="_blank" rel="noreferrer" style="text-decoration:none; ">
+      <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&amp;style=social" style="vertical-align: middle">
+    </a>
+    <img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue" style="vertical-align: middle;">
+    <img alt="Licence" src="https://img.shields.io/github/license/encord-team/ebind" style="vertical-align: middle;">
+  </div>
+</div>
+# EBind: Multi-Modal Embeddings
+## Model Details
+### Model Description
+EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computations.
+The model builds on top of three other models; [Perception Encoder](https://huggingface.co/facebook/PE-Core-L14-336), [ImageBind](https://huggingface.co/nielsr/imagebind-huge), and [Uni3D](https://github.com/baaivision/Uni3D).
+As indicated by the figure in the top, data is first embedded individually by the three said models.
+Audio and 3D point cloud embeddings are successively projected with an MLP into the embedding space of the Perception Encoder.
+The model produces unit-norm embeddings directly usable for similarity comparisons via dot-products ([cosine similarity]).
+This version loads all encoders.
+If you do not need all modalities, please refer to the [audio-vision](https://huggingface.co/encord-team/ebind-audio-vision) and [3D-points-vision](https://huggingface.co/encord-team/ebind-points-vision) only models.
+- **Developed by:** The Encord ML Team.
+- **Model type:** Multimodal embedding model.
+- **License:** The model is published under the [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt) license.
+### Model Sources
+- **Repository:** [Github](https://github.com/encord-team/ebind)
+- **Project Page:** [e-mm1.github.io](https://e-mm1.github.io)
+- **Paper [optional]:** Coming soon.
+- **Demo [optional]:** [Explore the embedding space](https://data.encord.com)
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+The model is intended to be used with direct file-inputs of the said modalities; image, video, audio, 3D, and text. It will produce a 1024 dimension embedding per input, suited for similarity computations.
+**Downstream Use**
+The model could be used to build multimodal LLMs, generative models, and systems that perceive their surroundings via both visual, audio, and point cloud embeddings.
+## Bias, Risks, and Limitations
+The model was built on data specified in the paper.
+As such, it will be biased towards data that "lives on the internet."
+For specific use-cases, a subsequent fine-tuning stage may be necessary.
+## How to Get Started with the Model
+**Option 1**
+If you want to work within the repository, use [`uv`](https://docs.astral.sh/uv/) to install the necessary dependencies.
+```bash
+git clone https://github.com/encord-team/ebind
+cd ebind
+uv sync
+```
+**Option 2**
+You can also install it as an external dependency for another project:
+```bash
+# Option 2.a
+python -m pip install git@https://github.com/encord-team/ebind
+# Option 2.b; or install a local, editable version
+git clone https://github.com/encord-team/ebind
+cd /path/to/your/project
+python -m pip install -e /path/to/ebind
+```
+> [!WARNING]
+> If you are running a project with pytorch~=2.8.0, you should install torchcodec~=0.7.0 (as opposed to the ~=0.8.0)
+> which is automatically installed with uv. `torchcodec~=0.8.*` matches `pytorch~=2.9.0`.
+> [!NOTE]
+> The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile](#compile-pointnet2-cuda-ops-optional).
+> To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels.
+### Loading the Model
+```python
+import torch
+from ebind import EBindModel, EBindProcessor
+model = EBindModel.from_pretrained("encord-team/ebind-full")
+processor = EBindProcessor.from_pretrained("encord-team/ebind-full")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device).eval()
+processor = processor.to(device)
+```
+### Processing Multi-Modal Inputs
+```python
+inputs = {
+    "image": ["examples/dog.png", "examples/cat.png"],
+    "video": ["examples/dog.mp4", "examples/cat.mp4"],
+    "audio": ["examples/dog.mp4", "examples/cat.mp4"],
+    "text": ["A dog is howling in the street", "A cat is sleeping on the couch"],
+    "points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"],
+}
+with torch.inference_mode():
+    batch = processor(inputs, return_tensors="pt")  # set text_file_paths=True if passing text file paths instead of strings
+    outputs = model.forward(**batch)
+```
+### Computing Cross-Modal Similarities
+```python
+keys = list(outputs.keys())
+for i, modality in enumerate(keys):
+    for j, modality2 in enumerate(keys[i + 1:]):
+        result = outputs[modality] @ outputs[modality2].T
+        print(f"{modality} x {modality2}:")
+        print(result.cpu().detach().numpy())
+        print('='*26)
+```
+Expected Output:
+```
+image x video similarity:
+[[0.48 0.42]
+ [0.41 0.6 ]]
+==========================
+image x audio similarity:
+[[0.07 0.05]
+ [0.02 0.12]]
+==========================
+image x text similarity:
+[[0.16 0.07]
+ [0.08 0.14]]
+==========================
+image x points similarity:
+[[0.2  0.19]
+ [0.18 0.19]]
+==========================
+video x audio similarity:
+[[0.19 0.08]
+ [0.03 0.16]]
+==========================
+video x text similarity:
+[[0.26 0.05]
+ [0.11 0.14]]
+==========================
+video x points similarity:
+[[0.24 0.15]
+ [0.17 0.26]]
+==========================
+audio x text similarity:
+[[ 0.12 -0.  ]
+ [ 0.07  0.09]]
+==========================
+audio x points similarity:
+[[0.13 0.06]
+ [0.1  0.12]]
+==========================
+text x points similarity:
+[[0.19 0.14]
+ [0.05 0.18]]
+==========================
+```
+**Note:** The image/video similarity is significantly higher because they share the same vision encoder.
+### Compile PointNet2 CUDA ops (optional)
+If you have CUDA available, consider building the [PointNet2](https://github.com/erikwijmans/Pointnet2_PyTorch/tree/master/pointnet2_ops_lib/pointnet2_ops/_ext-src) custom ops used for embedding point clouds to get faster inference:
+```bash
+cd src/ebind/models/uni3d/pointnet2_ops && \
+    uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \
+    MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplace
+```
+> We have modified the code slightly in `src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.py` to
+> have a fallback torch implementation in order for the model to be executable on no-GPU
+> hardware.
+## Evaluation
+We have evaluated the model on multiple benchmarks.
+We highlight that EBind is performing close to as well as models 4 and 17 times larger.
+![Summary plot](./summary.jpg)
+**Figure 1:** An average of the 13 benchmarks presented in the two tables below, plotted against model size.
+![Table 1: Retrieval benchmarks](./table-1.png)
+![Table 1: Zero-shot benchmarks](./table-2.png)
+## Citation [optional]
+**BibTeX:** Coming soon..