--- license: apache-2.0 ---

[📚 Paper](https://arxiv.org/abs/2503.19740) - [🤖 GitHub](https://github.com/visurg-ai/LEMON) We provide the models used in our data curation pipeline in [📚 LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings](https://arxiv.org/abs/2503.19740) to assist with constructing the LEMON dataset (for more details about the LEMON dataset and our LemonFM foundation model, please visit our github repository at [🤖 GitHub](https://github.com/visurg-ai/LEMON)) . If you use our dataset, model, or code in your research, please cite our paper: ``` @misc{che2025lemonlargeendoscopicmonocular, title={LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings}, author={Chengan Che and Chao Wang and Tom Vercauteren and Sophia Tsoka and Luis C. Garcia-Peraza-Herrera}, year={2025}, eprint={2503.19740}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.19740}, } ``` This Hugging Face repository includes video storyboard classification models, frame classification models, and non-surgical object detection models. The model loader file can be found at [model_loader.py](https://huggingface.co/visurg/Surg3M_curation_models/blob/main/model_loader.py)

Model	Architecture	Download
Video storyboard classification models	ResNet-18	Full ckpt
Frame classification models	ResNet-18	Full ckpt
Non-surgical object detection models	Yolov8-Nano	Full ckpt

The data curation pipeline leading to the clean videos in the LEMON dataset is as follows:

Usage -------- **Video classification models** are employed in the step **2** of the data curation pipeline to classify a video storyboard as either surgical or non-surgical, the models usage is as follows: ```python import torch import torchvision from PIL import Image from model_loader import build_model # Load the model net = build_model(mode='classify') model_path = 'Video storyboard classification models' # Enable multi-GPU support net = torch.nn.DataParallel(net) torch.backends.cudnn.benchmark = True state = torch.load(model_path, map_location=torch.device('cpu')) net.load_state_dict(state['net']) net.eval() # Load the video storyboard and convert it to a PyTorch tensor img_path = 'path/to/your/image.jpg' img = Image.open(img_path) img = img.resize((224, 224)) transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize( (0.4299694, 0.29676908, 0.27707579), (0.24373249, 0.20208984, 0.19319402) ) ]) img_tensor = transform(img).unsqueeze(0).to('cuda') # Extract features from the image outputs = net(img_tensor) ``` **Frame classification models** are used in the step **3** of the data curation pipeline to classify a frame as either surgical or non-surgical, the models usage is as follows: ```python import torch import torchvision from PIL import Image from model_loader import build_model # Load the model net = build_model(mode='classify') model_path = 'Frame classification models' # Enable multi-GPU support net = torch.nn.DataParallel(net) torch.backends.cudnn.benchmark = True state = torch.load(model_path, map_location=torch.device('cpu')) net.load_state_dict(state['net']) net.eval() img_path = 'path/to/your/image.jpg' img = Image.open(img_path) img = img.resize((224, 224)) transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize( (0.4299694, 0.29676908, 0.27707579), (0.24373249, 0.20208984, 0.19319402) ) ]) img_tensor = transform(img).unsqueeze(0).to('cuda') # Extract features from the image outputs = net(img_tensor) ``` **Non-surgical object detection models** are used to obliterate the non-surgical region in the surgical frames (e.g. user interface information), the models usage is as follows: ```python import torch import torchvision from PIL import Image from model_loader import build_model # Load the model net = build_model(mode='mask') model_path = 'Frame classification models' # Enable multi-GPU support net = torch.nn.DataParallel(net) torch.backends.cudnn.benchmark = True state = torch.load(model_path, map_location=torch.device('cpu')) net.load_state_dict(state['net']) net.eval() img_path = 'path/to/your/image.jpg' img = Image.open(img_path) img = img.resize((224, 224)) transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize( (0.4299694, 0.29676908, 0.27707579), (0.24373249, 0.20208984, 0.19319402) ) ]) img_tensor = transform(img).unsqueeze(0).to('cuda') # Extract features from the image outputs = net(img_tensor) ```