Paper GitHub License: MIT

🦠 MicrobELP β€” Microbiome Entity Recognition and Normalisation

MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text. It is part of the microbELP toolkit and has been optimised for CPU and GPU inference.

This model enables automated extraction of microbiome names from unstructured text, facilitating microbiome-related text mining and literature curation.

We also provide a Named Entity Normalisation model on Hugging Face:

Hugging Face Models


πŸš€ Quick Start (Hugging Face)

You can directly load and run the model with the Hugging Face transformers pipeline:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("omicsNLP/microbELP_NER")
model = AutoModelForTokenClassification.from_pretrained("omicsNLP/microbELP_NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = "The first microbiome I learned about is called Helicobacter pylori."
ner_results = nlp(example)

print(ner_results)

Output:

[
 {'entity': 'LABEL_0', 'score': 0.9954, 'index': 1, 'word': 'the', 'start': 0, 'end': 3},
 ...
 {'entity': 'LABEL_1', 'score': 0.9889, 'index': 11, 'word': 'he', 'start': 47, 'end': 49},
 {'entity': 'LABEL_2', 'score': 0.9710, 'index': 16, 'word': 'p', 'start': 60, 'end': 61},
 ...
]

where:

  • LABEL_0 β†’ Outside (O)
  • LABEL_1 β†’ Begin-microbiome (B-microbiome)
  • LABEL_2 β†’ Inside-microbiome (I-microbiome)

🧩 Integration with the microbELP Python Package

If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the microbELP package directly.

Installation:

git clone https://github.com/omicsNLP/microbELP.git
pip install ./microbELP

It is recommended to install in an isolated environment due to dependencies.

Example Usage

from microbELP import microbiome_DL_ner

input_text = "The first microbiome I learned about is called Helicobacter pylori."
print(microbiome_DL_ner(input_text))

Output:

[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}]

You can also process a list of texts for batch inference:

input_list = [
    "The first microbiome I learned about is called Helicobacter pylori.",
    "Then I learned about Eubacterium rectale."
]
print(microbiome_DL_ner(input_list))

Output:

[
  [{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}],
  [{'Entity': 'Eubacterium rectale', 'locations': {'offset': 21, 'length': 19}}]
]

Each element in the output corresponds to one input text, containing recognised microbiome entities and their text locations.

There is one optional parameter to this function called cpu <type 'bool'>, the default value is False, i.e. runs on a GPU if any are available. If you want to force the usage of the CPU, you will need to use microbiome_DL_ner(input_list, cpu = True).


πŸ“˜ Model Details

Find below some more information about this model.

Property Description
Task Named Entity Recognition (NER)
Domain Microbiome / Biomedical Text Mining
Entity Type microbiome
Model Type Transformer-based token classification
Framework Hugging Face πŸ€— Transformers
Optimised for GPU inference

πŸ“š Citation

If you find this repository useful, please consider giving a like ❀️ and a citation πŸ“:

@article {Patel2025.08.29.671515,
    author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
    title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
    elocation-id = {2025.08.29.671515},
    year = {2025},
    doi = {10.1101/2025.08.29.671515},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
    eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
    journal = {bioRxiv}
}

πŸ”— Resources

Find below some more resources associated with this model.

Property Description
GitHub Project
Paper DOI:10.1101/2021.01.08.425887
Data DOI
Codiet CoDiet

βš™οΈ License

This model and code are released under the MIT License.

Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for omicsNLP/microbELP_NER

Finetuned
(32)
this model