| # π§ Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`) | |
| This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text. | |
| --- | |
| ## π§ Features | |
| - β Preprocessed dataset with BIO-tagged tokens | |
| - β Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers | |
| - β Token-label alignment | |
| - β Evaluation using `seqeval` metrics (Precision, Recall, F1) | |
| - β Inference pipeline to extract keyphrases | |
| - β CUDA-enabled for GPU acceleration | |
| --- | |
| ## π Dataset | |
| **Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) | |
| - Fields: | |
| - `document`: List of tokenized words (already split) | |
| - `doc_bio_tags`: BIO-format labels for keyphrases | |
| - Splits: | |
| - `train`: 1000 samples | |
| - `validation`: 500 samples | |
| - `test`: 500 samples | |
| --- | |
| ## π Setup & Installation | |
| ```bash | |
| git clone https://github.com/your-username/keyphrase-bert-inspec | |
| cd keyphrase-bert-inspec | |
| pip install -r requirements.txt | |
| ``` | |
| ### `requirements.txt` | |
| ```text | |
| datasets | |
| transformers | |
| evaluate | |
| seqeval | |
| ``` | |
| --- | |
| ## π§ͺ Training | |
| ```python | |
| from datasets import load_dataset | |
| from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer | |
| ``` | |
| 1. Load and preprocess data with aligned BIO labels | |
| 2. Fine-tune `bert-base-cased` on the dataset | |
| 3. Evaluate and save model artifacts | |
| ### Training Script Overview: | |
| ```python | |
| trainer = Trainer( | |
| model=model, | |
| args=training_args, | |
| train_dataset=tokenized_datasets["train"], | |
| eval_dataset=tokenized_datasets["validation"], | |
| tokenizer=tokenizer, | |
| data_collator=data_collator, | |
| compute_metrics=compute_metrics, | |
| ) | |
| trainer.train() | |
| trainer.save_model("keyphrase-bert-inspec") | |
| ``` | |
| --- | |
| ## π Evaluation Metrics | |
| ```python | |
| { | |
| "precision": 0.84, | |
| "recall": 0.81, | |
| "f1": 0.825, | |
| "accuracy": 0.88 | |
| } | |
| ``` | |
| --- | |
| ## π Inference Example | |
| ```python | |
| from transformers import pipeline | |
| ner_pipeline = pipeline( | |
| "ner", | |
| model="keyphrase-bert-inspec", | |
| tokenizer="keyphrase-bert-inspec", | |
| aggregation_strategy="simple" | |
| ) | |
| text = "Information-based semantics is a theory in the philosophy of mind." | |
| results = ner_pipeline(text) | |
| for r in results: | |
| print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}") | |
| ``` | |
| ### Sample Output | |
| ``` | |
| π’ Extracted Keyphrases: | |
| - Information-based semantics (score: 0.94) | |
| - philosophy of mind (score: 0.91) | |
| ``` | |
| --- | |
| ## πΎ Model Artifacts | |
| After training, the model and tokenizer are saved as: | |
| ``` | |
| keyphrase-bert-inspec/ | |
| βββ config.json | |
| βββ pytorch_model.bin | |
| βββ tokenizer_config.json | |
| βββ vocab.txt | |
| ``` | |
| --- | |
| ## π Future Improvements | |
| - Add postprocessing to group fragmented tokens | |
| - Use a larger dataset (like `scientific_keyphrases`) | |
| - Convert to a web app using Gradio or Streamlit | |
| --- | |
| ## π¨βπ¬ Author | |
| **Your Name** | |
| GitHub: [@your-username](https://github.com/your-username) | |
| Contact: your.email@example.com | |
| --- | |
| ## π License | |
| MIT License. See `LICENSE` file. | |