cloudops-bert / README.md

Update README.md

ea746aa verified 2 months ago

5.88 kB

	---
	license: apache-2.0
	datasets:
	- honicky/hdfs-logs-encoded-blocks
	- Kingslayer5437/BGL
	language:
	- en
	metrics:
	- f1
	- precision
	- recall
	- roc_auc
	base_model:
	- distilbert/distilbert-base-uncased
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- log-analysis
	- anomaly-detection
	- bert
	- huggingface
	model-index:
	- name: CloudOpsBERT (distributed-storage)
	results:
	- task:
	type: text-classification
	name: Anomaly Detection
	dataset:
	name: HDFS
	type: honicky/hdfs-logs-encoded-blocks
	split: test
	metrics:
	- type: f1
	value: 0.571
	- type: precision
	value: 0.992
	- type: recall
	value: 0.401
	- type: auroc
	value: 0.73
	- type: threshold
	value: 0.5
	- name: CloudOpsBERT (HPC)
	results:
	- task:
	type: text-classification
	name: Anomaly Detection
	dataset:
	name: BGL
	type: Kingslayer5437/BGL
	split: test
	metrics:
	- type: f1
	value: 1.00
	- type: precision
	value: 1.00
	- type: recall
	value: 1.00
	- type: auroc
	value: 1.00
	- type: threshold
	value: 0.05
	---
	---
	# CloudOpsBERT: Domain-Specific Language Models for Cloud Operations

	CloudOpsBERT is an open-source project exploring domain-adapted transformer models for cloud operations log analysis — specifically anomaly detection, reliability monitoring, and cost optimization.

	This project fine-tunes lightweight BERT variants (e.g., DistilBERT) on large-scale system log datasets (HDFS, BGL) and provides ready-to-use models for the research and practitioner community.

	---

	## 🚀 Motivation

	Modern cloud platforms generate massive amounts of logs. Detecting anomalies in these logs is crucial for:
	- Ensuring reliability (catching failures early),
	- Improving cost efficiency (identifying waste or misconfigurations),
	- Supporting autonomous operations (AIOps).

	Generic LLMs and BERT models are not optimized for this domain. CloudOpsBERT bridges that gap by:
	- Training on real log datasets (HDFS, BGL),
	- Addressing imbalanced anomaly detection with class weighting,
	- Publishing open-source checkpoints for reproducibility.

	---




	## 🔍 Inference (Pretrained)
	Predict anomaly probability for a single log line:
	```
	python src/predict.py \
	--model_dir vaibhav2507/cloudops-bert \
	--subfolder distributed-storage \
	--text "ERROR dfs.DataNode: Lost connection to namenode"
	```
	Batch inference (file with one log line per row):

	```
	python src/predict.py \
	--model_dir vaibhav2507/cloudops-bert \
	--subfolder distributed-storage \
	--file samples/sample_logs.txt \
	--threshold 0.5 \
	--jsonl_out predictions.jsonl
	```

	## 📊 Results
	* HDFS (in-domain, test set)
	* F1: 0.571
	* Precision: 0.992
	* Recall: 0.401
	* AUROC: 0.730
	* Threshold: 0.50 (tuneable)
	- Cross-domain (HDFS → BGL)
	- Performance degrades significantly due to dataset/domain shift (see paper).
	- BGL (training in progress)
	- Will be released as cloudops-bert (subfolder bgl) once full training is complete.

	## 📦 Models

	* vaibhav2507/cloudops-bert (Hugging Face Hub)
	* subfolder="distributed-storage" – HDFS-trained CloudOpsBERT
	* subfolder="hpc" – BGL-trained CloudOpsBERT
	* Each export includes:
	* Model weights (pytorch_model.bin)
	* Config with label mappings (normal, anomaly)
	* Tokenizer files

	## 🚀 Quickstart (Scripts)
	1) Setup folders
	```
	bash scripts/setup_dirs.sh
	```

	2) (optional) Download a local copy of a submodel from Hugging Face
	```
	bash scripts/fetch_pretrained.sh # downloads 'hdfs' by default
	SUBFOLDER=bgl bash scripts/fetch_pretrained.sh # downloads 'bgl'
	```

	3) Single-line prediction (directly from HF)
	```
	bash scripts/predict_line.sh "ERROR dfs.DataNode: Lost connection to namenode" hdfs
	```

	4) Batch prediction (using local model folder)
	```
	bash scripts/make_sample_logs.sh
	bash scripts/predict_file.sh samples/sample_logs.txt hdfs models/cloudops-bert-hdfs preds/preds_hdfs.jsonl
	```

	## 📚 Related Work

	Several prior works have explored using BERT for log anomaly detection:

	- Leveraging BERT and Hugging Face Transformers for Log Anomaly Detection
	- Tutorial-style blog post demonstrating how to fine-tune BERT on log data with Hugging Face. Useful as an introduction, but not intended as a reproducible research artifact.

	LogBERT (HelenGuohx/logbert)
	- Academic prototype from ~2019–2020 focusing on modeling log sequences with BERT. Demonstrates feasibility but limited to in-domain experiments and lacks integration with modern Hugging Face tooling.

	AnomalyBERT (Jhryu30/AnomalyBERT)
	- Another exploratory repository showing BERT-based anomaly detection on logs, with dataset-specific preprocessing. Similar limitations in generalization and reproducibility.

	## 🔑 How CloudOpsBERT is different
	- Domain-specific adaptation: explicitly trained for cloud operations logs (HDFS, BGL) with class-weighted loss.
	- Cross-domain evaluation: includes in-domain and cross-domain benchmarks, highlighting generalization challenges.
	- Reproducibility & usability: clean repo, scripts, and ready-to-use Hugging Face exports.
	- Future directions: introduces MicroLM — compressed micro-language models for efficient edge/cloud hybrid inference.
	- In short: previous work showed that “BERT can work for logs.”
	- CloudOpsBERT operationalizes this idea into reproducible benchmarks, public models, and deployable tools for both researchers and practitioners.

	## 📜 Citation
	If you use CloudOpsBERT in your research or tools, please cite:
	```
	@misc{pandey2025cloudopsbert,
	title={CloudOpsBERT: Domain-Specific Transformer Models for Cloud Operations Anomaly Detection},
	author={Pandey, Vaibhav},
	year={2025},
	howpublished={GitHub, Hugging Face},
	url={https://github.com/vaibhav-research/cloudops-bert}
	}
	```