feat: add mean_seq_mha pooling option

Browse files

Add updated MaxPoolBERT paper's MeanSeq+MHA pooling option

Files changed (3) hide show

README.md +3 -1
configuration_modchembert.py +25 -4
modeling_modchembert.py +11 -6

README.md CHANGED Viewed

@@ -1,5 +1,6 @@
 ---
 license: apache-2.0
 library_name: transformers
 tags:
 - modernbert
@@ -64,13 +65,14 @@ This base model includes configurable pooling strategies for downstream fine-tun
 - `max_cls`: Max over last k layers of [CLS]
 - `cls_mha`: MHA with [CLS] as query
 - `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query
 - `sum_mean`: Sum over all layers then mean tokens
 - `sum_sum`: Sum over all layers then sum tokens
 - `mean_mean`: Mean over all layers then mean tokens
 - `mean_sum`: Mean over all layers then sum tokens
 - `max_seq_mean`: Max over last k layers then mean tokens
-Note: ModChemBERT's `max_seq_mha` differs from MaxPoolBERT [3]. MaxPoolBERT uses PyTorch `nn.MultiheadAttention`, whereas ModChemBERT's `ModChemBertPoolingAttention` adapts ModernBERT's `ModernBertAttention`.
 On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with `nn.MultiheadAttention`. Training instability with ModernBERT has been reported in the past ([discussion 1](https://huggingface.co/answerdotai/ModernBERT-base/discussions/59) and [discussion 2](https://huggingface.co/answerdotai/ModernBERT-base/discussions/63)).
 ## Intended Use

 ---
 license: apache-2.0
+base_model: Derify/ModChemBERT-IR-BASE
 library_name: transformers
 tags:
 - modernbert
 - `max_cls`: Max over last k layers of [CLS]
 - `cls_mha`: MHA with [CLS] as query
 - `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query
+- `mean_seq_mha`: MHA with mean pooled sequence as KV and mean pooled [CLS] as query
 - `sum_mean`: Sum over all layers then mean tokens
 - `sum_sum`: Sum over all layers then sum tokens
 - `mean_mean`: Mean over all layers then mean tokens
 - `mean_sum`: Mean over all layers then sum tokens
 - `max_seq_mean`: Max over last k layers then mean tokens
+Note: ModChemBERT's `cls_mha`, `max_seq_mha`, and `mean_seq_mha` differ from MaxPoolBERT [3]. MaxPoolBERT uses PyTorch `nn.MultiheadAttention`, whereas ModChemBERT's `ModChemBertPoolingAttention` adapts ModernBERT's `ModernBertAttention`.
 On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with `nn.MultiheadAttention`. Training instability with ModernBERT has been reported in the past ([discussion 1](https://huggingface.co/answerdotai/ModernBERT-base/discussions/59) and [discussion 2](https://huggingface.co/answerdotai/ModernBERT-base/discussions/63)).
 ## Intended Use

configuration_modchembert.py CHANGED Viewed

@@ -37,14 +37,15 @@ class ModChemBertConfig(ModernBertConfig):
             - "max_cls": Element-wise max pooling over last k hidden states, then take CLS token
             - "cls_mha": Multi-head attention with CLS token as query and full sequence as keys/values
             - "max_seq_mha": Max pooling over last k states + multi-head attention with CLS as query
             - "max_seq_mean": Max pooling over last k hidden states, then mean pooling over sequence
             Defaults to "sum_mean".
         classifier_pooling_num_attention_heads (int, optional): Number of attention heads for multi-head attention
-            pooling strategies (cls_mha, max_seq_mha). Defaults to 4.
         classifier_pooling_attention_dropout (float, optional): Dropout probability for multi-head attention
-            pooling strategies (cls_mha, max_seq_mha). Defaults to 0.0.
-        classifier_pooling_last_k (int, optional): Number of last hidden layers to use for max pooling
-            strategies (max_cls, max_seq_mha, max_seq_mean). Defaults to 8.
         *args: Variable length argument list passed to ModernBertConfig.
         **kwargs: Arbitrary keyword arguments passed to ModernBertConfig.
@@ -68,6 +69,7 @@ class ModChemBertConfig(ModernBertConfig):
             "max_cls",
             "cls_mha",
             "max_seq_mha",
             "max_seq_mean",
         ] = "max_seq_mha",
         classifier_pooling_num_attention_heads: int = 4,
@@ -75,6 +77,25 @@ class ModChemBertConfig(ModernBertConfig):
         classifier_pooling_last_k: int = 8,
         **kwargs,
     ):
         # Pass classifier_pooling="cls" to circumvent ValueError in ModernBertConfig init
         super().__init__(*args, classifier_pooling="cls", **kwargs)
         # Override with custom value

             - "max_cls": Element-wise max pooling over last k hidden states, then take CLS token
             - "cls_mha": Multi-head attention with CLS token as query and full sequence as keys/values
             - "max_seq_mha": Max pooling over last k states + multi-head attention with CLS as query
+            - "mean_seq_mha": Mean pooling over last k states + multi-head attention with CLS as query
             - "max_seq_mean": Max pooling over last k hidden states, then mean pooling over sequence
             Defaults to "sum_mean".
         classifier_pooling_num_attention_heads (int, optional): Number of attention heads for multi-head attention
+            pooling strategies (cls_mha, max_seq_mha, mean_seq_mha). Defaults to 4.
         classifier_pooling_attention_dropout (float, optional): Dropout probability for multi-head attention
+            pooling strategies (cls_mha, max_seq_mha, mean_seq_mha). Defaults to 0.0.
+        classifier_pooling_last_k (int, optional): Number of last hidden layers to use for max/mean pooling
+            strategies (max_cls, max_seq_mha, mean_seq_mha, max_seq_mean). Defaults to 8.
         *args: Variable length argument list passed to ModernBertConfig.
         **kwargs: Arbitrary keyword arguments passed to ModernBertConfig.
             "max_cls",
             "cls_mha",
             "max_seq_mha",
+            "mean_seq_mha",
             "max_seq_mean",
         ] = "max_seq_mha",
         classifier_pooling_num_attention_heads: int = 4,
         classifier_pooling_last_k: int = 8,
         **kwargs,
     ):
+        valid_classifier_pooling_options = [
+            "cls",
+            "mean",
+            "sum_mean",
+            "sum_sum",
+            "mean_mean",
+            "mean_sum",
+            "max_cls",
+            "cls_mha",
+            "max_seq_mha",
+            "mean_seq_mha",
+            "max_seq_mean",
+        ]
+        if classifier_pooling not in valid_classifier_pooling_options:
+            raise ValueError(
+                f"Invalid value for `classifier_pooling`, should be one of {valid_classifier_pooling_options}, "
+                f"but is {classifier_pooling}."
+            )
         # Pass classifier_pooling="cls" to circumvent ValueError in ModernBertConfig init
         super().__init__(*args, classifier_pooling="cls", **kwargs)
         # Override with custom value

modeling_modchembert.py CHANGED Viewed

@@ -19,9 +19,9 @@
 # Modifications include:
 # - Additional classifier_pooling options for ModChemBertForSequenceClassification
 #   - sum_mean, sum_sum, mean_sum, mean_mean: from ChemLM (utilizes all hidden states)
-#   - max_cls, cls_mha, max_seq_mha: from MaxPoolBERT (utilizes last k hidden states)
 #   - max_seq_mean: a merge between sum_mean and max_cls (utilizes last k hidden states)
-# - Addition of ModChemBertPoolingAttention for cls_mha and max_seq_mha pooling options
 import copy
 import math
@@ -499,7 +499,7 @@ class ModChemBertForSequenceClassification(InitWeightsMixin, ModernBertPreTraine
         self.config = config
         self.model = ModernBertModel(config)
-        if self.config.classifier_pooling in {"cls_mha", "max_seq_mha"}:
             self.pooling_attn = ModChemBertPoolingAttention(config=self.config)
         else:
             self.pooling_attn = None
@@ -649,6 +649,7 @@ def _pool_modchembert_output(
     - max_cls: Element-wise max pooling over the last k hidden states, then take CLS token
     - cls_mha: Multi-head attention with CLS token as query and full sequence as keys/values
     - max_seq_mha: Max pooling over last k states + multi-head attention with CLS as query
     - max_seq_mean: Max pooling over last k hidden states, then mean pooling over sequence
     - sum_mean: Sum all hidden states across layers, then mean pool over sequence
     - sum_sum: Sum all hidden states across layers, then sum pool over sequence
@@ -665,7 +666,7 @@ def _pool_modchembert_output(
         torch.Tensor: Pooled representation of shape (batch_size, hidden_size)
     Note:
-        Some pooling strategies (cls_mha, max_seq_mha) require the module to have a pooling_attn
         attribute containing a ModChemBertPoolingAttention instance.
     """
     config = typing.cast(ModChemBertConfig, module.config)
@@ -689,10 +690,13 @@ def _pool_modchembert_output(
             q=q, kv=last_hidden_state, attention_mask=attention_mask
         )  # (batch, seq_len, hidden)
         last_hidden_state = torch.mean(attn_out, dim=1)
-    elif config.classifier_pooling == "max_seq_mha":
         k_hidden_states = hidden_states[-config.classifier_pooling_last_k :]
         theta = torch.stack(k_hidden_states, dim=1)  # (batch, k, seq_len, hidden)
-        pooled_seq = torch.max(theta, dim=1).values  # Element-wise max over k -> (batch, seq_len, hidden)
         # Query is pooled CLS token (position 0); Keys/Values are pooled sequence
         q = pooled_seq[:, 0, :].unsqueeze(1)  # (batch, 1, hidden)
         q = q.expand(-1, pooled_seq.shape[1], -1)  # (batch, seq_len, hidden)
@@ -729,6 +733,7 @@ def _pool_modchembert_output(
 __all__ = [
     "ModChemBertForMaskedLM",
     "ModChemBertForSequenceClassification",
 ]

 # Modifications include:
 # - Additional classifier_pooling options for ModChemBertForSequenceClassification
 #   - sum_mean, sum_sum, mean_sum, mean_mean: from ChemLM (utilizes all hidden states)
+#   - max_cls, cls_mha, max_seq_mha, mean_seq_mha: from MaxPoolBERT (utilizes last k hidden states)
 #   - max_seq_mean: a merge between sum_mean and max_cls (utilizes last k hidden states)
+# - Addition of ModChemBertPoolingAttention for cls_mha, max_seq_mha, and mean_seq_mha pooling options
 import copy
 import math
         self.config = config
         self.model = ModernBertModel(config)
+        if self.config.classifier_pooling in {"cls_mha", "max_seq_mha", "mean_seq_mha"}:
             self.pooling_attn = ModChemBertPoolingAttention(config=self.config)
         else:
             self.pooling_attn = None
     - max_cls: Element-wise max pooling over the last k hidden states, then take CLS token
     - cls_mha: Multi-head attention with CLS token as query and full sequence as keys/values
     - max_seq_mha: Max pooling over last k states + multi-head attention with CLS as query
+    - mean_seq_mha: Mean pooling over last k states + multi-head attention with CLS as query
     - max_seq_mean: Max pooling over last k hidden states, then mean pooling over sequence
     - sum_mean: Sum all hidden states across layers, then mean pool over sequence
     - sum_sum: Sum all hidden states across layers, then sum pool over sequence
         torch.Tensor: Pooled representation of shape (batch_size, hidden_size)
     Note:
+        Some pooling strategies (cls_mha, max_seq_mha, mean_seq_mha) require the module to have a pooling_attn
         attribute containing a ModChemBertPoolingAttention instance.
     """
     config = typing.cast(ModChemBertConfig, module.config)
             q=q, kv=last_hidden_state, attention_mask=attention_mask
         )  # (batch, seq_len, hidden)
         last_hidden_state = torch.mean(attn_out, dim=1)
+    elif config.classifier_pooling in {"max_seq_mha", "mean_seq_mha"}:
         k_hidden_states = hidden_states[-config.classifier_pooling_last_k :]
         theta = torch.stack(k_hidden_states, dim=1)  # (batch, k, seq_len, hidden)
+        if config.classifier_pooling == "max_seq_mha":
+            pooled_seq = torch.max(theta, dim=1).values  # Element-wise max over k -> (batch, seq_len, hidden)
+        else:
+            pooled_seq = torch.mean(theta, dim=1)  # Element-wise mean over k -> (batch, seq_len, hidden)
         # Query is pooled CLS token (position 0); Keys/Values are pooled sequence
         q = pooled_seq[:, 0, :].unsqueeze(1)  # (batch, 1, hidden)
         q = q.expand(-1, pooled_seq.shape[1], -1)  # (batch, seq_len, hidden)
 __all__ = [
+    "ModChemBertModel",
     "ModChemBertForMaskedLM",
     "ModChemBertForSequenceClassification",
 ]