File size: 2,287 Bytes
4ff3cec 4ec698c 4ff3cec 4ec698c 4ff3cec 1d404e5 942beb7 1d404e5 942beb7 1d404e5 b39bd7a 13963a7 942beb7 1d404e5 f4927e9 1d404e5 41df100 1d404e5 b39bd7a 4ff3cec 1d404e5 13963a7 41df100 1d404e5 13963a7 4ec698c 4ff3cec 4ccfaf4 4ff3cec 4ec698c 4ff3cec 4ec698c 4ff3cec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
license: apache-2.0
tags:
- vision
- ocr
- compression
- autoencoding
---
# Bad Autoencoding - Model Checkpoints
Checkpoints for the paper: **"Optical Context Compression Is Just (Bad) Autoencoding"**
Ivan Lee, Cheng Yang, Taylor Berg-Kirkpatrick
## Links
- **Paper**: [arXiv:2512.03643](https://arxiv.org/abs/2512.03643)
- **Code**: [https://github.com/ivnle/bad-autoencoding](https://github.com/ivnle/bad-autoencoding)
## Available Checkpoints
Naming convention: `{regime}_{config}_h{N}_{objective}[_recon-init]`
### Reconstruction
| Checkpoint | Regime | CR | PPL |
|------------|--------|-----|-----|
| `vision_base_h0_recon` | Vision base | 3.60 | 1.03 |
| `meanpool_w4s4_h0_recon` | Meanpool w4s4 | 3.97 | 1.04 |
| `conv1d_t250_h0_recon` | Conv1D t250 | 3.97 | 1.00 |
| `vision_tiny_h0_recon` | Vision tiny | 12.82 | 1.14 |
| `conv1d_t63_h0_recon` | Conv1D t63 | 15.38 | 1.01 |
### Language Modeling
| Checkpoint | Regime | CR | Init | PPL |
|------------|--------|-----|------|-----|
| `vision_base_h0_lm` | Vision base | 3.60 | Direct | 5.08 |
| `vision_base_h0_lm_recon-init` | Vision base | 3.60 | From recon | 5.06 |
| `text_ctx277_h0_lm` | Text ctx277 (Truncation) | 3.60 | Direct | 5.02 |
| `meanpool_w4s4_h0_lm_recon-init` | Meanpool w4s4 | 3.97 | From recon | 5.02 |
| `conv1d_t250_h0_lm_recon-init` | Conv1D t250 | 3.97 | From recon | 4.96 |
## Model Details
- **Architecture**: DeepSeek-OCR with vision encoder
- **Vision checkpoints**: Trained encoder (base=768x768, tiny=384x384)
- **Text checkpoints**: Truncation baseline (no vision encoder), context=277 tokens
- **Meanpool checkpoints**: Frozen encoder, window=4, stride=4
- **Conv1D checkpoints**: Trained hierarchical encoder (t250=CR 3.97, t63=CR 15.38)
- **Dataset**: 510k samples from FineWiki
## Usage
```python
from huggingface_hub import hf_hub_download
# Download a specific checkpoint
checkpoint_path = hf_hub_download(
repo_id="ivnle/bad-autoencoding",
filename="vision_base_h0_lm/model.pt",
repo_type="model"
)
```
## Citation
```bibtex
@article{lee2024optical,
title={Optical Context Compression Is Just (Bad) Autoencoding},
author={Lee, Ivan and Yang, Cheng and Berg-Kirkpatrick, Taylor},
journal={arXiv preprint arXiv:2512.03643},
year={2024}
}
```
|