File size: 2,287 Bytes
4ff3cec
 
 
 
 
 
 
 
 
 
 
4ec698c
 
 
4ff3cec
 
 
4ec698c
4ff3cec
 
 
 
1d404e5
942beb7
1d404e5
942beb7
1d404e5
 
 
 
b39bd7a
13963a7
 
942beb7
1d404e5
f4927e9
1d404e5
 
 
 
41df100
1d404e5
b39bd7a
4ff3cec
 
 
1d404e5
13963a7
41df100
1d404e5
13963a7
4ec698c
4ff3cec
 
 
 
 
 
 
 
 
4ccfaf4
4ff3cec
 
 
 
 
 
 
4ec698c
4ff3cec
4ec698c
 
 
4ff3cec
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: apache-2.0
tags:
  - vision
  - ocr
  - compression
  - autoencoding
---

# Bad Autoencoding - Model Checkpoints

Checkpoints for the paper: **"Optical Context Compression Is Just (Bad) Autoencoding"**

Ivan Lee, Cheng Yang, Taylor Berg-Kirkpatrick

## Links

- **Paper**: [arXiv:2512.03643](https://arxiv.org/abs/2512.03643)
- **Code**: [https://github.com/ivnle/bad-autoencoding](https://github.com/ivnle/bad-autoencoding)

## Available Checkpoints

Naming convention: `{regime}_{config}_h{N}_{objective}[_recon-init]`

### Reconstruction

| Checkpoint | Regime | CR | PPL |
|------------|--------|-----|-----|
| `vision_base_h0_recon` | Vision base | 3.60 | 1.03 |
| `meanpool_w4s4_h0_recon` | Meanpool w4s4 | 3.97 | 1.04 |
| `conv1d_t250_h0_recon` | Conv1D t250 | 3.97 | 1.00 |
| `vision_tiny_h0_recon` | Vision tiny | 12.82 | 1.14 |
| `conv1d_t63_h0_recon` | Conv1D t63 | 15.38 | 1.01 |

### Language Modeling

| Checkpoint | Regime | CR | Init | PPL |
|------------|--------|-----|------|-----|
| `vision_base_h0_lm` | Vision base | 3.60 | Direct | 5.08 |
| `vision_base_h0_lm_recon-init` | Vision base | 3.60 | From recon | 5.06 |
| `text_ctx277_h0_lm` | Text ctx277 (Truncation) | 3.60 | Direct | 5.02 |
| `meanpool_w4s4_h0_lm_recon-init` | Meanpool w4s4 | 3.97 | From recon | 5.02 |
| `conv1d_t250_h0_lm_recon-init` | Conv1D t250 | 3.97 | From recon | 4.96 |

## Model Details

- **Architecture**: DeepSeek-OCR with vision encoder
- **Vision checkpoints**: Trained encoder (base=768x768, tiny=384x384)
- **Text checkpoints**: Truncation baseline (no vision encoder), context=277 tokens
- **Meanpool checkpoints**: Frozen encoder, window=4, stride=4
- **Conv1D checkpoints**: Trained hierarchical encoder (t250=CR 3.97, t63=CR 15.38)
- **Dataset**: 510k samples from FineWiki

## Usage

```python
from huggingface_hub import hf_hub_download

# Download a specific checkpoint
checkpoint_path = hf_hub_download(
    repo_id="ivnle/bad-autoencoding",
    filename="vision_base_h0_lm/model.pt",
    repo_type="model"
)
```

## Citation

```bibtex
@article{lee2024optical,
  title={Optical Context Compression Is Just (Bad) Autoencoding},
  author={Lee, Ivan and Yang, Cheng and Berg-Kirkpatrick, Taylor},
  journal={arXiv preprint arXiv:2512.03643},
  year={2024}
}
```