mohdusman001/Text-to-Table-Stage1

Stage‑2 (π₂) text + schema → table model fine‑tuned from meta-llama/Meta-Llama-3.1-8B-Instruct with a 3‑stage schedule (2k → 4k → 8k context). This repo includes merged weights + tokenizer and sample artifacts.

TL;DR

Context window: 8192 tokens (final stage)
Final eval (loss / ppl): 2.395211 / 10.9705
Sanity (json_valid / key_order / type): 0.078 / 0.078 / 0.078
Artifacts: see metrics/ and samples/.

How to prompt (JSONL rows)

The model expects a schema and a document snippet. It should emit one JSON object per line, with keys exactly in schema order (no code fences, no prose).

[SCHEMA]
{"fields":[{"name":"order_id","type":"string"},{"name":"item","type":"string"},{"name":"qty","type":"integer"}]}

<|document|>
Orders today:
- O-1003: 2x pencil
- O-1004: 1x notebook

Python usage (deterministic table emission)

import torch, json
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "mohdusman001/Text-to-Table-Stage1"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tok.pad_token is None: tok.pad_token = tok.eos_token
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map="auto",
                                             attn_implementation="flash_attention_2")
prompt = (
    "[SCHEMA]
"
    '{"fields":[{"name":"order_id","type":"string"},{"name":"item","type":"string"},{"name":"qty","type":"integer"}]}\n\n'
    "<|document|>
"
    "Orders today:
- O-1003: 2x pencil
- O-1004: 1x notebook
"
)
chat = [{"role":"user","content":prompt}]
txt = tok.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inp = tok(txt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inp, max_new_tokens=256, do_sample=False, temperature=0.0,
                         eos_token_id=tok.eos_token_id, pad_token_id=tok.pad_token_id)
generated = tok.decode(out[0][inp["input_ids"].shape[1]:], skip_special_tokens=True)

rows = [json.loads(l) for l in generated.splitlines() if l.strip()]
print(rows)  # list of dicts

Example generations

Sample 1

user: [SCHEMA] {"fields": [{"name": "order_id", "type": "string"}, {"name": "item", "type": "string"}, {"name": "qty", "type": "integer"}]}

<|document|> Orders today:

O-1003: 2x pencil
O-1004: 1x notebook

generation (head):

Based on the provided schema, here's a breakdown of the data:

- **order_id**: a unique identifier for each order (e.g., O-1003, O-1004)
- **item**: the product being ordered (e.g., pencil, notebook)
- **qty**: the quantity of the item ordered (e.g., 2, 1)

So, the data can be represented as a list of dictionaries, where each dictionary represents an order:

```json
[

Sample 2

user: [SCHEMA] {"fields": [{"name": "name", "type": "string"}, {"name": "department", "type": "string"}, {"name": "start_date", "type": "date"}, {"name": "full_time", "type": "boolean"}]}

<|document|> New hires:

Priya Singh joined Sales on 2024-08-01 full‑time.
Ken Adams joined Engineering on 2024-09-15 part‑time.

generation (head):

Based on the provided schema and the given information, we can create a JSON object for each new hire. However, since the schema specifies that the "full_time" field should be a boolean, we will assume that "part-time" is equivalent to False and "full-time" is equivalent to True.

Here's the JSON data:

```json
[
  {
    "name": "Priya Singh",
    "department": "Sales",
    "start_date": "2024-08-01",

Sample 3

system: You convert documents into tabular data strictly under a provided JSON schema. Output ONLY JSON Lines (one JSON object per row), with EXACT columns and order as the schema, and no commentary. user: <|policy|> [POLICY]

Extract only facts explicitly supported by the document. No guessing, no background knowledge, no synonyms.
Never invent rows or columns. If a value is not present, output an empty string for that cell.
Output exactly the columns listed in [SCHEMA]. The key order in each JSON object MUST match the schema order.
Do not add headers, comments, explanations, or markdown. Emit ONLY raw JSONL (one JSON object per line).
Output must be deterministic for identical input.
Trim leading/trailing whitespace in strings; preserve internal spacing and case from the document.
IDs/codes stay strings (preserve leading zeros). Do not convert units or reformat currencies.
Booleans accept tokens ⊆ {true,false,yes,no,1,0,t,f} (case-insensitive). Keep the surface form unless the schema requires normalization.
Integers: ^[+-]?\d+$ (after trim). Numbers: plain JSON numbers if unambiguous; otherwise keep as strings.
Dates: prefer ISO-like YYYY, YYYY-MM, or YYYY-MM-DD if explicitly present; otherwise keep as strings.
Treat as missing (case-insensitive, after trim): "", -, —, –, N/A, NA, None, Null, Unknown, TBD, and the na_token from [METADATA]. For missing values, emit empty string "".
For pivot-like tables, emit a single row per entity with all columns populated when available.
For key–value (slot/value) tables, emit one row per pair with exactly the two columns from the schema.
No trailing commas. Ensure every line is valid JSON. Do not wrap rows in an array. No code fences.

<|metadata|> [METADATA] {"language": "en", "script": "auto", "direction": "auto", "source_modality": "plain_text", "na_token": "", "document_char_len": 76, "table_count": "auto", "structure_candidates": ["kv_single", "kv_multi", "flat_single_row", "row_grouped"], "table_hints": {"header_rows_max": 3, "header_cols_max": 2, "row_header_possible": true, "col_header_possible": true, "ragged_rows_possible": false, "multiple_tables_possible": true}, "locale": {"numeric": "auto", "decimal_separators": [".", ","], "thousand_separators": [",", ".", " "], "negative_patterns": ["-x", "(x)"], "percent_symbol": "%", "permille_symbol": "‰"}, "parsing_hints": {"kv_markers": [":", "—", "–", "=", "→"], "list_markers": ["- ", "* ", "• ", "•\t", "▪ "], "section_markers": ["###", "##", "--", "__", "**", ":"]}, "missing_tokens": ["", "-", "—", "–", "N/A", "NA", "None", "Null", "Unknown", "TBD"]}

<|schema|> [SCHEMA] {"schema_id": "gold.e2137f3600e1", "structure": "kv_multi", "description": "Schema induced from the gold table.", "fields": [{"name": "slot", "type": "string", "required": true}, {"name": "value", "type": "string", "required": true}], "constraints": {"enums": {"slot": ["name", "area", "familyFriendly"], "value": ["Alimentum", "city centre", "no"]}}}

<|document|> [DOCUMENT] There is a place in the city centre, Alimentum, that is not family-friendly.

<|output_format|> [OUTPUT_FORMAT] Emit ONLY JSON Lines (JSONL), one JSON object per row, with keys exactly as in [SCHEMA] and in the same order. Do NOT wrap in an array. No comments or markdown.

<|task|> [TASK] Fill the table under the given [SCHEMA] using facts from [DOCUMENT]. If a value is not explicitly present, output an empty string for that cell. Ensure every line is valid JSON. assistant: {"slot": "name", "value": "Alimentum"} {"slot": "area", "value": "city centre"} {"slot": "familyFriendly", "value": "no"}

generation (head):

I've generated the output based on the provided schema and document.

Notes

Trained with FA‑2 + FSDP; LoRA in earlier stages then merged for the final artifact.
For higher validity, keep decoding deterministic (no sampling) and validate lines downstream.
Respect the base model license and your data licenses.

Downloads last month: 52

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mohdusman001/Text-to-Table-Stage1

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2054)

this model