-
davanstrien/reasoning-required
Viewer β’ Updated β’ 5k β’ 228 β’ 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification β’ 0.1B β’ Updated β’ 28 β’ 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer β’ Updated β’ 10k β’ 82 β’ 1 -
davanstrien/fine-reasoning-questions
Viewer β’ Updated β’ 244 β’ 126 β’ 19
Daniel van Strien PRO
davanstrien
AI & ML interests
Machine Learning Librarian
Recent Activity
updated
a dataset
about 5 hours ago
data-is-better-together/fineweb-c-progress
updated
a dataset
about 22 hours ago
librarian-bots/dataset-columns
upvoted
a
collection
1 day ago
K2-V2
Organizations
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
-
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
open-r1/OpenR1-Math-220k
Viewer β’ Updated β’ 450k β’ 13.1k β’ 677 -
simplescaling/s1K-1.1
Viewer β’ Updated β’ 1k β’ 2.41k β’ 140 -
MU-NLPC/Calc-ape210k
Viewer β’ Updated β’ 404k β’ 1.7k β’ 25
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
-
bigcode/self-oss-instruct-sc2-exec-filter-50k
Viewer β’ Updated β’ 50.7k β’ 847 β’ 104 -
davanstrien/similarity-dataset-sc2-8b
Viewer β’ Updated β’ 2.32k β’ 89 β’ 6 -
davanstrien/code-prompt-similarity-model
Sentence Similarity β’ 0.1B β’ Updated β’ 13 β’ 6 -
davanstrien/abstract-wiki
Viewer β’ Updated β’ 5k β’ 61 β’ 2
haiku
πΈ This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Probably DPO datasets
A collection of datasets that probably support DPO
-
HuggingFaceH4/ultrafeedback_binarized
Viewer β’ Updated β’ 187k β’ 8.88k β’ 316 -
mlabonne/orpo-dpo-mix-40k
Viewer β’ Updated β’ 44.2k β’ 564 β’ 296 -
argilla/OpenHermesPreferences
Viewer β’ Updated β’ 989k β’ 1.36k β’ 209 -
argilla/distilabel-capybara-dpo-7k-binarized
Viewer β’ Updated β’ 7.56k β’ 1.68k β’ 182
query-to-hub-datasets-viewer-project
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation β’ 0.4B β’ Updated β’ 39 β’ 11 -
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer β’ Updated β’ 5k β’ 63 β’ 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer β’ Updated β’ 5k β’ 91 β’ 1
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error8
Genstruct 7B
π8 -
Runtime errorFeatured86
Instruction Synthesizer
π86Generate instruction-response pairs from text
-
Running on ZeroFeatured72
Magpie
π¦72Generate and rate instruction-response pairs
-
Runtime error11
Bonito
π¬11Generate task-specific instructions and responses from text
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper β’ 2404.14361 β’ Published β’ 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper β’ 2403.04190 β’ Published β’ 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper β’ 2404.07503 β’ Published β’ 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper β’ 2404.14445 β’ Published
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
Reasoning Required?
-
davanstrien/reasoning-required
Viewer β’ Updated β’ 5k β’ 228 β’ 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification β’ 0.1B β’ Updated β’ 28 β’ 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer β’ Updated β’ 10k β’ 82 β’ 1 -
davanstrien/fine-reasoning-questions
Viewer β’ Updated β’ 244 β’ 126 β’ 19
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation β’ 0.4B β’ Updated β’ 39 β’ 11 -
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer β’ Updated β’ 5k β’ 63 β’ 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer β’ Updated β’ 5k β’ 91 β’ 1
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
-
Running84
Semantic Hugging Face Hub Search
π84Find datasets and models using semantic search
-
open-r1/OpenR1-Math-220k
Viewer β’ Updated β’ 450k β’ 13.1k β’ 677 -
simplescaling/s1K-1.1
Viewer β’ Updated β’ 1k β’ 2.41k β’ 140 -
MU-NLPC/Calc-ape210k
Viewer β’ Updated β’ 404k β’ 1.7k β’ 25
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error8
Genstruct 7B
π8 -
Runtime errorFeatured86
Instruction Synthesizer
π86Generate instruction-response pairs from text
-
Running on ZeroFeatured72
Magpie
π¦72Generate and rate instruction-response pairs
-
Runtime error11
Bonito
π¬11Generate task-specific instructions and responses from text
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
-
bigcode/self-oss-instruct-sc2-exec-filter-50k
Viewer β’ Updated β’ 50.7k β’ 847 β’ 104 -
davanstrien/similarity-dataset-sc2-8b
Viewer β’ Updated β’ 2.32k β’ 89 β’ 6 -
davanstrien/code-prompt-similarity-model
Sentence Similarity β’ 0.1B β’ Updated β’ 13 β’ 6 -
davanstrien/abstract-wiki
Viewer β’ Updated β’ 5k β’ 61 β’ 2
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper β’ 2404.14361 β’ Published β’ 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper β’ 2403.04190 β’ Published β’ 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper β’ 2404.07503 β’ Published β’ 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper β’ 2404.14445 β’ Published
haiku
πΈ This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
Probably DPO datasets
A collection of datasets that probably support DPO
-
HuggingFaceH4/ultrafeedback_binarized
Viewer β’ Updated β’ 187k β’ 8.88k β’ 316 -
mlabonne/orpo-dpo-mix-40k
Viewer β’ Updated β’ 44.2k β’ 564 β’ 296 -
argilla/OpenHermesPreferences
Viewer β’ Updated β’ 989k β’ 1.36k β’ 209 -
argilla/distilabel-capybara-dpo-7k-binarized
Viewer β’ Updated β’ 7.56k β’ 1.68k β’ 182
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
query-to-hub-datasets-viewer-project