Pretraining Datasets wikimedia/wikipedia Viewer • Updated Jan 9, 2024 • 61.6M • 86.1k • 1.15k togethercomputer/RedPajama-Data-V2 Updated Nov 21, 2024 • 5.85k • 397 Skywork/SkyPile-150B Viewer • Updated Dec 7, 2023 • 1.76M • 15.5k • 402
Awesome Instruction Tuning Dataset Open-Orca/OpenOrca Viewer • Updated Feb 19, 2025 • 2.94M • 16.6k • 1.51k glaiveai/glaive-code-assistant Viewer • Updated Sep 27, 2023 • 136k • 424 • 99 silk-road/alpaca-data-gpt4-chinese Viewer • Updated May 23, 2023 • 52k • 585 • 102 anon8231489123/ShareGPT_Vicuna_unfiltered Updated Apr 12, 2023 • 127k • 850
Awesome Instruction Tuning Dataset Open-Orca/OpenOrca Viewer • Updated Feb 19, 2025 • 2.94M • 16.6k • 1.51k glaiveai/glaive-code-assistant Viewer • Updated Sep 27, 2023 • 136k • 424 • 99 silk-road/alpaca-data-gpt4-chinese Viewer • Updated May 23, 2023 • 52k • 585 • 102 anon8231489123/ShareGPT_Vicuna_unfiltered Updated Apr 12, 2023 • 127k • 850
Pretraining Datasets wikimedia/wikipedia Viewer • Updated Jan 9, 2024 • 61.6M • 86.1k • 1.15k togethercomputer/RedPajama-Data-V2 Updated Nov 21, 2024 • 5.85k • 397 Skywork/SkyPile-150B Viewer • Updated Dec 7, 2023 • 1.76M • 15.5k • 402