Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.
By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.
Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.
Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.
By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.
Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
Public reports allege that Anthropic gobbled up trillions of tokens of copyrighted material and public data to build their castle. 🏰📄 Now that they're sitting on top, they're begging for special laws to protect their profits while pulling the ladder up behind them. 🪜🚫
But the hypocrisy meter just broke! 📉 They are accusing Chinese labs like DeepSeek, Minimax, and Kimi of "huge distillation attacks. The Reality is that You can't just loot the entire internet's library, lock the door, and then sue everyone else for reading through the window. Stop trying to gatekeep the tech you didn't own in the first place. Read the complete article on it: https://huggingface.co/blog/Ujjwal-Tyagi/the-dark-underbelly-of-anthropic
PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.
By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.
PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.
JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.
By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.
JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.
By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.
Qwen 3.5 Model is here! Supporting 1m context length by default, It is giving much good performance and competitive to Claude Opus 4.6, Qwen/Qwen3.5-397B-A17B, here it's GGUF: unsloth/Qwen3.5-397B-A17B-GGUF, Follow me and turn on the notification for the latest news!
🏙️ Hugging Face Community Post Title: 🧬 Experimenting with "Dynamic Chaos" in Tamil SLMs
Hi everyone! I just published a new experimental study on Small Language Model (SLM) resilience.
I took the Qwen2.5-0.5B model and put it through a "Chaos Phase" to see how much weight data a tiny model can lose before its understanding of classical Tamil grammar breaks.
Key highlights of the study:
Target Data: Fine-tuned on the Thirukkural (1,330 couplets + modern explanations). The Chaos Step: Applied 20% random weight pruning but implemented "Layer Protection" for the Token Embeddings and LM Head to keep the characters readable. Compression: 4-bit (Q4_K_M) quantization for extreme efficiency. Result: A surrealist classical Tamil model that is ultra-light (~300MB) and ultra-fast!
There is a new open-source music generation model called HeartMuLa. It offers strong, competitive performance compared to Suno and supports English, Chinese, Japanese, Korean, and Spanish. It is optimized to run easily on RTX GPUs and other consumer-grade hardware. HeartMuLa/HeartMuLa-oss-3B https://github.com/HeartMuLa/heartlib
So, Koreans are also doing great progress behind Chinese, Their two open source ai models that are actually good in coding. upstage/Solar-Open-100Bskt/A.X-K1
I’m excited to release hawky-ai-Qwen3-0.6B-Marketing-MoT, a specialized SLM designed for deep strategic reasoning in performance marketing.
While small at 0.6B parameters, this model punches way above its weight class by utilizing a Mixture of Thoughts (MoT) framework. It doesn't just give you an answer; it thinks through the logic of Meta Ads scaling, GA4 attribution, and unit economics before providing a strategic recommendation.
Key Features:
Thinking-First: Trained on 1,500+ critical thinking scenarios. MoT Framework: 5 distinct reasoning styles (Linear, Exploratory, Critical, Deconstructive, Analogical). SLM Speed: Perfect for low-latency, high-precision marketing audits. Check it out on Hugging Face: 🔗 Sri-Vigneshwar-DJ/hawky-ai-Qwen3-0.6B-Marketing-MoT
I am very excited to see the release of nyuuzyou/gitee-code. This is exactly what I have been looking for. Thank you to @nyuuzyou for his hard work on this.
I’m looking for AI engineers and researchers to join my company as part of the core team. We’ll be working on cutting-edge research and hands-on implementation across LLMs and related systems. I’m especially interested in founding engineers for my ai startup, who want to build from the ground up and shape both the product and the research direction. If this sounds interesting to you, reply to this post and message me on Discord — my username is "ujjwal_tyagi.shirova", Please also attach your Resume and Details of your open source projects (if any related to LLMs) on discord, avoid sharing here as a reply to this post.