Cookbooks" and LLM development guides

11:38
14.05.2025
randall
182

We at MWS have launched a language model aggregator, where you can work with multiple LLMs through a single interface. In MWS GPT, the following models are available: MTS's own models, external models like DeepSeek, or the customer's own models. These models can easily be connected to any corporate system or chatbot via API.

We’ve also gathered materials for developers and data scientists who want to sharpen their skills in machine learning and the development of large language models. Here you’ll find the essentials — from basic guides to scientific reference books.

From Developers to Developers

OpenCoder is a family of open-source LLMs designed for coding tasks. The project authors are the AI system development company INF, along with the research community Multimodal Art Projection (M-A-P). They released OpenCoder to enrich the open-source model niche, providing models suitable both for coding and for scientific research.

Interestingly, the developers open-sourced not only the weights and training data, but also the data processing pipeline and experiment results. These processes are described in a scientific paper.

The guide covers all stages of LLM design, including data collection from GitHub and the Common Crawl web archive for pretraining, as well as removal of copyright-protected materials and personal data from the sample. The developers also paid special attention to balancing datasets through filtering code in languages such as Java and HTML.

There’s also a section on model training, covering architecture selection and optimization methods. For example, OpenCoder was trained on a dataset containing topics from theoretical computer science — algorithms, data structures, and network architecture principles. Only after that was the model trained on code examples from GitHub.

The provided “cookbook” is perfect for those experimenting with model training.

From Student to Students

“LLM Cookbook” is a large language model development guide created by a Boston University student. The information is organized in a GitHub repository and includes both the author's notes and instructions from resources like Hugging Face. There are pages with code samples for everything: local and cloud LLM deployment, fine-tuning tips, running an LLM as a RESTful service. You’ll find both basic instructions, such as setting up a model to handle your own dataset, as well as more advanced tips — for example, ways to improve LLM training efficiency on a single GPU.

The author started building the repository this year, so many topics haven’t yet been fully covered. In the future, they plan to include subsections on LLM inference efficiency, benchmarking, and practical use case analysis. Overall, the author is open to collaboration and invites anyone to help fill out the “cookbook”.

LLM Programming for Developers

In the guide "Hands on introduction to LLM programming for developers," the author covers the basic concepts of machine learning — what embeddings, tokens, temperature are, and provides examples of setting up and using LLM for various tasks. The tutorial is written in simple language, so it will be useful for those who are just starting to dive into the topic of machine learning.

The "Cookbook" covers setting up the development environment, using the API to make requests to LLM, designing and refining prompts, processing the model's output, and integrating LLM capabilities into applications. The author also explains how to use the LangChain framework to write code and analyze document contents using Retrieval Augmented Generation (RAG).

"Cookbook" by Hugging Face

"Open-Source AI Cookbook" — guide, published on Hugging Face, provides instructions for developing advanced LLMs and will be useful for experienced specialists. The cookbook covers various areas, including natural language processing, computer vision, and also contains instructions on model deployment, performance optimization, and efficient use of datasets. For example, one of the instructions is dedicated to analyzing artistic styles using multimodal embeddings.

At the same time, the authors of the "cookbook" invite anyone interested to contribute to the development of the project. You can suggest an idea for a tutorial, send a ready notebook with a practical example, or improve existing guides.

Guide for Scientists

The guide will be useful for scientists who wish to use AI systems in their research. The authors of the "cookbook" are the interdisciplinary team NASA Impact, which designs technologies to support scientific and applied developments.

The sections of the guide discuss responsible use of AI systems in science, popular LLMs and their features, methods for integrating models into research, such as retrieval-augmented generation (RAG) and prompt engineering. The "cookbook" also includes examples of optimizing research processes using LLMs. For example, using the LangChain framework, the authors developed an application for searching scientific research data.

Generative Capabilities of LLM (Framework)

This guide is for programmatically generating data templates that enhance the capabilities of large language models. Its authors — researchers from Stanford University — aimed to develop a solution for fine-tuning models and avoid issues related to datasets formed manually or with LLMs (particularly those related to personal data).

The researchers have described the framework development process in detail, so anyone interested can use it as a guide for preparing their own training datasets. The framework is based on forming datasets as Python functions. To boost efficiency when tackling multiple tasks, the framework combines data from different sets. This is done using an optimization algorithm that evaluates the accuracy of each model for subsequent tasks.

The authors also provide examples of summaries for various tasks—question answering, entity matching, and logical reasoning by the model. In addition, the guide shows how template development can be automated using GPT-4. The material is relatively new. Nevertheless, the authors themselves note that fine-tuning on data generated within this framework can improve LLM performance by more than 52 points.