From context to jurisdiction: 7 key parameters when choosing an LLM for your project

A year ago, it seemed that simply choosing GPT-4 was enough to solve all AI problems. Today, the language model market resembles a zoo, with new exotic species appearing every day. Claude, Gemini, Mistral, Qwen — and this is just the tip of the iceberg.

Hello, tekkix! I am Sergey, the product manager of the AI direction at Bitrix24. For the past year, we have been actively implementing neural networks into our product, and I want to share some experiences that can save you time and money.

It turned out that choosing the right neural network model is an art. Context sizes, licenses, language support, access methods - these parameters can make your head spin. But understanding them is critically important if you don't want to waste a lot of time and money.

A year ago, it seemed that ChatGPT (based on gpt-3.5 and gpt-4) was the pinnacle of language model evolution. But the AI market is developing at a furious pace, and now the situation has changed dramatically.

A bunch of new players have appeared: Claude, Gemini, Mistral, Qwen - and this is just the tip of the iceberg. Some of them already outperform ChatGPT in certain tasks. The Russian market is also not lagging behind: YandexGPT, GigaChat, T-lite from T-Bank. And this is where the most interesting (read: difficult) part begins. How to choose the one that suits you in this zoo of neural networks? In this article, I will tell you what you really need to pay attention to when choosing an LLM for your business or project. We will analyze the key characteristics of the models, their pros and cons, as well as the pitfalls you might not have suspected.

By what parameters can all neural networks be divided

1. Context window size

The context window of a neural network is probably the most well-known parameter that everyone talks about. In fact, it is the maximum amount of data in the form of text or, more precisely, tokens into which the text is divided, processed by the neural network in one request.

Note: 1000 tokens is approximately 750 English words (this was the case for gpt-3.5, the count for different models may vary).

There are models that support a large context window - the leader now is Gemini 1.5, accommodating 2 million tokens.

There are neural networks that support a smaller context volume. For example, gemma-2-27b supports a context length of only up to 8000 tokens.

It is also important to consider HOW models work with long texts and pass the so-called "needle in a haystack" tests. For example, developers from the Anthropic team have optimized Claude for this. With a context window of 200 thousand tokens, it significantly "loses" less information from the middle, unlike other models.

But that's not all. It is important to pay attention to how many tokens the model can output as a response. For example, gpt-4o-mini outputs 16 thousand tokens with a total context window of 128 thousand, while o1-preview outputs 32 thousand tokens as a response, with a total capacity of 128 thousand tokens.

In the product: For conditional text writing, the standard 8000 tokens are enough (many solutions have this capacity). But if you need the neural network to interact with large files, for example, a script, a long dialogue, or a book in PDF format, 8000 tokens are already not enough. You will need to choose solutions with a larger context.

2. Data Processing Speed

Data processing speed is, roughly speaking, the rate at which the neural network provides responses to queries.

Usually, within a vendor or group of models, fast models are inferior in quality, while "smart" ones are inferior in speed. For example, GPT-4o outputs about 80 tokens per second, Gemini Flash - 150 tokens. It turns out that a 1000-token response (for example, a job template for a prompt engineer) will be written by the first neural network in 12 seconds, and the second in just 6.

In the product: For some tasks, an instant response is not as important as for others. For example, in scenarios of processing large documents, you send a file and wait for the result, whether the wait lasts 5 or 7 seconds is not so important. But in working with chat scenarios, synchronous dialogue translation, or creating subtitles, the user expects to see an instant result.

3. Language Support

When choosing a neural network, it is worth considering which languages it supports. For example, if you are going to work with the Chinese market, it is better to specifically choose networks that are well-trained to work with Chinese.

Usually, models are trained for specific languages. Even for gpt4o, it is a limited set of languages (albeit quite large). But in the recently launched Llama 3.2, eight languages are officially supported: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. In these languages, the neural network provides the highest quality responses. This does not mean that it will not work in others, but:

  1. As a rule, models perceive instructions worse in a language they were not trained on. That is, the number of errors will be higher, and the quality of responses will be lower.

  2. Due to an suboptimal tokenizer (a tool that breaks text into tokens) - more tokens are spent on requests in an unfamiliar language. That is, in addition to quality, this will also affect the cost of using the model.

In the product: If you are going to work with the Chinese market, it is better to specifically choose networks that are well trained to work with Chinese. Conversely, if you choose a model not trained for the target market, you need to take these risks into account.

4. Multimodality

Modality in the context of LLM is the type or format of input data that the model can process. For example, text is one modality, images are another. Some models, like gpt-3.5 Turbo, can only process text, others, like gpt-4o, can process text, images, and audio.

A year ago, mainly models with one modality (text, working with images or sound) were common. Now many large AIs work simultaneously with images, audio, and some even with video.

In the product: Multimodal models are good for cases where, for example, image analysis and further dialogue with LLM based on "what is seen" are implied. They are convenient because no additional actions from development are required to work with different types of data. However, they usually cost more.

5. Adaptation for solving specific types of tasks

There are models tailored for writing code, higher mathematics, working with databases, documents. There are those that perform well in CRM scenarios, technical support, or marketing.

Finding the right solution for your type of tasks without studying forums and specialized chats is quite a challenge, but in general, you can rely on independent benchmarks. For example, LLM Benchmarks from Trustbit. It has separate measurements for various product scenarios. At the same time, you cannot blindly trust the quality assessment from model developers, because we all know how much "marketing" is in them.

In the product: Well, it's simple – we determine what type of tasks we need to solve, and go look at the benchmarks.

Note: If you take the general benchmark/leaderboard by which you can evaluate the "level of the model", then the standard is considered to be the Chatbot Arena LLM Leaderboard, in which the leader is chosen based on blind voting between users for answers from various models.

6. Additional features available "out of the box"

Some models have various features available that can be implemented with additional development and with other models, but nevertheless, their presence can influence the choice, as it literally simplifies the implementation of certain tasks.

This includes function calls, working with files aka RAG, setting the response format, executing code inside.

  • Let's start with function calls, like in GPT-4, Claude, and Gemini.

We describe a specific set of functions for the model, their name, characteristics of the data needed for their execution. The model, when processing requests, independently decides whether to call any of the functions we specified, and if the decision is positive, it forms a JSON with the necessary data to call this function. We, in turn, execute the function and return the result to the model.

The simplest example: a user asks about the weather in Nizhny Novgorod.

The neural network sees that there is a suitable function and signals to us that we need to call the "weather request" function with the data "Location: Nizhny Novgorod". We (from the code) contact the weather service, return the answer to the model, it processes the service's response and answers the person's question.

  • The ability to add files to get answers from them (Retrieval).

Allows the model to extract information from uploaded files (e.g., knowledge base) during the dialogue. This is the ability to implement the very RAG that everyone is talking about, only on a minimal scale. We put files in a special storage (defined by the AI vendor), and the model, for example, GPT-4, responds taking into account the information in them.

  • Structured response format

If the model supports this feature, it can provide answers, organized in a certain way.

Example:output that is structured in JSON format. JSON formatted responses are convenient to use if further data processing in code is planned (i.e., almost always) — such responses are easier to work with.

In the latest gpt-4o, developers have gone further and allow you to forcibly set the JSON response schema.

What else needs to be considered, but there was not enough strength to attribute to the parameters:

Compatibility with Open Al API

This determines whether your product will interact with the model using the same API as used for gpt. Now solutions often add compatibility with the Open Al API, and such an industry standard has even emerged. If there is no compatibility, additional difficulties arise: you will have to describe working with the API, you will not be able to quickly switch to another model, etc.


Maas vs Hosting

The next parameter turned out to be so extensive that it had to be considered in detail.

By the type of model access, they can be divided into two large groups:

  1. MaaS (model as a service)

  2. Self-deployment

Let's talk about the advantages and disadvantages of each option:

If the model is provided by the vendor and implemented as a cloud service (MaaS)

The neural network provided by the vendor is implemented as a cloud service (MaaS, model as a service) and provides access to pre-trained AI models via API. Users pay for actual use of the models, most often — for tokens used in requests and responses.

When a neural network is deployed on the vendor's servers and maintained by them, it provides developers with an API through which data can be sent for processing and results can be obtained. The vendor processes and stores user data within the framework of the law, and can also train models based on it. Retraining does not always occur - it is necessary to understand what the vendor declares. (For example, OpenAI mentions everywhere that "we do not train models on your data").

Advantages of vendor models

Disadvantages of such models

So far, all top models are released by vendors. You can use the improved model in the product immediately after release.

Your data is received by a third party. This is sensitive for some types of businesses and government companies that fear leaks.

You can quickly and conveniently work with the product without setting up equipment.

There is a dependence on the supplier, their policies, and activities.

It is easy to scale resources and pay for each user request (pay as you go).

Limited customization and optimization capabilities for the company's needs. The solution may not meet some of the unique business requirements.

There is quick access to the model almost worldwide. You can get qualified technical support.

There is a constant improvement of services, which does not require investment.

We can compare MAAS models by several parameters:

1. Cost

The cost of incoming and outgoing requests may vary.

For example: when working with Claude 3.5 Sonnet, the user pays $3 for 1 million incoming tokens and $15 for 1 million outgoing tokens.

This is important because a large input may result in a small output or vice versa.

In the product:

Suppose the total context is 4000 tokens. The distribution can be as follows:

200 incoming tokens / 4800 outgoing tokens — for example, if you need to draft a contract.

4800 incoming / 200 outgoing — if you want to summarize a contract.

2. The ability to delay request processing

If your task does not require an immediate solution, you can gradually send requests that the neural network will process in stages. That is, we make a request now, and receive a response, for example, within 12/24 hours.

Yes, it will take more time, but it will save up to 50% of the budget — an excellent option for non-urgent tasks.

3. Country of jurisdiction and other legal aspects

Depending on where the neural network is legally registered and where we use it, different legal norms apply. The use of neural networks may also be subject to sanctions restrictions, so currently the Claude model does not work with the Russian Federation: it blocks requests from IPs and does not accept payments from Russian cards.

4. The ability to fine-tune models on your data (fine-tuning)

Fine-tuning allows you to fine-tune the model, adapting it to your needs and reducing costs. This capability is provided, for example, by gpt-4o.

Provided the correct setup and well-collected data for fine-tuning, the model provides more accurate and correct answers. The improved version of the model is more expensive, and additional payment is required for training (the cost depends on the amount of data transferred for training).

If the neural network is deployed on your servers (hosted model)

You can deploy the model on your server if its capacity allows. Outgoing requests will not go beyond the perimeter in which the model is deployed or the cloud. Such models are open-source, so they are a little less like a "black box" than ChatGPT - the weights are publicly available.

In this case, the costs will include payment for the physical server or cloud, as well as for the support and administration of such a model. The monthly rent of a server with an A100 graphics card with 40 GB in many large cloud services in Russia now costs about 200 thousand rubles.

Advantages of hosted models

Disadvantages of such models

Requests are not sent to third parties on remote servers, making data easier to control. You can protect information from unauthorized access and enhance the protection of confidential information.

This option does not always pay off. It requires significant costs for the purchase and installation of servers, creation and maintenance of infrastructure.

You can use specific tools and libraries that are not available from vendors. You can implement specialized models and algorithms based on business requirements.

Specialists with high technical expertise are required. You need to independently update the models.

If it is cost-effective for the business, you can save money by using your own servers.

Neural networks hosted on their own servers differ:

1. By the amount of resources required for operation (physical limitations)

It is difficult to say in advance how much resources will be required for the neural network to operate. However, when choosing a model, you can evaluate its main characteristics: the number of parameters and the level of quantization, response speed. If you plan to use server models with graphics cards, you need to carefully evaluate their cost and generation.

Usually, the model documentation indicates how many resources are needed for the models to work at a certain level of compression (quantization).

So, to run Qwen2.5 70B with Int4 quantization and a speed of 11.32 tokens/sec, you will need 48.86 GB GPU Memory (graphics card memory) for a single stream.

2. By license — the model's license may allow tasks only for educational or cultural purposes and prohibit commercial use.

Examples of licenses:

  • Mistral Large 2 has a Mistral Research license — it can only be used for research and education. For commercial use, a separate license purchase is required.

  • Qwen2.5's license allows commercial use if you have up to 100 million active users per month.

  • The Chinese llm Yi-1.5-34B-Chat operates with almost no restrictions.

Trends to watch

And if the parameters are clear, here are some trends in the development of the llm market that will change the neural network market in the near future

Hosted models are approaching the capabilities of vendor models

A year ago, this was not the case, but gradually models on local servers began to give good results, close to the top vendor solutions.

Other major players are becoming popular in the market

Recently, all solutions were built around OpenAI. Now the network has at least two major competitors — Gemini from Google and Claude from former OpenAI employees, Anthropic. Both neural networks are making headlines. For example, Claude 3 Opus in April displaced ChatGPT from the top of the leaderboards for several weeks. In early August, another rare event occurred: Gemini 1.5 Pro led the arena rankings, surpassing ChatGPT-4 and Claude 3.5 Sonnet.

User tariffs are becoming more democratic

We consistently see a reduction in the cost of paid tariffs for working with neural networks, which allows them to be used more frequently in solving tasks. For example, OpenAI constantly reduces costs.

The version of ChatGPT-4o released in early August now costs $2.5 per million incoming tokens and $10 per million outgoing tokens. This is 30% less than before. Gemini 1.5 Flash from Google has also become cheaper. The cost of incoming tokens has decreased by 78%, and the cost of outgoing tokens by 71%.

So, we have analyzed the key parameters for choosing an LLM for business tasks. I hope now you can find the optimal model without spending extra.

In conclusion, I want to leave a list of questions that I recommend answering before choosing a model for your goal:

  1. What task do you want to solve with LLM?

  2. Does the task involve processing large amounts of data?

  3. How important is the model's response speed for the task?

  4. What languages should the model work in? In which region will it be applied?

  5. Is it necessary to process different types of data (multimodality)?

  6. How specific is the task?

  7. Do you need the ability to further train the model on your data?

  8. How strong are the requirements for data security and confidentiality?

  9. What is your budget for implementing and using the model?

  10. Is compliance with legal norms and standards required (jurisdiction, licenses)?

  11. Do you have the necessary technical resources and expertise for self-deployment?

Any questions or something to share? I would be happy to discuss in the comments.

If you are interested in learning more about the practical application of AI in work (and not only) tasks, I invite you to my Telegram channel.

Comments