From context to jurisdiction: 7 key parameters when choosing an LLM for your project

18:04
17.10.2024
Ser_no
151

A year ago, it seemed that simply choosing GPT-4 was enough to solve all AI problems. Today, the language model market resembles a zoo, with new exotic species appearing every day. Claude, Gemini, Mistral, Qwen — and this is just the tip of the iceberg.

Hello, tekkix! I am Sergey, the product manager of the AI direction at Bitrix24. For the past year, we have been actively implementing neural networks into our product, and I want to share some experiences that can save you time and money.

It turned out that choosing the right neural network model is an art. Context sizes, licenses, language support, access methods - these parameters can make your head spin. But understanding them is critically important if you don't want to waste a lot of time and money.

A year ago, it seemed that ChatGPT (based on gpt-3.5 and gpt-4) was the pinnacle of language model evolution. But the AI market is developing at a furious pace, and now the situation has changed dramatically.

A bunch of new players have appeared: Claude, Gemini, Mistral, Qwen - and this is just the tip of the iceberg. Some of them already outperform ChatGPT in certain tasks. The Russian market is also not lagging behind: YandexGPT, GigaChat, T-lite from T-Bank. And this is where the most interesting (read: difficult) part begins. How to choose the one that suits you in this zoo of neural networks? In this article, I will tell you what you really need to pay attention to when choosing an LLM for your business or project. We will analyze the key characteristics of the models, their pros and cons, as well as the pitfalls you might not have suspected.

By what parameters can all neural networks be divided

1. Context window size

The context window of a neural network is probably the most well-known parameter that everyone talks about. In fact, it is the maximum amount of data in the form of text or, more precisely, tokens into which the text is divided, processed by the neural network in one request.

Note: 1000 tokens is approximately 750 English words (this was the case for gpt-3.5, the count for different models may vary).

There are models that support a large context window - the leader now is Gemini 1.5, accommodating 2 million tokens.

There are neural networks that support a smaller context volume. For example, gemma-2-27b supports a context length of only up to 8000 tokens.

It is also important to consider HOW models work with long texts and pass the so-called "needle in a haystack" tests. For example, developers from the Anthropic team have optimized Claude for this. With a context window of 200 thousand tokens, it significantly "loses" less information from the middle, unlike other models.

Maas vs Hosting

The next parameter turned out to be so extensive that it had to be considered in detail.

By the type of model access, they can be divided into two large groups:

MaaS (model as a service)
Self-deployment

Let's talk about the advantages and disadvantages of each option:

If the model is provided by the vendor and implemented as a cloud service (MaaS)

The neural network provided by the vendor is implemented as a cloud service (MaaS, model as a service) and provides access to pre-trained AI models via API. Users pay for actual use of the models, most often — for tokens used in requests and responses.

When a neural network is deployed on the vendor's servers and maintained by them, it provides developers with an API through which data can be sent for processing and results can be obtained. The vendor processes and stores user data within the framework of the law, and can also train models based on it. Retraining does not always occur - it is necessary to understand what the vendor declares. (For example, OpenAI mentions everywhere that "we do not train models on your data").

Advantages of vendor models	Disadvantages of such models
So far, all top models are released by vendors. You can use the improved model in the product immediately after release.	Your data is received by a third party. This is sensitive for some types of businesses and government companies that fear leaks.
You can quickly and conveniently work with the product without setting up equipment.	There is a dependence on the supplier, their policies, and activities.
It is easy to scale resources and pay for each user request (pay as you go).	Limited customization and optimization capabilities for the company's needs. The solution may not meet some of the unique business requirements.
There is quick access to the model almost worldwide. You can get qualified technical support.
There is a constant improvement of services, which does not require investment.

We can compare MAAS models by several parameters:

1. Cost

The cost of incoming and outgoing requests may vary.

For example: when working with Claude 3.5 Sonnet, the user pays $3 for 1 million incoming tokens and $15 for 1 million outgoing tokens.

This is important because a large input may result in a small output or vice versa.

In the product:

Suppose the total context is 4000 tokens. The distribution can be as follows:

200 incoming tokens / 4800 outgoing tokens — for example, if you need to draft a contract.

4800 incoming / 200 outgoing — if you want to summarize a contract.

2. The ability to delay request processing

If your task does not require an immediate solution, you can gradually send requests that the neural network will process in stages. That is, we make a request now, and receive a response, for example, within 12/24 hours.

Yes, it will take more time, but it will save up to 50% of the budget — an excellent option for non-urgent tasks.

3. Country of jurisdiction and other legal aspects

Depending on where the neural network is legally registered and where we use it, different legal norms apply. The use of neural networks may also be subject to sanctions restrictions, so currently the Claude model does not work with the Russian Federation: it blocks requests from IPs and does not accept payments from Russian cards.

4. The ability to fine-tune models on your data (fine-tuning)

Fine-tuning allows you to fine-tune the model, adapting it to your needs and reducing costs. This capability is provided, for example, by gpt-4o.

Provided the correct setup and well-collected data for fine-tuning, the model provides more accurate and correct answers. The improved version of the model is more expensive, and additional payment is required for training (the cost depends on the amount of data transferred for training).

Choosing is easy - the main thing is to choose correctly

If the neural network is deployed on your servers (hosted model)

You can deploy the model on your server if its capacity allows. Outgoing requests will not go beyond the perimeter in which the model is deployed or the cloud. Such models are open-source, so they are a little less like a "black box" than ChatGPT - the weights are publicly available.

In this case, the costs will include payment for the physical server or cloud, as well as for the support and administration of such a model. The monthly rent of a server with an A100 graphics card with 40 GB in many large cloud services in Russia now costs about 200 thousand rubles.

Advantages of hosted models	Disadvantages of such models
Requests are not sent to third parties on remote servers, making data easier to control. You can protect information from unauthorized access and enhance the protection of confidential information.	This option does not always pay off. It requires significant costs for the purchase and installation of servers, creation and maintenance of infrastructure.
You can use specific tools and libraries that are not available from vendors. You can implement specialized models and algorithms based on business requirements.	Specialists with high technical expertise are required. You need to independently update the models.
If it is cost-effective for the business, you can save money by using your own servers.

Neural networks hosted on their own servers differ:

1. By the amount of resources required for operation (physical limitations)

It is difficult to say in advance how much resources will be required for the neural network to operate. However, when choosing a model, you can evaluate its main characteristics: the number of parameters and the level of quantization, response speed. If you plan to use server models with graphics cards, you need to carefully evaluate their cost and generation.

Usually, the model documentation indicates how many resources are needed for the models to work at a certain level of compression (quantization).

So, to run Qwen2.5 70B with Int4 quantization and a speed of 11.32 tokens/sec, you will need 48.86 GB GPU Memory (graphics card memory) for a single stream.

2. By license — the model's license may allow tasks only for educational or cultural purposes and prohibit commercial use.

Examples of licenses:

Mistral Large 2 has a Mistral Research license — it can only be used for research and education. For commercial use, a separate license purchase is required.
Qwen2.5's license allows commercial use if you have up to 100 million active users per month.
The Chinese llm Yi-1.5-34B-Chat operates with almost no restrictions.

Trends to watch

And if the parameters are clear, here are some trends in the development of the llm market that will change the neural network market in the near future

Hosted models are approaching the capabilities of vendor models

A year ago, this was not the case, but gradually models on local servers began to give good results, close to the top vendor solutions.

Other major players are becoming popular in the market

Recently, all solutions were built around OpenAI. Now the network has at least two major competitors — Gemini from Google and Claude from former OpenAI employees, Anthropic. Both neural networks are making headlines. For example, Claude 3 Opus in April displaced ChatGPT from the top of the leaderboards for several weeks. In early August, another rare event occurred: Gemini 1.5 Pro led the arena rankings, surpassing ChatGPT-4 and Claude 3.5 Sonnet.

User tariffs are becoming more democratic

We consistently see a reduction in the cost of paid tariffs for working with neural networks, which allows them to be used more frequently in solving tasks. For example, OpenAI constantly reduces costs.

The version of ChatGPT-4o released in early August now costs $2.5 per million incoming tokens and $10 per million outgoing tokens. This is 30% less than before. Gemini 1.5 Flash from Google has also become cheaper. The cost of incoming tokens has decreased by 78%, and the cost of outgoing tokens by 71%.

So, we have analyzed the key parameters for choosing an LLM for business tasks. I hope now you can find the optimal model without spending extra.

In conclusion, I want to leave a list of questions that I recommend answering before choosing a model for your goal:

What task do you want to solve with LLM?
Does the task involve processing large amounts of data?
How important is the model's response speed for the task?
What languages should the model work in? In which region will it be applied?
Is it necessary to process different types of data (multimodality)?
How specific is the task?
Do you need the ability to further train the model on your data?
How strong are the requirements for data security and confidentiality?
What is your budget for implementing and using the model?
Is compliance with legal norms and standards required (jurisdiction, licenses)?
Do you have the necessary technical resources and expertise for self-deployment?

Any questions or something to share? I would be happy to discuss in the comments.

If you are interested in learning more about the practical application of AI in work (and not only) tasks, I invite you to my Telegram channel.