Processing of catalog and products on LLM

12:12
18.02.2025
Squirrelfm
212

There is a classic problem on any marketplace related to how to categorize and understand product descriptions. It is especially exacerbated by the fact that users create convoluted descriptions even for the simplest products. For example, a regular blue t-shirt may be described as sky blue or even dark blue aquamarine.

There is a classic problem on any marketplace related to how to sort and analyze the catalog. It is especially exacerbated by the fact that users or employees create convoluted descriptions even for the simplest products. For example, a regular blue t-shirt can be described as "sky blue," "royal blue," or even "dark blue-aquamarine." Some sellers and suppliers manage to shove product information directly into the image itself, drawing over a poorly lit photo in bright green letters: "The best t-shirt in the world!" As a result, two identical products can appear as if they are from different universes.

Consequently, searching, categorizing, and analyzing these products becomes a real headache (and guarantees job security for data specialists). Moreover, under such conditions, it is challenging to match products with trending trends, as a trend is also, in essence, a categorization task.

Of course, over the years, various ways have emerged to deal with this:

1. Category-specific matcher. This is a separate "matching" model or algorithm for each product category — for electronics, for clothing, for cosmetics, and so on, down to subcategories. This approach is very specialized and works, but it can turn into quite a hassle if you have a hundred thousand categories. There are many options on how to implement this, usually depending on what attribute it is, for example, Named Entity Recognition for brand names, decision trees, and even a huge if-then-else script.

2. Candidate searching using embeddings. Embeddings are vector representations of data (for example, descriptions or product names) used to determine their similarity. By using text or image processing methods (like word2vec or sentence-transformers), similar products can be found based on the proximity of the obtained embeddings. Embeddings can also be generated using LLM.

3. Attribute extraction for each product. Product information (such as brand, model, color, size, etc.) is extracted from descriptions, for example, through regex, to analyze and match products at a deeper level.

4. Gradient Boosting. Gradient boosting algorithms (CatBoost, etc.) are applied to classification tasks, determining whether products are similar. These models are trained on pre-labeled data and take into account both textual and numerical attributes. They require labeled data.

All of this really helps up to a certain point, but people are remarkably inventive in describing things. Feature hell is a reality where there are 400 ways to say "comfortable," and visually identical clothing can be named in completely different ways ("Eco T-shirt made of hemp" vs. "100% eco-friendly plant-based top").

In the field of clothing and similar products that may be identical but described completely differently, matching can only be done based on photographs, considering color, shape, and fabric structure, i.e., the main information will still not be in the product description.

The new approach is to use the multimodal capabilities of LLMs (language processing AI models) and Vision-Language Models (VLMs) to tackle this problem. This approach is already being widely used by leading retailers and platforms:

Attribute extraction using LLM

Example prompt:

Matching using LLM

Example color comparison

Prompt:

There are two product listings from the "Freezers" category on the marketplace. Compare the attribute "color". The answer should include:

is_match: A boolean value indicating whether the values match.
reason: A detailed explanation of the comparison result. The response format is JSON.

Product Listing 1:

Product Listing 2:

Model: Upstreman UF70

Color: White

Model: Midea WHS-109FSS1

Color: Stainless Steel

Result:

{
  "is_match": false,
  "reason": "The color attributes do not match. Product 1 is 'White' while Product 2 is 'Stainless Steel'."
}

Matching multiple attributes

Prompt:

There are two product listings from the "Freezers" category on the marketplace. Compare their attributes to determine if the two products are similar enough to be considered the same model. The attributes for comparison are:

Model name: Remove spaces for comparison.
Capacity: Matches if the difference is less than or equal to 0.2 cubic feet.
Energy efficiency: Matches if both are ENERGY STAR certified or have identical efficiency ratings.
Dimensions: Match if the height, width, and depth differences are no more than 1 inch.
Color: Matches only if the color is identical.

The answer should include:

is_match: A boolean value indicating whether the two products are considered the same model.
reason: A detailed explanation of the comparison result. The response format is JSON.

Product Card 1:

Product Card 2:

Model: Upstreman UF70

Capacity: 7.0 cubic feet

Energy Efficiency: ENERGY STAR Certified

Dimensions: 56.1x21.3x22.8 inches (H x W x D)

Color: Pearl White

Model: Upstreman UF70

Capacity: 6.9 cubic feet

Energy Efficiency: ENERGY STAR Certified

Dimensions: 56x21.2x22.9 inches (H x W x D)

Color: White

Output:

{
  "is_match": true,
  "reason": "Two freezers are considered the same model based on the following comparisons:
     1. The model name matches after normalization ('UpstremanUF70').
     2. The capacities are similar (7.0 vs 6.9 cubic feet, a difference of 0.1 cubic feet, which is within the acceptable range of 0.2 cubic feet).
     3. Both have ENERGY STAR certification, indicating identical energy efficiency.
     4. The dimensions are almost identical (difference: height 0.1 inches, width 0.1 inches, depth 0.1 inches, all within an acceptable margin of error of 1 inch).
     5. Both products have similar colors: 'White' and 'Pearl White'."
}

Working with Photos and Images

Extracting Attributes

Suppose we have an image like this. We will use the same request as for text extraction:

There is a product card from the category "Refrigerators" on a marketplace. 
We need to extract and format attributes from it.

key_attributes_list = [

    # ...

    {
        "name": "dimensions",
        "attribute_comment": "Include height, width, and depth in centimeters (e.g., '180x70x60'). Look for terms like 'Dimensions'. If not available, set the value to null."
    }
]

Innovative data processing methods in product catalogs based on LLM ensure accuracy and quick adaptation to changes.

Result:

{
  "dimensions": "56.10x21.30x22.76 inches"
}

Another one:

LLM technologies significantly accelerate the interpretation and analysis of large product catalogs, facilitating search and sorting.

Output:

{
  "category": "Headphones",
  "brand": "KVIDIO",
  "color": "Black",
  "features": [
    "Full-sized cup design",
    "Wireless",
    "Bluetooth connectivity",
    "Soft ear cushions"
  ]
}

Matching Clothing Items

Prompt:

As a fashion expert, you should compare two photos of women's tops. The analysis should focus solely on the tops themselves, ignoring:

Any other visible clothing items, even if they are part of the outfit or match the style.
Differences in poses, body movements, or how the tops are worn.

Goal: to determine whether the tops are identical or completely different. Provide a clear answer of "Identical" or "Different," accompanied by a brief phrase explaining. The answer should be in JSON format.

Result:
The LLM was able to detect a slight difference in the neckline, which, being a man, I still am not sure I see.

{
  "result": "Different",
  "reason": "The tops have different necklines: the first has a round neckline, while the second has a boat neckline."
}

Here’s another example:

Analysis and structuring of product data using LLM contributes to increasing business competitiveness.

{
  "result": "Identical",
  "reasoning": "Both tops have the same color, design, and fabric characteristics, including long sleeves, a fitted cut, and a light aqua shade."
}

Perfecto!

Models and Throughput

Throughput is extremely important for the marketplace, as thousands of products are processed daily, with new ones added every day. Throughput depends on two factors:

Your hardware
The size and type of the model

If you are using a 70b model, such as LLama or Qwen, it will work well but slowly. Without a supercomputer,

Using a smaller model trained on your dataset. The size of the model depends on the task, and you will likely need to train several smaller models, as each will only be able to handle a limited number of categories. Overall, you will have to experiment. General observations are as follows:
- Models with 7-12b parameters are suitable for extracting attributes from text.
- Models smaller than 7b may be suitable for a limited set of finely-tuned attributes, and such a model can possibly be obtained through distillation of a larger model.
  This can increase throughput by 10–20 times. However, it should be noted that smaller models struggle to extract many attributes at once and may have difficulty handling complex queries, so you will need to test them on your task.
Quantization. Quantization can increase the number of requests per second by 20–50% without significantly reducing the model's performance quality.
Scenarios with large volumes of data. In such cases, it makes no sense to use anything other than a self-hosted model, as the costs of using OpenAI or Anthropic would be too high. However, such models are suitable for prototyping and testing ideas or for handling very complex cases. They parse truly complex descriptions at a level comparable to humans.
For self-hosting, I would recommend using the latest versions of LLama or Qwen models. Start with 70b for testing, and then optimize down to smaller models until the performance meets your satisfaction.

You will likely need to further fine-tune the model for specific categories. For example, in the medical field, abbreviations are common, as they are in the construction industry. A universal model may struggle with such cases, so in these situations, it would be beneficial to use the LORA (Low-Rank Adaptation) method.

In addition, consider the language. For example, LLama works well with English, but for Chinese, it is likely to hallucinate. In this case, the Gwen model is probably the better choice. In the Russian language, endings, as well as names and their declensions, often cause failures, so it is worth paying attention to products where this is one of the attributes: movies, music, books.

Pricing

Extracting the price from the LLM description is possible, but can LLMs help with price analysis and pricing? I think not yet, and the problem lies in working with numbers. The latest models from OpenAI can calculate fairly well, but it will be too expensive when trying to analyze large volumes of data.

For narrow niches with specific pricing, it is possible to build such a system on agents. Most platforms and large retailers are what is called "market-driven," or simply put, they orient themselves to competitors' prices. Therefore, either custom algorithms or pricing management systems, similar to those we helped create for the Keeprise team, are used there.

Leave your questions in the comments.

Wishing everyone well!