Profiling: comparing pt-pmp and perf tools on real examples

14:14
18.10.2024
aslepnev
46

Hello, tekkix! In the previous article, the primary analysis of the application was discussed, which tools should be used to collect information and how to work with these tools. Let me remind you that we were talking about two utilities: poor man's profile, which allows you to comprehensively evaluate the application's performance by displaying off-cpu and on-cpu parts; and perf, which has high accuracy and powerful functionality overall. Both of these tools are used for performance analysis, as their combination allows you to examine the "patient" from all sides.

Performance profiling: comparison of pt-pmp and perf on real examples

Only part of the flame graph is displayed here, namely the methods of the consumer threads (the root method is Workers::ExecuteTask). It can be seen that consumers spend about 77% of the time processing the package, the remaining 23% waiting for the mutex to unlock, thereby blocking each other's work. It becomes clear that to optimize the work of consumers, it is necessary to reduce the time during which they are idle and direct these resources directly to working with messages.

Problem Solving

An effective way to speed up the work of threads in the given example is to change the concept of std::conditional_variable + std::list and completely remove the work with mutexes. An excellent option seems to be a lock-free queue — this is a data structure designed to store elements without using locking mechanisms. The lock-free queue uses atomic operations and other synchronization methods that allow multiple threads to simultaneously add and remove elements from the queue without blocking each other (more details about such queues can be found in this article).

In the case of the SysLog module, the queue should meet the following requirements:

comply with the single producer — many consumers pattern;
be limited: you cannot add an infinite number of messages to the queue;
have the ability to block threads: if the queue is empty, consumer threads should wait for a new message. If the queue is full, the producer thread should wait for a new space.

This description fits concurrentqueue — it is an implementation of a lock-free queue based on sequential blocks instead of linked lists. It consists of a set of subqueues, one for each producer. When a consumer thread wants to read a message from the queue, it checks all the subqueues until it finds one that is not empty (in our case, only one subqueue is used, as there is only one producer). A more detailed description with implementation details and benchmarks can be found in the article by the library author.

After integrating the queue, the producer thread code looks like this:

// Create the queue (using the thread-blocking variant)
static constexpr std::size_t EXPLICIT_PRODUCERS_NUMBER = 1;
static constexpr std::size_t IMPLICIT_PRODUCERS_NUMBER = 0;
static moodycamel::BlockingConcurrentQueue m_queue{queueSize, EXPLICIT_PRODUCERS_NUMBER, IMPLICIT_PRODUCERS_NUMBER};

// Create a token for the producer thread
static moodycamel::ProducerToken token(m_queue);

// Enqueue the message
if (!m_queue.try_enqueue(token, std::move(message)))
    m_queue.wait_enqueue(token, std::move(message));

Initially, the try_enqueue method is used, which places the message in the queue if there is space. If there is no space, the wait_enqueue method is called, which will do so when space becomes available. At the time of writing, BlockingConcurrentQueue does not contain this method, so I implemented it as follows:

// Send the stream to "sleep", then try to add a message to the queue.
// The waiting time doubles with each subsequent iteration.
inline void wait_enqueue(producer_token_t const& token, T&& item)
{
    auto sleepTime = 100;

    do
    {
        std::this_thread::sleep_for(std::chrono::nanoseconds(sleepTime));
        sleepTime = std::min(sleepTime * 2, 10000000);

    } while (!add_item_busy_wait(token, std::move(item)));
}

// Try to add a message to the queue. If there is no space, use spin lock to wait
// (it is assumed that space in the queue will appear quickly and a short pause will be enough)
bool add_item_busy_wait(producer_token_t const& token, T&& item)
{
    for (auto i = 0; i < 1000; i++)
    {
        if (try_enqueue(token, std::move(item)))
            return true;
        _mm_pause();
    }

    return false;
}

Below is the consumer thread code:

static constexpr std::int64_t QUEUE_WAITING_PERIOD_USEC = 100000;
// Create a token for the consumer thread
thread_local moodycamel::ConsumerToken token(m_queue);

while (!m_Stopped)
{
   Message message;

    if (!m_Tasks.wait_dequeue_timed(token, message, QUEUE_WAITING_PERIOD_USEC))
        continue;

    message.process();
}

The logic here is simple: wait for a message in the queue, read it, and process it.

Optimization Testing

Now let's see how the flame graphs have changed after adding the patch. pt-pmp showed the following results:

Comparison chart of pt-pmp and perf execution time

It can be seen that message processing now takes 97.5% of the time, waiting has significantly decreased, mutex locks are not visible on the graph, as a lock-free queue is used. Compared to the previous version, the SysLog module is now about 50% faster. Below are the results of comparing the two versions through MPS (messages per second). The version before optimization is marked in red, the version after is in blue:

Comparison with perf flame graphs

It remains to answer the question: what does the flame graph based on perf show in this case? Would it be possible to find the described optimization if only the on-cpu profiler was used?

Answer: the flame graph based on perf is not enough. This is what the consumer's work looks like from the point of view of perf:

Using pt-pmp and perf for performance optimization

There are no locks here at all, in which the consumer actually spends quite a lot of time. It creates the impression that the thread only processes messages and is almost not in a waiting state.

Now let's look at the perf flame graph after the introduction of the lock-free queue:

Performance testing results using pt-pmp and perf

The picture has not changed much, we see 97% of useful work compared to 95.5%. It can be concluded that for analyzing the application from the point of view of the off-cpu part, pt-pmp is more suitable than perf.

Using perf

The main advantage of this profiler compared to pt-pmp is higher measurement accuracy, so let's consider an example where this feature was useful. We will analyze the event normalizer (MaxPatrol SIEM component), its main task is to convert all events into one form so that other components can correctly process it afterwards.

Case conversion problem

One of the popular methods used in normalization is converting a string from uppercase to lowercase. The code is based on the open-source ICU library:

inline static std::string toLowerIcu(std::string_view src) {
    static UErrorCode err = U_ZERO_ERROR;
    static const UCaseMap *csm = ucasemap_open(nullptr, U_FOLD_CASE_DEFAULT, &err);
    std::string dst(src.size(), char{});

    ucasemap_utf8ToLower(csm, dst.data(), static_cast(dst.size()),
        src.data(), static_cast(src.size()), &err);

    return dst;
}

On the perf flame graph, the transformation looks as follows:

Comparison of pt-pmp and perf efficiency on real data

It is evident that it consists entirely of the ICU function and takes up 1.5–2% of the total runtime, so there is room for optimization here.

Solution Preparation and Testing

One of the features of the normalizer is that it must correctly handle both ASCII format strings and a broader set of UTF-8 format strings (which is why ICU is used). However, this method is quite slow, especially for ASCII strings. To convert such a string from uppercase to lowercase, it is sufficient to call the standard tolower function, which works faster. But first, you need to make sure that the input string is in ASCII format.

The updated conversion code looks like this:

// Helper function to check ASCII string
inline static void isAsciiAndLowercase(std::string_view str, 
                                       bool &isAscii, 
                                       bool &needsConvToLower)
{
    const uint8_t *pos;
    size_t len;
    uint64_t acc1 = 0;
    uint64_t acc2 = 0x2020202020202020UL; 

    // It is assumed that the input to this function is short enough, 
    // so scanning the entire string in a vectorizable loop should be 
    // faster than checking conditions inside the loop to exit it earlier.
    //
    // The following assumptions are used:
    //
    // - all ASCII characters (and only them) have the most significant bit 
    // (bit #8) of their first and only byte set to 0;
    //
    // - no uppercase ASCII character has bit #6 set 
    // ('A' - 'Z' are encoded in binary values 0100 0001 - 0101 1010). 
    // This means that if bit #6 is set for all characters, the string 
    // will definitely not contain uppercase characters 
    // and therefore will not need to be converted to lowercase 
    // (needsConvToLower == false);
    //
    // - Conversely, if all characters in the string are less than 64 
    // (i.e., their bit #7 is cleared), no conversion is required;
    //
    // - if both conditions above are false, the string may or may not have 
    // uppercase letters, but the function will return (needsConvToLower == true), 
    // i.e., false positives for this check are possible, 
    // but not false negatives.      

    for (pos = (const uint8_t *)str.data(), len = str.length(); len >= 8; 
         pos += 8, len -= 8) {
        uint64_t val = *(const uint64_t *)pos;
        acc1 |= val;
        acc2 &= val;
    }

    uint8_t acc3 = 0x20;
    for (; len > 0; pos++, len--) {
        acc1 |= *pos;
        acc3 &= *pos;
    }

    isAscii = (acc1 & 0x8080808080808080UL) == 0;
    needsConvToLower =
        !((acc1 & 0x4040404040404040UL) == 0 ||
          ((acc2 & 0x2020202020202020UL) == 0x2020202020202020UL && 
           (acc3 & 0x20) == 0x20));
}

// ASCII string conversion function
inline static std::string asciiToLower(std::string_view src) {
    std::string dst(src.size(), char{});

    for (size_t i = 0; i < src.size(); i++) {
        dst[i] = ::tolower(src[i]);
    }

    return dst;
}  

// Conversion function with ASCII string check
inline static std::string toLowerAscii(std::string_view src) {
    bool isAscii;
    bool needsConvToLower;

    pt::isAsciiAndLowercase(src, isAscii, needsConvToLower);
    if (isAscii) {
        return needsConvToLower ? asciiToLower(src) : std::string{src};
    }

    // Call the original ICU-based method 
    // if the string does not fit the ASCII format 
    return toLowerIcu(src);
}

Let's compare its performance on a microbenchmark:

$ lscpu
...
Vendor ID:            GenuineIntel
CPU family:           6
Model:                62
Model name:           Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
...

$ ./bin/benchmark_string_upper_to_lower
2024-07-29T12:31:18+00:00
Running ./bin/benchmark_string_upper_to_lower
Run on (12 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x12)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 30720 KiB (x0)
Load Average: 4.60, 4.39, 3.88
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
IcuConversion        1815000 ns      1815000 ns          314
ToLowerConversion     837000 ns       836966 ns          835

It can be seen that the optimized algorithm works approximately twice as fast as the original version. Now let's see how this is reflected in the perf flame graph:

Comparison with pt-pmp flame graphs

If only pt-pmp was used as a profiler, the conversion problem would be difficult to detect, as it is less accurate compared to perf and does not show the lower method on the flame graph.

It looks like this:

Comparison of profiling tools: pt-pmp and perf

For clarity, I will duplicate the perf flame graph before optimization, only with the same scale as pt-pmp, to see the difference exactly:

Real examples of using pt-pmp and perf for profiling

It can be concluded that it is better to use perf rather than pt-pmp to find local optimizations.

Conclusion

The described examples demonstrate the strengths and weaknesses of pt-pmp and perf and show when and in what cases each of them was applied in practice. As already mentioned in the first part of the article, we do not need to choose just one; an effective option is to use both profilers at once, thus it will be possible to analyze the application from different perspectives: both in terms of thread interaction and various expectations, and in terms of pinpoint optimizations and acceleration of specialized algorithms.

It is worth clarifying: although the performance in the examples differs (with the help of pt-pmp it was possible to find an optimization of 50%, and 1% with the use of perf), this does not mean that perf is weaker and should not be used. The emphasis here is on the strengths and weaknesses of these tools, as it is difficult to find problems with pt-pmp, the closure of which will give a 1-2% increase. But there can be many such problems, and if they are gradually corrected, the performance of the application will noticeably accelerate.

Application

Microbenchmark code for string conversion:

#include 
#include 
#include 
#include 

#include 
#include 

static constexpr size_t STR_MAX_SIZE = 100;
static constexpr size_t VECTOR_SIZE = 3000;  

static std::string getRandomString(size_t length)
{
    static constexpr std::string_view CHARACTERS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
    std::default_random_engine generator;
    std::uniform_int_distribution<> distribution(0, CHARACTERS.size());

    std::string string;
    string.reserve(length);
    for (std::size_t i = 0; i < length; ++i)
    {
        string.push_back(CHARACTERS[distribution(generator)]);
    }

    return string;
}

static std::vector getRandomStringVector() {

    std::default_random_engine generator;
    std::uniform_int_distribution<> distribution(1, STR_MAX_SIZE);
    std::vector vector{VECTOR_SIZE};

    std::generate(vector.begin(), vector.end(), [&generator, &distribution