小熊猫大暴走

Yet another implementation of a lock-free circu...

Download code_submitted.zip - 20.95 KB

1. Introduction

Improving applications with high-performance constraints has an obvious choice nowadays: Multithreading. Threads have been around for quite a long time. In the old days, when most computers had only 1 processor, threads were mainly used as a way of dividing the total work in smaller execution units, allowing them to do actual work (that is, "processing") while other smaller executing units were waiting for resources. An easy example is a network application which listens from a TCP port and does some processing once a request arrives through that port. In a monothread approach the application won't be able to respond to more requests until each request is processed, so potencial users might think the application is down when it is just doing work. In a Multithread approach, a new thread could be in charge of the processing side, while the main thread would be always ready to respond to potential users.

Under monoprocessor machines, processing-eager multithread applications might not have the result intended. All threads might end up "fighting" to get the processor to be able to do some work, and the overall performance can be the same, or even worse than letting only 1 processing unit to do all the work, because of the overhead needed to communicate and share data between threads.

In a symmetric multiprocessing (SMP) machine though, the former multithreaded application would really do several tasks at the same time (parallelizing). Each thread would have a real physical processor, instead of sharing the only resource available. In a N-processor SMP system a N-thread application could theoretically reduce the amount of time needed in such applications by N times (it's always less since there is still some overhead to communicate or share data accross threads).

SMP machines were quite expensive in the past, and only companies with a big interest on this kind software could afford them, but nowadays multicore processors are reasonably cheap (most computers sold right now have more than 1 core), so parallelizing these kind of applications became more and more popular due to the big impact it can have on performance.

But multithreading is not an easy task. Threads have to share data and communicate with each other, and soon enough you'll find yourself facing the same old problems: deadlocks, uncontrolled access to shared data, dynamic memory allocation/deletion across threads, etc. Besides, if you are lucky enough to be working on an application with high performance constraints, there will also be a different set of problems which severely affect the overall performance of your beloved multithreaded system:

Cache trashing
Contention on your synchonization mechanism. Queues
Dynamic memory allocation

This article is about how you can minimize the former 3 performance-related problems using an array based lock-free queue. Especially the usage of dynamic memory allocation, since that was the main objective of this lock-free queue when it was designed in the first place.

2. How synchronizing threads can reduce the overall performance

2.1 Cache trashing

Threads are (from Wikipedia): "the smallest unit of processing that can be scheduled by an operating system". Each OS has its own implementations of threads, but basically a process contains a set of instructions (code) plus a piece of memory local to the process. A thread runs a piece of code but shares the memory space with the process that contains it. In Linux (this queue I'm writing about was firstly intended to work in this OS) a thread is just another "context of execution", in this OS there "is no concept of a thread. Linux implements threads as standard processes. The Linux kernel does not provide any special scheduling semantics or data structures to represent threads. Instead, a thread is merely a process that shares certain resources with other processes."[1]

Each one of these running tasks, threads, contexts of execution or whatever you might want to call them, use a set of registers of the CPU to run. They contain internal data to the task, such as the address of the instruction being run at the moment, the operands and/or results of certain operations, a pointer to the stack, etc. This set of information is called "context". Any preemtive OS (most modern operative systems are preemtive) has to be able to stop a running task at almost any time saving the context somewhere, so it can be restored in the future (There are few exceptions, for instance processes that declare themselves non-preemtive for a while to do something). Once the task is restored it will resume whatever thing it was doing as if nothing had happened. This is a good thing, the processor is shared accross tasks, so tasks waiting for I/O can be preemted allowing another task to take over. Monoprocessor systems can behave as if they were multiprocessor, but, as everything in life, there is a trade-off: The processor is shared, but each time a task is preemted there is an overhead to save/restore the context of both the exiting and the entrant task.

There is also an extra hidden overhead in the save/restore context: Data saved into cache will be useless for the new process since it was cached for the former existing task. It is important to take into account that processors are several times faster than memories, so a lot of processing time is wasted waiting for memory to read/write data coming to/from the processor. This is why cache memories were put between standard RAM memory and the processor. They are faster and smaller (and much more expensive) memory slots where data accessed from the standard RAM is copied because it will probably be accessed again in the near future. In processing intensive applications cache misses are very important for performance since when data is already in cache the processing time will be times faster.

So, anytime a task is preempted cache is likely to be overwritten by the following process(es), which means it will take a while for that process to be as productive as it was before being preemtped once it resumes its operation. (Some operative systems -for instance, linux- try to restore processes on the last processor tasks were using, but depending on how much memory the later process needed, cache might also be useless). Of course I don't mean preemption is bad, preemption is needed for the OS to work properly, but depending on the design of your MT system, some threads might be suffering from preemption too often, which reduces performance because of cash trashing.

When are tasks preempted then? This is very much dependant on your OS, but interrupt handling, timers and system calls are very likely to cause the OS preemption subsystem to decide to give processing time to some other processes in the system. This is actually a very important piece of the OS puzzle, since nobody wants processes to be idle (starve) for too long. Some system calls are "blocking", which means that the task asks for a resource to the OS and it waits until it is ready because that task needs the resource to keep going. This is a good example of a preemtable task, since it will be doing nothing until the resource is ready, so the OS will put that task on wait and give the processor to some other task to work.

Resources are basically data in memory or hard disk, network, peripheral but also blocking synchronization mechanisms like semaphores or mutexes. If a task tries to enter a mutex which is already being held it will be preemtped, and once the mutex is available again, the thread will be added to the "ready-to-run" queue of tasks. So if you are worried about your process being preemtped too often (and cache trashing) you should avoid blocking synchronization mechanisms when possible.

But as anything in life, it is never that easy. Latency on your system might be hit if you are using more threads to do processing intensive tasks than the number of physical processors you have when avoiding blocking synchronization mechanisms. The fewer times the OS rotates tasks the more currently inactive processes wait till they find an idle processor to be restored in. It might even happen that the whole application might be hit because it is waiting for a starved thread to finish some calculation before the system can step forward to do something else. There is no valid formula, it always depends on your application, your system and your OS. For instance, in a processing intensive real-time application I would choose to use non-blocking mechanisms to synchronize threads AND less threads than physical processors, though this might not be always possible. In some other applications which are idle most of the time waiting for data coming from, for instance, the network, non blocking synchronization can be an overkill (which can end up killing you). There's no recipe here, each method has its own pros and cons, and it's up to you to decide what to use.

2.2 Contention on your synchonization mechanism. Queues

Queues can easily be applicable to a wide range of different multithread situations. If two or more threads need to communicate events in order, the first thing that comes to my head is a queue. Easy to understand, easy to use, well tested, and easy to teach (that is, cheap). Every programmer in the world has had to deal with queues. They are everywhere.

Queues are easy to use in mono thread applications, and they can "easily" be adapted to multithread systems. All you need is a non protected queue (for instance std::queue in C++) and some blocking synchronization mechanisms (mutexes and conditional variables for example). I uploaded into the article an easy example of a blocking queue implemented using glib (which makes it compatible with a wide range of operative systems where glib has been ported). There is no real need to reinvent the wheel with such a queue though, since GAsyncQueue[7] is an implementation of a thead-safe queue already included in Glib, but this code is a good example of how to convert a standard queue into a thread-safe one.

Let's have a look at the implementation of the most common methods that are expected to be found in a queue: IsEmpty, Push and Pop. The base non protected queue is a std::queue which has been declared as std::queue m_theQueue;. The three methods we are going to see here show how that non-safe implementation is wrapped with GLib mutexes and contitional variables (declared as GMutex* m_mutex and Cond* m_cond). The real queue that can be downloaded from this article also contains TryPush and TryPop which don't block the calling thread if the queue is full or empty

Collapse | Copy Code

    template <typename T>
    bool BlockingQueue<T>::IsEmpty()
    {
        bool rv;

        g_mutex_lock(m_mutex);
        rv = m_theQueue.empty();
        g_mutex_unlock(m_mutex);

        return rv;
    }

IsEmpty is expected to return true if the queue has no elements, but before any thread can access to the non-safe implementation of the queue, it must be protected. This means that the calling thread can get blocked for a while until the mutex is freed

Collapse | Copy Code

    template <typename T>
    bool BlockingQueue<T>::Push(const T &a_elem)
    {
        g_mutex_lock(m_mutex);

        while (m_theQueue.size() >= m_maximumSize)
        {
            g_cond_wait(m_cond, m_mutex);
        }

        bool queueEmpty = m_theQueue.empty();

        m_theQueue.push(a_elem);

        if (queueEmpty)
        {
            // wake up threads waiting for stuff
            g_cond_broadcast(m_cond);
        }

        g_mutex_unlock(m_mutex);

        return true;
    }

Push inserts an element into queue queue. The calling thread will get blocked if another thread owns the lock that protects the queue. If the queue is full the thread will be blocked in this call until someone else pops an element from the queue, however the calling thread won't be using any CPU time while it is waiting for someone else to pop an element off the queue since it has been put to sleep by the OS

Collapse | Copy Code

    template <typename T>
    void BlockingQueue<T>::Pop(T &out_data)
    {
        g_mutex_lock(m_mutex);

        while (m_theQueue.empty())
        {
            g_cond_wait(m_cond, m_mutex);
        }

        bool queueFull = (m_theQueue.size() >= m_maximumSize) ? true : false;

        out_data = m_theQueue.front();
        m_theQueue.pop();

        if (queueFull)
        {
            // wake up threads waiting for stuff
            g_cond_broadcast(m_cond);
        }

        g_mutex_unlock(m_mutex);
    }

Pop extracts an element from the queue (and removes it from the queue). The calling thread will get blocked if another thread owns the lock that protects the queue. If the queue is empty the thread will be blocked in this call until someone else pushes an element into the queue, however (as it happens with Push) the calling thread won't be using any CPU time while it is waiting for someone else to push an element off the queue since it has been put to sleep by the OS

As I tried to explain in the previous section, blocking is not a trivial action. It involves the OS putting the current task "on hold", or sleeping (waiting without using any processor). Once some resource (a mutex for instance) is available the blocked task can be unblocked (awaken), which isn't a trivial action either, so it can continue into the mutex. In heavily loaded applications using these blocking queues to pass messages between threads can lead to contention, that is, tasks spend longer claiming the mutex (sleeping, waiting, awakening) to access data in the queue than actually doing "something" with that data.

In the easiest case, one thread inserting data into the queue (producer) and another thread removing it (consumer), both threads are "fighting" for the only mutex that protects the queue. If we choose to write our own implementation of a queue instead of just wrapping an existing one, we could use 2 different mutexes, one for inserting in and another one for removing items from the queue. Contention in this case only applies to the extreme cases, when the queue is almost empty or almost full. Now, once we need more than 1 thread inserting or removing elements off the queue our problem comes back, consumers or producers will be fighting to take the mutex.

This is where non-blocking mechanisms apply. Tasks don't "fight" for any resource, they "reserve" a place in the queue without being blocked or unblocked, and then they insert/remove data from the queue. These mechanisms need a special kind of operation called CAS (Compare And Swap) which (from the wikipedia) defines as "a special instruction that atomically compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value". For instance:

Collapse | Copy Code

    volatile int a;
    a = 1;

    // this will loop while 'a' is not equal to 1. If it is equal to 1 the operation will atomically
    // set a to 2 and return true
    while (!CAS(&a, 1, 2))
    {
        ;
    }

Using CAS to implement a lock-free queue is not a brand new topic. There is quite a few examples of implementations of data structures, most of them using a linked list. Have a look at [2] [3] or [4]. It's not the purpose of this article to describe what a lock free queue is, but basically:

To insert new data into the queue a new node is allocated (using malloc) and it is inserted into the queue using CAS operations
To remove an element from the queue it uses CAS operations to move the pointers of the linked-list and then it retrieves the removed node to access the data

This is an example of an easy implementation of a linked-list based lock free queue (copied from [2], who based it on [5])

Collapse | Copy Code

    typedef struct _Node Node;
    typedef struct _Queue Queue;

    struct _Node {
        void *data;
        Node *next;
    };

    struct _Queue {
        Node *head;
        Node *tail;
    };

    Queue*
    queue_new(void)
    {
        Queue *q = g_slice_new(sizeof(Queue));
        q->head = q->tail = g_slice_new0(sizeof(Node));
        return q;
    }

    void
    queue_enqueue(Queue *q, gpointer data)
    {
        Node *node, *tail, *next;

        node = g_slice_new(Node);
        node->data = data;
        node->next = NULL;

        while (TRUE) {
            tail = q->tail;
            next = tail->next;
            if (tail != q->tail)
                continue;

            if (next != NULL) {
                CAS(&q->tail, tail, next);
                continue;
            }

            if (CAS(&tail->next, null, node)
                break;
        }

        CAS(&q->tail, tail, node);
    }

    gpointer
    queue_dequeue(Queue *q)
    {
        Node *node, *tail, *next;

        while (TRUE) {
            head = q->head;
            tail = q->tail;
            next = head->next;
            if (head != q->head)
                continue;

            if (next == NULL)
                return NULL; // Empty

            if (head == tail) {
                CAS(&q->tail, tail, next);
                continue;
            }

            data = next->data;
            if (CAS(&q->head, head, next))
                break;
        }

        g_slice_free(Node, head); // This isn't safe
        return data;
    }

In those programming languages which don't have a garbage collector (C++ is one of them) latest call to g_slice_free is not safe because of what is known as ABA problem:

Thread T1 reads a value to dequeue and stops just before the first call to CAS in queue_dequeue
Thread T1 is preemtped. T2 attempts a CAS operation to remove the same node T1 was about to dequeue
It succedes and frees the memory allocated for that node
That same thread (or a new one, for instace T3) is going to enqueue a new node. the call to malloc returns the same address that was being used by the node removed in step 2-3. It adds that node into the queue
T1 takes the processor again, the CAS operation succeds incorrectly since the address is the same, but it's not the same node. T1 removes the wrong node

The ABA problem can be fixed adding a reference counter to each node. This reference counter must be checked before assuming the CAS operation is right to avoid the ABA problem. Good news is the queue this article is about doesn't suffer from the ABA problem, since it doesn't use dynamic memory allocation.

2.3 Dynamic memory allocation

In multithreaded systems, memory allocation must be seriously considered. The standard memory allocation mechanism blocks all the tasks that share the memory space (all the threads of the process) from reserving memory on the heap when it is allocating space for one task. This is an easy way of doing things, and it works, there's no way 2 threads can be given the same memory address because they can't be allocating space at the same time. It's quite slow though when threads allocate memory quite often (and it must be noted that small things like inserting elements into standard queues or standard maps allocate memory on the heap)

There are some libraries out there that override the standard allocation mechanism to provide a lock-free memory allocation mechanism to reduce contention for the heap, for example libhoard[6]. There are lots of different libraries of this kind, and they can have a great impact in your system if you switch from the standard C++ allocator to one of these lock-free based memory allocators. But sometimes they aren't just what the system needs, and software must go the extra mile and change its synchronization mechanism.

3. The circular array based lock-free queue

So, finally, this is the circular-array-based lock free queue, the thing this article was intended for in the first place. It has been developed to drop the impact of those 3 problems described above. Its properties can be summarized with the following list of characteristics:

As a lock-free synchronization mechanism, it decreases the frequency in which processes are preempted, reducing, therefore, cache trashing.
Also, as any lock-free queue, contention between threads drops considerably since there is no lock to protect any data structure: Threads basically claim space first and then occupy it with data.
It doesn't need to allocate anything in the heap as other lock-free queue implementations do.
It doesn't suffer from the ABA problem either, though it adds some level of overhead in the array handling.

3.1 How does it work?

The queue is based on an array and 3 different indexes:

writeIndex: Where a new element will be inserted to.
readIndex: Where the next element where be extracted from.
maximumReadIndex: It points to the place where the latest "commited" data has been inserted. If it's not the same as writeIndex it means there are writes pending to be "commited" to the queue, that means that the place for the data was reserved (the index in the array) but the data is still not in the queue, so the thread trying to read will have to wait for those other threads to save the data into the queue.

It is worth mentioning that 3 different indexes are required because the queue allows to set up as many producers and consumers as needed. There is already an article on a queue for the single-producer and single-consumer configuration [11]. Its simple approach is definitely worth reading (I have always liked the KISS principle). Things have got more complicated here because the queue must be safe for all kind of thread configurations.

3.1.1 The CAS operation

The synchronization mechanism of this lock-free queue is based on compare-and-swap CPU instructions. CAS operations were included in GCC in version 4.1.0. Since I was using GCC 4.4 to compile this algorithm I decided to take advantage of the GCC built-in CAS operation called __sync_bool_compare_and_swap (it is described here). In order to support more than one compiler this operation was "mapped" to the word CAS using a #define in the file atomic_ops.h:

Collapse | Copy Code

    /// @brief Compare And Swap
    ///        If the current value of *a_ptr is a_oldVal, then write a_newVal into *a_ptr
    /// @return true if the comparison is successful and a_newVal was written
    #define CAS(a_ptr, a_oldVal, a_newVal) __sync_bool_compare_and_swap(a_ptr, a_oldVal, a_newVal)

If you plan to compile this queue with some other compiler, all you need to do is defining an operation CAS that works somehow with your compiler. It MUST suit the following interface:

1st parameter is the address of the variable to change
2nd parameter is the old value
3rd parameter is the value that will be saved into the 1st one if it is equal to the 2nd one
returns true (non-zero) if it succeds. False otherwise

3.1.2 Inserting elements into the queue

This is the code responsible for inserting new elements in the queue:

Collapse | Copy Code

    /* ... */
    template <typename ELEM_T, uint32_t Q_SIZE>
    inline
    uint32_t ArrayLockFreeQueue<ELEM_T, Q_SIZE>::countToIndex(uint32_t a_count)
    {
        return (a_count % Q_SIZE);
    }

    /* ... */

    template <typename ELEM_T>
    bool ArrayLockFreeQueue<ELEM_T>::push(const ELEM_T &a_data)
    {
        uint32_t currentReadIndex;
        uint32_t currentWriteIndex;

        do
        {
            currentWriteIndex = m_writeIndex;
            currentReadIndex  = m_readIndex;
            if (countToIndex(currentWriteIndex + 1) ==
                countToIndex(currentReadIndex))
            {
                // the queue is full
                return false;
            }

        } while (!CAS(&m_writeIndex, currentWriteIndex, (currentWriteIndex + 1)));

        // We know now that this index is reserved for us. Use it to save the data
        m_theQueue[countToIndex(currentWriteIndex)] = a_data;

        // update the maximum read index after saving the data. It wouldn't fail if there is only one thread 
        // inserting in the queue. It might fail if there are more than 1 producer threads because this
        // operation has to be done in the same order as the previous CAS

        while (!CAS(&m_maximumReadIndex, currentWriteIndex, (currentWriteIndex + 1)))
        {
            // this is a good place to yield the thread in case there are more
            // software threads than hardware processors and you have more
            // than 1 producer thread
            // have a look at sched_yield (POSIX.1b)
            sched_yield();
        }

        return true;
    }

The following image describes the initial configuration of the queue. Each square describes a position in the queue. If it is marked with a big X it contains some data. Blank squares are empty. In this particular case there are 2 elements curently inserted into the queue. WriteIndex points to the place where new data would be inserted. ReadIndex points to the slot which would be emptied in the next call to pop.

Basically when a new element is going to be written into the queue the producer thread "reserves" the space in the queue incrementing WriteIndex. MaximumReadIndex points to the last slot that contains valid (commited) data.

Once the new space is reserved, the current thread can take its time to copy the data into the queue. It then increments the MaximumReadIndex

Now there are 3 elements fully inserted in the queue. In the next step another task tries to insert a new element into the queue

It already reserved the space its data will occupy, but before this task can copy the new piece of data into the reserved slot a different thread reserves a new slot. There are 2 tasks inserting elements at the same time.

Threads now copy their data in the slot they reserved, but it MUST be done in STRICT order: The 1st producer thread will increment MaximumReadIndex, followed by 2nd producer thread. The strict order constraint is important because we must ensure data saved in a slot is fully commited before allowing a consumer thread to pop it from the queue.

The first thread commited data into the slot. Now the 2nd thread is allowed to increment MaximumReadIndex as well

The second thread also incremented MaximumReadIndex. Now there are 5 elements into the queue

3.1.3 Removing elements from the queue

This is the piece of code that removes elements from the queue:

Collapse | Copy Code

    
    /* ... */

    template <typename ELEM_T>
    bool ArrayLockFreeQueue<ELEM_T>::pop(ELEM_T &a_data)
    {
        uint32_t currentMaximumReadIndex;
        uint32_t currentReadIndex;

        do
        {
            // to ensure thread-safety when there is more than 1 producer thread
            // a second index is defined (m_maximumReadIndex)
            currentReadIndex        = m_readIndex;
            currentMaximumReadIndex = m_maximumReadIndex;

            if (countToIndex(currentReadIndex) == 
                countToIndex(currentMaximumReadIndex))
            {
                // the queue is empty or
                // a producer thread has allocate space in the queue but is 
                // waiting to commit the data into it
                return false;
            }

            // retrieve the data from the queue
            a_data = m_theQueue[countToIndex(currentReadIndex)];

            // try to perfrom now the CAS operation on the read index. If we succeed
            // a_data already contains what m_readIndex pointed to before we 
            // increased it
            if (CAS(&m_readIndex, currentReadIndex, (currentReadIndex + 1)))
            {
                return true;
            }

            // it failed retrieving the element off the queue. Someone else must
            // have read the element stored at countToIndex(currentReadIndex)
            // before we could perform the CAS operation        

        } while(1); // keep looping to try again!

        // Something went wrong. it shouldn't be possible to reach here
        assert(0);

        // Add this return statement to avoid compiler warnings
        return false;
    }

This is the same starting configuration as in the "Inserting elements in the queue" section. There are 2 elements curently inserted into the queue. WriteIndex points to the place where new data would be inserted. ReadIndex points to the slot which would be emptied out in the next call to pop.

The consumer thread, which is about to read, copies the element pointed by ReadIndex and tries to perform a CAS operation over the same ReadIndex. If the CAS operation succeds the thread retrieved the element from the queue, and since CAS operations are atomic only one thread at a time can increment this ReadIndex. If the CAS operation didn't succed it will try again with the next slot pointed by ReadIndex

Another thread (or the same one) reads the next element. The queue is empty

Now a task is trying to add a new element into the queue. It succesfully reserved the slot but the change is pending to be commited. Any other thread trying to pop a value knows the queue is not empty, because another thread already reserved a slot in the queue (writeIndex is not equal to readIndex), but it can't read the value in that slot, because MaximumReadIndex is still equal to readIndex. This thread trying to pop the value will be looping inside the pop call either until the task that is commiting the change into the queue increments MaximumReadIndex or until the queue is empty again (which can happen if once the producer thread has incremented MaximumReadIndex another consumer thread popped it before our thread tried, so writeIndex and readIndex point to the same slot again)

When the producer thread completely inserted the value into the queue, size will be 1 and the consumer thread will be able to read the value

3.1.4 On the need for yielding the processor when there is more than 1 producer thread

The reader might have noticed at this point that the push function may call to a funcion (sched_yield()) to yield the processor, which might seem a bit strange for an algorithm that claims to be lock-free. As I tried to explain at the beginning of this article, one of the ways multithreading can affect your overall performance is cache trashing. A typical way cache trashing happens whenever a thread is preempted the OS must save into memory the state (context) of the exiting process, and restore the context of the entrant process. But, aside from this wasted time, data saved into cache will be useless for the new process since it was cached for the former existing task.

So, when this algorithm calls to sched_yield() it is specifically telling the OS: "please, can you put someone else on this processor, since I must wait for something to occur?". The main difference between lock-free and blocking synchronization mechanisms is supposed to be the fact that we don't need to be blocked to synchronize threads when working with lock-free algorithms, so why are we requesting the OS to preempt? Answer to this question is not trivial. It is related to how producer threads store new data into the queue: they must perform 2 CAS operations in FIFO order, one to allocate the space in the queue, and an extra one to notify readers they can read data as it has been already "commited".

If our application has only 1 producer thread (which was what this queue was designed for in the first place), the call to sched_yield() will never take place, since the 2nd CAS operation (the one that "commits" data in the queue) can't fail. There is no chance that this operation can't be done in FIFO order when there is only one thread inserting stuff in the queue (and therefore modifying writeIndex and maximumReadIndex)

The problem begins when there is more than 1 thread inserting elements into the queue. The whole process of inserting new elements in the queue is explained in section 3.1.2, but basically producer threads do 1 CAS operation to "allocate" space in which the new data will be stored and then do a 2nd CAS operation once the data has been copied into the allocated space to notify consumer threads that new data is available to read. This 2nd CAS operation MUST be done in FIFO order, that is, in the same order the 1st CAS operation is done. This is where the issue starts. Let's think of the following scenario, with 3 producer 1 consumer threads:

Thread 1, 2 and 3 allocate space for inserting data in that order. The 2nd CAS operation will have to be done in the same order, that is, thread 1 first, then thread 2 and finally thread 3.
Thread 2 arrives first to the offending 2nd CAS operation though, and it fails since thread 1 hasn't done it yet. Same thing happens to thread 3.
Both threads (2 and 3) will keep looping trying to perform their 2nd CAS operation until thread 1 performs it first.
Thread 1 finally performst it. Now thread 3 must wait for thread 2 to do its CAS operation.
Thread 2 succeds now that thread 1 is done, its data is totally inserted in the queue. Thread 3 also stops looping since its CAS operation succeded too after thread 2

In the former scenario producer threads might be spinning for a good while trying to perform the 2nd CAS operation while waiting for another thread to do the same (in order). In multiprocessor machines where there are more idle physical processors than threads using the queue it might not be that important: Threads will be stuck trying to perform the CAS operation on the volatile variable that holds the maximumReadIndex, but the thread they are waiting for has a physical processor assigned, so eventually it will perform its 2nd CAS operation while running on another physical processor, and the other spinning thread will succeed also when performing its 2nd CAS operation. So, all in all, the algorithm might hold threads looping, but this behaviour is expected (and desired) since this is the way it works as quick as possible. There is no need, therefore, for sched_yield(). In fact, that call should be removed to achieve maximum performance, since we don't want to tell the OS to yield the processor (with the associated overhead) when there is no real need for it.

However, calling to sched_yield() is very important for the queue's performance when there is more than 1 producer thread and the number of physical processors is smaller than the amount of total threads. Think of the previous sceneario again, when 3 threads were trying to insert new data into the queue: If thread 1 is preempted just after allocating its new space in the queue but before performing the 2nd CAS operation, threads 2 and 3 will be looping forever waiting for something that won't ever happen until thread 1 is waken up. Here's the need for the sched_yield() call, the OS must not keep threads 2 and 3 looping, they must be blocked as soon as possible to try to get thread 1 to perform the 2nd CAS operation so 2 and 3 can go forward and commit their data into the queue too.

4. Known issues of the queue

The main objective of this lock-free queue is offering a way of having a lock-free queue without the need for dynamic memory allocation. It has been succesfully achieved, but this algorithm does have a few known drawbacks that should be considered before using it in production environments.

4.1 Using more than one producer thread

As it has been described in section 3.1.4 (you must carefully read that section before trying to use this queue with more than 1 producer thread) if there are more than 1 producer thread, they might end up spinning for too long trying to update the MaximumReadIndex since it must be done in FIFO order. The original scenario this queue was designed for had only one producer, and it is definitely true that this queue will show a significant performance drop in those scenarios where there is more than one producer threads.

Besides, if you plan to use this queue with only 1 producer thread, there is no need for the 2nd CAS operation (the one that commits elements into the queue). Along with the 2nd CAS operation the volatile variable m_maximumReadIndex should be removed, and all references to it changed to m_writeIndex. So push and pop should be swapped by the following snippet of code:

Collapse | Copy Code

    
    template <typename ELEM_T>
    bool ArrayLockFreeQueue<ELEM_T>::push(const ELEM_T &a_data)
    {
        uint32_t currentReadIndex;
        uint32_t currentWriteIndex;

        currentWriteIndex = m_writeIndex;
        currentReadIndex  = m_readIndex;
        if (countToIndex(currentWriteIndex + 1) == 
            countToIndex(currentReadIndex))
        {
            // the queue is full
            return false;
        }

        // save the date into the q
        m_theQueue[countToIndex(currentWriteIndex)] = a_data;

        // increment atomically write index. Now a consumer thread can read
        // the piece of data that was just stored 
        AtomicAdd(&m_writeIndex, 1);

        return true;
    }

    template <typename ELEM_T>
    bool ArrayLockFreeQueue<ELEM_T>::pop(ELEM_T &a_data)
    {
    uint32_t currentMaximumReadIndex;
    uint32_t currentReadIndex;

    do
    {
        // m_maximumReadIndex doesn't exist when the queue is set up as
        // single-producer. The maximum read index is described by the current
        // write index
        currentReadIndex        = m_readIndex;
        currentMaximumReadIndex = m_writeIndex;

        if (countToIndex(currentReadIndex) == 
            countToIndex(currentMaximumReadIndex))
        {
            // the queue is empty or
            // a producer thread has allocate space in the queue but is 
            // waiting to commit the data into it
            return false;
        }

        // retrieve the data from the queue
        a_data = m_theQueue[countToIndex(currentReadIndex)];

        // try to perfrom now the CAS operation on the read index. If we succeed
        // a_data already contains what m_readIndex pointed to before we 
        // increased it
        if (CAS(&m_readIndex, currentReadIndex, (currentReadIndex + 1)))
        {
            return true;
        }

        // it failed retrieving the element off the queue. Someone else must
        // have read the element stored at countToIndex(currentReadIndex)
        // before we could perform the CAS operation        

    } while(1); // keep looping to try again!

    // Something went wrong. it shouldn't be possible to reach here
    assert(0);

    // Add this return statement to avoid compiler warnings
    return false;
    }

If you plan to use this queue on scenarios were there is only 1 producer and 1 consumer thread, it's definitely worth it to have a look at [11], where a similar circular queue designed to be used in this particular configuration is explained in detail.

4.2 Using the queue with smart pointers

If the queue is instantiated to hold smart pointers, note that the memory protected by the smart pointers inserted into the queue won't be completely deleted (smart pointer's reference counter equals to 0) until the index where the element was stored is occupied by a new smart pointer. This shouldn't be a problem in busy queues, but the programmer should take into account that once the queue has been completely filled up for the 1st time, the amount of memory taken up by the application won't go down even if the queue is empty.

4.3 Calculating size of the queue

The original function size might return bogus values. This is a code snippet of it:

Collapse | Copy Code

    template <typename ELEM_T>    
    inline uint32_t ArrayLockFreeQueue<ELEM_T>::size()
    {
        uint32_t currentWriteIndex = m_writeIndex;
        uint32_t currentReadIndex  = m_readIndex;

        if (currentWriteIndex >= currentReadIndex)
        {
            return (currentWriteIndex - currentReadIndex);
        }
        else
        {
            return (m_totalSize + currentWriteIndex - currentReadIndex);
        }
    }

The following scenario describes a situation where this function returns bogus data:

when the statement currentWriteIndex = m_writeIndex is run, m_writeIndex is 3 and m_readIndex is 2. Real size is 1.
afterwards this thread is preemted. While this thread is inactive 2 elements are inserted and removed from the queue, so m_writeIndex is 5 m_readIndex 4. Real size is still 1.
Now the current thread comes back from preemption and reads m_readIndex. currentReadIndex is 4.
currentReadIndex is bigger than currentWriteIndex, so m_totalSize + currentWriteIndex - currentReadIndex is returned, that is, it returns that the queue is almost full, when it is almost empty.

The queue uploaded into this article includes the solution to this problem. It consists of adding a new class member containing the current number of elements of the queue, that could be incremented/decremented using AtomicAdd/AtomicSub operations. This solution adds an important overhead since these atomic operations are expensive because they can't be easily optimized by the compiler.

For intance, in a test run in a core 2 duo E6400 at 2.13 Ghz (2 hardware processors) it takes for 2 threads (1 producer + 1 consumer) to insert 10,000k elements in a lock-free queue initialized to a size of 1k slots about 2.64s without the reliable size variable, and about 3.42s (around 22% more) if that variable is being mantained by the queue. It also takes 22% longer when 2 consumers and 1 producer are used under the same environment: It takes 3.98s for the unreliable-size version of the queue and about 5.15 for the other one.

This is why it has been left up to the developer to activate the overhead of the size variable. It always depends on the kind of application the queue is being used for to decide whether that overhead is worth it or not. There is a compiler preprocessor variable in array_lock_free_queue.h called ARRAY_LOCK_FREE_Q_KEEP_REAL_SIZE. If it is defined the "reliable size" overhead is activated. If it is undefined the overhead will be deactivated and the size function might end up returning bogus values.

5. Compiling the code

Both the lock-free queue and the Glib blocking queue included in the article are template-based C++ classes. Templatized code must be in header files so it won't be compiled until it is used in a .cpp file. I have included both queues in a .zip file which contains a sample file per queue to show its usage, and measure how long it takes for each queue to complete a multithreaded test.

Testing code has been written using gomp, the GNU implementation of the OpenMP Application Programming Interface (API) for multi-platform shared-memory parallel programming in C/C++ [9] which is included in GCC since version 4.2. OpenMP is a simple and flexible interface for developing parallel applications for different platforms and it is a very easy way of writing multithreaded code.

So, there are 3 parts in the code attached to this article, each one with different requirements:

1. Array based Lock-free queue:

There are two separate versions of the lock-free queue. One for any thread configuration, and another one for environments with only 1 producer thread. They are stored in array_lock_free_queue.h and array_lock_free_queue_single_producer.h
GCC newer or equal to 4.1.0 for the atomic operations (CAS, AtomicAdd and AtomicSub). If another compiler is to be used, these operations must be defined in atomic_ops.h (they might depend either on the compiler or on your platform or both)
You might also need to define uint32_t if this type is not included in the implementation of stdint.h of your environment. It is included in GNU-Linux, but it's not in Windows. In most modern environments all you've got to do is:
Collapse | Copy Code
```
typedef unsigned int uint32_t; // int is (normally) 32bit in both 32 and 64bit machines
```
It's also important to note here that this queue hasn't been tested on a 64bit environment. GCC could give compile-time errors if atomic operations are not supported for 64bit type variables, and this is why a 32bit type variable was picked to implement the queue (in 32bit machines there might not be 64bit atomic operations). If your machine supports atomic operations over 64bit variables I don't see why the queue could fail operating with 64bit indexes
The lock-free queue for all thread configurations (the one that performs 2 CAS operations per push operation) also makes use of a call to yield the processor: sched_yield(). According to the documentation this call is part of POSIX [10], so any POSIX compliant OS should compile it no problem.

2. Glib based blocking queue:

You will need to have glib available in your system. This is pretty straight forward in GNU-Linux systems, though it can be a more complicated under other platforms systems. There is an ready-to-use package of the whole GTK+ library to be installed in GNU-Linux, Windows and OSX here: http://www.gtk.org/download.html
It also uses glib implementation of mutexes and conditional variables which are part of the gthread library, so you will also have to link to this when compiling your application if you decided to include this queue

3. test applications:

Fulfill demanded requirements for both Lock-free and Glib blocking queues
The GNU make application to process the makefile. You can pass compile-time options to the compiling process, as in
Collapse | Copy Code
```
make N_PRODUCERS=1 N_CONSUMERS=1 N_ITERATIONS=10000000 QUEUE_SIZE=1000
```
Where:
- N_PRODUCERS is the number of producer threads.
- N_CONSUMERS is the number of consumer threads.
- N_ITERATIONS is the total number of items that will be inserted and removed from the queue.
- QUEUE_SIZE is the maximum number of slots of the queue
GCC, at least in version 4.2 to be able to compile Gomp
It's also needed to add 'OMP_NESTED=TRUE' to the command line before running the test app, as in `OMP_NESTED=TRUE ./test_lock_free_q`

6. A few figures

The following graphs show the result of running the test aplications included in this article in a 2-core machine (2 hardware processors) with different settings and thread configurations.

6.1. The impact on performance of the 2nd CAS operation

A known issue of the queue is the spare 2nd CAS operation when there is only one producer thread. The following graph shows the impact of removing it on a 2-core machine when there is only 1 procuder thread (lower is better). It shows an improvement of around 30% in the mount of seconds it takes to insert and extract simultaneously 1 million elements.

6.2. Lock-free vs. blocking queue. Number of threads

The following graph shows a comparison on the amount of time it takes to push and pop simultaneously 1 million elements (lower is better) depending on the set up of threads (queue size has been set to 16384). The lockfree queue outperforms the standard blocking algorithm when there is only one producer thread. When several producer threads (all equally greedy to push elements into the queue) are used, the lockfree queue performance quickly diminishes.

6.3. Performance using 4 threads

The following graphs show the overall performance of the different queues when 1 million items are pushed into and pop using different thread configurations.

6.3.1 One producer thread

6.3.2 Two producer thread

6.3.3 Three producer threads

6.3.1 One consumer thread

6.3.2 Two consumer thread 6.3.3 Three consumer threads

6.4 A 4-core machine

It would be good to run the same tests on a 4-core machine to study the overall performance when there are 4 physical processors available. Those tests could also be used to have a look on the performance impact that the sched_yield() call has when there isn't real need for it.

7. Conclusions

This array based lock-free queue has been proved to work in its 2 versions, the one which is thread safe in multiple producer thread configurations, and the one which is not (the one similar to [11] but allowing multiple consumer threads). Both queues are safe to be used as a synchronization mechanism in multithread applications because:

Since the CAS operation is atomic, threads trying to parallely push or pop elements into/from the queue can't run into a deadlock.
Several threads trying to push elements into the queue at the same time can't write in the same slot of the array, stepping onto each other's data.
Several threads trying to pop elements from the queue at the same time can't remove the same item more than once.
A thread can't push new data into a full queue or pop data from an empty one
Neither push nor pop suffer from the ABA problem.

However it is important to note that even though the algorithm is thread-safe, its performance in multiple producer environments is outperformed by a simple block-based queue. Therefore, choosing this queue over a standard block-based queue makes sense only under one of the following configurations:

There is only 1 producer thread (the 2nd single-producer versions is faster).
There is 1 busy producer thread but we still need the queue to be thread-safe because under some circumstances a different thread could push stuff into the queue.

8. History

4rd January 2011: Initial version

27th April 2011: Highlighting of some key words. Removed unnecesary uploaded file. A few typos fixed

9. References

[1] Love, Robert: "Linux Kernel Development Second Edition", Novell Press, 2005

[2] Introduction to lock-free/wait-free and the ABA problem

[3] Lock Free Queue implementation in C++ and C#

[4] High performance computing: Writing Lock-Free Code

[5] M.M. Michael and M.L. Scott, "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms," Proc. 15th Ann. ACM Symp. Principles of Distributed Computing, pp. 267-275, May 1996.

[6] The Hoard Memory Allocator

[7] Glib Asynchronous Queues

[8] boost::shared_ptr documentation

[9] GNU libgomp

[10] sched_yield documentation

[11] lock-free single producer - single consumer circular queue

License

This article, along with any associated source code and files, is licensed under The BSD License

About the Author

Faustino Frechilla

Software Developer
Spain
Member

I am a software engineer who enjoys facing computer performance challenges and solving them through multi-threading and synchronisation.
I have always liked the free-software (or open-source for that matter) approach, and how it has been proved to be a way for improving our world and sharing knowledge while it's also possible to make a living out of it.
So far I had only tried to release into the free-software community a project about a board game I love: blokus (blockem.sourceforge.net). I've left the profit bit for some other project to come...

你可能感兴趣的:(Yet another implementation of a lock-free circu...)

深度学习模块C2f代码详解你是狒狒吗目标检测人工智能计算机视觉 pytorch YOLO 神经网络
C2f是一个用于构建卷积神经网络（CNN）的模块，特别是在YOLOv5和YOLOv8等目标检测模型中。这个模块是一个改进的CSP（CrossStagePartial）Bottleneck结构，旨在提高计算效率和特征提取能力。下面是对C2f类的详细解释：类定义和初始化Python复制classC2f(nn.Module):“”“FasterImplementationofCSPBottleneckw
【Markdown】【mermaid】Mermaid时序图基础语法Sequence Diagrams - Basic Syntax hmywillstronger microsoft mermaid
时序图-SequenceDiagrams简介-Introduction时序图是一种交互图，显示了流程如何相互操作以及它们的执行顺序。它可以用来描述用例场景或设计一个良好的面向对象系统。Sequencediagramsareatypeofinteractiondiagramthatillustratehowflowsoperatewithoneanotherandinwhatorder.Theyca
Camel-AI项目模块详解 UFO上的可乐人工智能深度学习 pycharm python ipython pip conda
前提内容快捷键在Pycharm中使用ctrl+F12查看类中所有方法查看某一个类中的方法的实现类：鼠标点到方法名字上右键→goto→Implementationscamel项目目录如下：camel/├──agents/#智能体相关代码├──models/#模型集成与管理├──tools/#工具集成与使用├──conversations/#对话管理与处理├──data/#数据处理与管理├──exam
pycharm python pyqt5 qq_24158561 python 开发语言
#-*-coding:utf-8-*-#Formimplementationgeneratedfromreadinguifile'UI1.ui'##Createdby:PyQt5UIcodegenerator5.13.0##WARNING!Allchangesmadeinthisfilewillbelost!#PyCharm，从Tools->ExternalTools->QtDesigner打开Q
【Triton 教程】持久矩阵乘法 (Persistent Matmul)
Triton是一种用于并行编程的语言和编译器。它旨在提供一个基于Python的编程环境，以高效编写自定义DNN计算内核，并能够在现代GPU硬件上以最大吞吐量运行。更多Triton中文文档可访问→https://triton.hyper.ai/该脚本展示了使用Triton进行矩阵乘法的持久化内核实现(persistentkernelimplementations)。包含多种矩阵乘法方法，例如基础的朴
leetcode中等.数组(21-40)python 九日火 python leetcode
80.RemoveDuplicatesfromSortedArrayII(m-21)Givenasortedarraynums,removetheduplicatesin-placesuchthatduplicatesappearedatmosttwiceandreturnthenewlength.Donotallocateextraspaceforanotherarray,youmustdoth
1-1.Jetpack 之 Navigation 简单编码模板我命由我12345 Android -Jetpack 简化编程 java java-ee android-studio android studio 安卓 android jetpack
一、Navigation1、Navigation概述Navigation是Jetpack中的一个重要成员，它主要是结合导航图（NavigationGraph）来控制和简化Fragment之间的导航，即往哪里走，该怎么走2、Navigate引入在模块级build.gradle中引入相关依赖implementation'androidx.navigation:navigation-fragment:2
Core Foundation 对象的内存管理言己言
底层的CoreFoundation对象，大多数以xxxCreateWithxxx这样的方式创建，例如：#import"TestViewController.h"#import@interfaceTestViewController()@end@implementationTestViewController-(void)viewDidLoad{[superviewDidLoad];//创建一个CF
R-Drop pytorch实现 warpin 深度学习深度学习 pytorch
Pytorch实现了R-Drop，可以用于训练分类模型。#-*-coding:utf-8-*-"""Description:AnimplementationofR-Drop(https://arxiv.org/pdf/2106.14448.pdf).Authors:lihpCreateDate:2021/8/24"""fromtorchimportnnfromtorch.nnimportfunct
flutter app_Flutter App中的错误处理 weixin_26638123 python
flutterappPartofthe‘AWorkinProgress’Series“正在进行的工作”系列的一部分Today,I’mdemonstratinghowtheMVCframeworklibrarypackageisadaptiveandflexibleinitsimplementationofspecificneeds.Inthiscase,I’llshowyouhowtheframe
Objective-C 静态方法可以重写吗赵哥窟
首先来看一段代码#import@interfacePerson:NSObject+(void)pringName:(NSString*)name;@end#import"Person.h"@interfacePerson()@end@implementationPerson+(void)pringName:(NSString*)name{NSLog(@"Person-%@",name);}@end
Android kotlin开发项目MVP架构搭建哎吆我呸 android Kotlin初入门
1、引入需要的网络库implementation'com.squareup.retrofit2:retrofit:2.7.1'implementation'com.squareup.retrofit2:converter-gson:2.7.1'implementation'com.jakewharton.retrofit:retrofit2-kotlin-coroutines-adapter:0.
Android 利用OkHttp进行文件下载操作淼森007 Android基础
上回我的博客中讲了如何使用OkHttp封装一套自己的网路请求框架，这次说说文件下载。其实我们APP中还是很多地方会用到文件下载的。比如版本更新的时候，比如图片本地缓存的时候，都会用到文件下载，那么我们如何使用这个功能呢?首先我们要引入框架implementation'com.squareup.okhttp3:okhttp:3.6.0'接着创建类DownloadUtil.java，内容如下publi
nginx过滤爬虫访问梓沂 nginx 爬虫运维
思路来自ai：Nginx可以通过多种方式来限制爬虫的行为：1.**User-Agent限制**：可以通过检查HTTP请求的User-Agent头部来识别并限制某些爬虫。例如，可以在Nginx配置文件中使用`if`语句来检查User-Agent，并使用`return`指令拒绝特定的User-Agent。```nginxif($http_user_agent~*(BadCrawler|AnotherB
vulnhub靶机-DC2-Writeup 含日靶机 linux 安全靶机渗透测试安全漏洞
0x01部署靶机地址：https://www.vulnhub.com/entry/dc-2,311/DESCRIPTIONMuchlikeDC-1,DC-2isanotherpurposelybuiltvulnerablelabforthepurposeofgainingexperienceintheworldofpenetrationtesting.AswiththeoriginalDC-1,i
水风的ScalersTalk第四轮新概念朗读持续力训练Day 141 20190225 喵小园upup
练习材料Lesson23-1Oneman’smeatisanotherman’spoisonPeoplebecomequiteillogicalwhentheytrytodecidewhatcanbeeatenandwhatcannotbeeaten.IfyoulivedintheMediterranean,forinstance,youwouldconsideroctopusagreatdeli
大数据知识总结（三）：Hadoop之Yarn重点架构原理 Lansonli 大数据大数据 hadoop 架构 Yarn
文章目录Hadoop之Yarn重点架构原理一、Yarn介绍二、Yarn架构三、Yarn任务运行流程四、Yarn三种资源调度器特点及使用场景Hadoop之Yarn重点架构原理一、Yarn介绍ApacheHadoopYarn(YetAnotherReasourceNegotiator，另一种资源协调者)是Hadoop2.x版本后使用的资源管理器，可以为上层应用提供统一的资源管理平台。二、Yarn架构Y
05-树8 File Transfer(C) L_glonar c语言数据结构
日常，满分Wehaveanetworkofcomputersandalistofbi-directionalconnections.Eachoftheseconnectionsallowsafiletransferfromonecomputertoanother.Isitpossibletosendafilefromanycomputeronthenetworktoanyother?InputSp
python之异常处理小鱼爱吃火锅 Python python
在Python中，异常处理是一个重要的机制，用于捕获和处理程序运行过程中可能发生的错误。通过异常处理，程序可以在遇到错误时采取适当的措施，而不是直接崩溃。基本语法Python使用try、except、else和finally关键字来处理异常。try:#可能会引发异常的代码passexceptSomeExceptionase:#处理特定异常的代码passexceptAnotherExceptiona
RSocket-JS 使用指南郁俪晟Gertrude
RSocket-JS使用指南rsocket-jsJavaScriptimplementationofRSocket项目地址:https://gitcode.com/gh_mirrors/rs/rsocket-js1.项目介绍RSocket-JS是一个实现了RSocket协议（版本1.0）的JavaScript库，专为在浏览器环境及Node.js中使用设计。该库允许开发者通过RSocket协议进行高
一定的年纪，不一定的坚持王凯泽
一觉醒来后，照镜子发现自己:突-然-变-老-了。那一刻开始害怕，诅咒岁月无情,催人老啊！到了一定年纪，很多事情无论你做与不做，它们都会迎头赶上；很多人，无论你喜不喜欢，他们都会与你发生交集。明天，代表着希望，是象征，是新的起点。现在的我们不敢再坚定的说：tommorrowisanotherday.如果今天的我们，不愿意付出，还不学习和做出改变，那么明天的我们未必会幸福，过上自己想要的生活。我们要接
Spark 3.5.1 升级 Java 17 异常 cannot access class sun.nio.ch.DirectBuffer 敏叔V587 spark java nio
异常说明使用Spark3.5.1升级到Java17的时候会有一个异常，异常如下SLF4J:Failedtoloadclass"org.slf4j.impl.StaticLoggerBinder".SLF4J:Defaultingtono-operation(NOP)loggerimplementationSLF4J:Seehttp://www.slf4j.org/codes.html#Static
Android从零开始搭建MVVM架构（6） m0_66070459 程序员面试移动开发 android
//加载项目build.gradle的anroid标签下dataBinding{enabled=true}添加相关依赖//okhttp、retrofit、rxjavaimplementation‘com.squareup.okhttp3:okhttp:3.8.0’implementation‘com.squareup.retrofit2:retrofit:2.3.0’implementation‘
Unique3D：开启单张图片三维重建新篇章余靖年Veronica
Unique3D：开启单张图片三维重建新篇章Unique3DOfficialimplementationofUnique3D:High-QualityandEfficient3DMeshGenerationfromaSingleImage项目地址:https://gitcode.com/gh_mirrors/un/Unique3D在当今高速发展的科技领域中，三维重建技术正以惊人的速度改变着我们的视
The 2023 ICPC Asia Regionals Online Contest (2)-2023 ICPC网络赛第二场部分题解 I,M 小新-杂货铺算法竞赛补题复盘网络算法 c++
目录MDirtyWork（数学期望/贪心）IImpatientPatient(数学期望）原题地址：PTA|程序设计类实验辅助教学平台(pintia.cn)MDirtyWork（数学期望/贪心）ItisanotherICPCcontest.Yourteammatessketchedoutallsolutionstotheproblemsinafractionofasecondandwentawayt
parallel dryad_f4f8
1.这是什么词？词：parallel英英释义：apersonorthingthatissimilartoanother例句：TherearemanystrikingparallelsbetweenShanghaiandNewYorkCity.2.为什么选这个词？“parallel”作形容词讲，本意是“与......平行的（AisparalleltoB）”；作名词讲，本意是“平行线（apairofp
nvm install 16.15.0 : The process cannot access the file because it is being used by another process Faith-J nvm node.js npm
nvminstall16.15.0:TheprocesscannotaccessthefilebecauseitisbeingusedbyanotherprocessC:\Windows\System32>nvminstall16.15.0Downloadingnode.jsversion16.15.0(64-bit)...CompleteDownloadingnpmversion8.5.5...
Time-LLM 开源项目使用教程袁菲李
Time-LLM开源项目使用教程Time-LLM[ICLR2024]Officialimplementationof"Time-LLM:TimeSeriesForecastingbyReprogrammingLargeLanguageModels"项目地址:https://gitcode.com/gh_mirrors/ti/Time-LLM项目介绍Time-LLM是一个用于时间序列预测的框架，通过
探索未来时间管理新方式：Time-LLM 班歆韦Divine
探索未来时间管理新方式：Time-LLMTime-LLM[ICLR2024]Officialimplementationof"Time-LLM:TimeSeriesForecastingbyReprogrammingLargeLanguageModels"项目地址:https://gitcode.com/gh_mirrors/ti/Time-LLM项目简介Time-LLM是一个创新的时间管理和任务
基于Hadoop的学习行为数据云存储平台的设计与实现 usp1994 hadoop 学习大数据
基于Hadoop的学习行为数据云存储平台的设计与实现DesignandImplementationofaHadoop-BasedLearningBehavioralDataCloudStoragePlatform完整下载链接:基于Hadoop的学习行为数据云存储平台的设计与实现文章目录基于Hadoop的学习行为数据云存储平台的设计与实现摘要第一章绪论1.1研究背景1.2研究目的1.3研究意义第二章
插入表主键冲突做更新 a-john
有以下场景：用户下了一个订单，订单内的内容较多，且来自多表，首次下单的时候，内容可能会不全（部分内容不是必须，出现有些表根本就没有没有该订单的值）。在以后更改订单时，有些内容会更改，有些内容会新增。问题：如果在sql语句中执行update操作，在没有数据的表中会出错。如果在逻辑代码中先做查询，查询结果有做更新，没有做插入，这样会将代码复杂化。解决： mysql中提供了一个sql语
Android xml资源文件中@、@android:type、@*、？、@+含义和区别 Cb123456 @+@?@*
一.@代表引用资源 1.引用自定义资源。格式：@[package:]type/name android：text="@string/hello" 2.引用系统资源。格式：@android:type/name android:textColor="@android:color/opaque_red"
数据结构的基本介绍天子之骄数据结构散列表树、图线性结构价格标签
数据结构的基本介绍数据结构就是数据的组织形式，用一种提前设计好的框架去存取数据，以便更方便，高效的对数据进行增删查改。正确选择合适的数据结构，对软件程序的高效执行的影响作用不亚于算法的设计。此外，在计算机系统中数据结构的作用也是非同小可。例如常常在编程语言中听到的栈，堆等，就是经典的数据结构。经典的数据结构大致如下：一：线性数据结构 (1)：列表 a
通过二维码开放平台的API快速生成二维码一炮送你回车库 api
现在很多网站都有通过扫二维码用手机连接的功能，联图网(http://www.liantu.com/pingtai/)的二维码开放平台开放了一个生成二维码图片的Api,挺方便使用的。闲着无聊，写了个前台快速生成二维码的方法。 html代码如下:(二维码将生成在这div下) ? 1 &nbs
ImageIO读取一张图片改变大小 3213213333332132 java IO image BufferedImage
package com.demo; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; import javax.imageio.ImageIO; /** * @Description 读取一张图片改变大小 * @author FuJianyon
myeclipse集成svn（一针见血） 7454103 eclipse SVN MyEclipse
&n
装箱与拆箱----autoboxing和unboxing darkranger J2SE
4.2　自动装箱和拆箱基本数据(Primitive)类型的自动装箱(autoboxing)、拆箱(unboxing)是自J2SE 5.0开始提供的功能。虽然为您打包基本数据类型提供了方便，但提供方便的同时表示隐藏了细节，建议在能够区分基本数据类型与对象的差别时再使用。 4.2.1　autoboxing和unboxing 在Java中，所有要处理的东西几乎都是对象(Object)
ajax传统的方式制作ajax aijuans Ajax
//这是前台的代码 <%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%> <% String path = request.getContextPath(); String basePath = request.getScheme()+
只用jre的eclipse是怎么编译java源文件的？ avords java eclipse jdk tomcat
eclipse只需要jre就可以运行开发java程序了，也能自动编译java源代码，但是jre不是java的运行环境么，难道jre中也带有编译工具？还是eclipse自己实现的？谁能给解释一下呢问题补充：假设系统中没有安装jdk or jre，只在eclipse的目录中有一个jre，那么eclipse会采用该jre，问题是eclipse照样可以编译java源文件，为什么呢？ &nb
前端模块化 bee1314 模块化
背景：前端JavaScript模块化，其实已经不是什么新鲜事了。但是很多的项目还没有真正的使用起来，还处于刀耕火种的野蛮生长阶段。 JavaScript一直缺乏有效的包管理机制，造成了大量的全局变量，大量的方法冲突。我们多么渴望有天能像Java（import），Python (import)，Ruby(require)那样写代码。在没有包管理机制的年代，我们是怎么避免所
处理百万级以上的数据处理 bijian1013 oracle sql 数据库大数据查询
一.处理百万级以上的数据提高查询速度的方法： 1.应尽量避免在 where 子句中使用!=或<>操作符，否则将引擎放弃使用索引而进行全表扫描。 2.对查询进行优化，应尽量避免全表扫描，首先应考虑在 where 及 o
mac 卸载 java 1.7 或更高版本征客丶 java OS
卸载 java 1.7 或更高 sudo rm -rf /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin 成功执行此命令后，还可以执行 java 与 javac 命令 sudo rm -rf /Library/PreferencePanes/JavaControlPanel.prefPane 成功执行此命令后，还可以执行 java
【Spark六十一】Spark Streaming结合Flume、Kafka进行日志分析 bit1129 Stream
第一步，Flume和Kakfa对接，Flume抓取日志，写到Kafka中第二部，Spark Streaming读取Kafka中的数据，进行实时分析本文首先使用Kakfa自带的消息处理（脚本）来获取消息，走通Flume和Kafka的对接 1. Flume配置 1. 下载Flume和Kafka集成的插件，下载地址：https://github.com/beyondj2ee/f
Erlang vs TNSDL bookjovi erlang
TNSDL是Nokia内部用于开发电信交换软件的私有语言，是在SDL语言的基础上加以修改而成，TNSDL需翻译成C语言得以编译执行，TNSDL语言中实现了异步并行的特点，当然要完整实现异步并行还需要运行时动态库的支持，异步并行类似于Erlang的process（轻量级进程），TNSDL中则称之为hand，Erlang是基于vm(beam)开发，
非常希望有一个预防疲劳的java软件, 预防过劳死和眼睛疲劳,大家一起努力搞一个 ljy325 企业应用
　非常希望有一个预防疲劳的java软件，我看新闻和网站，国防科技大学的科学家累死了，太疲劳，老是加班，不休息，经常吃药，吃药根本就没用，根本原因是疲劳过度。我以前做java,那会公司垃圾，老想赶快学习到东西跳槽离开，搞得超负荷，不明理。深圳做软件开发经常累死人，总有不明理的人，有个软件提醒限制很好，可以挽救很多人的生命。相关新闻：（1）IT行业成五大疾病重灾区：过劳死平均37.9岁
读《研磨设计模式》-代码笔记-原型模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * Effective Java 建议使用copy constructor or copy factory来代替clone()方法： * 1.public Product copy(Product p){} * 2.publi
配置管理---svn工具之权限配置 chenyu19891124 SVN
今天花了大半天的功夫，终于弄懂svn权限配置。下面是今天收获的战绩。安装完svn后就是在svn中建立版本库，比如我本地的是版本库路径是C:\Repositories\pepos。pepos是我的版本库。在pepos的目录结构 pepos component webapps 在conf里面的auth里赋予的权限配置为 [groups]
浅谈程序员的数学修养 comsci 设计模式编程算法面试招聘
浅谈程序员的数学修养
批量执行 bulk collect与forall用法 daizj oracle sql bulk collect forall
BULK COLLECT 子句会批量检索结果，即一次性将结果集绑定到一个集合变量中，并从SQL引擎发送到PL/SQL引擎。通常可以在SELECT INTO、 FETCH INTO以及RETURNING INTO子句中使用BULK COLLECT。本文将逐一描述BULK COLLECT在这几种情形下的用法。有关FORALL语句的用法请参考：批量SQL之 F
Linux下使用rsync最快速删除海量文件的方法 dongwei_6688 OS
1、先安装rsync：yum install rsync 2、建立一个空的文件夹：mkdir /tmp/test 3、用rsync删除目标目录：rsync --delete-before -a -H -v --progress --stats /tmp/test/ log/这样我们要删除的log目录就会被清空了，删除的速度会非常快。rsync实际上用的是替换原理，处理数十万个文件也是秒删。
Yii CModel中rules验证规格 dcj3sjt126com rules yii validate
Yii cValidator主要用法分析： yii验证rulesit 分类： Yii yii的rules验证 cValidator主要属性 attributes ,builtInValidators,enableClientValidation,message,on,safe,skipOnError
基于vagrant的redis主从实验 dcj3sjt126com vagrant
平台: Mac 工具: Vagrant 系统: Centos6.5 实验目的: Redis主从实现思路制作一个基于sentos6.5, 已经安装好reids的box, 添加一个脚本配置从机, 然后作为后面主机从机的基础box 制作sentos6.5+redis的box mkdir vagrant_redis cd vagrant_
Memcached(二)、Centos安装Memcached服务器 frank1234 centos memcached
一、安装gcc rpm和yum安装memcached服务器连接没有找到，所以我使用的是make的方式安装，由于make依赖于gcc，所以要先安装gcc 开始安装，命令如下，[color=red][b]顺序一定不能出错[/b][/color]：建议可以先切换到root用户，不然可能会遇到权限问题：su root 输入密码...... rpm -ivh kernel-head
Remove Duplicates from Sorted List hcx2013 remove
Given a sorted linked list, delete all duplicates such that each element appear only once. For example,Given 1->1->2, return 1->2.Given 1->1->2->3->3, return&
Spring4新特性——JSR310日期时间API的支持 jinnianshilongnian spring4
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
浅谈enum与单例设计模式 247687009 java 单例
在JDK1.5之前的单例实现方式有两种(懒汉式和饿汉式并无设计上的区别故看做一种)，两者同是私有构造器，导出静态成员变量，以便调用者访问。第一种 package singleton; public class Singleton { //导出全局成员 public final static Singleton INSTANCE = new S
使用switch条件语句需要注意的几点 openwrt c break switch
1. 当满足条件的case中没有break，程序将依次执行其后的每种条件（包括default）直到遇到break跳出 int main() { int n = 1; switch(n) { case 1: printf("--1--\n"); default: printf("defa
配置Spring Mybatis JUnit测试环境的应用上下文 schnell18 spring mybatis JUnit
Spring-test模块中的应用上下文和web及spring boot的有很大差异。主要试下来差异有：单元测试的app context不支持从外部properties文件注入属性 @Value注解不能解析带通配符的路径字符串解决第一个问题可以配置一个PropertyPlaceholderConfigurer的bean。第二个问题的具体实例是：
Java 定时任务总结一 tuoni java spring timer quartz timertask
Java定时任务总结一.从技术上分类大概分为以下三种方式： 1.Java自带的java.util.Timer类，这个类允许你调度一个java.util.TimerTask任务; 说明： java.util.Timer定时器，实际上是个线程，定时执行TimerTask类 &
一种防止用户生成内容站点出现商业广告以及非法有害等垃圾信息的方法 yangshangchuan rank 相似度计算文本相似度词袋模型余弦相似度
本文描述了一种在ITEYE博客频道上面出现的新型的商业广告形式及其应对方法，对于其他的用户生成内容站点类型也具有同样的适用性。最近在ITEYE博客频道上面出现了一种新型的商业广告形式，方法如下： 1、注册多个账号（一般10个以上）。 2、从多个账号中选择一个账号，发表1-2篇博文