Devil In The Details


OpenClovis is built from the ground up to enable highly reliable systems

Although Linux is entering widespread unattended enterprise use, it was not originally written for this kind of duty. Standard Linux software libraries were not written to be reliable for long running unattended installations. Due to their rare occurrence, the resulting bugs can require thousands of hours of run time to reproduce and be very difficult to diagnose — you will not find them until after deployment. Here are a few examples:

Process Management

The standard Linux semaphore facility (sem_xxx system calls) is not robust in the face of process failures. If a process fails while the semaphore is “taken” that semaphore remains “taken” indefinitely, permanently blocking all processes that use it. This same effect occurs if pthread mutexes are used in shared memory.

To make matters worse, the industry standard boost C++ library can hang since it bases its “interprocess” semaphores on these Linux semaphores. Therefore use of the Boost interprocess library is dangerous in a high availability environment.

Applications using OpenClovis SAFplus rarely need to directly use interprocess semaphores since coordination happens through the SAFplus availability, event, checkpoint, and messaging abstractions. But under the hood, OpenClovis semaphores are automatically “given” when a process fails and shared memory access is carefully managed so that data integrity and consistency is maintained even if a process fails during an update.

Timers

Linux timers based on the calendar clock are affected by changes in the system time, such as that produced by daylight savings time. Applications that base protocol timers on these Linux timers will have potentially catastrophic improper timing exactly twice a year.

OpenClovis provides a high performance timer library that can handle tens of thousands of timers and is not affected by system clock changes.

Standard Input/Output Blocking

When a Linux process spawns other processes, the child processes send their standard output (i.e “printf”) to the same destination as the parent. But if this destination stops (blocks) for ANY reason, all the child processes will also block which can cause your entire application stack to stop in an improperly set up system. Additionally a “runaway” process can send so much data to the I/O system that working applications are slowed to a crawl.

The OpenClovis SAFplus logging system avoids file I/O altogether, instead using an unblockable shared memory architecture that is unaffected by “runaway” processes.

Kernel Optimization for High Bandwidth/Low Latency Communications

In a high performance, low latency physically co-located cluster, the vast majority of packet drops occur due to buffering problems within the sending and receiving network stack. OpenClovis recommends specific kernel buffer settings and suggests modified, high performance kernel driver libraries to ensure that communications occur smoothly.

In summary, OpenClovis has spent $40M and many man years perfecting its 99.999% reliable system, so that can users can build applications on top of SAFplus and be confident they can be safely deployed in the most exacting environments.


Read more about our products, download our software for free or contact us to talk about your project