Need for Puptime
IRIS has over twenty-five network services, from services like SMTP and DNS name servers in addition to the expected HTTP servers and databases. Ensuring all services work and are always available (that is, have a high uptime) can be tricky. A system uptime monitoring tool automates monitoring services and notifies the IRIS team immediately to take corrective action when a service fails.
While there are many server uptime monitoring tools available, ranging from open-source, free, and enterprise-grade – none of the available tools satisfied our requirements. In particular, we were looking for flexibility in integration (for example, recording failures in a database for analytics and posting status updates to a secondary website) and more precision with application protocols (identifying whether Redis is working rather than a UDP server).
Puptime (and all other uptime monitors) have two primary functions: Querying the state of network services and publishing notifications. Let’s take a closer look at each of them.
Querying the state of network service means to check whether the service is working and accessible. For an HTTP server, this might include checking whether the host is reachable (the host is active, part of the network and not blocked by a firewall), the port is reachable (a process is listening), and the contents of the page are what we expect, typically done by looking for specific phrases in response. We ask for known domain’s IP addresses and compare the results for a DNS server. Querying network services is dependent on the application protocol. The functionality is implemented by
Puptime::Service::DNS, and other application-specific classes, which inherit common behavior from
When a network service fails, Puptime publishes notifications by notifying on communication channels. Currently, Puptime can send e-mails and send messages on Microsoft Teams – although the exact nature of the channel is unimportant. For organizations that use Slack for internal communication, Puptime should ideally be extended to notify on Slack (we look at problems with extending Puptime later on). Similar to services, the functionality resides in
Puptime::Notifier::Team with common behavior in
The services and communication channels are connected through
Puptime::ServiceSet is the set of network services queried, with each query running in a new thread. If a service fails, the failure and any diagnostic information are pushed to
Puptime::NotificationQueue and consumed by the communication channels.
Downsides of Current Architecture
While the current architecture works and Puptime is used successfully in production, I cannot help but feel uneasy. The Unix Philosophy summarizes my unease with remarkable accuracy:
Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
– Doug McIlroy, the inventor of Unix pipes and one of the founders of the Unix tradition.
It’s hard to argue that Puptime does one thing, especially with the increasing number of integrations. Currently, it is supposed to:
- Query network systems about their status.
- Save query results into a log file.
- Save query results into a database.
- Send notification through Microsoft Teams.
- Send notification through E-mail.
We are also looking to add a web server to query whether Puptime is up (infinite recursion, anyone?), integrate with version control to publish status updates to a secondary website, similar to istheservicedown, and other unforeseen integrations. The point being, Puptime wears many hats.
Puptime is also a black box to other applications. Puptime takes no inputs (except a configuration file) and has no output – again, except log files meant for human consumption. The inability to work with other applications makes it difficult to solve ad-hoc problems with Puptime. Extending Puptime to send notifications on Slack would require code changes upstream while being functionally the same as Microsoft Teams and E-mail. One might implement all requested integrations to compensate but it requires substantial developer investment, increases source code (making the application slower and larger), and forces unrequired dependencies.
As discussed in the previous section, Puptime does many things and does not work with other programs.
Life is easier when we dump our problems on others.
– Anthony Williams, Author of “C++ Concurrency in Action” discusses how using
std::asyncdumps thread management onto the C++ standard library.
If Puptime queries network systems and calls appropriate hooks, we have pulled off an impressive magic trick in shrinking responsibilities. We dump the problem of writing integrations onto the users (who know their systems better than us). Suddenly, all sorts of integrations seem possible – after all, it’s only the matter of writing an appropriate hook.
Let’s walk through the new cycle of events:
- Puptime starts and parses configuration.
- For each service in configuration, Puptime queries the service and writes the result to a temporary file.
- Depending on the service status, Puptime calls the appropriate hooks (if any), passing the temporary file and moves on next service.
I like to call the alternative architecture an “event-driven” architecture as focus shifts from a linear procedure to calling hooks when a relevant event occurs. It’s easy to see that logging was a specific hook called after each query and sending notifications were a specific hook called after each failing status.
Integrating new services is a breeze now. For example, integrating Slack no longer requires implementing
Puptime::Notifier::Slack but only parsing a JSON file and making use of Slack API.
By shifting responsibilities, Puptime can focus on its niche – targeting more application protocols with specificity that other uptime monitors do not.
The alternative architecture is not without faults. There is some overhead with writing to and reading from files, compared to shared memory access. Executing external files without sandboxing can lead to security vulnerabilities (mainly if Puptime is run as root). Puptime needs to ensure backward compatibility should the target file format change. But in my opinion, the positives outweigh the negatives.
Puptime is a tool that grew out of IRIS’s needs that existing uptime services did not satisfy – greater specificity with application protocols, automatic triggers to integrate with our workflow, and low resource consumption. It’s out of the initial prototyping phase and we are looking to add features fast.
We are happy to open source it and look forward to contributions on making the alternative design a reality and more!