Overview

Welcome to the Assimilation project. (README) - proudly sponsored by Assimilation Systems Limited.

We provide open source discovery with zero network footprint integrated with highly-scalable monitoring. Here are the problems we address:

Organizations are vulnerable to attack through forgotten or unknown systems (30% of all intrusions)
Organizations have no automatic infrastructure discovery, or they run it infrequently
- => System configuration information is out of date, or only in people's heads
System discovery is not integrated with monitoring
- Most organizations have no way of knowing they're monitoring everything - and probably aren't
- Most monitoring is time-consuming to configure, incomplete, out-of-date, easily confused
Monitoring is complex and expensive to scale.

What we do: Continually discover and monitor systems, services, switches and dependencies with very low human and network overhead

Discover systems, services, switches and dependencies using zero network footprint techniques
Monitor systems and services with very low overhead and extreme scalability
Make montoring easy to configure and manage

Introduction

The Assimilation Project is designed to discover and monitor infrastructure, services, and dependencies on a network of potentially unlimited size, without significant growth in centralized resources. The work of discovery and monitoring is delegated uniformly in tiny pieces to the various machines in a network-aware topology - minimizing network overhead and being naturally geographically sensitive.

The two main ideas are:

distribute discovery throughout the network, doing most discovery locally
distribute the monitoring as broadly as possible in a network-aware fashion.
use autoconfiguration and zero-network-footprint discovery techniques to monitor most resources automatically. during the initial installation and during ongoing system addition and maintenance.

The original monitoring scalability idea was outlined in two different articles

These two main ideas create a system which will provide significant important capabilities giving both a great out-of-the-box experience for new users and smooth accommodation of growth to virtually all environments.

For a human-driven overview, we recommend our videos from interviews and conference presentations.

We also have a few demos, which demonstrate the ease of use and power of the Assimilation software.

Project Sponsors

Assimilation Systems Limited was founded by project founder Alan Robertson to provide paid support and alternative licenses for the Assimilation Project.

Project Integrity

The project software undergoes a number of rigorous static and dynamic tests to ensure its continued integrity.

Highly restrictive gcc options in all compiles - no warnings allowed (-Werror)
Static Analysis via the Clang static analyzer - zero warnings allowed before changes are pushed to public repository
Static Analysis by Coverity before each release candidate (and other times)
Four collections of regression tests - successful run required before any changes are pushed to public repository
pylint Python code checker - enforces Python coding standards and performs static error checks

Progress Reports on the project

The team currently posts updates in the following places:

#assimilation channel on freenode IRC
[Google+] page for the Assimilation Project.
Twitter - fairly frequent from @OSSAlanR - using hash tag #AssimMon
Assimilation Systems Blog - about weekly
Older Managing Computers with Automation Blog
Assimilation Mailing List - not quite as often - shooting for weekly.
Assimilation Trello project management boards. These give a very good overview of the project goals, current and future work items along with open project roles. Come find your future here!
linux-ha-dev mailing list - parent project

External Links

Ohloh entry for the Assimilation Project.
Assimilation Interview - Fall 2012.
Assimilation Interview - April 2013.
LinuxCon 2012 slides, video.
A picture of what our graphs look like.
Blog articles on our Neo4j schema:

Architecture

This concept has two kinds of participating entities:

a Centralized Management Authority - monitoring the collective, and collecting discovery information
a potentially very large number of lightweight monitoring/discovery agents (aka nanoprobes)

Scalable Monitoring

The picture below shows the architecture for discovering system outages.

Multi-Ring Heartbeating Architecture

Each of the blue boxes represents a server. Each of the connecting arcs represent bidirectional heartbeat paths. When a failure occurs, the systems which observe it report directly to the central collective management authority (not shown on this diagram). Several things are notable about this kind of heartbeat architecture:

It has no single points of failure. Each system is monitored by at least two other systems.
It is simple to detect the difference between a switch failure and a host failure by which systems report the failure, and which ones do not.
Each system talks to no more than 4 systems - no matter how big the collection being monitored. Since the central system only hears from the monitored systems when a failure occurs, the work to perform monitoring of systems does not go up as the number of systems being monitored goes up.
Approximately 96% of all monitoring traffic stays within edge switches.
This architecture is naturally geographically sensitive. Very little traffic goes between sites to monitor multiple sites from a central location.
This architecture is simple and easy to understand.

This is all controlled and directed by the collective monitoring authority (CMA) - which is designed to be configured to run in an HA cluster using a product like Pacemaker. The disadvantage of this approach is the getting started after a complete data center outage/shutdown can take a while - this part is not O(1).

An alternative approach would be to make the rings self-organizing. The advantage of this is that startup after an full datacenter outage would happen much more quickly. The disadvantage is that this solution is much more complex, and embeds knowledge of the desired topology (which is to some degree a policy issue) into the nanoprobes. It also is not likely to work as well when CDP or LLDP are not available, and to properly diagnose complex faults, it is necessary to know the order nodes are placed on rings.

Autoconfiguration through Discovery

One of the key aspects of this system is it be largely auto-configuring, and incorporates discovery into its basic philosophy. It is expected that a customer will drop the various nanoprobes onto the clients being monitored, and once those are running, the systems register themselves and get automatically configured into the system once the nanoprobes are installed and activated.

What is Zero Network Footprint Discovery™?

Zero-network-footprint discovery is a process of discovering systems and services without sendign active probes across the network which might trigger security alarms. Some examples of current and anticipated zero-network-footprint discovery techniques include:

Discovery of newly installed systems by auto-registration
Discovery of network topology using LLDP and CDP aggregation
Discovery of services using netstat -utnlp
Discovery of services using "service" command and related techniques
Discovery of systems using arp -n
Discovery of systems using netstat -utnp
Discovery of service interdependencies using netstat -utnp
Discovery of network filesystem mount dependencies using the mount table

These techniques will not immediately provide a complete list of all systems in the environment. However as nanoprobes are activated on systems discovered in this way, this process will converge to include the complete set of systems and edge switches in the environment - without setting off even the most sensitive security alarms.

In addition, the netstat information correlated across the servers also provides information about dependencies and service groups.

Furthermore, these nanoprobes use zero-network-footprint discovery methods to discover systems not being monitored and services on the systems being monitored. Zero-network-footprint discovery methods are methods which cannot trip even the most sensitive network security alarm - because no probes (packets) are sent over the network to perform discovery.

This discovery process is intended to achieve these goals:

Simplify initial installation
Provide a continuous audit of the monitoring configuration
Create a rich collection of information about the data center

Zero-Network-Footprint Discovery Process

Lightweight monitoring agents

The nanoprobe code is written largely in C and minimizes use of:

CPU
memory
disk
network resources

To do this, we will follow a management by exception philosophy for exception monitoring - when nothing is wrong, nothing will be reported. Although the central part of the code will likely be only available on POSIX systems, the nanoprobes will also be available on various flavors of Windows as well.

Service Monitoring

To the degree possible, we will perform exception monitoring of services on the machine they're provided on - which implies zero network overhead to monitor working services. Stated another way, we follow a management by exception philosophy. Our primary tool for monitoring services is through the use of a re-implemented Local Resource Manager from the Linux-HA project.

Testing Strategy

There are three kinds of testing I see as necessary

junit/pyunit et al level of testing for the python code
Testing for the C nanoprobes in situ
System level (simulated) testing for the CMA Each of these areas is discussed below.

Unit-level testing

We are currently using the Testify software written by the folks at Yelp. Probably will try some of the alternatives as well. Very pleased with the results it's bringing. The nice thing about this is much of the detailed gnarly C code is wrapped by the python code, so when I run the python tests of those wrappers, the C code under it gets well tested as well.

Testing of the Nanoprobes

Not quite sure how to best accomplish this. Some of it can just be my home network, but I suppose I could also spin up some cloud VMs too... Not sure yet... Automation is a GoodThing.

Testing of the Collective Management Code

I have been thinking about this quite a bit, and have what I think is a reasonable idea about it. It involves writing a simulator to simulate up to hundreds of thousands of nanoprobe clients through a separate python process - probably using the Twisted framework. It would accept and ACK requests from the CMA and randomly create failure conditions similar to those in the "real world" - except at a radically faster rate. This is a big investment, but likely worth it. It helps to have this in mind while designing the CMA as well - since there are things that it could do to make this job a little easier.