How-to implement a SOLID Watchdog

close up photo of dog wearing sunglasses
Photo by Ilargian Faus on Pexels.com

We all have components in our software or network that we want to monitor how are they behaving.

This is usually called a watchdog component and I was not that happy with some of the implementations and decided to put a bit of my private time on it, as a bit of “technical challenge”.

 

Quick introduction

Recently in my Company we had to implement a watchdog to monitor hardware resources and their availability.
Reasoning is that network connectivity is “eventual” and we must react to these states (basically we are talking about trains, its wagons and accessing them from central so we have these things that happen like separating and uniting wagons, tunnels and our wonderful networks that work so flawlessly when we need them the most, right?)
But, this might be migrated to another scenarios, like watching out for micro services, identify if they are behaving properly and, if not, restart them if necessary…

I decided to put some fun tech time to get some of the best features I could find on the net and from my colleagues and create something close to the best solution possible to a watchdog system that could notify “some other system” when disconnection and/or re-connection events happen 😉
Yes, I wanted this to be efficient but also as decoupled as possible.

Maybe too much for a single blog post? Maybe you are correct, but come with me till the end of the post to see if I managed to get there…

Let’s get to it!!

 

An early beginning..

To monitor networking resources we will use a Ping, for this we must use its containing .NET namespace, System.Net.NetworkInformation.

Ping sends an ICMP (Internet Control Message Protocol) echo message to the computer with the specified IP address. We have also a parameter for a timeout in milliseconds. But, we have to note that if this number is very small, the ping can be anyway received even if the timeout ms have elapsed. So it seems a bit “unimportant”.

After we can use its basic construct so let’s try to ping ourselves..

int timeout = 10;
Ping p = new Ping();
PingReply rep = p.Send("192.168.178.1", timeout);
if (rep.Status == IPStatus.Success)
  {
       Console.WriteLine("It's alive!");
       Console.ReadLine();
  }

Not very exciting yet, but it Works (or it should… as we are pinging ourselves… unless we have a very restrictive firewall..)

 

Now some more timers.. and pinging Asynchronously…

Upon looking for references, Tim Cooker response to a particular post looked to me as the best implementation so far) Ref: https://stackoverflow.com/questions/4042789/how-to-get-ip-of-all-hosts-in-lan/4042887#4042887

He is using a countdownEvent primitive to synchronise the pingAsync() responses which I particularly liked…

To takeaway is the synchronisation of the Ping.SendAsync() calls is a bit confusing

If you  bring it to Windows Forms, you will find some issues due to the nature of Ping.SendAsync()

Read more on it here: https://stackoverflow.com/questions/7766953/asynchronous-code-that-works-in-console-but-not-in-windows-forms/7767632#7767632

Basically the Ping tries to raise the PingCompleted on the same thread that SendAsync() was invoked on. But, as we blocked the thread with the countdownEvent, the Ping cannot complete as the thread is blocked. Welcome Deadlock (I am speaking in the case of executing this code on Winforms).

In a console application it Works due that it does not have a synchronisation provider and the PingComplete() is raised in a thread pool thread.

Solution would be to run the code on a worker thread, but this will result that the PingComplete() will also be called on that thread..

Another Pearl on “Ping usage”

I found another article which is a must read: http://www.justinmklam.com/posts/2018/02/ping-sweeper/

Here it clarifies what we saw regarding using the Ping.SendAsync(). Basically we have another command on this .NET namespace with an extremely similar name, SendPingAsync(). At a simple view we would think it is redundant and does not sound right… so, what are they doing and what are they different?

According to MSDN

  • SendAsync method Asynchronously attempts to send an Internet Control Message Protocol (ICMP) echo message to a computer, and receive a corresponding ICMP echo reply message from that computer.
  • SendPingAsync Sends an Internet Control Message Protocol (ICMP) echo message to a computer, and receives a corresponding ICMP echo reply message from that computer as an asynchronous operation.

So translating a bit, SendAsync sends the ICMP asynchronously but the reception is not asynchronous. SendPingAsync ensures that the reception is asynchronous.

So, we should use SendPingAsync().

 

Links to the referenced MSDN sources:

 

Putting it all together with our “ol’ friend”, the “Timer”..

Initial implementations I have seen for a watchdog were usually following the pattern that a timer is created that will trigger a “watch this resource” task on a given regularity.

This Tick of the timer will trigger the Ping process we have seen.

I got two very good tips on this implementation which are:

  • Once the timer Tick is called, we pause the timer it by setting the timer period to infinite. We resume it only when the tasks to be performed, in this case a single ping, are done. This will ensure no overlap happens and we have no escalation on the number of threads.
  • Set the timer to a random time, to ensure a bit of variability on the execution to help on distributing the load, ie., the pings do not happen at the same time.

 

Some code:

using System;
using System.Net.NetworkInformation;
using System.Threading;

namespace cappPingWithTimer
{
    class Program
    {
        static Random rnd = new Random();
        static int min_period = 30000; // in ms
        static int max_period = 50000;
        static Timer t;
        static string ipToPing = "192.168.178.1";
        static void Main(string[] args)
        {
            Console.WriteLine("About to set a timer for the ping... press enter to execute >.<");
            Console.ReadLine();
            Console.WriteLine("Creating new timer and executing it right away...");
            t = new Timer(new TimerCallback(timerTick), null, 0, 0);       
            Console.WriteLine("press Enter to stop execution!");
            Console.ReadLine();
        }

        private static void timerTick(object state)
        {
            Console.WriteLine("Timer created, first tick here");
            t.Change(Timeout.Infinite, Timeout.Infinite); // this will pause the timer
            Ping p = new Ping();
            p.PingCompleted += new PingCompletedEventHandler(p_PingCompleted);
            Console.WriteLine("Sending the ping inside the timer tick...");
            p.SendAsync(ipToPing, 500, ipToPing);
        }

        private static void p_PingCompleted(object sender, PingCompletedEventArgs e)
        {
            string ip = (string)e.UserState;
            if (e.Reply != null && e.Reply.Status == IPStatus.Success)
            {
                Console.WriteLine("{0} is up: ({1} ms)", ip, e.Reply.RoundtripTime);
            }
            else if (e.Reply == null) //if the IP address is incorrectly specified, the reply object can be null, so it needs to be handled for the code to be resilient..
            {
                Console.WriteLine("Pinging {0} failed. (Null Reply object?)", ip);
            }
            else
            {
                Console.WriteLine("response: {0}", e.Reply.Status.ToString());
            }

            int TimerPeriod = getTimeToWait();
            Console.WriteLine(String.Format("rescheduling the timer to execute in... {0}", TimerPeriod));
            t.Change(TimerPeriod, TimerPeriod); // this will resume the timer
        }

        static int getTimeToWait()
        {
            return rnd.Next(min_period, max_period);
        }
    }
}

 

Main issues here is that this technique is meant to trigger a timer “watchdog” for every component/resource we want to monitor, so architecturally wise it might look like an uncontrolled mess that can create a lot of threads..

 

But this is a Task driven world… since .NET 4.5 at least..

So, now we know how to implement a truly asynchronous ping with SendPingAsync(),  which we can call recurrently with a Timer object instance..

But as of this writing we have better tools in .NET for asynchronous/parallel work.. and it is not Backgroundworker (which would be if we needed to deploy a pre- .NET 4.5 solution)…

…but using Async/Await Tasks, which we have since .NET 4.5 (aka, TAP).

Basically it would become a matter of creating a task for every Ping operation and wait for them to complete in parallel.

And if inside a watchdog, to call it every certain time.. maybe not with Timer… if we have .NET 4.5 we can try to avoid creating extra threads if we can, right?

 

And now, discussing our Watchdog implementation…

Wouldn’t be ideal to have a decoupled watchdog system that we are able to plug and play anywhere in a SOLID way?

Basically I try to think in simple patterns and provide simple solutions so the first that comes to my mind is implementing the watchdog using an observer pattern that we can register for getting updates on the “connectivity” status of different network resources.

To me only two simple connectivity status matter:

  • Connected
  • Disconnected

So the code can react on this…

We could add “reconnecting” but this would mean that the watchdog knows the implementation of whoever uses it, but we want to decouple (abstract) so this is something that the “user” app should have to manage by itself. So no.

To our watchdog if we get a IP response back, we are connected. And the Consumer app react on this two basic facts. Simple, right?

Another thing we would need is a way to add the elements to be watched by the Watchdog.

For now, I believe this list should have this information for each of the elements to watch over:

  • IP
  • ConnectionStatus
  • LastPingResponse (why not keep the latest ping reply)
  • ElapsedFromLastConnected (from the previous time we had a connection)
  • TotalPings
  • TotalPingsSuccessful

 

Timer or no Timer? As a timer will create a thread 100% we have also a way to have this done by TAP, and we can make the process cancellable, so the decision is easy.

And we put the Observer pattern in the mix too right?

observer.jpg

So, to benefit our “Network” Watchdog subscribers (Observers) we will provide means for:

  • Attach or Register
  • Detach or Unregister
  • Be Notified

Also, we have the fundamental question on how do we want to do this… we can let them subscribe to the full list of resources, if they are centralised, it makes sense or, otherwise, we might have them to be observed by a concrete end component.. and on this case it might only be interested in a single resource.. so… what to do?

Basically I implemented the Subject Interface to both; the resource list and the concrete resource. Then we have a flexible design that fits all functions and fits with proper software quality standards.

 

The code is simple, a simple Project implementing a .NET Core component with a simple console application that showcases its usage:

img 01

Five interfaces are declared, one for the Network Watchdog so we can:

  1. Add resources to be watched
  2. Configure the Watchdog
  3. Start it
  4. Stop it

Yes,  future one would be to remove resources, but did not think that fully yet. But will probably get that in short.

 

The other four are the interfaces for the Observer pattern, one for the Subject and other for the Observer. I am more familiar with the concept of having a Publisher and a Subscriber and these names sound better and feel more meaningful to me tan Subject/Observer so I use them instead.

One pair is meant for the Network Watchdog, for all the resources and another one is meant for a concrete Resource itself.

img 02 - interfaces

Then we have an enum with the connectivity state, which holds Connected and Disconnected.

Then the next thing to watch is the implementation of ResourceConnection:

using ConnWatchDog.Interfaces;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Net.NetworkInformation;
using System.Threading.Tasks;

namespace ConnWatchDog
{
    public class ResourceConnection : IResourcePublisher
    {
        public string IP { get; private set; }
        public ConnectivityState ConnectionState { get; private set; }
        public IPStatus LastStatus { get; private set; } // Technically it can be obtained from the PingReply..
        public PingReply LastPingReply { get; private set; }
        public TimeSpan LastConnectionTime { get; private set; }
        public long TotalPings { get; set; }
        public long TotalSuccessfulPings { get; set; }
        private Stopwatch stopWatch = new Stopwatch();
        public bool StateChanged { get; private set; }

        // Member for Subscriber management
        List<IResourceSubscriber> ListOfSubscribers;

        public ResourceConnection(string ip)
        {
            ConnectionState = ConnectivityState.Disconnected; // first we asume its disconnection until we prove opposite.
            LastStatus = IPStatus.Unknown;
            TotalPings = 0;
            TotalSuccessfulPings = 0;
            stopWatch.Start();
            IP = ip;
            StateChanged = false;

            ListOfSubscribers = new List<IResourceSubscriber>();
        }

        public void AddPingResult(PingReply pr)
        {
            StateChanged = false;
            TotalPings++;
            LastPingReply = pr;
            LastStatus = pr.Status;

            if (pr.Status == IPStatus.Success)
            {
                stopWatch.Stop();
                LastConnectionTime = stopWatch.Elapsed;
                TotalSuccessfulPings++;
                stopWatch.Restart();

                if (ConnectionState == ConnectivityState.Disconnected)
                    StateChanged = true;
                ConnectionState = ConnectivityState.Connected;
            }
            else // no success..
            {
                if (ConnectionState == ConnectivityState.Connected)
                    StateChanged = true;
                ConnectionState = ConnectivityState.Disconnected; 
            }

            // We trigger the observer event so everybody subscribed gets notified
            if (StateChanged)
            {
                NotifySubscribers();
            }
        }

        /// <summary>
        ///  Interface implemenation for Observer pattern (IPublisher)
        /// </summary>
        public void RegisterSubscriber(IResourceSubscriber subscriber)
        {
            ListOfSubscribers.Add(subscriber);
        }
        public void RemoveSubscriber(IResourceSubscriber subscriber)
        {
            ListOfSubscribers.Remove(subscriber);
        }
        public void NotifySubscribers()
        {
            Parallel.ForEach(ListOfSubscribers, subscriber => {
                subscriber.Update(this);
            });
        }
    }
}

 

This is a simple beast that implements the Observer pattern for if any other component wants to watch over a concrete resource.

 

The NetworkWatchdog code:

using ConnWatchDog.Interfaces;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.NetworkInformation;
using System.Threading;
using System.Threading.Tasks;

namespace ConnWatchDog
{
    public class NetworkWatchdogService : IWatchdogPublisher, INetworkWatchdog
    {
        List<ResourceConnection> ListOfConnectionsWatched;
        List<IWatchdogSubscriber> ListOfSubscribers;
        int RefreshTime = 0;
        int PingTimeout = 0;
        public CancellationToken cancellationToken { get; set; }
        bool NotifyOnlyWhenChanges = false;
        bool IsConfigured = false;

        public NetworkWatchdogService()
        {
            ListOfConnectionsWatched = new List<ResourceConnection>();
            ListOfSubscribers = new List<IWatchdogSubscriber>();
        }

        /// <summary>
        ///  Interface implemenation for Observer pattern (IPublisher)
        /// </summary>
        public void RegisterSubscriber(IWatchdogSubscriber subscriber)
        {
            ListOfSubscribers.Add(subscriber);
        }
        public void RemoveSubscriber(IWatchdogSubscriber subscriber)
        {
            ListOfSubscribers.Remove(subscriber);
        }
        public void NotifySubscribers()
        {
            Parallel.ForEach(ListOfSubscribers, subscriber => {
                subscriber.Update(ListOfConnectionsWatched);
            });
        }

        /// <summary>
        ///  Interfaces for the Network Watchdog
        /// </summary>
        
        public void AddResourceToWatch(string IP)
        {
            ResourceConnection rc = new ResourceConnection(IP);
            ListOfConnectionsWatched.Add(rc);
        }

        public void ConfigureWatchdog(int RefreshTime = 30000, int PingTimeout = 500, bool notifyOnlyWhenChanges = true)
        {
            this.RefreshTime = RefreshTime;
            this.PingTimeout = PingTimeout;
            this.NotifyOnlyWhenChanges = notifyOnlyWhenChanges;
            cancellationToken = new CancellationToken();
            IsConfigured = true;
        }

        public void Start()
        {
            StartWatchdogService();
        }

        public void Stop()
        {
            cancellationToken = new CancellationToken(true);
        }

        private async void StartWatchdogService()
        {
            var tasks = new List<Task>();

            if (IsConfigured) {
                while (!cancellationToken.IsCancellationRequested)
                {
                    foreach (var resConn in ListOfConnectionsWatched)
                    {
                        Ping p = new Ping();
                        var t = PingAndUpdateAsync(p, resConn.IP, PingTimeout);
                        tasks.Add(t);
                    }

                    if (this.NotifyOnlyWhenChanges)
                    {
                        await Task.WhenAll(tasks).ContinueWith(t =>
                        {
                        // now we can send the notification ... if any resources has changed its state from connected <==> disconnected 
                        if (ListOfConnectionsWatched.Any(res => res.StateChanged == true))
                            {
                                NotifySubscribers();
                            }
                        });
                    }
                    else NotifySubscribers();

                    // After all resources are monitored, we delay until the next planned execution.
                    await Task.Delay(RefreshTime).ConfigureAwait(false);
                }
            }
            else
            {
                throw new Exception("Cannot start Watchdog not configured");
            }
        }

        private async Task PingAndUpdateAsync(Ping ping, string ip, int timeout)
        {
            var reply = await ping.SendPingAsync(ip, timeout);
            var res = ListOfConnectionsWatched.First(item => item.IP == ip);
            res.AddPingResult(reply);
        }
    }
}

Yes, now Reading it again the ListOfConnectionsWatched could be named differently like ListOfNetworkResourcesWatched… so keeping it in mind so I can update that in the Git repo later on.

As a detail, while writing the demo, I thought that in some cases we want all the updates to be notified, so we have a permanent heartbeat that we can bind to a UI. NotifyOnlyWhenChanges does this, otherwise we only send notifications if there has been a change in the connected/disconnected state.

 

And.. no timer is used, so at the end we are using:

await Task.Delay(RefreshTime).ConfigureAwait(false);

Which does what we want without extra threading.

 

The last piece of code is an example application which uses the presented software component:

This is an example on how to use the presented software component, my “use case” is “I want a ping sweep over my local home network to see who is responding or not”.

The code creates several resources to be watched and setup the class as a subscriber to the receive connection status update.

For this we have to implement the interface IWatchdogSubscriber and implement the update method.

    public class AsyncPinger : IWatchdogSubscriber
    {
        private string BaseIP = "192.168.178.";
        private int StartIP = 1;
        private int StopIP = 255;
        private string ip;
        private int timeout = 1000;
        private int heartbeat = 10000;
        private NetworkWatchdogService nws;

        public AsyncPinger()
        {
            nws = new NetworkWatchdogService();
            nws.ConfigureWatchdog(heartbeat, timeout, false);
        }

        public async void RunPingSweep_Async()
        {           
            var tasks = new List<Task>();

            for (int i = StartIP; i < StopIP; i++)
            {
                ip = BaseIP + i.ToString();
                nws.AddResourceToWatch(ip);
            }

            nws.RegisterSubscriber(this);
            var cts = new CancellationTokenSource();
            cts.CancelAfter(60000);
            nws.cancellationToken = cts.Token;
            nws.Start();
        }

        public void Update(List<ResourceConnection> data)
        {
            Console.WriteLine("Update from the Network watcher!");
            foreach (var res in data)
            {
                if (res.ConnectionState == ConnectivityState.Connected)
                {
                    Console.WriteLine("Received from " + res.IP + " total pings: " + res.TotalPings.ToString() + " successful pings: " + res.TotalSuccessfulPings.ToString());
                }
            }
            Console.WriteLine("End of Update ");
        }
    }

Note that the CancellationTokenSource is there for convenience, this will stop the Watchdog after 60 seconds, but we could bound that to the UI or any other logic.

From my main clause, I have only two lines of code needed:

var ap = new AsyncPinger();           
ap.RunPingSweep_Async();

I am happy with the results and time dedicated and I am already thinking on how to potentially extend  such a system like:

  • Extending to another scenarios like monitoring Services
  • Extending to provide custom actions (ping, Service monitor, application monitor, etc…) but this would require as well custom data too.
  • Extending to enable the watchdog to do simple actions like a “keep alive” or in case of a Service issue, to be able on its own to try to solve the issue “stop and restart” protocols…
  • Improve this above points with implementing the Strategy pattern “properly”.

 

So what do you think? What would you change, add or remove from the proposed solution to make it better?

 

Happy coding!

The full code with the sample usage can be found here: https://github.com/joslat/NetworkWatchDog