So this article isn’t going to be for everyone, however I suspect it will be somewhat appealing for anyone who is looking at Windows Server 2012 R2’s Failover Clustering capability.
I’m going to write this in a series of posts, as I think there’s also some merit in looking at diagnostic approaches to finding out what the heck is going wrong with a Failover Cluster, rather than focusing on an ideal end state in isolation.
If you’re not interested in the MassTransit parts, skip this introduction and check out Part 2 (coming soon!) which will focus on Clustered MSMQ roles and diagnosis.
MassTransit: In a Nutshell
We’re taking a view of Failover Clustering from the point of view of MassTransit, which is an open source implementation of a lightweight message queue-backed Service Bus (of sorts). Here’s the official blurb from the GitHub page:
MassTransit is a free, open-source distributed application framework for .NET. MassTransit makes it easy to create applications and services that leverage message-based, loosely-coupled asynchronous communication for higher availability, reliabililty, and scalability.
Some documentation is also available here.
What are we focusing on?
Well, MassTransit, from version 3.x onwards, only supports RabbitMQ and the Azure Service Bus. As we had initiated implementation in late 2014 with version 2, we flaunted the abandonment of MSMQ and boldly decided to use it anyway; mainly because it is an OOB first class service as standard with Windows Server 2012 R2 (and earlier versions). So this article won’t feature RabbitMQ or Azure Service Bus, but I might tackle that topic at a later time.
To eliminate a ton of extremely complex code, support for MSMQ was completely ripped out 
Therefore, this series of articles pertains only to MassTransit version 2.x and MSMQ.
We’re also using subscriptions, which means we are using the MassTransit subscriptions queue, and message consumers & subscribers register with a runtime service before interacting with the message bus. This is important to note, because the objective of using a clustered queue is to mitigate service outages by shifting the active queue.
How does MassTransit work, in basic terms?
We’re working with two categories of (.NET) application; message publishers and message consumers. An application can do one or both, in other words you can publish, subscribe or publish and subscribe.
There’s one caveat: both need to use a local MSMQ, which is used for two reasons essentially – a local staging location for retrying the publishing of messages and also to hold unprocessed and processed messages for consumers (in case the consumer is offline but has a persistent subscription). There’s also an error queue which messages will land in if they are unable to be successfully processed.
What is your High Availability goal?
Essentially, we want to have the subscription queue highly available. In the event of an unplanned outage, the queue will move to the next available node in the cluster. The clustered MSMQ role also has a dedicated network name & IP address which means that it acts as a central address point – no matter which machine acts as the active host – i.e. no need for publishers or subscribers to be aware of the cluster itself.
Since the majority of message consumers in the design also reside on the same server, the act of failing over the HA queue would also failover the message consumers too.
This approach doesn’t rule out scaling horizontally at a later stage if we need to, there’s a plethora of options which can be made available, including some home brew options from MassTransit itself in the form of something called the Distributor.
So if you look at my approach, the intention is to cluster two or more Windows Server hosts, and then stick a bunch of Windows Services on each node in the cluster, making them Cluster Resources. The simplistic model is to have the services started only on the active node.
Here’s an illustration of the target solution, with a two node Failover Cluster:
The MassTransit Runtime Service manages message subscriptions, from the clustered MSMQ role. Now none of this really should make sense until you see the deployment in context. The following is a sample diagram of a typical DMZ/LAN architecture:
So that’s the essential scope of my MassTransit HA deployment. What will come in the next article is a closer inspection of how High Availability failover will function, and the mechanics behind it.
If you were looking for Clustered MSMQ guidance, Part 2 is for you!