1. Introduction
When deciding to start using asynchronous communication in a software organization that is used to synchronous communication, it will take time before architects/analysts/programmers will be able to make the mental switch to design and implement truly idiomatic asynchronous interfaces.
Old habits die hard.
In order to help with this changeover, I have collected a list of questions that can be used during the review of designs and implementations of asynchronous interfaces. If you’re not entirely convinced a checklist is appropriate, check out the book 'Checklist Manifesto' by Atul Gawande.
2. FAQ - Frequently Asked Questions
A.k.a. The things I hoped you might ask
-
Q: What technologies is this checklist covering?
A: Originally the checklist was developed primarily with the RabbitMQ AMQP implementation in mind. I believe it is fair to claim the checklist would work with other AMQP and AMQP-like implementations such as Apache Qpid, Apache ActiveMQ, etc. In addition, it is largely applicable for Java Message service (JMS) implementations. And while some items are applicable to Apache Kafka, Apache Kafka might require a dedicated checklist.
-
Q: This isn’t a checklist! The answers to the questions aren’t Yes/No
A: Okay, it is true it does not resemble 'Oil pressure … Green' and 'Throttle … 1000 RPMN '. The idea of having an item as 'mark completed' would be to assess if the question was asked, alternatives considered, and a satisfactory answer provided and documented.
-
Q: I found a grammar or spelling error(s)
A: Please send me an email (or pull request) and I might correct it.
-
Q: I would like to contribute to this checklist!
A: Great! Check first if you agree with license, and if yes, feel free to fork this project and publish your own amended/extend/reduced version. When you drop me an email, I might add your version to the external references.
-
Q: Can I use this list for commercial purposes?
A: Check the license! You’ll see commercial use is allowed, provided you abide by the other rules. And when the checklist was useful, you can always buy me a beer! https://PayPal.Me/DennisVandePoel
-
Q: This checklist seems to be devoid of security-related topics
A: Security is too important for a it to be based on checklist that is not reviewed and maintained on a daily basis by dedicated people. A good start would be your CISO or https://owasp.org
-
Q: What is the definition of a term used in the checklist?
A: In future versions of this document I intend to add more precision to the questions, add a glossary, and add more detailed guidance on the interpretation. One day…
3. The Checklist
Answer the below questions for each point-to-point channel. In case of a publish subscribe channel, consider each pub-sub combination as a point-to-point Channel; i.e. in case (X) publishes a message and (A) and (B) subscribe to the message; then consider this pub-sub as two point-to-point channels: (X) → (A) and (X) → (B).
The questions have been grouped by the different perspectives, i.e. from the side of the producer, from the side of the consumer, and a more holistic end-to-end view. You should assign these perspectives to the relevant team(s). The intention is that the responsible team’s documentation should have captured the answer. It does not mean the teams are responsible for providing the answers, but they should collect the answers, and each answer should be reflected in their documentation and guide their implementation processes.
3.1. End-to-end communication (e.g. Architect / System Analyst)
Note; For questions 1-3, consider the messages traveling between the endpoints. Repeat for each point-to-point channel;
-
What happens if a message gets lost? Do you really need the 'at least once' delivery guarantee
-
from POV of Business Process?
-
from POV of Consumer side?
-
from POV of Producer side?
-
-
What happens if a message is delivered multiple times to the same consumer?
-
Do you really need at most once/exactly once delivery guarantee?
-
Can you design the message to be idempotent (e.g. UUID)?
-
-
Is there any (implicit) ordering between messages?
-
When ordering is violated, can this result in wrong business results in the consumer and its dependent systems?
-
What happens if messages occur in the same period, arrive in order, but due to delay all in a separate 'period' (e.g. next day, next fiscal year)?
-
Can you make the messages 'associative'?
-
What happens if messages occur in the same period, arrive in order, but due to delay some of them in one period and some of them in a separate 'period' (e.g. next day, next fiscal year)?
-
What happens if messages occur in the same period, arrive out-of-order, and due to delay some of them in one period and some of them in a separate 'period' (e.g. next day, next fiscal year)?
-
Can the message structure contain ordering (so to offload ordering responsibility from the broker)?
-
3.2. Message Producer
-
What happens if the (RabbitMQ) broker is unreachable/unavailable?
-
Where will messages be stored during the duration of the outage?
-
How will you ensure you do not cascade the (RabbitMQ) broker outage to the Producer’s clients? Where is the relevant Component test for this?
-
What is the pressure valve in case of extended outage duration and a growth of messages to be produced? Where is the relevant Component test for this?
-
-
Is there a need for the producer to be involved to execute a 'replay' in case of broker/consumer failure?
-
Has this been verified with the Broker Infrastructure team?
-
Has this been verified with the Consumer team?
-
Is the producer capable of re-generating messages for a replay?
-
How far back in time can the producer go back to do a replay?
-
-
If an at least once delivery guarantee is required for a specific consumer, how will the producer verify that a specific binding is in place for that specific consumer (not just any binding to the exchange by any consumer).
-
Does the producer expect the broker to provide durability/ persistence guarantees?
-
If yes, has this been acknowledged by the team that provides broker?
-
3.3. Message Consumer
-
What will be your acknowledgment of message strategy?
-
Is it compatible with the delivery guarantee set out by the Architect / System Analyst?
-
Is it not unnecessarily exceeding the delivery guarantee requirements required by all relevant point-to-point channels?
-
-
What will a consumer do with messages that are not processable (e.g. not adhering to schema due to producer bug)?
-
What team will monitor and handle this issue in Production? Has that team accepted responsibility?
-
-
Will the consumer use the broker’s reject and/or requeue feature?
-
Does the consumer distinguish between temporary message processing failure and permanent message processing failure? Where is the relevant Component test for this?
-
-
How does the consumer handle 'duplicate messages' (caused by broker /consumer) and duplicate 'bodies of messages' (caused by producer)?
-
Will there be competing consumers?
-
What component test verifies correct behavior in case of a re-delivery of a non-idempotent message?
-
To what extent can the broker group messages for efficiency?
-
-
Are test messages expected in production?
-
Are Correlation identifiers expected?
3.4. The Broker
-
What is the time-to-live configuration for each queue?
-
If not explicitly configured, what is the default value?
-
Is this sufficient for normal operation
-
Is this sufficient in case of consumer failure/outage?
-
In case of overflow, where do the messages go?
-
In case of event messages, how old can an event be before becoming irrelevant?
-
-
What is the maximum queue size (bytes or number of messages) for each queue?
-
If not explicitly configured, what is the default value?
-
Is this sufficient for normal operation?
-
Is this sufficient in case of consumer failure/outage?
-
In case of overflow, where do the messages go?
-
-
How are dead letter queues configured for exchanges and queues?
-
What happens to the messages if the dead-letter queues time-to-live configuration is exceeded?
-
What happens to the messages if the dead-letter queues maximum queue size configuration is exceeded?
-
Is the underlying hardware (RAM/HDD/Network) sufficiently sized for this eventuality?
-
-
Do consumers intend to use re-delivery?
-
How often will the broker attempt to re-deliver a message?
-
What happens to the message if the maximum number of re-delivery attempts has been reached?
-
How do you ensure a message is not re-delivered indefinitely?
-
-
Has a replay mechanism been agreed?
-
Who are the relevant teams to involve in a replay?
-
Is ordering of messages of any importance?
-
Are any queue flushes required?
-
Are there any performance considerations (e.g. a preference for LIFO handling)?
-
-
Is the producer ensuring durability/persistence of the messages (underlying data) or does it expect the broker to provide durability/persistence guarantees?
4. Credits
-
Dennis Van de Poel
5. External References
6. License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.