116. Event-Driven Bad Signal Detection
Status: Accepted Date: 2025-07-06
Context
The Maat module is responsible for detecting "Bad Signals" (BS) - anomalies, inconsistencies, or questionable outputs - from various parts of the system. A polling-based approach, where Maat would periodically query every other module to ask for potential issues, would be inefficient, slow, and would tightly couple Maat to the internal workings of every other service. We need a way to capture these potential issues in real-time without creating a web of dependencies.
Decision
The system will use an event-driven architecture for BS detection.
A BSEmitterService will be provided as a shared utility. Any module in the system, when it encounters a situation that it deems potentially anomalous, can use this service to emit a bs.candidate event. This event will contain a payload with all the relevant context about the potential issue (e.g., the function that was running, the inputs it received, the strange output it produced).
The Maat module will contain a BSReportConsumer that listens for these bs.candidate events. When an event is received, the consumer will process it and store the candidate in the appropriate Gist for later review, according to the sampling rules (adr://sampling-based-collection).
This decouples the "producers" of potential BS (any module) from the "consumer" (Maat). Modules don't need to know anything about Maat or how BS is stored; they just need to fire a "here is something weird" event and move on.
Consequences
Positive:
- Decoupling: Modules that generate BS candidates are completely decoupled from the
Maatmodule. They have no direct dependency onMaat. - Real-Time Detection: Issues are captured and sent to
Maatthe instant they occur, rather than waiting for a polling cycle. This provides a much more immediate view of system health. - Extensibility: It's incredibly easy to add new sources of BS candidates. Any new module can be instrumented to emit these events without requiring any changes to the
Maatmodule itself. - Low Overhead: Emitting an event is a lightweight, fire-and-forget operation, so it has minimal performance impact on the source module.
Negative:
- Requires Instrumentation: This approach is not automatic. Developers of other modules must proactively identify areas where BS could occur and add the code to emit the event.
- Potential for Event Storms: If a systemic issue causes a single type of event to be fired very rapidly, it could flood the event bus and the
Maatconsumer.
Mitigation:
- Developer Discipline: Instrumenting code for observability and validation is a core part of our development practice. This is not seen as an extra task, but as a standard part of writing robust code.
- Sampling and Rate Limiting: The
BSEmitterServicecan have built-in client-side sampling (adr://sampling-based-collection). Furthermore, theBSReportConsumerand the underlying message queue (BullMQ) can be configured with rate limits to prevent a single consumer from being overwhelmed, ensuring the rest of the system remains stable.