65. Asynchronous API Pattern for AI Analysis
Status: Accepted Date: 2025-07-06
Context
Many AI and machine learning analysis tasks are long-running; they can take anywhere from a few seconds to many minutes to complete. Attempting to handle these tasks in a single, synchronous HTTP request-response cycle is not feasible. It would lead to client timeouts, tie up server resources, and create a brittle, unresponsive system.
Decision
For all long-running analysis operations exposed through the Morpheus API gateway, we will implement an Asynchronous Job API Pattern.
The workflow will be as follows:
- Request: The client makes an initial
POSTrequest to an endpoint like/api/v1/analysiswith the parameters for the analysis. - Acknowledge & Enqueue: The Morpheus gateway validates the request, creates a job with a unique
jobId, adds it to the appropriate BullMQ processing queue, and immediately returns a202 Acceptedresponse to the client. The body of this response will contain thejobId. - Poll for Status: The client can then periodically make
GETrequests to a status endpoint, like/api/v1/analysis/status/{jobId}, to check the status of the job (e.g.,pending,in_progress,completed,failed). - Retrieve Result: Once the job status is
completed, the client can make a finalGETrequest to a result endpoint, like/api/v1/analysis/result/{jobId}, to retrieve the analysis output.
This pattern decouples the long-running task from the client's request, ensuring the API remains responsive.
Consequences
Positive:
- Responsiveness & Scalability: The API remains responsive and is not blocked by long-running tasks. The backend processing can be scaled independently by adding more queue workers.
- Reliability: Using a persistent queue (BullMQ) ensures that analysis jobs are not lost if a server or worker process crashes. The job can be retried or picked up by another worker.
- Improved Client Experience: Clients are not forced to maintain a long-lived, open HTTP connection. They receive an immediate acknowledgment and can fetch the result at their convenience.
- Handles Timeouts Gracefully: This pattern completely avoids issues with HTTP timeouts for long-running processes.
Negative:
- Increased Client-Side Complexity: The client is more complex. Instead of a single request, the client must now manage a multi-step workflow: make the initial request, store the
jobId, and poll for the result. - Eventual Consistency: The system is eventually consistent. The result is not available immediately, which clients need to be designed to handle.
Mitigation:
- Client Libraries/SDKs: We can provide a simple client library or SDK that abstracts away the complexity of the polling mechanism, making it easier for clients to interact with the asynchronous API.
- WebSockets for Real-Time Updates: For clients that need more real-time updates (like a web dashboard), we can supplement the polling mechanism with WebSockets. The server can push a notification to the client when the job is complete, eliminating the need for polling in those use cases.
- Clear API Documentation: The asynchronous nature of the API will be clearly and prominently documented in our OpenAPI/Swagger specification so that clients understand the expected workflow.