Skip to main content

65. Asynchronous API Pattern for AI Analysis

Status: Accepted Date: 2025-07-06

Context

Many AI and machine learning analysis tasks are long-running; they can take anywhere from a few seconds to many minutes to complete. Attempting to handle these tasks in a single, synchronous HTTP request-response cycle is not feasible. It would lead to client timeouts, tie up server resources, and create a brittle, unresponsive system.

Decision

For all long-running analysis operations exposed through the Morpheus API gateway, we will implement an Asynchronous Job API Pattern.

The workflow will be as follows:

  1. Request: The client makes an initial POST request to an endpoint like /api/v1/analysis with the parameters for the analysis.
  2. Acknowledge & Enqueue: The Morpheus gateway validates the request, creates a job with a unique jobId, adds it to the appropriate BullMQ processing queue, and immediately returns a 202 Accepted response to the client. The body of this response will contain the jobId.
  3. Poll for Status: The client can then periodically make GET requests to a status endpoint, like /api/v1/analysis/status/{jobId}, to check the status of the job (e.g., pending, in_progress, completed, failed).
  4. Retrieve Result: Once the job status is completed, the client can make a final GET request to a result endpoint, like /api/v1/analysis/result/{jobId}, to retrieve the analysis output.

This pattern decouples the long-running task from the client's request, ensuring the API remains responsive.

Consequences

Positive:

  • Responsiveness & Scalability: The API remains responsive and is not blocked by long-running tasks. The backend processing can be scaled independently by adding more queue workers.
  • Reliability: Using a persistent queue (BullMQ) ensures that analysis jobs are not lost if a server or worker process crashes. The job can be retried or picked up by another worker.
  • Improved Client Experience: Clients are not forced to maintain a long-lived, open HTTP connection. They receive an immediate acknowledgment and can fetch the result at their convenience.
  • Handles Timeouts Gracefully: This pattern completely avoids issues with HTTP timeouts for long-running processes.

Negative:

  • Increased Client-Side Complexity: The client is more complex. Instead of a single request, the client must now manage a multi-step workflow: make the initial request, store the jobId, and poll for the result.
  • Eventual Consistency: The system is eventually consistent. The result is not available immediately, which clients need to be designed to handle.

Mitigation:

  • Client Libraries/SDKs: We can provide a simple client library or SDK that abstracts away the complexity of the polling mechanism, making it easier for clients to interact with the asynchronous API.
  • WebSockets for Real-Time Updates: For clients that need more real-time updates (like a web dashboard), we can supplement the polling mechanism with WebSockets. The server can push a notification to the client when the job is complete, eliminating the need for polling in those use cases.
  • Clear API Documentation: The asynchronous nature of the API will be clearly and prominently documented in our OpenAPI/Swagger specification so that clients understand the expected workflow.