Mercury Infrastructure Upgrade: From Redis Poverty to Production Beast π
Today we diagnosed why Mercury was only scheduling 586 jobs instead of the expected 1,524 for TA availability checks. Spoiler alert: our infrastructure was basically asking Redis to bench press while sitting on the bar! πͺπ
The Great Redis Mystery π΅οΈβ
Our Mercury trading system runs 5 variants (A, B, H, R, W) with a sophisticated job scheduling system. Each should schedule:
- 508 USDT markets Γ 3 timeframes = 1,524 TA availability jobs
But we were only getting 586 jobs. The math was simple: something was dying silently.
Root Cause: Algorithmic Brutality + Hardware Povertyβ
The Redis Killer Codeβ
// This innocent-looking code was murdering Redis π
const existingJobs = await this.shadowOrdersQueue.getJobs(['waiting', 'delayed']);
const jobAlreadyExists = existingJobs.some(
(job) => job.name === MorpheusJobName.EXECUTE_SHADOW_ORDER &&
job.data.orderId === orderId
);
What this actually does:
- Fetches ALL 2,500+ jobs from Redis into memory
- Scans through every single job to check for duplicates
- Repeats this 1,524 times during bulk scheduling
- Creates O(nΒ²) complexity when Redis offers O(1) with proper job IDs
The Hardware Reality Checkβ
# Our current "production" setup π€‘
redis:
image: 'redis:alpine' # 1GB memory limit
command: redis-server --appendonly yes # No limits, no config
Load Analysis:
- 5 Mercury variants + 3 domain apps
- BullMQ queues (high write volume)
- TA cache data for 508 markets Γ multiple timeframes
- Market data cache
- All running on 2x ARM servers (4 CPU, 8GB RAM each)
Result: Redis memory exhaustion β connection timeouts β silent job failures.
The "Robust by Accident" Discoveryβ
The funniest part? Our TA availability scheduler has built-in resilience:
// Keeps rescheduling until all markets are covered
// Mercury: "Redis failed me? Fine, I'll just keep trying!"
The system was literally self-healing through brute force scheduling. Eventually, after multiple runs, all 1,524 jobs would get scheduled. Peak Mercury engineering! π
The Great Infrastructure Upgrade Planβ
Current State: Poverty Editionβ
- 2x Hetzner ARM VPS (4 CPU, 8GB RAM) - "not for high CPU load"
- Redis running on bicycle wheels π²
- Multiple services fighting for 8GB
Future State: Beast Modeβ
JANUS (The Beast - Xeon, Unlimited Power):
βββ Mercury-TA (primary) - Heavy TA-Lib calculations
βββ Domain apps (Arcana, Anytracker, Maschine) - 99.9% idle
βββ Mercury-TA (failover) - Backup from Hetzner
ARM Server 1 (Hetzner):
βββ Mercury variants A, B, H
βββ Redis (dedicated namespace)
βββ PostgreSQL (mercury-abh)
ARM Server 2 (Hetzner):
βββ Mercury variants R, W
βββ Redis (dedicated namespace)
βββ PostgreSQL (mercury-rw)
Strategy Benefitsβ
- Co-location: Redis + PostgreSQL + App on same instance = zero network latency
- Load separation: Heavy computation β JANUS, Trading logic β ARM servers
- Bulletproof: Server death affects only 2-3 variants, not everything
- Cost effective: ARM servers handle what they're good at, beast handles heavy lifting
Technical Lessons Learnedβ
1. Redis Performance Isn't the Problemβ
Redis can handle millions of operations per second. The issue was:
- Algorithmic complexity: O(nΒ²) duplicate checking
- Memory limits: 1GB trying to hold gigabytes
- No configuration: Default limits for production load
2. ARM Servers Have Their Placeβ
ARM architecture is great for:
- Trading logic (sufficient CPU performance)
- I/O bound operations (network, database)
- Cost efficiency for sustained workloads
Not great for:
- Heavy computational tasks (TA-Lib calculations)
- Memory-intensive operations (large Redis datasets)
3. The Power of Proper Job Schedulingβ
// Fix: Use deterministic job IDs instead of scanning
await queue.add(jobName, data, {
jobId: `execute-order-${orderId}`, // O(1) duplicate prevention
removeOnComplete: false, // Required with jobId
removeOnFail: false,
});
4. Infrastructure Co-location Strategyβ
Placing related services together eliminates:
- Network latency between Redis and app
- Connection pool exhaustion across servers
- Complex service discovery and networking
- Cascade failures from network issues
The Cheap Ass Engineering Philosophyβ
Sometimes the best solutions come from constraints:
- Work with what you have until you hit real limits
- Profile before upgrading - understand your bottlenecks
- Horizontal scaling can be cheaper than vertical
- Robust-by-accident designs often work better than over-engineered ones
Our "accidental resilience" through multiple scheduling attempts taught us that eventual consistency can be a feature, not a bug.
Next Stepsβ
- Immediate: Fix the O(nΒ²) scheduler logic
- Short-term: Upgrade Redis configuration with proper memory limits
- Medium-term: Migrate to the JANUS + ARM hybrid architecture
- Long-term: Document this as a case study in "cheap ass engineering that actually works"
Conclusionβ
International Cheap Ass Day reminded us that constraints breed creativity. Our poverty-spec infrastructure forced us to:
- Understand our bottlenecks deeply
- Design resilient systems (accidentally)
- Optimize algorithms instead of throwing hardware at problems
- Plan sustainable growth without breaking the bank
Sometimes you need to run a Ferrari on bicycle wheels to truly appreciate proper tires! ποΈ
Happy International Cheap Ass Day! May your infrastructure be robust and your servers be cheap! π
