dbtrail
Guides

Capacity Planning

Instance specs, resource usage, indexing lag, disk sizing, and failure recovery behavior for production deployments

dbtrail runs a lightweight Go agent on AWS Graviton EC2 instances. The agent orchestrates binlog streaming, indexing, and archival by delegating heavy work to the bintrail CLI — the agent itself is a thin HTTP server that manages subprocesses. This page covers what you need to size and plan for in production.

Instance specifications

Each plan tier runs on a specific EC2 instance type. Free tier shares an instance with other tenants; paid tiers get a dedicated instance.

FreeProPremiumEnterprise
Instance typer7g.medium (shared)t4g.mediumt4g.larget4g.large
vCPU1222
RAM8 GB4 GB8 GB8 GB
Disk50 GB gp350 GB gp350 GB gp350 GB gp3
InnoDB buffer pool2 GB2 GB4 GB4 GB
Max index DB connections100100200200

Free tier resource sharing

Free tier runs as a Docker container on a shared EC2 instance (up to 6 containers per host). CPU and memory are not reserved — performance may vary under load. Paid tiers get a dedicated instance with guaranteed resources.

Agent resource usage

The Go agent itself is minimal. CPU and memory are dominated by the bintrail CLI subprocesses and the MySQL index database, not the agent HTTP server.

Real-world measurements

These numbers come from a production demo running on a t4g.medium (2 vCPU, 4 GB RAM) indexing a WordPress database doing ~2,500 writes/day with binlog_row_image=FULL:

ComponentMemory (RSS)
bintrail-agent (HTTP server)14 MB
bintrail stream (binlog parser)27 MB
bintrail rotate (archival daemon)30 MB
Total agent footprint~71 MB

The remaining ~2 GB of used memory is the InnoDB buffer pool for the local index database. The agent processes themselves are lightweight.

CPU: 7–9% average across all hours of the day, with occasional spikes to ~50% during archive rotation (Parquet export + S3 upload). The 51% spike lasts seconds, not minutes. System load average holds steady at 0.1–0.2.

Connections to your source MySQL

The agent opens a small, fixed number of connections to your monitored database:

ConnectionPurposeCount
ReplicationBinlog streaming via COM_BINLOG_DUMP_GTID1
Connection cache pollerReads performance_schema.threads every 500ms2
Total3

Audit plugin optimization

If the agent detects an active audit log plugin on your source MySQL (e.g., Percona Audit Log, MySQL Enterprise Audit), the connection cache poller is automatically disabled. This reduces source connections from 3 to 1, since the audit log provides superior historical connection data.

Plan limits

FreeProPremiumEnterprise
Servers1520Unlimited
Concurrent streams1520100
History retention7 days30 days90 days365 days
API rate limit (tenant)120 RPM600 RPM2,000 RPM10,000 RPM
API rate limit (per user)60 RPM200 RPM600 RPM2,000 RPM

Rate limits apply to API calls (query, recover, status, forensics). They do not throttle the binlog stream itself, which runs continuously.

Index size and disk usage

The index database stores one row per binlog change event, including structured before/after column values, timestamps, binlog positions, and schema metadata.

How much disk does the index use?

From our production demo (WordPress, ~2,500 writes/day, binlog_row_image=FULL):

MetricValue
Source binlog throughput~40 MB/day
Index payload per day8–32 MB (varies with write volume)
Average event size in index~5.5 KB
Total events indexed (19 days)57,130
Live index size (with 1-day retention)107 MB
Total disk used (index + MySQL + OS + agent)5.5 GB of 50 GB (12%)

The index-to-binlog ratio depends on your row width and binlog_row_image setting. With FULL row images (which store complete before/after rows), the index is roughly 50–80% the size of the raw binlog. With MINIMAL row images, the index can be significantly smaller since only changed columns are recorded.

Disk monitoring

Two independent mechanisms protect against disk exhaustion. A disk watcher polls every 3 minutes and cancels backup/dump operations if free space drops below 3 GB or 10% (whichever is stricter). Separately, the /health endpoint checks disk on every request and degrades to degraded when usage exceeds 95%. Monitor disk_usage_percent and disk_free_gb from your alerting system.

Sizing recommendation: start a stream, let it run for 24–48 hours, then check disk usage to extrapolate. Schema and table filtering can significantly reduce index size if you only need to track specific tables.

Retention and archiving

Retention by plan

PlanLive index retentionS3 archive
Free7 days (auto-enforced)Not available
Pro30 daysIncluded
Premium90 daysIncluded
Enterprise365 daysIncluded

How retention works

The rotate daemon runs on a configurable interval (default: every hour). It:

  1. Exports events older than the retention window to Parquet files, partitioned by date and hour
  2. Uploads Parquet files to S3 with checksum verification (size + SHA-256)
  3. Deletes local Parquet files only after S3 verification succeeds
  4. Purges expired rows from the live index

Archive compression

From 19 days of production data:

MetricValue
Parquet files produced450
Total S3 archive size12.8 MB
Average per day~670 KB
Compression vs raw binlog~60:1

Parquet's columnar format with built-in compression makes long-term storage extremely efficient. Even with 365 days of retention on a moderately active server, S3 storage costs are typically under $1/month.

S3 lifecycle policies

Retention applies to the local index database. Archived Parquet files in S3 follow your S3 lifecycle policies and are not subject to the rotate daemon's retention window. You can keep archives indefinitely in S3 or move them to Glacier for long-term compliance storage.

Indexing lag

Indexing lag is the delay between when a change occurs on your MySQL server and when it appears in the dbtrail index. It's the sum of binlog replication delay, event parsing time, and index write time.

How lag is exposed

The stream status endpoint returns two lag metrics:

  • lag_seconds — time difference between the most recent indexed event and the current time
  • lag_events — number of binlog events received but not yet written to the index

Both are visible in the dashboard's stream status panel and via the /api/v1/status API endpoint.

Checkpoint mechanism

The stream writes a checkpoint (binlog file + position, or GTID set) to the index database at a configurable interval. On restart, the stream resumes from the last checkpoint — no events are reprocessed or lost, as long as the source MySQL hasn't purged the binlog files covering the gap.

What affects lag

  • Binlog event volume — high-throughput workloads (bulk INSERTs, batch UPDATEs) produce more events per second, increasing parsing time
  • Row width — wide rows with large TEXT or BLOB columns take longer to parse and write to the index
  • Network latency — relevant if the agent connects to the source MySQL over a network (not localhost) or through an SSH tunnel
  • Index DB write performance — gp3 EBS volumes with baseline 3,000 IOPS handle typical workloads easily; extremely write-heavy servers may benefit from provisioned IOPS

Under typical OLTP workloads (hundreds to low thousands of writes per second), expect sub-second lag. High-throughput batch operations may temporarily increase lag, which recovers once the burst subsides.

Failure and recovery

What happens when the agent restarts?

The agent persists stream state to disk (a JSON state file in Docker mode, or systemd journal in systemd mode). On restart:

  1. The agent reads the persisted state and re-launches all previously running streams
  2. Each stream reads its last checkpoint from the index database
  3. Streaming resumes from the checkpoint position — no events are duplicated or lost

If the source MySQL has purged the binlog files that cover the gap between the last checkpoint and the current position, a fresh snapshot (full dump) is needed to re-establish a baseline. This is the same recovery model as MySQL replication.

Auto-restart behavior

If a stream process crashes, the agent restarts it automatically:

ParameterValue
Max restart attempts5 (consecutive)
Initial backoff5 seconds
Maximum backoff80 seconds (doubles each attempt)
Stable uptime threshold2 minutes

If the stream runs for 2+ minutes after a restart, the retry counter resets — it's treated as a transient crash. After 5 consecutive failures within the stable threshold, the agent stops retrying and reports the stream as failed.

Graceful shutdown

On SIGTERM, the agent stops all streams gracefully, writes final state to disk, and exits cleanly. Systemd and Docker both send SIGTERM by default on stop/restart operations.

Monitoring

Health endpoint

GET /health — lightweight, unauthenticated. Suitable for load balancer health checks or external monitoring.

{
  "status": "healthy",
  "agent_version": "0.4.1",
  "mysql_index": "connected",
  "disk_usage_percent": 12,
  "disk_total_gb": 47.3,
  "disk_free_gb": 41.8,
  "uptime_seconds": 5422
}

Status is healthy when the index database is reachable and disk usage is below 95%. It degrades to degraded otherwise.

Stream status endpoint

GET /api/v1/status — returns detailed stream metrics including lag, checkpoint position, and schema coverage. Requires authentication via service token.

Poll /health every 30–60 seconds from your monitoring system (Datadog, Prometheus, Nagios, etc.). Alert on:

  • status != healthy
  • disk_usage_percent > 80% (warning) or > 90% (critical)
  • disk_free_gb < 5 GB

For stream-level monitoring, poll /api/v1/status and alert on lag_seconds exceeding your tolerance (e.g., > 60 seconds sustained).

Sizing recommendations

WorkloadRecommended planNotes
Single server, < 1 GB binlog/dayFreeShared instance, 7-day retention
1–5 servers, < 10 GB binlog/dayProDedicated t4g.medium, 30-day retention
5–20 servers, < 50 GB binlog/dayPremiumDedicated t4g.large (8 GB RAM), 90-day retention
20+ servers or compliance requirementsEnterpriseDedicated t4g.large, 365-day retention, custom limits

These are starting points. Actual requirements depend on row width, change frequency, and how aggressively you filter schemas and tables. Start with a trial and monitor disk usage for 48 hours before committing to a plan.

Next steps

On this page