dbtrail Documentation

Instance specs, resource usage, indexing lag, disk sizing, and failure recovery behavior for production deployments

dbtrail runs a lightweight Go agent on AWS Graviton EC2 instances. The agent orchestrates binlog streaming, indexing, and archival by delegating heavy work to the bintrail CLI — the agent itself is a thin HTTP server that manages subprocesses. This page covers what you need to size and plan for in production.

Instance specifications

Each plan tier runs on a specific EC2 instance type. Free tier shares an instance with other tenants; paid tiers get a dedicated instance.

	Free	Pro	Premium	Enterprise
Instance type	r7g.medium (shared)	t4g.medium	t4g.large	t4g.large
vCPU	1	2	2	2
RAM	8 GB	4 GB	8 GB	8 GB
Disk	50 GB gp3	50 GB gp3	50 GB gp3	50 GB gp3
InnoDB buffer pool	2 GB	2 GB	4 GB	4 GB
Max index DB connections	100	100	200	200

Free tier resource sharing

Free tier runs as a Docker container on a shared EC2 instance (up to 6 containers per host). CPU and memory are not reserved — performance may vary under load. Paid tiers get a dedicated instance with guaranteed resources.

Agent resource usage

The Go agent itself is minimal. CPU and memory are dominated by the bintrail CLI subprocesses and the MySQL index database, not the agent HTTP server.

Real-world measurements

These numbers come from a production demo running on a t4g.medium (2 vCPU, 4 GB RAM) indexing a WordPress database doing ~2,500 writes/day with binlog_row_image=FULL:

Component	Memory (RSS)
`bintrail-agent` (HTTP server)	14 MB
`bintrail stream` (binlog parser)	27 MB
`bintrail rotate` (archival daemon)	30 MB
Total agent footprint	~71 MB

The remaining ~2 GB of used memory is the InnoDB buffer pool for the local index database. The agent processes themselves are lightweight.

CPU: 7–9% average across all hours of the day, with occasional spikes to ~50% during archive rotation (Parquet export + S3 upload). The 51% spike lasts seconds, not minutes. System load average holds steady at 0.1–0.2.

Connections to your source MySQL

The agent opens a small, fixed number of connections to your monitored database:

Connection	Purpose	Count
Replication	Binlog streaming via `COM_BINLOG_DUMP_GTID`	1
Connection cache poller	Reads `performance_schema.threads` every 500ms	2
Total		3

Audit plugin optimization

If the agent detects an active audit log plugin on your source MySQL (e.g., Percona Audit Log, MySQL Enterprise Audit), the connection cache poller is automatically disabled. This reduces source connections from 3 to 1, since the audit log provides superior historical connection data.

Plan limits

	Free	Pro	Premium	Enterprise
Servers	1	5	20	Unlimited
Concurrent streams	1	5	20	100
History retention	7 days	30 days	90 days	365 days
API rate limit (tenant)	120 RPM	600 RPM	2,000 RPM	10,000 RPM
API rate limit (per user)	60 RPM	200 RPM	600 RPM	2,000 RPM

Rate limits apply to API calls (query, recover, status, forensics). They do not throttle the binlog stream itself, which runs continuously.

Index size and disk usage

The index database stores one row per binlog change event, including structured before/after column values, timestamps, binlog positions, and schema metadata.

How much disk does the index use?

From our production demo (WordPress, ~2,500 writes/day, binlog_row_image=FULL):

Metric	Value
Source binlog throughput	~40 MB/day
Index payload per day	8–32 MB (varies with write volume)
Average event size in index	~5.5 KB
Total events indexed (19 days)	57,130
Live index size (with 1-day retention)	107 MB
Total disk used (index + MySQL + OS + agent)	5.5 GB of 50 GB (12%)

The index-to-binlog ratio depends on your row width and binlog_row_image setting. With FULL row images (which store complete before/after rows), the index is roughly 50–80% the size of the raw binlog. With MINIMAL row images, the index can be significantly smaller since only changed columns are recorded.

Disk monitoring

Two independent mechanisms protect against disk exhaustion. A disk watcher polls every 3 minutes and cancels backup/dump operations if free space drops below 3 GB or 10% (whichever is stricter). Separately, the /health endpoint checks disk on every request and degrades to degraded when usage exceeds 95%. Monitor disk_usage_percent and disk_free_gb from your alerting system.

Sizing recommendation: start a stream, let it run for 24–48 hours, then check disk usage to extrapolate. Schema and table filtering can significantly reduce index size if you only need to track specific tables.

Retention and archiving

Retention by plan

Plan	Live index retention	S3 archive
Free	7 days (auto-enforced)	Not available
Pro	30 days	Included
Premium	90 days	Included
Enterprise	365 days	Included

How retention works

The rotate daemon runs on a configurable interval (default: every hour). It:

Exports events older than the retention window to Parquet files, partitioned by date and hour
Uploads Parquet files to S3 with checksum verification (size + SHA-256)
Deletes local Parquet files only after S3 verification succeeds
Purges expired rows from the live index

Archive compression

From 19 days of production data:

Metric	Value
Parquet files produced	450
Total S3 archive size	12.8 MB
Average per day	~670 KB
Compression vs raw binlog	~60:1

Parquet's columnar format with built-in compression makes long-term storage extremely efficient. Even with 365 days of retention on a moderately active server, S3 storage costs are typically under $1/month.

S3 lifecycle policies

Retention applies to the local index database. Archived Parquet files in S3 follow your S3 lifecycle policies and are not subject to the rotate daemon's retention window. You can keep archives indefinitely in S3 or move them to Glacier for long-term compliance storage.

Indexing lag

Indexing lag is the delay between when a change occurs on your MySQL server and when it appears in the dbtrail index. It's the sum of binlog replication delay, event parsing time, and index write time.

How lag is exposed

The stream status endpoint returns two lag metrics:

lag_seconds — time difference between the most recent indexed event and the current time
lag_events — number of binlog events received but not yet written to the index

Both are visible in the dashboard's stream status panel and via the /api/v1/status API endpoint.

Checkpoint mechanism

The stream writes a checkpoint (binlog file + position, or GTID set) to the index database at a configurable interval. On restart, the stream resumes from the last checkpoint — no events are reprocessed or lost, as long as the source MySQL hasn't purged the binlog files covering the gap.

What affects lag

Binlog event volume — high-throughput workloads (bulk INSERTs, batch UPDATEs) produce more events per second, increasing parsing time
Row width — wide rows with large TEXT or BLOB columns take longer to parse and write to the index
Network latency — relevant if the agent connects to the source MySQL over a network (not localhost) or through an SSH tunnel
Index DB write performance — gp3 EBS volumes with baseline 3,000 IOPS handle typical workloads easily; extremely write-heavy servers may benefit from provisioned IOPS

Under typical OLTP workloads (hundreds to low thousands of writes per second), expect sub-second lag. High-throughput batch operations may temporarily increase lag, which recovers once the burst subsides.

Failure and recovery

What happens when the agent restarts?

The agent persists stream state to disk (a JSON state file in Docker mode, or systemd journal in systemd mode). On restart:

The agent reads the persisted state and re-launches all previously running streams
Each stream reads its last checkpoint from the index database
Streaming resumes from the checkpoint position — no events are duplicated or lost

If the source MySQL has purged the binlog files that cover the gap between the last checkpoint and the current position, a fresh snapshot (full dump) is needed to re-establish a baseline. This is the same recovery model as MySQL replication.

Auto-restart behavior

If a stream process crashes, the agent restarts it automatically:

Parameter	Value
Max restart attempts	5 (consecutive)
Initial backoff	5 seconds
Maximum backoff	80 seconds (doubles each attempt)
Stable uptime threshold	2 minutes

If the stream runs for 2+ minutes after a restart, the retry counter resets — it's treated as a transient crash. After 5 consecutive failures within the stable threshold, the agent stops retrying and reports the stream as failed.

Graceful shutdown

On SIGTERM, the agent stops all streams gracefully, writes final state to disk, and exits cleanly. Systemd and Docker both send SIGTERM by default on stop/restart operations.

Monitoring

Health endpoint

GET /health — lightweight, unauthenticated. Suitable for load balancer health checks or external monitoring.

{
  "status": "healthy",
  "agent_version": "0.4.1",
  "mysql_index": "connected",
  "disk_usage_percent": 12,
  "disk_total_gb": 47.3,
  "disk_free_gb": 41.8,
  "uptime_seconds": 5422
}

Status is healthy when the index database is reachable and disk usage is below 95%. It degrades to degraded otherwise.

Stream status endpoint

GET /api/v1/status — returns detailed stream metrics including lag, checkpoint position, and schema coverage. Requires authentication via service token.

Recommended monitoring setup

Poll /health every 30–60 seconds from your monitoring system (Datadog, Prometheus, Nagios, etc.). Alert on:

status != healthy
disk_usage_percent > 80% (warning) or > 90% (critical)
disk_free_gb < 5 GB

For stream-level monitoring, poll /api/v1/status and alert on lag_seconds exceeding your tolerance (e.g., > 60 seconds sustained).

Sizing recommendations

Workload	Recommended plan	Notes
Single server, < 1 GB binlog/day	Free	Shared instance, 7-day retention
1–5 servers, < 10 GB binlog/day	Pro	Dedicated t4g.medium, 30-day retention
5–20 servers, < 50 GB binlog/day	Premium	Dedicated t4g.large (8 GB RAM), 90-day retention
20+ servers or compliance requirements	Enterprise	Dedicated t4g.large, 365-day retention, custom limits

These are starting points. Actual requirements depend on row width, change frequency, and how aggressively you filter schemas and tables. Start with a trial and monitor disk usage for 48 hours before committing to a plan.

Next steps

Stream configuration — configure binlog streaming, filtering, and checkpoints
Backup strategy — periodic full dumps complement binlog streaming
Troubleshooting — common issues and resolution steps
Status API — endpoint details for monitoring integration

Capacity Planning