Skip to content

Diagnosing Bottlenecks

Quine is a fully backpressured system. When one component can't keep up, it slows down upstream components rather than dropping data. This makes the system resilient, but it also means that a bottleneck in one area can manifest as slowness elsewhere.

This guide helps you identify where bottlenecks are occurring so you can focus optimization efforts effectively.

Common Symptoms

Symptom Possible Bottleneck
Low ingest rate despite available CPU Standing queries or persistor
High CPU with low ingest rate Inefficient queries or supernodes
Ingest rate drops periodically Standing query backpressure
Standing query results delayed or dropped Output destination or result queue overflow

Key Metrics for Diagnosis

Ingest Rate

Metrics: ingest.{name}.count, ingest.{name}.bytes

The ingest rate shows how many records per second are being processed. Low ingest rates can have many causes, so use other metrics to narrow down the bottleneck.

Note: Ingest rate is reported as an exponentially weighted moving average, which can be volatile at the beginning and end of a stream. Allow the rate to stabilize for at least 10 minutes before drawing conclusions.

Standing Query Backpressure Valve

Metric: shared.valve.ingest

This is a key diagnostic metric. When this gauge shows a non-zero value, it means ingest is being paused because the standing query result queue is filling up faster than results can be processed and delivered.

Results flow from the queue through the output query and then to the destination. The most common cause of backpressure is output queries that need optimization. If the output query performs expensive operations (such as additional graph traversals or lookups), it can become a bottleneck. The second most common cause is destination performance, including slow network connections, rate-limited APIs, or destinations at capacity.

Persistor Latency

Metrics: persistor.{query-type} timers

These metrics track how long persistence operations take. Key statistics to watch:

  • avg (average): If high across all query types, indicates a general persistor bottleneck
  • p95 (95th percentile): If p95 is high but avg is low, indicates occasional problematic queries, often caused by supernodes
  • p99 (99th percentile): Can reveal rare but severe issues that p95 misses

When persistor latency is the bottleneck, the cause is typically either I/O bound (disk throughput) or compute bound (CPU/memory on the persistor hosts).

Edge Count Histogram

Metrics: node.edge-counts.{bucket}

High counts in the larger buckets (2048-16383 or 16384-infinity) indicate the presence of supernodes. Supernodes are not inherently problematic, but they can cause performance issues in queries that traverse them.

Resource Utilization

Resource metrics (CPU, memory, network) must be measured externally to Quine. See Operational Considerations for detailed guidance on resource planning. In general:

  • High CPU utilization is normal and indicates good resource usage
  • Low CPU utilization with low ingest rates suggests the bottleneck is elsewhere

Identifying the Bottleneck

Step 1: Check Standing Query Backpressure

Start by checking shared.valve.ingest. If this metric shows non-zero values, standing query outputs are causing backpressure on ingest.

Next steps: First, review output query complexity and optimize any expensive operations. Second, check output destination throughput and capacity.

Step 2: Check Persistor Latency

If standing query backpressure is not the issue, check persistor latency metrics.

High average latency across all operations: The persistor is generally overloaded. Consider:

  • Adding persistor resources
  • Reviewing persistor configuration (journaling, snapshot settings)
  • Checking persistor host disk I/O and CPU

High p95/p99 but normal average: Occasional operations are slow, often due to supernodes. Check the edge count histogram for confirmation.

Step 3: Check Resource Utilization

If neither standing queries nor the persistor appear to be the bottleneck:

  • Low CPU on Quine hosts: Check network throughput.
  • High CPU on Quine hosts: Review ingest query efficiency. Queries that don't anchor by ID cause expensive all-node scans. See Troubleshooting Queries for detailed query debugging techniques.

Step 4: Check for Supernodes

If the edge count histogram shows significant counts in the high buckets, supernodes may be impacting performance. Supernodes affect:

  • Query performance when traversing edges
  • Persistor performance when reading/writing node state
  • Memory usage for caching node state

Quine Enterprise

Quine Enterprise includes supernode mitigation capabilities for production deployments. Compare editions.

Quick Reference

Metric Normal Indicates Problem
shared.valve.ingest 0 Non-zero values indicate SQ output backpressure
persistor.*.avg < 10ms > 50ms suggests persistor bottleneck
persistor.*.p95 Similar to avg Much higher than avg suggests supernodes
node.edge-counts.16384-infinity 0 or low High values indicate supernodes
standing-queries.dropped.{name} 0 Non-zero means results are being lost