Diagnosing Bottlenecks¶
Quine is a fully backpressured system. When one component can't keep up, it slows down upstream components rather than dropping data. This makes the system resilient, but it also means that a bottleneck in one area can manifest as slowness elsewhere.
This guide helps you identify where bottlenecks are occurring so you can focus optimization efforts effectively.
Common Symptoms¶
| Symptom | Possible Bottleneck |
|---|---|
| Low ingest rate despite available CPU | Standing queries or persistor |
| High CPU with low ingest rate | Inefficient queries or supernodes |
| Ingest rate drops periodically | Standing query backpressure |
| Standing query results delayed or dropped | Output destination or result queue overflow |
Key Metrics for Diagnosis¶
Ingest Rate¶
Metrics: ingest.{name}.count, ingest.{name}.bytes
The ingest rate shows how many records per second are being processed. Low ingest rates can have many causes, so use other metrics to narrow down the bottleneck.
Note: Ingest rate is reported as an exponentially weighted moving average, which can be volatile at the beginning and end of a stream. Allow the rate to stabilize for at least 10 minutes before drawing conclusions.
Standing Query Backpressure Valve¶
Metric: shared.valve.ingest
This is a key diagnostic metric. When this gauge shows a non-zero value, it means ingest is being paused because the standing query result queue is filling up faster than results can be processed and delivered.
Results flow from the queue through the output query and then to the destination. The most common cause of backpressure is output queries that need optimization. If the output query performs expensive operations (such as additional graph traversals or lookups), it can become a bottleneck. The second most common cause is destination performance, including slow network connections, rate-limited APIs, or destinations at capacity.
Persistor Latency¶
Metrics: persistor.{query-type} timers
These metrics track how long persistence operations take. Key statistics to watch:
- avg (average): If high across all query types, indicates a general persistor bottleneck
- p95 (95th percentile): If p95 is high but avg is low, indicates occasional problematic queries, often caused by supernodes
- p99 (99th percentile): Can reveal rare but severe issues that p95 misses
When persistor latency is the bottleneck, the cause is typically either I/O bound (disk throughput) or compute bound (CPU/memory on the persistor hosts).
Edge Count Histogram¶
Metrics: node.edge-counts.{bucket}
High counts in the larger buckets (2048-16383 or 16384-infinity) indicate the presence of supernodes. Supernodes are not inherently problematic, but they can cause performance issues in queries that traverse them.
Resource Utilization¶
Resource metrics (CPU, memory, network) must be measured externally to Quine. See Operational Considerations for detailed guidance on resource planning. In general:
- High CPU utilization is normal and indicates good resource usage
- Low CPU utilization with low ingest rates suggests the bottleneck is elsewhere
Identifying the Bottleneck¶
Step 1: Check Standing Query Backpressure¶
Start by checking shared.valve.ingest. If this metric shows non-zero values, standing query outputs are causing backpressure on ingest.
Next steps: First, review output query complexity and optimize any expensive operations. Second, check output destination throughput and capacity.
Step 2: Check Persistor Latency¶
If standing query backpressure is not the issue, check persistor latency metrics.
High average latency across all operations: The persistor is generally overloaded. Consider:
- Adding persistor resources
- Reviewing persistor configuration (journaling, snapshot settings)
- Checking persistor host disk I/O and CPU
High p95/p99 but normal average: Occasional operations are slow, often due to supernodes. Check the edge count histogram for confirmation.
Step 3: Check Resource Utilization¶
If neither standing queries nor the persistor appear to be the bottleneck:
- Low CPU on Quine hosts: Check network throughput.
- High CPU on Quine hosts: Review ingest query efficiency. Queries that don't anchor by ID cause expensive all-node scans. See Troubleshooting Queries for detailed query debugging techniques.
Step 4: Check for Supernodes¶
If the edge count histogram shows significant counts in the high buckets, supernodes may be impacting performance. Supernodes affect:
- Query performance when traversing edges
- Persistor performance when reading/writing node state
- Memory usage for caching node state
Quine Enterprise
Quine Enterprise includes supernode mitigation capabilities for production deployments. Compare editions.
Quick Reference¶
| Metric | Normal | Indicates Problem |
|---|---|---|
shared.valve.ingest |
0 | Non-zero values indicate SQ output backpressure |
persistor.*.avg |
< 10ms | > 50ms suggests persistor bottleneck |
persistor.*.p95 |
Similar to avg | Much higher than avg suggests supernodes |
node.edge-counts.16384-infinity |
0 or low | High values indicate supernodes |
standing-queries.dropped.{name} |
0 | Non-zero means results are being lost |