I’m building a self-service tool that lets users see usage data for their own bucket (total size, file counts, per-subdirectory breakdown). I’ve tried querying the VAST Catalog a few different ways — vast-db-cli, Trino, and the Python SDK — and on our ~4PB cluster I’m struggling to do this safely and quickly regardless of the client.
What I’m seeing:
Simple bucket-scoped queries are slow.
Per-subdirectory aggregation (e.g. size of each immediate child of a path) is much worse.
One query actually saturated cluster bandwidth and starved other workloads — so “just run it” isn’t an option.
What I’d love guidance on:
What’s the right way to query the Catalog at this scale? Are there recommended filter/projection patterns, partition keys, or predicates that keep queries from turning into full scans?
Per-subtree aggregation — is there an efficient pattern for “size grouped by immediate child directory,” or is that inherently expensive on the Catalog?
Safety — can Catalog queries be rate-limited, capped, or isolated so a bad query can’t impact the data path?
Is the Catalog even the right tool for live, interactive, per-user usage queries, or should I be looking at quotas / folder stats / a periodic snapshot into an external store instead?
Hi @idogross2, In the short term I can offer some guidance around querying with the SDK and Trino. I prefer the SDK but the guidance is the same, as well as the questions. In order:
basic querying at-scale I like to do with the SDK, but you do need to make sure your session includes several IP addresses - preferably all of the addresses in the VIP pool unless that list is unmanageable. Each address manifests as a “split” which will give your searches parallelism. The only built-in data optimization in the catalog is the search_path field. It leverages skip-markers to narrow-down searches. If search_path is one of your predicates, it should restrict you to that subtree avoiding a full scan.
There’s a query library attached that might help with this (for the Trino path). Group by with directory prefixes can be a little ugly but it’s doable. Honestly, your favorite coding model might be more useful than the library. Per item #1 above search_path is the only optimized column that can be leveraged to speed parts of your query.
I’ll need to check on rate limitations. Of course, coarse-grained VIP pools can be used to direct search traffic to specific CNodes but I’ll check for more sophisticated approaches.
There is a capacity estimation API call that can be used for some types of assessments ( Capacity Analysis with vastpy-cli ). It has the benefit of accounting for deduplication, compression, etc., but lacks fine-grained capabilities (by user, file extension, metadata, anything in the catalog).
All of this said, querying very large catalogs (billions of objects) can be taxing and take time. It’s also possible that we need to make some performance improvements. If you can let me know the rough object count of your namespace, an example query and how long it took, I can make some assessments.
Also, I maintain an internal tool (not directly supported) that works via the SDK that simplifies searches and does its best to optimize searches. I can supply that if you’re interested.