Hi @idogross2, In the short term I can offer some guidance around querying with the SDK and Trino. I prefer the SDK but the guidance is the same, as well as the questions. In order:
-
basic querying at-scale I like to do with the SDK, but you do need to make sure your session includes several IP addresses - preferably all of the addresses in the VIP pool unless that list is unmanageable. Each address manifests as a “split” which will give your searches parallelism. The only built-in data optimization in the catalog is the search_path field. It leverages skip-markers to narrow-down searches. If search_path is one of your predicates, it should restrict you to that subtree avoiding a full scan.
-
There’s a query library attached that might help with this (for the Trino path). Group by with directory prefixes can be a little ugly but it’s doable. Honestly, your favorite coding model might be more useful than the library. Per item #1 above search_path is the only optimized column that can be leveraged to speed parts of your query.
-
I’ll need to check on rate limitations. Of course, coarse-grained VIP pools can be used to direct search traffic to specific CNodes but I’ll check for more sophisticated approaches.
-
There is a capacity estimation API call that can be used for some types of assessments ( Capacity Analysis with vastpy-cli ). It has the benefit of accounting for deduplication, compression, etc., but lacks fine-grained capabilities (by user, file extension, metadata, anything in the catalog).
All of this said, querying very large catalogs (billions of objects) can be taxing and take time. It’s also possible that we need to make some performance improvements. If you can let me know the rough object count of your namespace, an example query and how long it took, I can make some assessments.
Also, I maintain an internal tool (not directly supported) that works via the SDK that simplifies searches and does its best to optimize searches. I can supply that if you’re interested.