Show HN: VeilStream – prod-like data without the PII
The use cases we're trying to solve are:
- Production-like data in development environments
- Improve incident handling by masking all data that is not relevant
- Share a subset of your data
- Protecting data being shipped into a data lake
- Safe data to expose in internal tooling, metrics, or BI dashboards
- Empower non-technical staff to vibe-code against sanitized data
# How it fits in your stack
- Role based policies: define masking rules in our web dashboard
- The proxy picks up the configuration and starts applying rules automatically.
## You host it
- it's a docker container, two environment variables: an api key, and the database URI connection
## We host it
- Drop-in proxy: no code changes. Point your connection string at a new endpoint, that's it.
# How it works (and how fast it is)
Restructuring the query AST based on the config. AST rewrites depend on the text/structure of the query, not on how many rows the database eventually returns, so they are effectively O(1) with respect to result size.
# Status & feedback wanted
VeilStream is GA, but billing isn’t switched on yet so it's currently free at all tiers. We’d love your thoughts on:
- throughput / latency in real workloads
- Filter rules & DevX
- weird edge-case queries (pg_dump, logical replication, etc.)
I’ll be around all day to answer questions and dig into issues.
# tagline
Ship features with data you can trust and privacy you don't have to worry about.
How do you handle connection pooling? Does this interfere with pgbouncer or similar tools?
Also, does this work with all PostgreSQL extensions (PostGIS, timescaledb, etc.)?
We do not do connection pooling yet. currently it's a fresh connection per query (which adds a bit of latency). We're intending to add basic connection pooling shortly after launch. That said, if you put it in-front of pgbouncer, that would work well.
PostGIS and other extensions are on the radar, but currently are not supported. The proxy works with the extensions, but can't mask the data yet. If we get requests for specific extensions to be fully supported, we'll implement (same with extra masking data types). I look forward to the GIS data implementation, as I've met one of the postGIS contributors and have discussed several of those masking complexities.
- uuids: no, but I should. adding to my list :)
- ip addresses: yes ip4 and ip6, but I want to go further and let you configure the replacement ips to be within specified cidr blocks
- arrays: again, not yet. Do you mind if I ask the use case? Arrays are commonly done as single rows and foreign keys/look ups, which we can do.
We've internally got the path for adding new filter types (dashboard configuration, api layer storage, and proxy rule implementation) pretty optimized. it takes us a day or two to add simple requested filters. longer for more complex ones.
we were considering allowing the user to inject stored procedures themselves, and then use those, but currently, we're opting to implemented them ourselves, so we have better control over the user experience. In the future, for very custom stored procedures, I think we may allow the custom path.
A future improvement to that: currently the conditions are all ANDed together, I'd like to support more types of boolean logic in the future. :)