Discussion
Search code, repositories, users, issues, pull requests...
thunderbong: This is extremely valuable. Every time we get a problem which we are not able to reproduce, usually an extreme edge case, we end up getting our entire production DB replicated to get to the error.I'll surely try this. Thanks for posting it here.
patpatpat: I made one of these, however I still have to solve the PII issues convince the data custodians that it's safe to use.
nickzelei: Cool project.I haven’t looked at the code too much(yet). I’d be curious to know how you’re handling some of the more wiry edge cases when it comes to following foreign key constraints. Things like circular dependencies come to mind. As well as complex joins.I feel ok posting this because it’s archived, but this problem is basically what we designed for with Neosync [1]. It was probably the hardest feature to fully solve for the customers that needed it the most, which were the ones with the most complex data sets and foreign key dependencies.To the point where it was almost impossible to do this, at least with syncing it directly to another Postgres database with everything in tact. Meaning that if on the other side you want another pg database that has all of the same constraints, it is difficult to ensure you got the full sliced dataset. At least the way we were thinking about it.[1]: https://github.com/nucleuscloud/neosync
semiquaver: I built tools like this at several startups to copy production customer data onto a dev instance for the purpose of bug reproduction.When I moved to big tech the rules against doing this were honestly one of the biggest drivers of reduced velocity I encountered. Many, many bugs and customer issues are very data dependent and can’t easily be reproduced without access to the actual customer data. Obviously I get why the rules against data access like that exist but I don’t think it’s under-appreciated how much it slows down the real world progress of fixing customer-reported issues.
nabroleonx: it is understandable to a certain degree and it is entirely dependent on your company policy. however, with dbslice browser UI, you can audit every column and make sure nothing falls through the crack and get a signed-off config. once you do that you can just use that yaml file to do as many extractions as you needthe compliance profile + UI will be on the next release
nabroleonx: this is the main reason i started doing the compliance profiles for. you can choose a compliance like hippa and dbslice auto-applies masking rules, scans the output for residual PII, and generates an audit manifest your data custodian can review.if you want to see what is being masked before anything runs, i also have a browser UI where you can review every table and column. see which fields each compliance profile covers, adjust mappings as much as you want(select what columns to anonymize and how), and export the configboth will be out in the next release.
patpatpat: Sounds fantastic.