Data Cleaning Toolkit
Privacy-first CSV cleaning with transparent algorithms, built in C++.
status
structured csv: working
semi-structured: coming soon
unstructured: coming soon
what it does
this tool helps you clean csv files quickly and repeatably. it is aimed at people who want results they can understand and trust, without sending data to random cloud services.
- detect missing values
- detect and remove duplicate rows
- standardise common null values (for example, "n/a", "null")
- download a cleaned csv
- run offline-first using webassembly when possible
why i built it
i built this because many cleaning tools are either slow and manual (lots of clicking) or they require coding. i also wanted a tool that is honest about what it changes, and one that can run offline for privacy.
this is part of my honours work on transparent, auditable data cleaning and its impact on downstream machine learning fairness.
how transparency is achieved
i try to make every cleaning step clear and repeatable.
- open source: the algorithms are in the public github repo
- deterministic steps: same input should give the same output
- simple logic: no hidden "black box" cleaning rules
- traceable changes: the goal is to show what changed and how many values were affected
privacy and offline
when the app runs in offline mode (webassembly), your csv stays on your device. when offline mode is not available, the app can fall back to the api mode to run the same cleaning logic on the server.
if you are working with sensitive data, use offline mode and double check the mode indicator in the app.
feature requests and support
if you find a bug, or you have a good idea, please post it on github. i use it as a simple "featurebase" so ideas are public, searchable, and easy to track.
- bugs and feature requests: github issues
- questions and discussion: github discussions