Data Cleaning Toolkit

Privacy-first CSV cleaning with transparent algorithms, built in C++.

status

structured csv: working

semi-structured: coming soon

unstructured: coming soon


what it does

this tool helps you clean csv files quickly and repeatably. it is aimed at people who want results they can understand and trust, without sending data to random cloud services.

  • detect missing values
  • detect and remove duplicate rows
  • standardise common null values (for example, "n/a", "null")
  • download a cleaned csv
  • run offline-first using webassembly when possible

why i built it

i built this because many cleaning tools are either slow and manual (lots of clicking) or they require coding. i also wanted a tool that is honest about what it changes, and one that can run offline for privacy.

this is part of my honours work on transparent, auditable data cleaning and its impact on downstream machine learning fairness.

how transparency is achieved

i try to make every cleaning step clear and repeatable.

  • open source: the algorithms are in the public github repo
  • deterministic steps: same input should give the same output
  • simple logic: no hidden "black box" cleaning rules
  • traceable changes: the goal is to show what changed and how many values were affected

privacy and offline

when the app runs in offline mode (webassembly), your csv stays on your device. when offline mode is not available, the app can fall back to the api mode to run the same cleaning logic on the server.

if you are working with sensitive data, use offline mode and double check the mode indicator in the app.

feature requests and support

if you find a bug, or you have a good idea, please post it on github. i use it as a simple "featurebase" so ideas are public, searchable, and easy to track.

quick links