Honours project

The research angle behind the toolkit, and why it is built this way.

Theme

Transparency, privacy-first design, and repeatable cleaning results.


What i am building

I am building a data cleaning toolkit that aims to be fast, offline-capable, and transparent about what it changes. The purpose is not only to clean data, but to make cleaning decisions easier to explain and audit.

In many real workflows, cleaning is not a small step. It can change which rows survive, which values are treated as missing, and what patterns a model learns later on.

Why transparency matters

A lot of data cleaning happens as a mix of clicks and judgement calls. That can make it hard to reproduce and hard to justify, especially when a dataset feeds into machine learning.

My goal is to keep cleaning behaviour simple and inspectable. If the tool changes your data, you should be able to say what changed, how it changed, and why you chose that step.

Privacy-first by design

The offline-first mode uses WebAssembly so the algorithms can run in your browser. In simple terms, that means your CSV can stay on your device while the cleaning runs.

There is also an API fallback mode for compatibility, but the tool is built so that offline mode is the preferred path when it is available.

Technical approach

The core algorithms are implemented in C++. For offline mode, they are compiled to WebAssembly. For online mode, they run on the server via simple API endpoints.

  • Same overall steps across modes.
  • Deterministic behaviour where possible.
  • Minimal user interface on purpose.

Evaluation plan

I test the toolkit using real benchmark datasets with known quality issues (for example, CleanML datasets). i compare workflow and output against tools like OpenRefine and Python-based cleaning scripts.

When i report results, the aim is to report what i actually ran, including timing and before/after counts, rather than just claims.

Ethics and fairness

Cleaning can change data distributions. If a cleaning rule removes a lot of rows from one group, it can affect fairness and downstream decisions.

This project treats transparency as a practical safety feature. It makes it easier to spot when cleaning steps have unintended effects.

Links

Notes

If you are reading this page because you found it through search, the interactive tool is at /app. The homepage is at /.