Assume Noise: Systems Must Be Built to Assume Mess
Author
Jake FriedenbergDate Published

Systems Must Be Built to Assume Mess#
Okta's latest 2025 Businesses at Work Report shows the average company now runs over 100 apps, with large enterprises averaging 231. YoY growth is about 10%, and more apps just mean more IDs, more formats, and more duplicate entities - so basically you have even more data to clean up across more siloes.
Now add LLM slop. Synthetic content is leaking into the wild and back into training corpora. Researchers call this feedback loop “model collapse” where models trained on AI-generated data start to misperceive reality. In the enterprise, this shows up as inconsistent, auto-generated text in CRMs, tickets, notes, and more.
And migrations? They never really end. You “finish” the big move, then spend forever cleaning up edge cases and reconciling new data that arrives with old assumptions. A 2025 survey by Caylent found that only 6% of the toughest database migrations finished on time, and only 6% hit zero downtime.
Data scientists already spend 80% of their time cleaning data - and the problem is clearly getting worse and compounding at a faster and faster rate.
So what does “cleaning” actually mean? It's mind numbing stuff like filling in missing values. Fixing typos. De-duping entities. Normalizing formats. Forcing datasets to play nice - like from your 100+ applications full of synthetic slop in the middle of a migration.
Cleaning is a tax, not a strategy. #
The goal is trustworthy answers, not perfect spreadsheets. Yet most analytics tools pretend otherwise, assuming lab-clean data. Teams end up scrubbing, stitching, and torturing tables into shape, only to produce dashboards that show correlations, not causes. But decisions should come from a better place than data butchery. If you can move cleaning from a weeks-long gate to an embedded step inside a system that understands entities, time, and location, you reclaim time without lowering quality.
It's Time to Change the Workload#
The job of data science is not to polish spreadsheets. It's to find levers that change outcomes.
RootCause.ai doesn’t live in the sanitized lab. It works in the dirty swamp of enterprise data. Automated ingestion: Raw, messy, multi-source data is pulled in as-is and automatically mapped into a unified ontology of entities, time, and location.
Scalable causal discovery: Instead of brute force, causal discovery runs across thousands of signals to uncover the real drivers of outcomes.
Accessible to everyone: Heavy statistical lifting is automated. Data scientists spend their time simulating interventions, not babysitting regressions.
The pattern is consistent: weeks of wrangling collapse into hours of alignment. Cleaning becomes a built-in step, not the main event. You can still audit and override assumptions, but you’re no longer hand-stitching CSVs just to get to square one.
Why This Matters Now#
SaaS sprawl guarantees more schema drift and duplication. Synthetic text contamination is growing fast. Migrations extend cleaning indefinitely. The traditional flow is a slow, expensive journey: months of cleaning, a little modeling, then a dashboard of correlations.
The RootCause.ai flow is a direct path: ingestion, ontology alignment, causal discovery, and interventions - with cleaning reduced to targeted, in-platform adjustments you can version and repeat.
Want to see RootCause.ai live? Schedule a demo HERE.


