- Today
- Holidays
- Birthdays
- Reminders
- Cities
- Atlanta
- Austin
- Baltimore
- Berwyn
- Beverly Hills
- Birmingham
- Boston
- Brooklyn
- Buffalo
- Charlotte
- Chicago
- Cincinnati
- Cleveland
- Columbus
- Dallas
- Denver
- Detroit
- Fort Worth
- Houston
- Indianapolis
- Knoxville
- Las Vegas
- Los Angeles
- Louisville
- Madison
- Memphis
- Miami
- Milwaukee
- Minneapolis
- Nashville
- New Orleans
- New York
- Omaha
- Orlando
- Philadelphia
- Phoenix
- Pittsburgh
- Portland
- Raleigh
- Richmond
- Rutherford
- Sacramento
- Salt Lake City
- San Antonio
- San Diego
- San Francisco
- San Jose
- Seattle
- Tampa
- Tucson
- Washington
Titanic Today
By the People, for the People
The 80/20 Rule of AI: Why Data Cleaning is the Real Engineering
Behind every state-of-the-art model is a developer who spent two weeks fighting missing timestamps, ambiguous text, and floating-point errors.
Mar. 3, 2026 at 10:06pm
Got story updates? Submit your updates here. ›
The article discusses the 80/20 rule of Artificial Intelligence, where developers spend 80% of their time sourcing, cleaning, formatting, and agonizing over data, and only 20% of their time actually building the model. It highlights the challenges of working with 'dirty data' in the real world, such as missing time-series data, ambiguous text, and relational graph paradoxes. The author emphasizes that data cleaning is not a hurdle to machine learning, but rather the core of it, as the decisions made during the cleaning phase have a profound impact on the model's success.
Why it matters
This article provides valuable insights for developers and data scientists working on real-world AI projects. It underscores the importance of embracing the 'janitor work' of data cleaning, which is often overlooked or undervalued, but is crucial for building reliable and production-ready AI systems. By understanding the challenges of working with messy, unstructured data, developers can better prepare for the realities of AI engineering and avoid the pitfalls of the 'digital utopia' often presented in online courses.
The details
The article delves into three main challenges of working with 'dirty data': the time-series trauma in forecasting models, the unstructured nightmare of natural language processing (NLP), and the relational graph paradox. It explains how missing data, ambiguous language, and invalid relationships can all wreak havoc on machine learning models if not properly addressed during the data cleaning phase. The author emphasizes that decisions made during this phase, such as how to handle missing values or preprocess text, have a greater impact on model performance than the choice of architecture.
- The article was published on March 3, 2026.
The players
Tirth M Shah
The author of the article, who discusses the challenges of data cleaning in the context of building AI systems.
What they’re saying
“If you learn Machine Learning from curated online courses, you are living in a digital utopia.”
— Tirth M Shah, Author
“Garbage In, Garbage Out' isn't a cliché — it is the fundamental law of machine learning.”
— Tirth M Shah, Author
“Data is the fuel. If you put sludge into a Ferrari, it won't run.”
— Tirth M Shah, Author
The takeaway
This article highlights the critical importance of data cleaning in the field of AI, emphasizing that it is not a mere chore, but rather the core of successful machine learning engineering. By embracing the 'janitor work' of data preprocessing, developers can build more reliable and production-ready AI systems that can handle the messy realities of the real world.


