Don’t just move the 𝗗𝗔𝗧𝗔, know the “𝘣𝘶𝘴𝘪𝘯𝘦𝘴𝘴 𝘤𝘰𝘯𝘵𝘦𝘹𝘵” behind it. A few days ago, I asked someone: “What’s the business use case behind this data pipeline?” Silence. “We just move the data.” “Someone probably uses the reports.” The hard truth: Moving data without context is just noise with infrastructure costs. Before you ingest anything, answer these 3 questions: 1. 𝗪𝗵𝗼 consumes this data? 2. 𝗪𝗵𝗮𝘁 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻 does it drive? 3. 𝗪𝗵𝗮𝘁 𝗯𝗿𝗲𝗮𝗸𝘀 if it’s wrong or late? If you can’t answer these — stop. Go talk to the business. Before building anything, ask these 10 questions as a data professional: 1. Start with the business problem 2. Know the data consumers 3. Understand the decision it supports 4. Learn the transformation logic deeply 5. Analyze data volume and scale 6. Don’t over-engineer systems 7. Never compromise on data quality 8. Build secure and reliable pipelines 9. Define strong data contracts early 10. Optimize resources and costs wisely From “I move data” → to “I enable decisions.” That’s where data engineers become irreplaceable. 𝗗𝗮𝘁𝗮 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝗹𝗶𝗸𝗲 𝗮 𝗰𝗵𝗲𝗳 𝗰𝗼𝗼𝗸𝗶𝗻𝗴 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗮 𝗺𝗲𝗻𝘂 — 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗲𝗳𝗳𝗼𝗿𝘁, 𝘇𝗲𝗿𝗼 𝗼𝗿𝗱𝗲𝗿𝘀.
Understanding Data Context for Transformation
Explore top LinkedIn content from expert professionals.
Summary
Understanding data context for transformation means recognizing the importance of the meaning, relationships, and business relevance behind your data before making changes or building new systems. It's about making sure data is aligned with business goals and that everyone shares the same definitions, so transformation leads to useful insights rather than confusion.
- Clarify business purpose: Always ask how data supports decision-making and who relies on it before starting any transformation project.
- Align definitions: Standardize and document business terms and metrics across teams to prevent misinterpretation and confusion.
- Map relationships: Examine how data flows through systems and connects to business processes to ensure changes are meaningful and reliable.
-
-
The missing argument in IBM's CDO Study... corporate context is the real multiplier. IBM's 2025 CDO Study celebrates that 81% of CDOs now "bring AI to data rather than centralizing data." Good. Necessary. Modern. But here's what the report never actually says… Accessible data ≠ usable data Clean data ≠ connected data Distributed data ≠ contextualized data Without shared, cross-functional corporate context, AI won't be transformative. It stays departmental, narrow, problematic. IBM gets the plumbing right, but not the meaning The report hits all the right technical notes… federated access, hybrid architectures, real-time pipelines, multimodal data. What it never addresses directly is why AI actually stalls. Organizations don't share a common understanding of what their data means across systems and the organization. No shared definitions. No semantic layer. No cross-domain ontology. No map of how data relates or what causes what. Without that, AI can only work inside whatever system it touches. The failure in AI isn't siloed data. It's siloed and limited context. Sure, I agree, distributed data is the future. But not without centralized meaning. Every system defines the world differently: "customer," "close date," "active user," "conversion," "revenue." IBM frames this as a data problem when in fact it's a knowledge problem. To transform with AI, you need: 🔹 Cross-system meaning (ontologies) 🔹 Contextual relationships (how entities evolve across workflows) 🔹 Temporal alignment (how events in one system affect another) 🔹 Causal understanding (why things happen, not just what) 🔹 Shared business definitions (the enterprise language) None of this requires centralizing data. But it absolutely requires centralizing context. IBM notes teams spend "more time hunting for and aligning data than generating insights." That's not a data access problem. Hunting for data = hunting for meaning. When data centralizes, conflicts become visible and must be reconciled. When data stays distributed, conflicts stay hidden… until an AI agent produces nonsense. The data was available. The meaning never was. So, if IBM is promoting a distributed data model... it better emphasize a data context strategy. Data is a multiplier... but not without shared understanding. You can have perfect pipelines, real-time access, federated governance, clean data... and still fail to transform. If you want AI to transform the business, you must become obsessed with meaning. Otherwise, you'll spend the next decade fighting… "Why am I getting these results?" If you don’t want to do that? Then you’ll need to do better than perfect your data infrastructure. Build the semantic layer... the ontologies, knowledge graphs, and shared context that makes cross-functional AI actually possible. That's the multiplier IBM didn't mention.
-
15 people sent me the same article in the last 24 hours, OpenAI's announcement of how they built their own internal in-house data agent. Why does everyone think I need to see this? Beyond just being interesting, it validates something I've been saying for years: The model isn't the hard part. Context is. When we started talking about the idea of context being king for AI at Atlan, people would sometimes respond with blank stares: "Why are you building a context platform? Just plug in GPT." Finally, I can send them this article from OpenAI as a response. As they put it, "CONTEXT IS EVERYTHING. High-quality answers depend on rich, accurate context. Without context, even strong models can produce wrong results, such as vastly misestimating user counts or misinterpreting internal terminology. To avoid these failure modes, the agent is built around multiple layers of context that ground it in OpenAI’s data and institutional knowledge." To make their data agent successful, OpenAI needed to unify lots of different types of context from different sources, both within and beyond their data platform. They call it "multilayered contextual grounding." Here's what that means: → Table usage: Going beyond table names to understand how data flows and gets used (e.g. table schemas, relationships, lineage, usage patterns, and historical queries) → Human annotations: Pulling from domain-expert knowledge for each table that goes beyond metadata (e.g. semantics, business meaning, and known caveats) → Codex enrichment: Examining the code behind each data table to understand insights like scope and granularity, which can highlight important differences between tables that look similar on the surface → Institutional knowledge: Pulling context from Slack, Google Docs, and Notion to understand company specifics (e.g. launches, reliability incidents, internal codenames, key metrics) → Memory: Saving and learning from prior user corrections and agent discoveries over time via saved, editable memories → Runtime context: Live queries to the data warehouse or other data platform systems when context is missing or stale Can't wait for the next time someone tells me that context is easy. I'll just send them this article! Great work by Bonnie Xu, Aravind Suresh and Emma Tang.
-
💣 Most FP&A transformations are doomed to fail before they even begin Because Step 1 Data Alignment was never even a part of the plan! It is tempting to jump straight into reporting automation, dashboards, or planning tools. But if your source data is inconsistent, fragmented, or misaligned every insight you generate will be misleading, late, or flat-out wrong. This is not a technology issue. In finance, foundational disciplines must be executed flawlessly. Correct data architecture is the bedrock of any successful transformation. Here is what effective data alignment looks like (and how ABCL approaches it): 🔹 Map the source systems – ERP, CRM, HRIS, Excel: know where everything lives 🔹 Standardise definitions – Revenue, cost, margin, customer: aligned across departments 🔹 Harmonise timeframes – Close calendars, fiscal periods, weekly/daily logic 🔹 Design a unified model – Built around key business decisions, not data dumps 🔹 Implement ownership and governance – Define refresh cycles, owners, and data rules This is not a ‘tech’ task. It is a foundational finance technology architecture done once, done right. At Akshar Business consulting, every transformation starts with this: No dashboards. No models No forecasts. No AI. Until the data story is aligned.. That is why we never skips Step 1: Data Alignment.
-
I like to think about lineage in layers, and I separate it into four layers. ⏺️ The Physical Layer, also known as Technical Lineage, is the foundation. This layer captures the actual movement of data across the enterprise. Here, we deal with the concrete components of the information ecosystem—applications, systems, files, databases, schemas, tables, columns, data sets, integration interfaces, transformation code, ingestion tools, transformation logic, filtering logic, etc Physical Lineage is often extracted automatically from system logs and code. It shows us exactly how and where the data flows. ⏺️ Next is Semantic Layer. This layer adds meaning and context to the data. Here we have business terms, metrics, definitions, KPIs, calculation logic, relationships, and data classifications. It bridges the gap between raw technical elements and the business understanding of data. It helps users interpret what the data actually means. This is how a user can understand that a cryptic name like F1 means employment status code, and F2 refers to employment status change date. The Semantic Layer relies on a well-defined Business Glossary as its foundation—so we need to create the glossary first before attempting to build this layer. ⏺️ Yet another layer is the Business Layer, which connects data to organizational structures and governance roles. This layer includes business processes and subprocesses. It can also include data domains, business initiatives, projects—pretty much anything that gives context to why we’re interested in lineage in the first place. This is where information about data owners and stewards is captured. Having this info is useful for understanding who is responsible for the data shown in the physical layer. This layer can also capture organizational structures (business units or corporate functions), policies and regulations. This layer put the data into context—who uses the data, and how. ⏺️ Finally, I want to define the Operational Lineage Layer. There are other names for it—Execution Layer, Runtime Lineage, and many other variations. This layer captures data pipelines in action. It includes validation steps, data quality checks, observability checks, workflow orchestration, job scheduling, error handling, audit logs, and alerts. This layer is critical for tracing not just the path of the data, but also its health and performance. Together, these four layers form a holistic view of data lineage—from physical flow and operational execution to business processes, governance, and business understanding. Note: There is no standard terminology for these layers. I like to define them this way because I’m a practitioner, and this is how many lineage implementations are structured—and how many data catalogs support this division. However, other thought leaders define these layers differently, which might be a good topic for a future post.
-
If you don’t understand the data pipeline, you don’t understand the insights you see. Every chart, forecast, or KPI is powered by a series of systems quietly moving data across the organization. This breakdown shows how a modern platform turns raw signals into business value. 1. Data Sources Where everything begins - apps, sensors, logs, and systems generating continuous data. 2. Data Ingestion The entry point that reliably captures real-time or batch data and feeds it into the platform. 3. Raw Data Storage A safe landing zone that keeps data unchanged so teams can audit, replay, or reprocess anytime. 4. Data Transformation (ETL / ELT) Converts raw input into structured, analytics-ready formats depending on scale and performance needs. 5. Data Processing Cleans and enriches data while applying logic so it becomes usable for reporting or modeling. 6. Curated Storage Layer Optimized locations where processed data is stored for fast querying and analytics. 7. Data Modeling Defines clear schemas so metrics and dimensions stay consistent across teams. 8. Data Quality & Validation Automated checks ensure accuracy, completeness, and trust in every downstream output. 9. Analytics & BI Turns processed data into dashboards and insights business teams rely on. 10. Advanced Consumption Feeds machine learning, real-time decisions, anomaly detection, and AI systems. 11. Monitoring & Observability Tracks delays, failures, and freshness to keep the entire pipeline healthy and reliable. 12. Governance & Security Controls access, compliance, and data protection across the platform. 13. Feedback & Iteration Pipelines evolve as business needs shift, no modern data system is ever static. Strong data platforms don’t just store information - they transform it, validate it, and deliver insights you can trust.
-
To build a solid Data Foundation for AI Transformation, enterprises must ensure that data is not only available, but trusted, well-governed, and ready for intelligent use. A strong data foundation bridges the gap between business goals and AI model performance. Below are the main components: 🔷 1. Data Strategy & Governance - Data Ownership & Stewardship: Clear roles for who owns, curates, and validates data. - Data Policies: Governance policies for access, usage, privacy, and compliance (e.g. GDPR, HIPAA). - Master & Reference Data Management: Ensure consistency of critical data entities across systems. 🔷 2. Data Quality & Trust - Data Profiling & Cleansing: Remove duplicates, fix inconsistencies, fill gaps. - Validation Rules & Anomaly Detection: Detect data drift or broken pipelines early. - Lineage & Provenance: Know where data comes from and how it has changed. 🔷 3. Data Architecture & Infrastructure - Modern Data Platforms: Data lakes, warehouses, lakehouses, or vector databases. - Real-Time vs Batch Processing: Support both operational and analytical workloads. - Data Integration & APIs: ETL/ELT pipelines, connectors, and API-based data access. 🔷 4. Security, Privacy & Compliance - Data De-identification & Masking: Protect PII while preserving utility. - Role-Based Access Control (RBAC): Ensure only the right users/systems can access the right data. - Audit Trails & Monitoring: Track who accessed what, when, and why. 🔷 5. AI-Ready Data Practices - Labeling & Annotation Workflows: For supervised learning and fine-tuning. - Feature Stores & Embeddings: Reusable, standardized inputs for ML/AI models. - RAG-Enabling Structures: Chunked, semantically enriched documents for Retrieval-Augmented Generation. 🔷 6. DataOps & Automation - CI/CD for Data Pipelines: Automate testing and deployment of data workflows. - Metadata Management & Catalogs: Enable discovery and governance at scale. - Monitoring & Alerting: Real-time health checks on data pipelines and quality metrics. 🔧 Personal Tip: Build Talent Across Data and Infrastructure One of the most underestimated success factors in AI transformation? A team that understands both the data science and the engineering foundations beneath it. Many organizations invest heavily in AI skills, but neglect the cloud, DevOps, and data infrastructure expertise needed to scale those models in production. To make AI real, you need: - Data engineers who can build resilient, governed pipelines - Platform and cloud architects who can support scalable, secure compute - MLOps specialists who bridge model lifecycle with infrastructure operations 📌 AI doesn't run in notebooks—it runs on architecture. And that architecture has to be designed with security, performance, and cost in mind from day one. #AITransformation #DataEngineering #DataManagement #ArtificalIntelligence
-
I've been interviewing data candidates in system design for 10 years! I’ll teach you the key concepts in just 5m: 1. Clearly define the requirements Without clarity on the output, everything else falls apart. * Define the expected output precisely. * Ensure all stakeholders agree on what “success” looks like. Miss this, and no amount of skill will save the project. 2. Understand the data you have access to Knowing your inputs is essential to shaping your outputs. * Identify the business entities (e.g., customers, products). * Map the business transactions (e.g., orders, payments). * Define relationships between entities and transactions. * Know which datasets contribute to specific outputs. This mapping builds the foundation for your system design. 3. Understand how data is modeled Inputs generally fall into categories that dictate how they are used. * Facts: Core data points (e.g., purchase, checkout). * Dimensions: Descriptive attributes (e.g., product category). * Rollups: Aggregated views (e.g., total sales). * Joins: Understand the keys to use to join fact and dimensions Understanding data types and their relationships sets you up for transformation success. 4. Transform data effectively Use SQL to turn inputs into meaningful outputs. * Write functional SQL first; perfection comes later. * Don’t over-optimize in the early stages. Transformation is where the magic happens, but clarity and simplicity matter most at first. 5. Define data quality checks Poor data quality leads to poor decisions – safeguard against it. * Focus on key metrics (e.g., revenue, MAU, DAU). * Set constraints: PK, FK, NOT NULL, ENUMs. * Check for key metric skews, outliers, and reconciliation issues. Quality isn’t a “nice-to-have” – it’s a must-have for reliable systems. 6. Optimize performance through partitioning Partitioning data helps speed up access and processing. * Identify common filters (e.g., date, type). * Use low-cardinality columns for partitioning. * Process data in parallel for large-scale pulls. Partitioning is your first step toward scaling performance. 7. Advanced optimizations – Clustering & Ordering Fine-tune your data layout for even greater efficiency. * Use clustering for high-cardinality columns (e.g., timestamps). * Sort columns for range queries to minimize scan times. * Consider Z-order for multi-column range queries. These optimizations can make the difference between “good enough” and “great.” 8. Reduce data movement in the cluster Data shuffling is the silent killer of distributed performance. * Use filters and avoid unnecessary operations. * Know which actions trigger shuffling (e.g., group by, joins). * Leverage database engine optimizations (e.g., Spark’s AQE). Reducing shuffle equals faster processing and lower costs. - Like this post? Let me know your thoughts in the comments, and follow me for more actionable insights on data engineering and system design. #dataengineering #data
-
dbt isn’t just a transformation tool, it’s a modeling framework that defines how your data flows, evolves, and becomes analytics-ready. Every dbt project relies on specific model types that each play a different role in shaping the final dataset. Understanding these model categories helps you design cleaner pipelines, reduce complexity, and make your warehouse faster and more reliable. 1. Staging Models These models clean, normalize, and standardize raw source data before anything else happens. It’s where columns get renamed, types are fixed, and noisy data becomes usable. Think of it as the prep kitchen of your pipeline - making raw ingredients readable and consistent for downstream logic. 2. Intermediate Models Intermediate layers hold your core business logic. This is where joins, calculations, and reusable transformations come together. They prepare structured data for analytics without worrying about dashboards. It’s the brain of your modeling process - organizing logic before it becomes metrics. 3. Mart Models Mart models are built for consumption. They produce final KPIs, aggregates, and reporting-friendly structures that BI tools and stakeholders rely on every day. It’s the part of your warehouse that actually gets surfaced in dashboards and business decisions. 4. Incremental Models These models optimize performance by processing only new or changed data instead of rebuilding full tables. They’re essential when dealing with large datasets or frequent updates, keeping your pipelines fast and efficient without overloading the warehouse. 5. Snapshot Models Snapshots track how your data changes over time. They capture row-level history so you can analyze trends, deltas, and slowly changing dimensions. It works like version control, preserving past states for accurate historical analysis. 6. Ephemeral Models Ephemeral models are lightweight, temporary transformations that never get materialized as tables. dbt inlines them into downstream queries, making them perfect for simple, reusable logic without adding extra warehouse storage or clutter. Great dbt pipelines aren’t built with complex SQL, they’re built with the right model structure. When you know where each piece of logic belongs, your data flows cleaner, runs faster, and becomes far easier to maintain at scale.
-
How can workforce data drive deeper organisational problem-solving? Showcasing the best HR and people analytics resources for January https://lnkd.in/eeNH3B4Y "If organizations want to move beyond quick fixes and use work and workforce data to drive deeper—and often more challenging—problem-solving, it is important that they look at the data in context." The premise of this thoughtful article by Steve Hatfield, Susan Cantrell, and Brad Kreit is that without the right context, even simple measurements can undermine efforts to convert people data into value. They explore several examples – in the workforce, in the workplace, and in the work – where organisations might be limiting their analysis to the surface level and how deeper analysis can reveal systemic issues that lead to opportunities for transformation. Guidance on three actions leaders can take to help ensure they are not missing important context in their data analysis are provided: 1️⃣ Bring data from different domains and sources together for analysis. 2️⃣ Make sure you’re measuring what you should—not just what you can. 3️⃣ Identify potential biases in data collection algorithms. #humanresources #peopleanalytics #workforceplanning #orgdesign #culture #hrtech #diversity #leadership #employeeexperience