Data modelling is one of the most important parts of system design. Get it right, and everything else just works. I've seen too many teams rush into building APIs and UIs before truly understanding their data. It feels fast at first...until requirements change, and those early shortcuts turn into months of rework. That's why it's worth asking the right questions early: → What entities actually exist in the system? → How do they relate to one another? → Which should be immutable vs versioned? → Which fields are truly required vs optional? → How could this existing model evolve without breaking existing consumers? → Are you modelling your access patterns, not just your entities? → Do your models reflect real business concepts or just database convenience? If you're designing a new system or refactoring an existing one, here's some advice I've found helpful: [0] Start with relationships not tables → whiteboard your entities first before writing a single line of code. Truly understand how data connects, and the right structure will reveal itself. [1] Validate against reality → run real queries and flows through your model early, and if your model can't support actual access patterns efficiently, then it's not ready. [2] Be explicit about tradeoffs → every model optimises for something: consistency, availability, latency or simplicity. You can't have them all. Think PACELC not just CAP theorem. [3] Design for change → your model will change over time. The goal isn't to predict all future use cases but rather to make changes safe. Ask yourself how hard would it be to add new fields, relationships or versions without breaking downstream systems? [4] Document the reasoning, not just the result → share context with those around you, write down what you decided, why you decided it and what you explicitly chose not to do. [5] Seek diverse feedback early → share your proposals with experienced engineers before you finalise the model. They may catch scaling, indexing or risks that aren't obvious yet. Because whilst code can be rewritten easily, data lives forever. And changing it later always costs more than designing it right up front. Start with your data. Design everything else around it. #softwareengineering #systemdesign
Key Data Modeling Best Practices
Explore top LinkedIn content from expert professionals.
Summary
Key data modeling best practices involve designing structured representations of data to support reliable, accurate, and efficient information systems. At its core, data modeling is the process of defining how data is organized, related, and processed to reflect real-world business needs and ensure future adaptability.
- Define clear relationships: Map out how different pieces of data connect to each other before building any systems, which helps prevent confusion and costly mistakes later.
- Document grain and keys: Clearly specify the level of detail (grain) and unique identifiers (keys) for each table so everyone understands exactly what each row represents.
- Design for data quality: Set up processes to regularly check for errors, duplicates, and consistency in your data, ensuring that business decisions rely on trustworthy information.
-
-
Data modeling is one of those concepts that many data engineers don’t formally learn through coursework but often master through experience. Here my top 5 tips 1. Understand the Business Context Before you start modeling, deeply understand the business requirements and analytical needs. Engage with stakeholders to identify: • Key performance indicators (KPIs) • Critical dimensions and facts • Required data granularity • Expected query patterns Without this context, even the most technically sound model may fail to deliver value. 2. Follow Dimensional Modeling Principles For analytical workloads, adopting Kimball's dimensional modeling techniques is often the best approach. Key concepts include: • Fact Tables: Store measurable business events with numeric values (e.g., sales, transactions). • Dimension Tables: Store descriptive attributes (e.g., customer details, product categories). • Star Schema: Optimized for performance with fewer joins and simpler queries. • Snowflake Schema: Normalized dimensions to reduce data redundancy but requires more joins. 3. Prioritize Data Granularity Choosing the right grain is critical. Ask: What's the most detailed level of data you'll need for reporting? Will data be aggregated or filtered frequently? The more granular the better. A clear understanding of granularity ensures your model is efficient and avoids overcomplication. 4. Implement Surrogate Keys Avoid relying on natural keys directly in your model. Instead, use surrogate keys as primary keys in dimension tables. This enhances performance, simplifies joins, and protects against changes in natural keys. 5. Ensure Data Quality with Metadata Fields Add essential metadata fields to your tables: • created_date and last_modified_date for tracking data freshness • source_system to identify data origins • ETL_processed_date for tracking pipeline execution These fields simplify debugging, lineage tracking, and auditability.
-
Tools are the fashion; Data Modeling is the skeleton. You can swap Airflow for Prefect, or Spark for DuckDB. But you can’t swap "bad logic" for a faster engine and expect it to work. In one project, I used Airflow. In another, Spark. Lately, it’s all dbt. But 100% of the time, the win came down to Data Modeling fundamentals. Building a data platform without modeling is like building a skyscraper on a swamp. It doesn't matter how expensive your gold-plated elevators (tools) are if the foundation is sinking. Here's what actually matters: 𝗗𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝗮𝗹 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 = 𝗦𝗽𝗲𝗲𝗱 Star schemas make queries fast. Facts and dimensions separated = happy analysts. 𝗦𝗖𝗗𝘀 𝗪𝗶𝗹𝗹 𝗕𝗶𝘁𝗲 𝗬𝗼𝘂 Skip SCD Type 2 tracking? Debug why historical reports show wrong data at 2 AM. 𝗡𝗼𝗿𝗺𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗜𝘀𝗻'𝘁 𝗥𝗲𝗹𝗶𝗴𝗶𝗼𝗻 OLTP systems? Normalize for integrity. OLAP systems? Denormalize for speed. Know your world. Design accordingly. 𝗗𝗮𝘁𝗮 𝗩𝗮𝘂𝗹𝘁 = 𝗙𝗹𝗲𝘅𝗶𝗯𝗶𝗹𝗶𝘁𝘆 Business requirements changing weekly? Data Vault keeps you sane. Verbose but bulletproof. 👉 Here are the real Non-negotiables: • Model for how data will be queried, not just stored • Document your grain—ambiguity kills data trust • Surrogate keys > natural keys (trust me on this) • Test your model with real queries before building pipelines My 2 cents: Master data modeling, and every tool becomes easier. Skip it, and you'll spend your career firefighting broken pipelines. Are you willing to upskill❓Explore these resources: → Michael K.'s KahanDataSolutions - https://lnkd.in/g4JSFPph → Benjamin Rogojan's Seattle Data Guy - https://lnkd.in/ghewnvBX → The Data Warehouse Toolkit by Ralph Kimball - https://lnkd.in/dTynC6yD Image Credits: Shubham Srivastava Every pipeline you build will eventually be replaced. A solid data model? That becomes the language of the company. What's one data modeling mistake that cost you hours of debugging? Let's learn together. 👇
-
I've been interviewing data candidates in system design for 10 years! I’ll teach you the key concepts in just 5m: 1. Clearly define the requirements Without clarity on the output, everything else falls apart. * Define the expected output precisely. * Ensure all stakeholders agree on what “success” looks like. Miss this, and no amount of skill will save the project. 2. Understand the data you have access to Knowing your inputs is essential to shaping your outputs. * Identify the business entities (e.g., customers, products). * Map the business transactions (e.g., orders, payments). * Define relationships between entities and transactions. * Know which datasets contribute to specific outputs. This mapping builds the foundation for your system design. 3. Understand how data is modeled Inputs generally fall into categories that dictate how they are used. * Facts: Core data points (e.g., purchase, checkout). * Dimensions: Descriptive attributes (e.g., product category). * Rollups: Aggregated views (e.g., total sales). * Joins: Understand the keys to use to join fact and dimensions Understanding data types and their relationships sets you up for transformation success. 4. Transform data effectively Use SQL to turn inputs into meaningful outputs. * Write functional SQL first; perfection comes later. * Don’t over-optimize in the early stages. Transformation is where the magic happens, but clarity and simplicity matter most at first. 5. Define data quality checks Poor data quality leads to poor decisions – safeguard against it. * Focus on key metrics (e.g., revenue, MAU, DAU). * Set constraints: PK, FK, NOT NULL, ENUMs. * Check for key metric skews, outliers, and reconciliation issues. Quality isn’t a “nice-to-have” – it’s a must-have for reliable systems. 6. Optimize performance through partitioning Partitioning data helps speed up access and processing. * Identify common filters (e.g., date, type). * Use low-cardinality columns for partitioning. * Process data in parallel for large-scale pulls. Partitioning is your first step toward scaling performance. 7. Advanced optimizations – Clustering & Ordering Fine-tune your data layout for even greater efficiency. * Use clustering for high-cardinality columns (e.g., timestamps). * Sort columns for range queries to minimize scan times. * Consider Z-order for multi-column range queries. These optimizations can make the difference between “good enough” and “great.” 8. Reduce data movement in the cluster Data shuffling is the silent killer of distributed performance. * Use filters and avoid unnecessary operations. * Know which actions trigger shuffling (e.g., group by, joins). * Leverage database engine optimizations (e.g., Spark’s AQE). Reducing shuffle equals faster processing and lower costs. - Like this post? Let me know your thoughts in the comments, and follow me for more actionable insights on data engineering and system design. #dataengineering #data
-
A data model cost a company $50K last month. Nobody did anything wrong. The system allowed ambiguity. And ambiguity touched money. A B2B SaaS company pays sales commissions based on booked revenue. The data team owns the commission model. Sales ops trusts the dashboard. Finance approves payouts from it. The model joined an orders fact table to a deals-to-reps bridge table. The assumption: one rep per deal. Nobody wrote it down. But about 20% of deals had two reps attached. The commission model implicitly assumed one row per order per rep, but that grain was never defined or enforced. When those orders hit the bridge, revenue duplicated. A $100K order became two rows of $100K each. Total duplicated revenue: $1M. Commission rate: 5%. Overpayment: $50K. The money was paid. The clawback conversation was brutal. Sales morale dropped. 𝐓𝐡𝐞 𝐬𝐜𝐚𝐫𝐲 𝐩𝐚𝐫𝐭? 𝐓𝐡𝐞𝐫𝐞 𝐰𝐞𝐫𝐞 𝐧𝐨 𝐫𝐞𝐝 𝐟𝐥𝐚𝐠𝐬: → Company-level totals still looked reasonable → Overstatement only appeared when grouped by rep → No reconciliation to the billing system existed → No grain definition was documented 𝐓𝐡𝐞 𝟑 𝐦𝐨𝐝𝐞𝐥𝐢𝐧𝐠 𝐦𝐢𝐬𝐭𝐚𝐤𝐞𝐬 𝐭𝐡𝐚𝐭 𝐥𝐞𝐚𝐝 𝐡𝐞𝐫𝐞: 𝟏. 𝐍𝐨𝐭 𝐝𝐞𝐟𝐢𝐧𝐢𝐧𝐠 𝐠𝐫𝐚𝐢𝐧 𝐚𝐧𝐝 𝐤𝐞𝐲𝐬 → Write the grain as a sentence for every fact table → Model many-to-many explicitly with allocation rules → Test uniqueness at the declared grain 𝟐. 𝐓𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐝𝐚𝐭𝐚 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 𝐚𝐬 𝐨𝐩𝐭𝐢𝐨𝐧𝐚𝐥 → Reconcile money models to billing or GL every load → Add tests for uniqueness, referential integrity, freshness → Define metrics once and enforce them 𝟑. 𝐌𝐨𝐝𝐞𝐥𝐢𝐧𝐠 𝐟𝐨𝐫 𝐜𝐨𝐧𝐯𝐞𝐧𝐢𝐞𝐧𝐜𝐞 𝐢𝐧𝐬𝐭𝐞𝐚𝐝 𝐨𝐟 𝐰𝐨𝐫𝐤𝐥𝐨𝐚𝐝 → Separate raw ingestion from analytics-ready models → Use star schemas for financial reporting → Design for how data will be queried, not how it's stored 𝐓𝐡𝐞 𝐛𝐨𝐭𝐭𝐨𝐦 𝐥𝐢𝐧𝐞: A $50K data modeling mistake doesn't require incompetence. It only requires ambiguity. When data touches money, ambiguity is a liability. The fix is boring discipline—written contracts in the model and tests that fail before finance finds the problem. What's the worst join assumption you've seen—or caught before it became expensive? #DataEngineering #DataModeling #AnalyticsEngineering
-
🧠 Data Modeling: The Hidden Power Behind Every Scalable Data System Before building dashboards or writing complex SQL queries, one critical step shapes the success of your entire pipeline—data modeling. Whether you’re working on transactional systems or analytical platforms, modeling defines how efficiently your data can be stored, queried, and trusted. 🔍 What’s Covered in This Visual: 1️⃣ What is Data Modeling? It’s the process of structuring and organizing data so it’s ready for storage, querying, and analysis. It supports both OLTP (transactional) and OLAP (analytical) systems and evolves through three key stages: conceptual, logical, and physical. 2️⃣ The 3 Levels of Data Modeling: ~ Conceptual Model: A business-level view—no technical constraints, just what data is needed and how it's related. Used by stakeholders and data architects. ~ Logical Model: More technical—it includes attributes, keys, and normalization rules like 3NF. Still independent of any specific DBMS. ~ Physical Model: Now DBMS-specific. Tables, indexes, partitions, datatypes—all optimized for performance. Used by data engineers and DBAs. 3️⃣ Dimensional Modeling for OLAP (Data Warehousing): Here we focus on two key terms: ~ Facts: Quantitative, measurable data like sales or revenue. ~ Dimensions: Descriptive data like customer, region, or time—used to slice and dice metrics. 4️⃣ Schema Design Principles: ~ Star Schema: Simpler, fewer joins, faster queries, but uses more space. ~ Snowflake Schema: More normalized, saves space, but introduces more joins and complexity. 5️⃣ The One Big Table (OBT) Approach: OBT combines facts and dimensions into one wide table—optimized for read-heavy use cases like Power BI or Tableau. It simplifies access and speeds up dashboards but brings trade-offs like duplication and slower ETL. 6️⃣ Choosing the Right Modeling Strategy: ~ Use normalized models (logical → physical) for transactional systems. ~ Use dimensional models (star/snowflake) for analytics and reporting. ~ Use OBT for self-service BI tools where speed matters more than elegance. ~ For raw ingestion pipelines, minimal modeling or staging layers are preferred. 7️⃣ Modern Tools for Data Modeling: ~ dbt: Manage models as code ~ Snowflake / BigQuery: Schema-on-read for agility ~ Lucidchart / dbdiagram.io: Visual ERDs ~ Apache Hudi / Delta Lake: Handle large-scale physical modeling 📌 Whether you're optimizing a reporting layer or designing for scale—good data modeling is the difference between chaos and clarity. #DataEngineering #DataModeling #ETL #DataWarehouse #AnalyticsEngineering #ModernDataStack #StarSchema #SnowflakeSchema #dbt #PowerBI #BigQuery #DataGovernance #DimensionalModeling #Amigoscode
-
Here’s a question every data engineer has asked at some point What’s the best way to structure data so it’s clean, scalable, and analytics-ready? That’s where data modeling comes in - the backbone of every well-designed data system. Whether you’re designing a warehouse, building pipelines, or optimizing queries, mastering the right modeling technique makes all the difference. Here’s the breakdown : 1. Conceptual Data Modeling: Focuses on defining high-level business entities and their relationships without diving into technical details. 2. Logical Data Modeling: Adds structure - defining tables, fields, and relationships while applying normalization rules. 3. Physical Data Modeling: Implements logical designs into actual database schemas with optimized performance. 4. Star Schema: A simple model with a central fact table linked to multiple dimension tables - perfect for BI and analytics. 5. Snowflake Schema: An extended version of the star schema that normalizes dimension tables for better query optimization. 6. Data Vault Modeling: Ideal for historical data tracking - separates raw, business, and satellite data for scalability. 7. Dimensional Modeling: Used in data warehouses; organizes data into facts and dimensions for reporting. 8. Normalized Modeling (3NF): Reduces redundancy and ensures data integrity by following normalization principles. 9. Denormalized Modeling: Speeds up queries by merging tables and duplicating data, trading storage for performance. 10. Anchor Modeling: A flexible and audit-friendly approach that tracks changes over time using anchors and ties. In short: Data modeling is not one-size-fits-all - the right technique depends on your use case, scale, and performance goals.
-
Want to know the difference between a junior and senior analytics engineer? It's not just SQL skills—it's mastering the art of data modeling. Most people think data modeling is just "writing SQL transformations". There are more design considerations that go into it. ✅ Facts vs Dimensions - Understanding that facts capture business events while dimensions provide context ✅ Star Schemas - Building central fact tables surrounded by dimension tables to minimize joins and maximize query performance ✅ Slowly Changing Dimensions - Knowing when to overwrite (Type 1) vs. when to preserve history (Type 2) ✅ The Normalization Paradox - Keep your source data clean and normalized (no redundancy), then strategically denormalize for analytics to reduce downstream joins and boost query performance. The reality? Every senior analytics engineer I know didn't just learn these concepts—they practiced them repeatedly until they became second nature. ➡️ What's the most challenging data modeling decision you've faced recently? Drop it in the comments—let's learn from each other's experiences.
-
In my experience, the biggest and least understood barrier to robust data analysis is how data is modeled to capture the core business processes. There are three levels to good, clean data modeling that enable effective analysis: 1. Facts The foundation are well-defined atomic facts - the transactions or events that capture the core business activity. These should include full contextual information and audit trails. Additionally, proper table and column naming conventions, along with active deprecation of stale datasets is necessary for a clean and navigable data environment. 2. Metrics Atomic facts need to be transformed into self-contained datasets that can in turn generate a wide variety of aggregated metrics without the need to inject custom business logic at every turn. Another key aspect of metrics modeling is enabling low-friction creation of metric slices by dimensional attributes. If end users are constantly hopping across a series of joins of varying degrees to calculate basic metrics, then you are weakening the foundation. 3. Metric Relationships Finally, the true value of data modeling comes when you can inter-relate metrics to capture the underlying business processes. This is the evolving domain of metric trees. This is an area we at HelloTrace are particularly excited about, as it opens up the final data modeling frontier for democratizing analytics across the organization. With these three components in place—atomic facts, metrics, and metric relationships—analytical rigor will fall into place. By making it easy for organizations to execute data analysis, we can foster natural curiosity, not make it an ordeal to be endured.
-
𝘐𝘴 𝘥𝘢𝘵𝘢 𝘮𝘰𝘥𝘦𝘭𝘪𝘯𝘨 𝘧𝘰𝘳 𝘥𝘢𝘵𝘢𝘮𝘢𝘳𝘵𝘴 𝘢𝘯𝘥 𝘥𝘢𝘵𝘢 𝘭𝘢𝘬𝘦 𝘴𝘵𝘪𝘭𝘭 𝘳𝘦𝘭𝘦𝘷𝘢𝘯𝘵? 𝘏𝘰𝘸 𝘥𝘰 𝘺𝘰𝘶 𝘫𝘶𝘴𝘵𝘪𝘧𝘺 𝘪𝘵𝘴 𝘙𝘖𝘐? Lack of proper data modeling is one of the most impactful tech debt that can slow down data and AI teams. Many data teams incorrectly believe that data modeling is no longer necessary. The temptation to skip modeling in favor of working directly with raw or semi-structured data is real, especially with the rise of cloud platforms, data lakes, and schema-on-read architectures. Data modeling benefits—improved data quality, faster queries, and reduced rework—are indirect and long-term, making it harder to showcase immediate ROI. Here are key success measures to track when evaluating the impact of data modeling: 𝟭. 𝗤𝘂𝗲𝗿𝘆 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: Faster query execution translates to quicker insights. Denormalized tables in a star schema, for instance, can significantly speed up aggregations compared to queries spanning multiple normalized tables. 𝟮. 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆: Improved accuracy and consistency in reports by reducing errors and discrepancies. Without a structured model, inconsistencies like storing "Customer ID" as an integer in one system and a string in another can cause integration failures and flawed insights. 𝟯. 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: Less time spent troubleshooting means more time delivering value. Clear relationships and dependencies in a data model make it easier to trace and resolve root causes, such as upstream schema changes or ETL failures. 𝟰. 𝗖𝗼𝘀𝘁 𝗦𝗮𝘃𝗶𝗻𝗴𝘀: Optimized data structures lower storage and compute expenses. For example, pairing a star schema with columnar formats like Parquet reduces data duplication and enhances query performance, directly impacting the bottom line. 𝟱. 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆: A well-modeled system simplifies the integration of new data sources. A clear customer data model, for instance, ensures seamless mapping of attributes like customer IDs when onboarding a new CRM system, saving both time and effort. 𝘛𝘩𝘦 𝘣𝘰𝘵𝘵𝘰𝘮 𝘭𝘪𝘯𝘦: Data modeling isn’t just a technical exercise—it’s a strategic enabler for analytics, AI, and rapid change management. By tracking these success measures, you can clearly demonstrate its value and align your data strategy with tangible business outcomes. How does your team approach data modeling in today’s fast-paced, AI-first world? #DataModeling #DataLakes #AI #Analytics #DataEngineering #Scalability #DataQuality #ROI