Managing Dynamic Catalogs

Explore top LinkedIn content from expert professionals.

Summary

Managing dynamic catalogs means organizing, updating, and controlling collections of data or products that frequently change, such as e-commerce listings or IT service definitions. This process helps businesses ensure their catalogs are accurate, searchable, and easy to maintain, whether it's for online stores, IT services, or large-scale data platforms.

  • Specify storage locations: Always set clear storage paths for catalogs and schemas to maintain proper data isolation and ownership.
  • Streamline updates: Use automation and smart filtering to keep product or service catalogs current and prevent outdated or incorrect information from impacting customers.
  • Maintain clear naming: Establish consistent naming conventions and attributes to make your catalog searchable and bring clarity to your team’s work.
Summarized by AI based on LinkedIn member posts
  • View profile for Shashank Shekhar

    Lead Data Engineer | Solutions Lead | AI-Native Engineering Chapter Lead | Databricks MVP

    6,710 followers

    When I first started working with Databricks Unity Catalog, I wish someone had told me this simple but crucial detail about catalog provisioning and storage locations. If you’re about to set up a new catalog and the objects underneath, here’s what you need to know: ⁉️ What really happens when you create a new catalog? 💡 Sure, a new catalog gets registered in Unity Catalog. But if you don’t specify a storage root (location) during creation, Databricks will make it a managed catalog—and all your data will automatically land in the default metastore storage location. ⁉️ Why does this matter? 💡 Let’s say you move on to create a schema, again without specifying a storage path. That schema will also default to the metastore location. And when you create tables under that schema, unless you explicitly set a location, those tables will be managed tables (again), stored in the central metastore location. 🤷♂️ The hidden impact: If you’re building a data mesh or want clear data ownership boundaries, this can be a big deal. All your data across different catalogs, schemas, and tables ends up in a single, central storage account that you might not fully control. This can complicate data governance, access control, and cost allocation down the road. Also, it could result in too many API calls towards the same storage account which could lead to throttling as Azure enforces scalability targets (limits) on requests per sec for storage accounts. ✅ My tip on best practices for catalogs and schemas: 👉 Always specify the storage location when creating catalogs and schemas if you want true data isolation and ownership. 👉 Review your Unity Catalog setup to ensure your data lands where you expect it to! ☘️ Irrespective of the type of tables (managed or external) you're provisioning, make sure they land in the appropriate storage account otherwise their migration (in the future) will be hell of a task. ✅ My tip on best practices to avoid throttling (be futuristic): 👉 Use multiple storage accounts for different catalogs, domains, or high-traffic workloads. 👉 For blob, organise your catalogs, schemas and tables in a well-defined hierarchy. ⚠️ Trust me! Besides above points, it will result in a problematic situation if at any point of time, your team plans to migrate your UC External tables to Managed ones (I'll talk about it in a future post 😉). #Databricks #UnityCatalog #DataGovernance #DataEngineering #BestPractices

  • View profile for Andreas Kretz
    Andreas Kretz Andreas Kretz is an Influencer

    I teach Data Engineering and create data & AI content | 15+ years of experience | 3x LinkedIn Top Voice | 230k+ YouTube subscribers

    159,215 followers

    For Data Engineers, Databricks Unity Catalog is the secret to managing data at scale across teams, clouds, and projects. But what exactly is behind that term?   𝗨𝗻𝗶𝘁𝘆 𝗖𝗮𝘁𝗮𝗹𝗼𝗴 is the unified governance layer for 𝗮𝗹𝗹 your data and assets. Tables, files, notebooks, ML models, you name it.   It’s not just another feature; it’s a 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 for managing your data platform at scale, making data usable, secure, and trustworthy across your whole platform.   Here’s the core idea: - 𝗠𝗲𝘁𝗮𝘀𝘁𝗼𝗿𝗲: A single source of truth for metadata. One per region, no more per-workspace metastores. - 𝗖𝗮𝘁𝗮𝗹𝗼𝗴𝘀: Group your data by business domains → sales, marketing, operations. - 𝗦𝗰𝗵𝗲𝗺𝗮𝘀: Organize data logically → Bronze, Silver, Gold layers. - 𝗧𝗮𝗯𝗹𝗲𝘀 & 𝗩𝗶𝗲𝘄𝘀: Where your data lives. Structured, secure, discoverable. - 𝗘𝘅𝘁𝗲𝗿𝗻𝗮𝗹 𝗟𝗼𝗰𝗮𝘁𝗶𝗼𝗻𝘀: Securely link cloud storage with access policies via Storage Credentials.   But it’s not just structure. Here’s what Unity Catalog really brings to the table: ➡️ 𝗖𝗲𝗻𝘁𝗿𝗮𝗹𝗶𝘇𝗲𝗱 𝗮𝗰𝗰𝗲𝘀𝘀 𝗰𝗼𝗻𝘁𝗿𝗼𝗹: Manage permissions across all Databricks workspaces. No more messy, scattered permission settings. ➡️ 𝗙𝗶𝗻𝗲-𝗴𝗿𝗮𝗶𝗻𝗲𝗱 𝘀𝗲𝗰𝘂𝗿𝗶𝘁𝘆: Control access down to the column or even row level. Perfect for sensitive data (PII, anyone?). ➡️ 𝗠𝘂𝗹𝘁𝗶-𝗳𝗼𝗿𝗺𝗮𝘁 𝘀𝘂𝗽𝗽𝗼𝗿𝘁: Delta, Iceberg, Hudi. Work with the formats your team needs, no vendor lock-in. ➡️ 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗹𝗶𝗻𝗲𝗮𝗴𝗲: See exactly how data flows: from Bronze to Silver to Gold. Great for debugging, impact analysis, and compliance. ➡️ 𝗘𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝗹𝗶𝗻𝗲𝗮𝗴𝗲: Trace data from ingestion to final report, automatically updated in real time. and more...   For Data Engineers, Unity Catalog means fewer headaches and more confidence: - No more separate metastores per workspace. - Clear separation of storage (where data lives) and metadata (who can access what, how, and why). - Full traceability of every transformation, whether it’s a small type cast or a complex data model change.   ***** In my Azure Databricks project, we put Unity Catalog into practice! Here, you'll: ➡️ Set up storage credentials to access Azure Data Lake securely. ➡️ Create external locations for raw and processed data. ➡️ Organize data in a 3-level namespace: catalog.schema.table → aligned with Medallion Architecture. ➡️ Control access: Business users only see Gold, engineers get access to Silver/Bronze for transformation.   This is how we make complex data systems manageable. And this is what modern Data Engineers need to build.   🎓 Want to learn how it works, step by step? Check the project link in the comments! 👇

  • View profile for Chris Good

    Service Optimisation Managing Consultant at Mason Advisory

    3,876 followers

    Most organisations say they have a Service Catalogue. In reality, they usually have one of three things: • An application list • A Service Desk request catalogue • A spreadsheet someone once called “services” None of these are actually a Service Catalogue. When done properly, a Service Catalogue becomes the backbone of effective Service Management - informing service design, service levels, resilience planning, costing/charging, and support models. Here’s the simple approach we use with clients to build one that actually works: 1. Start with a clear definition of a Service ITIL defines a service as “a means of delivering value to customers by facilitating outcomes customers want to achieve.” That’s correct… but rarely helpful in practice. The first step is turning that definition into something clear and practical for your organisation. 2. Define the different Service types Most organisations actually have around seven different types of services. Each type drives different Service Management behaviours, e.g: • Service Design requirements • Support models • Disaster recovery expectations Without this clarity, catalogues quickly become inconsistent. 3. Identify real examples Theory alone doesn’t work. We normally identify client-specific examples for each service type so stakeholders can immediately see how their services fit into the model. This is often the moment where the catalogue starts to make sense. 4. Define the attributes for every Service A good catalogue should remove the need for multiple disconnected lists. Example attributes: • Service Owner • Purpose • Business criticality • Impact of outage • Information classification • Service levels • Disaster recovery approach We recommend ~35 attributes - comprehensive but still manageable. 5. Establish naming conventions Service naming seems trivial… until you have three teams calling the same thing three different names. A clear naming convention makes the catalogue usable and searchable. 6. Populate the catalogue Using the examples identified earlier, complete the attributes and finalise the service names. This becomes the initial working catalogue. 7. Gather any existing “catalogues” Most organisations already have several: • application lists • CMDB exports • spreadsheets maintained by teams These are useful inputs, but they usually require significant rationalisation. 8. Apply the agreed definitions and principles Finally, reconcile the existing lists against the new model to produce a validated, consistent Service Catalogue. A well-designed Service Catalogue doesn’t just document services. It brings clarity to how IT actually delivers value to the organisation. If you’re struggling with a Service Catalogue that no one trusts or uses, feel free to message me. I’d be happy to share blueprints.

  • View profile for Christian E. N. Hansen

    Databricks Champion | Lead Data & Machine Learning Engineer at Halfspace - A Data & AI Company

    2,081 followers

    You can now create catalogs directly in Databricks Asset Bundles! That means catalogs joins the list of resources that can now live inside your databricks.yml: • Catalog definitions • Schema creation • Grants and permissions • Metadata and properties Instead of creating these through Terraform or manual SQL, they can now be deployed together with the rest of your data product. This has some nice benefits: 📦 Catalogs and schemas live closer to the code that uses them 🔁 Everything is version-controlled in the same repository 🚀 Environments become easier to reproduce 🔐 Permissions can be defined alongside pipelines and workflows 🧩 Less infrastructure code for data-product level resources Terraform still makes sense for workspace infrastructure, networking, or metastores. But for data-product boundaries, managing catalogs and schemas directly in Databricks Asset Bundles feels like a very natural fit. I wrote a short Medium post with a few practical examples of how to define catalogs and schemas in a bundle, including deployment snippets and screenshots. 🔗 https://lnkd.in/esXVUthz #Databricks #DataEngineering #UnityCatalog #DataPlatform #DataGovernance #AnalyticsEngineering

  • View profile for Dipankar Mazumdar

    Director, Data/AI @Cloudera | Apache Iceberg, Hudi Contributor | Author of “Engineering Lakehouses”

    18,086 followers

    Catalogs in Iceberg and NOW Delta Lake... Delta Lake has introduced the new "Catalog-managed" tables in 4.1.0 version. Delta tables have been always filesystem-managed. The _delta_log/ ("transaction log") in object storage is the authority for table state - every commit appended a new JSON log file and readers determined the latest version by listing and reading files from storage. Concurrency relied on atomic filesystem operations (like rename semantics) provided by the underlying storage system. Now if you compare that to Apache Iceberg, the 'catalog' is one of the most critical component. It holds a single authoritative pointer to the current metadata file. A write operation generates a new metadata tree and then performs an atomic compare-and-swap (CAS) against the catalog. Readers resolve table state by asking the catalog for the current metadata location - not by scanning storage. That separation moves transaction coordination out of the object store and into a dedicated control plane. Delta Lake now is making that architectural shift as well. When the catalog is authoritative: - Concurrency semantics become explicit instead of storage-dependent - Governance enforcement moves above the object layer - Metadata resolution avoids heavy filesystem interaction - Multi-engine interoperability becomes structurally cleaner I think of the time when the need to mandatorily use 'catalogs' was questioned a lot in the Iceberg world - as it was another component to be added to the stack. My point was - lakehouses are database systems, and they need a catalog. Great to see the ecosystem now realizing the importance of this. #dataengineering #softwareengineering

  • View profile for Julia Førde

    Databricks MVP | Senior Consultant | Architect / Senior Data engineer

    4,513 followers

    🎄𝐅𝐚𝐯𝐨𝐫𝐢𝐭𝐞 𝐃𝐚𝐭𝐚𝐛𝐫𝐢����𝐤𝐬 𝐑𝐞𝐥𝐞𝐚𝐬𝐞𝐬 𝐟𝐨𝐫 𝐃𝐞𝐜𝐞𝐦𝐛𝐞𝐫🎄 Better late then never, here's my favorite relases from December! Time flies and I am already excited about digging deeper into the releases from January. 🔑 𝐔𝐧𝐢𝐭𝐲 𝐂𝐚𝐭𝐚𝐥𝐨𝐠 𝐌𝐀𝐍𝐀𝐆𝐄 𝐏𝐫𝐢𝐯𝐢𝐥𝐞𝐠𝐞 (𝐏𝐮𝐛𝐥𝐢𝐜 𝐏𝐫𝐞𝐯𝐢𝐞𝐰): Admins can now change ownership and manage permissions without being the current owner. This is a huge time-saver, especially in organizations where ownership changes frequently. No more waiting on the original owner to make adjustments—perfect for teams with dynamic roles. I know this is a time saver and for my and many other teams 🌐 𝐔𝐧𝐢𝐭𝐲 𝐂𝐚𝐭𝐚𝐥𝐨𝐠 𝐅𝐞𝐝𝐞𝐫𝐚𝐭𝐞𝐬 𝐭𝐨 𝐇𝐢𝐯𝐞 𝐌𝐞𝐭𝐚𝐬𝐭𝐨𝐫𝐞𝐬 𝐚𝐧𝐝 𝐀𝐖𝐒 𝐆𝐥𝐮𝐞 (𝐆𝐀): Federation allows Unity Catalog to connect to external systems metadata, and this release expands to include Hive Metastore and AWS Glue, enabling unified governance across multiple platforms. This is a improvement for companies working with different AWS or has legacy use of the hive metastore. This means that the metadata from these systems is visible directly in Unity Catalog, and you can enforce governance and access controls on that metadata without moving or duplicating data. 🔒 𝐂𝐫𝐞𝐝𝐞𝐧𝐭𝐢𝐚𝐥 𝐕𝐞𝐧𝐝𝐢𝐧𝐠 (𝐏𝐮𝐛𝐥𝐢𝐜 𝐏𝐫𝐞𝐯𝐢𝐞𝐰): Credential vending in Unity Catalog allows external systems to securely access data by generating temporary credentials for reading data from Unity Catalog external locations. These credentials are created on-demand, granting time-scoped access to specific storage locations. Credential vending is supported for external systems that can connect via the Unity REST API and Iceberg REST catalog. It is primarily designed for read-only access to Unity Catalog data. This ensures that external systems can interact with the data without compromising the governance and metadata management. 📁 𝐂𝐚𝐭𝐚𝐥𝐨𝐠-𝐋𝐞𝐯𝐞𝐥 𝐒𝐭𝐨𝐫𝐚𝐠𝐞 𝐈𝐬𝐨𝐥𝐚𝐭𝐢𝐨𝐧 (𝐆𝐀): You can now enforce storage isolation at the catalog level. This means if you have already set a default storage path on the metastore, you can now change that for the whole metastore, or even just specific catalogs, giving you more control over data storage management. This improvement ensures better security, compliance, and flexibility across different datasets and departments. 🌉 𝐂𝐫𝐨𝐬𝐬-𝐏𝐥𝐚𝐭𝐟𝐨𝐫𝐦 𝐕𝐢𝐞𝐰 𝐒𝐡𝐚𝐫𝐢𝐧𝐠 (𝐏𝐮𝐛𝐥𝐢𝐜 𝐏𝐫𝐞𝐯𝐢𝐞𝐰): Delta Sharing now supports sharing views in addition to Delta tables across platforms. This enables teams to collaborate securely on specific datasets or insights, while keeping raw data private. This is useful if you want to expose aggregated/filtered data without sharing the raw data. 📚 Read more: https://lnkd.in/dgYDzEeH What are your from favorite(s) last month? #DatabricksMVP #Databricks #UnityCatalog #SqlEditor #DataWarehousing

  • View profile for Amine Kaabachi

    Solutions Architect @Databricks | Architecture SME

    5,072 followers

    I’ve seen many organizations struggle with structuring catalogs within Unity Catalog! 👇 👇 👇 I would like to present my recommended approach for designing data domain catalogs. To establish a strong foundation, I recommend starting with three core catalogs: 🌀Sources: This catalog should contain domain-specific, source-aligned data. 🌀Derived: A flexible catalog for transformed, general-purpose data, supporting a wide range of applications. 🌀Customer-aligned: Here, you can focus on consumer-aligned data, optimized for specific use cases. In addition, I recommend creating two supplementary catalogs: 🌀Published: This catalog is vital for publishing data products and enforcing contracts on datasets, ensuring compliance, access control, and efficient data distribution. 🌀Sandbox: A dynamic space that enables ad-hoc analytics and exploration, providing a flexible environment for real-time data analysis and experimentation. 🎁 Each catalog can accommodate up to 10,000 schemas, which allows you to structure your data environments based on scale. If you anticipate exceeding this limit, you can create catalogs per environment or distribute data across multiple catalogs of the same type. 🎁 In a given region, you have a limit of 1,000 catalogs overall (though this is not a hard limit). Therefore, it’s essential to maintain an optimal schemas-per-catalog ratio in your design to maximize efficiency. 🎁 It’s also important to note that Sources, Derived, and Customer-aligned catalogs are not synonymous with Bronze, Silver, and Gold layers. For instance, Gold data might reside in a Sources catalog if you need to share that data directly with external users. If you’ve implemented a different structure for Unity Catalog in your organization, I’d love to hear about your approach. 

  • View profile for Bilal B.

    Data Engineering and Operations Lead & Manager

    7,267 followers

    🚀 Databricks Unity Catalog Managed Tables: What's Current and Coming Deep Dive 🌟 Databricks Unity Catalog is transforming how organizations manage their data with features that enhance interoperability, observability, and performance. Here are key aspects of Unity Catalog Managed Tables based on the latest current and coming updates. 🔄 Seamless Support for Legacy Readers A standout feature of Unity Catalog Managed Tables is seamless support for legacy readers. This capability addresses a common challenge: ensuring modern table features, which enhance performance and usability, remain accessible to older systems. Uniform Metadata Generation 📊: Unity Catalog generates appropriate metadata for both modern and outdated clients, allowing legacy systems to read the latest tables without compatibility issues. End of Version Stagnation 🚫: By enabling legacy clients to access updated tables, organizations can move away from the practice of limiting all tables to version one solely for compatibility reasons. This promotes a more dynamic data environment where innovation can thrive. 👀 Enhanced Observability Observability is crucial for effective data management, and Unity Catalog is making strides: Predictive Optimization 📈: Users benefit from an out-of-the-box system table that provides insights into operations performed on managed tables, helping users understand performance benefits easily. Dashboard Features 📊: An intuitive dashboard will visualize key metrics related to table performance, simplifying the tracking of changes and optimizing data usage. Delta Table System Table 📋: A dedicated system table will be generated automatically for all UC managed tables, providing comprehensive details about each managed table, including size metrics and query frequency. ⚙️ Automated Performance Improvements Unity Catalog enhances performance while reducing costs through automation: Liquid Clustering 💧: Automated liquid clustering techniques will optimize data layout, improving access speed. Automated Compression and Archival 📦: Data that is infrequently accessed can be automatically compressed or archived to lower-cost storage tiers. This not only saves costs but also ensures that resources are allocated efficiently. Automated Time-to-Live (TTL) ⏳: Users can set rules for automatic deletion of outdated data. For instance, rows older than 90 days can be automatically purged from the system, streamlining data management processes 🌐 Unified Data Management Approach The goal of Unity Catalog Managed Tables is to create a unified and open data management framework: Single Copy of Data 🗃️: Unity Catalog aims to maintain one authoritative copy of data accessible by various tools seamlessly. Ease of Use 🙌: The experience is designed to be hands-free while allowing users to observe and manage their data effectively, enabling focus on deriving insights rather than technical details. #Databricks #UnityCatalog #ManagedTables #Current #Coming

Explore categories