⭐️We are pleased to announce that Professor Themis Palpanas gave a talk on high-dimensional similarity search at Archimedes Talks - Scalable Vector Analytics: A Story of Twists and Turns. 📊The talk explored the evolution of similarity search over the past fifty years, highlighting why this classic data management problem is more relevant and challenging than ever—from time-series management systems to modern vector databases. It presented state-of-the-art solutions spanning exact and approximate search, in-memory and on-disk methods, and techniques ranging from LSH and product quantization to k-NN graphs and optimized linear scans. A key insight was how recent advances in time-series similarity search now achieve state-of-the-art performance even for general high-dimensional vector data. We are proud to have contributed to this important discussion and to the ongoing exploration of open research directions. https://lnkd.in/dSvng7wr Η δράση υλοποιείται στο πλαίσιο του Εθνικού Σχεδίου Ανάκαμψης και Ανθεκτικότητας Ελλάδα 2.0 με τη χρηματοδότηση της Ευρωπαϊκής Ένωσης – NextGenerationEU The project is implemented under the National Recovery and Resilience Plan “Greece 2.0”, with funding from the European Union – NextGenerationEU
HAR.S.H. (Hardware-Aware extReme-scale Similarity search)’s Post
More Relevant Posts
-
Data science is a rapidly evolving field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. As the volume, variety, and velocity of data continue to grow, data scientists need sophisticated tools to handle and analyze this data https://lnkd.in/dbNzzVhQ
To view or add a comment, sign in
-
"It’s a knowledge architecture problem. And knowledge architecture is an entirely different discipline, requiring different skills, than defining metric definitions. It’s closer to what librarians, taxonomists, and knowledge engineers have done for decades, and is a drastically different workflow than what data teams typically do today. It requires understanding how to formally represent domain knowledge, skills that fields like library science, information architecture, and semantic web research have been developing for years." https://lnkd.in/dH4nGUes
To view or add a comment, sign in
-
Choosing the right database is a systems decision, not a preference. Most data problems don’t come from “the wrong tool.” They come from using the right tool in the wrong context. Here’s a practical way to think about database choices, based on what the system actually needs to do. OLTP systems Best when you need fast, reliable transactions with strong consistency. Think orders, payments, user profiles. High write volume. Low latency. Clear integrity guarantees. OLAP systems Built for complex queries and analytical workloads. Large scans, aggregations, historical analysis. Optimised for insight, not individual transactions. Full-text search engines Designed for fast, flexible text queries. Relevance scoring, fuzzy matching, language support. Use them when search quality matters more than strict consistency. Document stores Great when your data is semi-structured and evolving. JSON-like documents, flexible schemas, easy iteration. Useful for APIs and rapidly changing domains. Key-value stores Simple data models. Extremely fast lookups. Perfect for caching, session storage, feature flags, and counters. Graph databases Built for highly connected data. Relationships are first-class citizens. Use them when traversals matter more than rows and columns. Vector databases (embeddings) Optimised for similarity search. Critical for modern AI use cases like recommendations, retrieval-augmented generation, and semantic search. Geospatial databases Specialised for location-based queries. Distance calculations, spatial joins, geofencing. Essential when “where” is part of the question. The key takeaway: There is no “best” database. There is only the database that matches your access patterns, consistency needs, and failure tolerance. Strong data engineers don’t ask, “What database should we use?” They ask, “What problem are we actually solving?” If this kind of systems-level thinking helps, I share deeper breakdowns like this in my newsletter. Which database decision caused the most pain in a system you’ve worked on? Choosing the right database is just one decision in a long chain. I write a short weekly newsletter on data systems, AI foundations, and engineering judgment. 👉 Subscribe to keep learning - https://lnkd.in/eFPw_cd5 #DataEngineering #Databases #DataArchitecture #ModernDataStack #AIEngineering #TechNewsletter
To view or add a comment, sign in
-
-
Why Vector Databases belong in your Modern Data Stack TL;DR: Vector databases are no longer just for data scientists; they are becoming a critical component of modern data pipelines (alongside Data Lakes and Warehouses). Data engineers need to evolve beyond standard SQL to master semantic search, high-dimensional indexing, and hybrid search (combining keywords + vectors). Ultimately, integrating vector DBs is essential for powering AI memory and RAG, requiring a new focus on efficiency and scalability in data architecture. ——————————————————————— If you think Vector Databases are just for Data Scientists, think again. The architecture of the modern data pipeline is evolving, and "Embedding Stores" are quickly becoming a core component alongside the Data Lake and Warehouse. Why should Data Engineers acclimatize? 1. The Shift to Semantic Search: Traditional keyword search (BM25) fails to capture intent. Vector DBs allow us to query by meaning. 2. Handling High-Dimensionality: Storing and querying billions of vectors requires a different approach to indexing and partitioning than standard B-Trees. 3. Hybrid Search: The future isn't just vector; it's hybrid (Keyword + Vector). Implementing this requires deep knowledge of both traditional search engines (like Elastic/Solr) and modern vector indices. The takeaway: Don't just learn how to SELECT *. Learn how to generate embeddings, manage vector indices, and optimize for Approximate Nearest Neighbor (ANN) search. Vector databases serve as the long-term memory for AI, storing data as high-dimensional vectors that represent semantic meaning rather than simple keywords. This allows models to retrieve relevant context from vast datasets, powering applications like Retrieval-Augmented Generation (RAG) where AI answers questions based on specific, private information. The tools may change, but the engineering principles remain: Efficiency, Scalability, and Reliability. What’s your experience integrating Vector DBs into existing pipelines? #DataArchitecture #VectorDB #MachineLearning #DataOps #SoftwareEngineering
To view or add a comment, sign in
-
-
I've been struggling with this idea for a long time - how do you explain and defend data science to folks who don't already know it? I think I finally have the right framing: Data Science and indeed science is about learning from and adjusting for the past and the present. Essentially, data science allows you to understand what happened and why it happened, if you are careful. There is a ton of art in the field, which is the challenge, because it means it cannot be completely engineered away. So how do you do data science and what tools do you use to do so? Recency - this is for answering what's happening now or how to react quickly to recent information. This is where constrained recommender systems are ideal - a multi-arm bandit or some other dynamical system that reweights probabilities before recommendation is best. This class of algorithms is "greedy". Long term trends - this is for answering questions about how things typically behave. For systems that don't have a ton of big shocks but enough that coding a deterministic system is either impossible or very difficult, a statistical model is really the best thing. The inductive biases make strong assumptions that allow you to carefully reason about what your data is telling you. And optimize for your measure of interest, subject to behavior. Long term trends with fads - if there are short term structural shocks like a big shift up or down in demand, hierarchical models with a sub-model can really help. Fine-tuning for the last three months or something and then using your long term data for the big trends can work exceptionally well. If I hit 100 likes on this, I'll turn this into a book. I think there could be a lot of great ideas in this framing but only if the community thinks this is helpful.
To view or add a comment, sign in
-
When data science projects break, we often think about technical issues: weak models, noisy data, insufficient features. In fact, many failures stem from methodological weaknesses. A common pattern is jumping straight to the latest or largest model, hoping for the best performance. This approach is expensive and often unnecessary. Worse, it bypasses a critical step that I like to call « taming the data » : understanding how it behaves and identifying potential pitfalls. Teams that adopt a stepwise approach often save time and money down the road. This isn’t just about exploratory analysis. It’s about decomposing a problem into smaller steps and starting with simpler, more manageable models before adding complexity. The same applies to bias and leakage. Many issues don’t come from a single obvious blunder, but from a series of design choices that look perfectly reasonable at the time: how data is split or shuffled, how features are encoded. Bias accumulates line by line as the pipeline is built. Treating bias awareness as a design reflex rather than a post-hoc validation step is often what makes the difference between fragile and robust results. There is no neutral pipeline. Preprocessing, feature engineering, and metric selection all encode assumptions about what we’re looking at and how success is defined. This isn’t a flaw; it’s a constraint that requires awareness. Coming from experimental research, this incremental, analytical, and skeptical approach was drilled into me early. I’ve found it translates well to applied data science. In short: data science tends to work best when treated as a hypothesis-testing discipline, not a model-shopping exercise, even in the era of low-code and plug-and-play ML.
To view or add a comment, sign in
-
The FAIR Principles: for Modern Data Modeling FAIR — Findable, Accessible, Interoperable, Reusable 🧠 What FAIR Actually Means Principle Core Action What It Prevents F — Findable Persistent IDs + rich metadata “I know it’s here somewhere…” A — Accessible Open protocols + documented permissions Data trapped behind paywalls or dead servers I — Interoperable Shared vocabularies + machine readability “Excel hell” and schema mismatches R — Reusable Licensing + provenance + contextual standards “Can we even use this?” or irreproducible science A dataset isn’t FAIR until it’s findable by machines and actionable by humans. 💀 The Counterfactual: When FAIR Is Ignored The $2.3 B Clinical‑Trial Dataset That Disappeared Without FAIR: code 2005–10: 50,000 patients, 10,000 variables 2015: Principal Investigator retires 2020: Hard drive discarded 2023: “Does anyone know where that dataset went?” 💸 $2.3 B asset → $0 value With FAIR: code 2005–10: Same dataset deposited in FAIR repository Identifier: DOI 10.1234/clinical‑trial‑2005‑2010 Metadata: indexed + reusable under CC‑BY license 2023: Used in 47 new studies, 23 regulatory filings 💰 Cumulative value > $2.7 B Moral: FAIR isn’t academic idealism — it’s economic engineering. 🏗️ FAIR as a Design Principle 1️⃣ Findability → Start at Schema Design Use resolvable IDs (DOIs, URIs), not opaque column headers. 2️⃣ Accessibility → Protocol‑Aware Infrastructure Move beyond “email the PI” to structured access endpoints, documented authentication, and 24/7 reliability. 3️⃣ Interoperability → Semantic Grounding Map every entity to a common ontology Data that can’t talk to other data can’t create new knowledge. 4️⃣ Reusability → Provenance as Code Track who, how, and why data was generated. Provenance isn’t metadata — it’s accountability. 🚨 Common Myths - “FAIR is costly.” → It costs far less than rebuilding data later. - “FAIR means open.” → No, accessible ≠ public; it means traceably managed. - “Standards change.” → FAIR handles versioning and mappings by design. #FAIRData #DataModeling #OpenScience #Interoperability #Metadata #AIReady #Provenance #Reproducibility #KnowledgeGraphs 👉 Follow me for Knowledge Management and Neuro Symbolic AI daily nuggets. 👉 Join my group for more insights and community discussions [Join the Group](https://lnkd.in/d9Z8-RQd)
To view or add a comment, sign in
-
-
I built an enterprise analytics dashboard that replaces expensive BI tools costing companies $15,000 per month. Fresh from completing Codecademy's "Mastering Generative AI & Agents for Developers" certification in March 2026, I immediately applied multi-agent architecture concepts to solve a real business problem. Codecademy #CodecademyGenAIBootcamp The challenge: Companies spend thousands on Tableau, Looker, and Power BI, yet employees still wait days for simple data answers. My solution handles 5,000 employees across 6 global regions with natural language queries responding in under 1 second. The multi-agent architecture in action: Type "revenue analysis" and get instant results: $5.75M revenue, $2.3M profit, 40% margin with interactive visualizations. Type "market trends" and see we're growing 38% versus industry average of 26%, outperforming by 45%. Type "strategic initiatives" and view 5 projects with $9.6M invested, 67% complete, and 214% average ROI. The platform demonstrates core concepts from my certification: Six specialized AI agents, each with department-specific tools and permissions. HR sees payroll data. Executives see everything. Business Support gets analytics. Each agent operates independently while sharing the same data infrastructure. Natural language processing transforms plain English into actionable insights. Context-aware responses adapt based on user permissions and department needs. Real-world implementation: - 9 currencies with real-time conversion - 7 timezones automatically converted - 26 months of historical data - Sub-1-second query responses - Production-ready deployment - Chatbot interface for conversation history Cost comparison: Traditional BI tools run $5,000-$15,000 monthly for 50 users. This dashboard costs $150-$1,000 monthly for unlimited users. That's 90% cost reduction. The platform works for any business. Retail can track inventory and sales. Healthcare can monitor patients and appointments. SaaS can analyze revenue and churn. Manufacturing can track production and equipment. Technical implementation: React, Vite, Recharts, deployed on Vercel with complete documentation. Optimized from 3.6M records to 130K through intelligent data aggregation. From certification to production in one week. Theory meets practice. Live demo: https://lnkd.in/eK9t9dhw Certification: 6942A6176A (Codecademy, February 2026) #CodecademyGenAIBootcamp #TalentWorldgroupPlc #NSCS #FullSailUniversity #AI #BusinessIntelligence #DataAnalytics #React #EnterpriseSoftware #TechInnovation #AI #Dashboard #SoftwareDevelopment
To view or add a comment, sign in
-
Which database should we use? That’s usually the wrong starting point. The real question in 2026 is: What kind of data behavior are we designing for? Because databases are no longer just storage choices — they’re architectural decisions. Over the years, I’ve seen teams struggle not because they picked the “wrong” database, but because they expected one database to behave like all others. That’s why I put together this visual of 12 database types worth keeping on your radar. Not as a checklist. Not as a comparison chart. But as a way to think clearly. Some patterns I see repeatedly in real systems: • Relational databases still anchor core business logic — but rarely operate alone • Vector databases are becoming first-class citizens in AI systems, not add-ons • Time-series and columnar stores quietly do the heavy lifting in observability and analytics • Graph databases shine when relationships are the product • Key-value and in-memory stores are about latency, not elegance • Document databases succeed when schemas must evolve faster than teams • NewSQL exists because we want SQL guarantees without giving up scale Modern architectures aren’t “polyglot” because it’s trendy. They’re polyglot because data workloads are fundamentally different. OLTP ≠ analytics Search ≠ reasoning Events ≠ relationships Inference ≠ transactions If you’re designing platforms today — especially AI-enabled ones — database literacy matters more than framework knowledge. Not knowing how your data behaves is far more dangerous than not knowing the latest tool.
To view or add a comment, sign in
-
-
Choose your #DB wisely The "What" vs. the "How": Don’t start by asking which database to use; start by defining the data behavior your system requires. Unique Identities: Modern databases are specialized tools. Expecting one database to behave like another is the primary cause of architectural failure. Shifting Perspective: Teams fail when they treat databases as interchangeable buckets rather than distinct components that dictate how an application scales and performs. Thanks for Sharing Brij kishore Pandey
Which database should we use? That’s usually the wrong starting point. The real question in 2026 is: What kind of data behavior are we designing for? Because databases are no longer just storage choices — they’re architectural decisions. Over the years, I’ve seen teams struggle not because they picked the “wrong” database, but because they expected one database to behave like all others. That’s why I put together this visual of 12 database types worth keeping on your radar. Not as a checklist. Not as a comparison chart. But as a way to think clearly. Some patterns I see repeatedly in real systems: • Relational databases still anchor core business logic — but rarely operate alone • Vector databases are becoming first-class citizens in AI systems, not add-ons • Time-series and columnar stores quietly do the heavy lifting in observability and analytics • Graph databases shine when relationships are the product • Key-value and in-memory stores are about latency, not elegance • Document databases succeed when schemas must evolve faster than teams • NewSQL exists because we want SQL guarantees without giving up scale Modern architectures aren’t “polyglot” because it’s trendy. They’re polyglot because data workloads are fundamentally different. OLTP ≠ analytics Search ≠ reasoning Events ≠ relationships Inference ≠ transactions If you’re designing platforms today — especially AI-enabled ones — database literacy matters more than framework knowledge. Not knowing how your data behaves is far more dangerous than not knowing the latest tool.
To view or add a comment, sign in
-