HAR.S.H. (Hardware-Aware extReme-scale Similarity search)’s Post

HAR.S.H. (Hardware-Aware extReme-scale Similarity search)

61 followers

2mo

⭐️We are pleased to announce that Professor Themis Palpanas gave a talk on high-dimensional similarity search at Archimedes Talks - Scalable Vector Analytics: A Story of Twists and Turns. 📊The talk explored the evolution of similarity search over the past fifty years, highlighting why this classic data management problem is more relevant and challenging than ever—from time-series management systems to modern vector databases. It presented state-of-the-art solutions spanning exact and approximate search, in-memory and on-disk methods, and techniques ranging from LSH and product quantization to k-NN graphs and optimized linear scans. A key insight was how recent advances in time-series similarity search now achieve state-of-the-art performance even for general high-dimensional vector data. We are proud to have contributed to this important discussion and to the ongoing exploration of open research directions. https://lnkd.in/dSvng7wr Η δράση υλοποιείται στο πλαίσιο του Εθνικού Σχεδίου Ανάκαμψης και Ανθεκτικότητας Ελλάδα 2.0 με τη χρηματοδότηση της Ευρωπαϊκής Ένωσης – NextGenerationEU The project is implemented under the National Recovery and Resilience Plan “Greece 2.0”, with funding from the European Union – NextGenerationEU

[Archimedes Talks] Scalable Vector Analytics: A Story of archimedesai.gr

To view or add a comment, sign in

More Relevant Posts

Jean-Pierre Palomba-Marin

Palomba Consulting Group…•13K followers
1mo
Report this post
Data science is a rapidly evolving field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. As the volume, variety, and velocity of data continue to grow, data scientists need sophisticated tools to handle and analyze this data https://lnkd.in/dbNzzVhQ

30 Popular Best Data Science Tools to use in 2026 - Big Data Analytics News https://bigdataanalyticsnews.com
Like Comment
To view or add a comment, sign in
Rayssa Lisbôa França

Wicomm•466 followers
2mo
Report this post
"It’s a knowledge architecture problem. And knowledge architecture is an entirely different discipline, requiring different skills, than defining metric definitions. It’s closer to what librarians, taxonomists, and knowledge engineers have done for decades, and is a drastically different workflow than what data teams typically do today. It requires understanding how to formally represent domain knowledge, skills that fields like library science, information architecture, and semantic web research have been developing for years." https://lnkd.in/dH4nGUes

Ontologies, Context Graphs, and Semantic Layers: What AI Actually Needs in 2026 metadataweekly.substack.com
Like Comment
To view or add a comment, sign in
Disha Mukherjee

Ford Credit•88K followers
1mo
Report this post
Choosing the right database is a systems decision, not a preference. Most data problems don’t come from “the wrong tool.” They come from using the right tool in the wrong context. Here’s a practical way to think about database choices, based on what the system actually needs to do. OLTP systems Best when you need fast, reliable transactions with strong consistency. Think orders, payments, user profiles. High write volume. Low latency. Clear integrity guarantees. OLAP systems Built for complex queries and analytical workloads. Large scans, aggregations, historical analysis. Optimised for insight, not individual transactions. Full-text search engines Designed for fast, flexible text queries. Relevance scoring, fuzzy matching, language support. Use them when search quality matters more than strict consistency. Document stores Great when your data is semi-structured and evolving. JSON-like documents, flexible schemas, easy iteration. Useful for APIs and rapidly changing domains. Key-value stores Simple data models. Extremely fast lookups. Perfect for caching, session storage, feature flags, and counters. Graph databases Built for highly connected data. Relationships are first-class citizens. Use them when traversals matter more than rows and columns. Vector databases (embeddings) Optimised for similarity search. Critical for modern AI use cases like recommendations, retrieval-augmented generation, and semantic search. Geospatial databases Specialised for location-based queries. Distance calculations, spatial joins, geofencing. Essential when “where” is part of the question. The key takeaway: There is no “best” database. There is only the database that matches your access patterns, consistency needs, and failure tolerance. Strong data engineers don’t ask, “What database should we use?” They ask, “What problem are we actually solving?” If this kind of systems-level thinking helps, I share deeper breakdowns like this in my newsletter. Which database decision caused the most pain in a system you’ve worked on? Choosing the right database is just one decision in a long chain. I write a short weekly newsletter on data systems, AI foundations, and engineering judgment. 👉 Subscribe to keep learning - https://lnkd.in/eFPw_cd5 #DataEngineering #Databases #DataArchitecture #ModernDataStack #AIEngineering #TechNewsletter
60 Comments
Like Comment
To view or add a comment, sign in
Vipin Kamalasanan

First Citizens Bank•514 followers
1mo
Report this post
Why Vector Databases belong in your Modern Data Stack TL;DR: Vector databases are no longer just for data scientists; they are becoming a critical component of modern data pipelines (alongside Data Lakes and Warehouses). Data engineers need to evolve beyond standard SQL to master semantic search, high-dimensional indexing, and hybrid search (combining keywords + vectors). Ultimately, integrating vector DBs is essential for powering AI memory and RAG, requiring a new focus on efficiency and scalability in data architecture. ——————————————————————— If you think Vector Databases are just for Data Scientists, think again. The architecture of the modern data pipeline is evolving, and "Embedding Stores" are quickly becoming a core component alongside the Data Lake and Warehouse. Why should Data Engineers acclimatize? 1. The Shift to Semantic Search: Traditional keyword search (BM25) fails to capture intent. Vector DBs allow us to query by meaning. 2. Handling High-Dimensionality: Storing and querying billions of vectors requires a different approach to indexing and partitioning than standard B-Trees. 3. Hybrid Search: The future isn't just vector; it's hybrid (Keyword + Vector). Implementing this requires deep knowledge of both traditional search engines (like Elastic/Solr) and modern vector indices. The takeaway: Don't just learn how to SELECT *. Learn how to generate embeddings, manage vector indices, and optimize for Approximate Nearest Neighbor (ANN) search. Vector databases serve as the long-term memory for AI, storing data as high-dimensional vectors that represent semantic meaning rather than simple keywords. This allows models to retrieve relevant context from vast datasets, powering applications like Retrieval-Augmented Generation (RAG) where AI answers questions based on specific, private information. The tools may change, but the engineering principles remain: Efficiency, Scalability, and Reliability. What’s your experience integrating Vector DBs into existing pipelines? #DataArchitecture #VectorDB #MachineLearning #DataOps #SoftwareEngineering
2 Comments
Like Comment
To view or add a comment, sign in
Eric Schles

HelloFresh•13K followers
1mo
Report this post
I've been struggling with this idea for a long time - how do you explain and defend data science to folks who don't already know it? I think I finally have the right framing: Data Science and indeed science is about learning from and adjusting for the past and the present. Essentially, data science allows you to understand what happened and why it happened, if you are careful. There is a ton of art in the field, which is the challenge, because it means it cannot be completely engineered away. So how do you do data science and what tools do you use to do so? Recency - this is for answering what's happening now or how to react quickly to recent information. This is where constrained recommender systems are ideal - a multi-arm bandit or some other dynamical system that reweights probabilities before recommendation is best. This class of algorithms is "greedy". Long term trends - this is for answering questions about how things typically behave. For systems that don't have a ton of big shocks but enough that coding a deterministic system is either impossible or very difficult, a statistical model is really the best thing. The inductive biases make strong assumptions that allow you to carefully reason about what your data is telling you. And optimize for your measure of interest, subject to behavior. Long term trends with fads - if there are short term structural shocks like a big shift up or down in demand, hierarchical models with a sub-model can really help. Fine-tuning for the last three months or something and then using your long term data for the big trends can work exceptionally well. If I hit 100 likes on this, I'll turn this into a book. I think there could be a lot of great ideas in this framing but only if the community thinks this is helpful.

3 Comments
Like Comment
To view or add a comment, sign in
Pablo Rougerie, Ph.D

Science Feedback•469 followers
1mo
Report this post
When data science projects break, we often think about technical issues: weak models, noisy data, insufficient features. In fact, many failures stem from methodological weaknesses. A common pattern is jumping straight to the latest or largest model, hoping for the best performance. This approach is expensive and often unnecessary. Worse, it bypasses a critical step that I like to call « taming the data » : understanding how it behaves and identifying potential pitfalls. Teams that adopt a stepwise approach often save time and money down the road. This isn’t just about exploratory analysis. It’s about decomposing a problem into smaller steps and starting with simpler, more manageable models before adding complexity. The same applies to bias and leakage. Many issues don’t come from a single obvious blunder, but from a series of design choices that look perfectly reasonable at the time: how data is split or shuffled, how features are encoded. Bias accumulates line by line as the pipeline is built. Treating bias awareness as a design reflex rather than a post-hoc validation step is often what makes the difference between fragile and robust results. There is no neutral pipeline. Preprocessing, feature engineering, and metric selection all encode assumptions about what we’re looking at and how success is defined. This isn’t a flaw; it’s a constraint that requires awareness. Coming from experimental research, this incremental, analytical, and skeptical approach was drilled into me early. I’ve found it translates well to applied data science. In short: data science tends to work best when treated as a hypothesis-testing discipline, not a model-shopping exercise, even in the era of low-code and plug-and-play ML.
Like Comment
To view or add a comment, sign in
M Bilal Ashfaq

MIGx AG•5K followers
1mo
Report this post
The FAIR Principles: for Modern Data Modeling FAIR — Findable, Accessible, Interoperable, Reusable 🧠 What FAIR Actually Means Principle Core Action What It Prevents F — Findable Persistent IDs + rich metadata “I know it’s here somewhere…” A — Accessible Open protocols + documented permissions Data trapped behind paywalls or dead servers I — Interoperable Shared vocabularies + machine readability “Excel hell” and schema mismatches R — Reusable Licensing + provenance + contextual standards “Can we even use this?” or irreproducible science A dataset isn’t FAIR until it’s findable by machines and actionable by humans. 💀 The Counterfactual: When FAIR Is Ignored The $2.3 B Clinical‑Trial Dataset That Disappeared Without FAIR: code 2005–10: 50,000 patients, 10,000 variables 2015: Principal Investigator retires 2020: Hard drive discarded 2023: “Does anyone know where that dataset went?” 💸 $2.3 B asset → $0 value With FAIR: code 2005–10: Same dataset deposited in FAIR repository Identifier: DOI 10.1234/clinical‑trial‑2005‑2010 Metadata: indexed + reusable under CC‑BY license 2023: Used in 47 new studies, 23 regulatory filings 💰 Cumulative value > $2.7 B Moral: FAIR isn’t academic idealism — it’s economic engineering. 🏗️ FAIR as a Design Principle 1️⃣ Findability → Start at Schema Design Use resolvable IDs (DOIs, URIs), not opaque column headers. 2️⃣ Accessibility → Protocol‑Aware Infrastructure Move beyond “email the PI” to structured access endpoints, documented authentication, and 24/7 reliability. 3️⃣ Interoperability → Semantic Grounding Map every entity to a common ontology Data that can’t talk to other data can’t create new knowledge. 4️⃣ Reusability → Provenance as Code Track who, how, and why data was generated. Provenance isn’t metadata — it’s accountability. 🚨 Common Myths - “FAIR is costly.” → It costs far less than rebuilding data later. - “FAIR means open.” → No, accessible ≠ public; it means traceably managed. - “Standards change.” → FAIR handles versioning and mappings by design. #FAIRData #DataModeling #OpenScience #Interoperability #Metadata #AIReady #Provenance #Reproducibility #KnowledgeGraphs 👉 Follow me for Knowledge Management and Neuro Symbolic AI daily nuggets. 👉 Join my group for more insights and community discussions [Join the Group](https://lnkd.in/d9Z8-RQd)
1 Comment
Like Comment
To view or add a comment, sign in
Rory Jenkins II

TalentWorldGroup Plc.•110 followers
1mo Edited
Report this post
I built an enterprise analytics dashboard that replaces expensive BI tools costing companies $15,000 per month. Fresh from completing Codecademy's "Mastering Generative AI & Agents for Developers" certification in March 2026, I immediately applied multi-agent architecture concepts to solve a real business problem. Codecademy #CodecademyGenAIBootcamp The challenge: Companies spend thousands on Tableau, Looker, and Power BI, yet employees still wait days for simple data answers. My solution handles 5,000 employees across 6 global regions with natural language queries responding in under 1 second. The multi-agent architecture in action: Type "revenue analysis" and get instant results: $5.75M revenue, $2.3M profit, 40% margin with interactive visualizations. Type "market trends" and see we're growing 38% versus industry average of 26%, outperforming by 45%. Type "strategic initiatives" and view 5 projects with $9.6M invested, 67% complete, and 214% average ROI. The platform demonstrates core concepts from my certification: Six specialized AI agents, each with department-specific tools and permissions. HR sees payroll data. Executives see everything. Business Support gets analytics. Each agent operates independently while sharing the same data infrastructure. Natural language processing transforms plain English into actionable insights. Context-aware responses adapt based on user permissions and department needs. Real-world implementation: - 9 currencies with real-time conversion - 7 timezones automatically converted - 26 months of historical data - Sub-1-second query responses - Production-ready deployment - Chatbot interface for conversation history Cost comparison: Traditional BI tools run $5,000-$15,000 monthly for 50 users. This dashboard costs $150-$1,000 monthly for unlimited users. That's 90% cost reduction. The platform works for any business. Retail can track inventory and sales. Healthcare can monitor patients and appointments. SaaS can analyze revenue and churn. Manufacturing can track production and equipment. Technical implementation: React, Vite, Recharts, deployed on Vercel with complete documentation. Optimized from 3.6M records to 130K through intelligent data aggregation. From certification to production in one week. Theory meets practice. Live demo: https://lnkd.in/eK9t9dhw Certification: 6942A6176A (Codecademy, February 2026) #CodecademyGenAIBootcamp #TalentWorldgroupPlc #NSCS #FullSailUniversity #AI #BusinessIntelligence #DataAnalytics #React #EnterpriseSoftware #TechInnovation #AI #Dashboard #SoftwareDevelopment

Enterprise Analytics Dashboard enterprise-dashboard-1.vercel.app

2 Comments
Like Comment
To view or add a comment, sign in
Brij kishore Pandey Brij kishore Pandey is an Influencer

Wells Fargo•716K followers
2mo
Report this post
Which database should we use? That’s usually the wrong starting point. The real question in 2026 is: What kind of data behavior are we designing for? Because databases are no longer just storage choices — they’re architectural decisions. Over the years, I’ve seen teams struggle not because they picked the “wrong” database, but because they expected one database to behave like all others. That’s why I put together this visual of 12 database types worth keeping on your radar. Not as a checklist. Not as a comparison chart. But as a way to think clearly. Some patterns I see repeatedly in real systems: • Relational databases still anchor core business logic — but rarely operate alone • Vector databases are becoming first-class citizens in AI systems, not add-ons • Time-series and columnar stores quietly do the heavy lifting in observability and analytics • Graph databases shine when relationships are the product • Key-value and in-memory stores are about latency, not elegance • Document databases succeed when schemas must evolve faster than teams • NewSQL exists because we want SQL guarantees without giving up scale Modern architectures aren’t “polyglot” because it’s trendy. They’re polyglot because data workloads are fundamentally different. OLTP ≠ analytics Search ≠ reasoning Events ≠ relationships Inference ≠ transactions If you’re designing platforms today — especially AI-enabled ones — database literacy matters more than framework knowledge. Not knowing how your data behaves is far more dangerous than not knowing the latest tool.
52 Comments
Like Comment
To view or add a comment, sign in
Ranjeet Bhargava ☁

CMS Computers India Pvt Ltd.•14K followers
2mo
Report this post
Choose your #DB wisely The "What" vs. the "How": Don’t start by asking which database to use; start by defining the data behavior your system requires. Unique Identities: Modern databases are specialized tools. Expecting one database to behave like another is the primary cause of architectural failure. Shifting Perspective: Teams fail when they treat databases as interchangeable buckets rather than distinct components that dictate how an application scales and performs. Thanks for Sharing Brij kishore Pandey
Brij kishore Pandey Brij kishore Pandey is an Influencer

AI Architect & Engineer | AI Strategist
2mo

Which database should we use? That’s usually the wrong starting point. The real question in 2026 is: What kind of data behavior are we designing for? Because databases are no longer just storage choices — they’re architectural decisions. Over the years, I’ve seen teams struggle not because they picked the “wrong” database, but because they expected one database to behave like all others. That’s why I put together this visual of 12 database types worth keeping on your radar. Not as a checklist. Not as a comparison chart. But as a way to think clearly. Some patterns I see repeatedly in real systems: • Relational databases still anchor core business logic — but rarely operate alone • Vector databases are becoming first-class citizens in AI systems, not add-ons • Time-series and columnar stores quietly do the heavy lifting in observability and analytics • Graph databases shine when relationships are the product • Key-value and in-memory stores are about latency, not elegance • Document databases succeed when schemas must evolve faster than teams • NewSQL exists because we want SQL guarantees without giving up scale Modern architectures aren’t “polyglot” because it’s trendy. They’re polyglot because data workloads are fundamentally different. OLTP ≠ analytics Search ≠ reasoning Events ≠ relationships Inference ≠ transactions If you’re designing platforms today — especially AI-enabled ones — database literacy matters more than framework knowledge. Not knowing how your data behaves is far more dangerous than not knowing the latest tool.
Like Comment
To view or add a comment, sign in

HAR.S.H. (Hardware-Aware extReme-scale Similarity search)

61 followers

View Profile Connect

HAR.S.H. (Hardware-Aware extReme-scale Similarity search)’s Post

More Relevant Posts

Explore related topics

Explore content categories