Apache Hive Data Warehouse and external indexes

Mich Talebzadeh (Ph.D.)

1K followers

Published May 5, 2016

I had discussions in Hive user forum user@hive.apache.org especially under this thread titled “ Hive footprint” with the community about the merits of Optimized Row Columnar (ORC) files and storage indexes. See my other article titled ORC File and Storage Index .

Having had a chance using ORC storage index in Hive, I feel that a new indexing strategy is required for Hive as through my experience ORC tables have not delivered the contribution to Hive optimizer as expected. My personal experience has been that ORC provide some improvements on what is already available (Statistics wise), but otherwise unless you bucket your table (i.e. have an effective numeric column (String columns don't work for this purpose. I confirmed it and a Jira has been raised in Hive) with high cardinality that can be used in hash partitioning the table), one cannot make effective use of storage index. In addition, if your tables are smaller than your chosen HDFS block size then ORC indexing strategy does not come into it.

I use Hive 2 with Spark engine (to be precise Spark 1.3.1 as there are issues using newer Spark versions as Hive engines) and there has been some improvements to Hive optimisation performance. I also use Spark 1.6.1 with Hive 2 and that looks very impressive, albeit if your underlying ORC table in Hive is bucketed and transactional, then Spark application fails to write to it. This is probably one of the reasons that ORC format is not widely used within Spark for storing the result set. Parquet tables still remain the favourite in this respect.

So I have made proposals as a contributor to Apache Hive to make Hive external indexes (as distinct from ORC storage index) visible to Hive Cost Based Optimizer (CBO). Visible meaning that the Optimizer can take advantage of these type of indexes. Currently the infrastructure is there to create these indexes but not the functionality. In other words you can create these external indexes but they are not used! I don’t know what it takes to make indexes in Hive accountable for the CBO. Hive is arguably the most popular Data Warehouse within the Hadoop ecosystem and its meta data in the form of Hive database makes it invaluable. We should aim to consolidate the Hadoop ecosystem by investing and improving the existing tools like Hive rather than trying to fragment it further by going in tangent and reinventing the wheel by creating yet another pet project. There seems to be little effort in this area of consolidation and improvement.

Case in point, the following shows bitmap indexes that I created on a FACT table called sales in Hive:

Cool. So the indexes are there but the Hive optimizer does not use them (not cool I guess).

Another challenge in implementing the external indexes in Hive is that the blocks in Hive tables are not stored sequentially. Actually the issue with this is that HDFS lacks the ability to co-locate blocks. So really a table scan in Hive in the sense of a conventional RDBMS does not exist. We need to come up with a different strategy.

Apache Hive Data Warehouse and external indexes

Mich Talebzadeh (Ph.D.)

1K followers

More articles by this author

Others also viewed

Apache Hive LLAP Performance

2.1 Hive - Metastore | Managed | External | Partition(static/Dynamic) | Bucketing.

How to: Spark to read HDFS files, union DataFrames and store in Parquet format (Scala, Spark SQL)

The Importance of Apache Drill to the Big Data Ecosystem

Hive vs Iceberg: Choosing the Right Big Data Technology for Your Use Case

On the Use of Nested Data in Hive

All You Need To Know About Parquet File Structure In Depth

Hive Architecture in Depth

Big Data Analysis Platforms and Tools

Explore content categories

AI-Driven Financial & Digital Forensics: Reconstructing the Truth from Data

Mar 5, 2026

Ink vs Infrastructure: Why Digital Footprints Now Matter More Than Signatures

Feb 17, 2026

How UK Company Law Enables 'Manufactured Consumer Ambiguity'

Jan 17, 2026

A First-Hand Account of Fraud and Document Tampering

Apr 15, 2025

Metadata Analysis: Indications of Tampering with Records

Feb 7, 2025

The Rise of Fake Reviews: How to Spot Online Deception with Data Science

Jan 31, 2025

The Dark Side of the Self Storage Industry: Why Regulation is Long Overdue

Jan 29, 2025

Beyond Digital: The Overlooked GDPR Risks in Handling Physical Data

Oct 30, 2024

Recovering from Critical Data Loss: My Experience with Oracle 12c, RMAN, and Spark

Aug 21, 2024

Feature Engineering for Data Engineers: Building Blocks for ML Success

Aug 3, 2024