Apache Hive Data Warehouse and external indexes
I had discussions in Hive user forum user@hive.apache.org especially under this thread titled “ Hive footprint” with the community about the merits of Optimized Row Columnar (ORC) files and storage indexes. See my other article titled ORC File and Storage Index .
Having had a chance using ORC storage index in Hive, I feel that a new indexing strategy is required for Hive as through my experience ORC tables have not delivered the contribution to Hive optimizer as expected. My personal experience has been that ORC provide some improvements on what is already available (Statistics wise), but otherwise unless you bucket your table (i.e. have an effective numeric column (String columns don't work for this purpose. I confirmed it and a Jira has been raised in Hive) with high cardinality that can be used in hash partitioning the table), one cannot make effective use of storage index. In addition, if your tables are smaller than your chosen HDFS block size then ORC indexing strategy does not come into it.
I use Hive 2 with Spark engine (to be precise Spark 1.3.1 as there are issues using newer Spark versions as Hive engines) and there has been some improvements to Hive optimisation performance. I also use Spark 1.6.1 with Hive 2 and that looks very impressive, albeit if your underlying ORC table in Hive is bucketed and transactional, then Spark application fails to write to it. This is probably one of the reasons that ORC format is not widely used within Spark for storing the result set. Parquet tables still remain the favourite in this respect.
So I have made proposals as a contributor to Apache Hive to make Hive external indexes (as distinct from ORC storage index) visible to Hive Cost Based Optimizer (CBO). Visible meaning that the Optimizer can take advantage of these type of indexes. Currently the infrastructure is there to create these indexes but not the functionality. In other words you can create these external indexes but they are not used! I don’t know what it takes to make indexes in Hive accountable for the CBO. Hive is arguably the most popular Data Warehouse within the Hadoop ecosystem and its meta data in the form of Hive database makes it invaluable. We should aim to consolidate the Hadoop ecosystem by investing and improving the existing tools like Hive rather than trying to fragment it further by going in tangent and reinventing the wheel by creating yet another pet project. There seems to be little effort in this area of consolidation and improvement.
Case in point, the following shows bitmap indexes that I created on a FACT table called sales in Hive:
Cool. So the indexes are there but the Hive optimizer does not use them (not cool I guess).
Another challenge in implementing the external indexes in Hive is that the blocks in Hive tables are not stored sequentially. Actually the issue with this is that HDFS lacks the ability to co-locate blocks. So really a table scan in Hive in the sense of a conventional RDBMS does not exist. We need to come up with a different strategy.