How to Build Optimal Hive Tables Using ORC, Partitions and Metastore Statistics

Data storage of tables is extremely important as it will ultimately dictate query performance.

This is a great article that describes some best practices on creating ORC based files:

Partition schemes require thoughtful design as you don’t want too many or too few. Time based partitions such as the date in the form of yyyy-mm-dd are usually good start. But it also depends on your data ingestion process and how often and in what sort of “chunks” does that arrive.

Key take-aways

  1. Build your table with partitions, ORC format, and SNAPPY compression.
  2. Analyze your table when you make changes or add a partition, and analyze the partition.
  3. Analyze the columns you use most often (or all of them) at the partition level when you add a partition.
1 Like