Slow dashboard queries -- data storage tuning parameters

Slow dashboard load times can be the result of various things.

You want to always start by looking at the query profiles as noted on this post and work backwards:

Assuming you noticed unusually long query times here’s a few things to check:

1) Where is the majority of the time being spent?

In this example, we can see a lot of time is being spent in HDFS scan phase. In particular notice the distribution of time. On average it’s taking ~3min while the max time is ~8min. This means there’s a large variation in reading of data across the cluster.

This can be the result of data skew or nodes in the cluster that are “stragglers”.

2) Storage format and compression

Make sure you are using a columnar storage format like PARQUET or ORC and that compression is enabled.

In this example we see parquet being used, but without compression. Two widely used compressions are SNAPPY or GZIP.

You can read more on how to create tables with compression here

3) Too many small files – Storage block size on HDFS

Having many small files can also hurt performance. It’s recommended to keep the storage block size on HDFS to anywhere in the range of 128MB to 512 MB.

Ensure HDFS block size and parquet block size are the same (1 parquet block per HDFS file).

SET dfs.block.size=134217728;
SET parquet.block.size=134217728;
SET parquet.compression=SNAPPY;

In this example we see over 6K files for relatively small table of ~36GB

4) Analytical view for acceleration

As a best practice, you should also build analytical views if your dashboards are going to be used in production setting. Analytical views are acceleration objects that are stored in an efficient manner back in the cluster.

Read more about how to enable and build them using “Smart acceleration” feature: