Slow dashboard load times can be the result of various things.
You want to always start by looking at the query profiles as noted on this post and work backwards:
Assuming you noticed unusually long query times here’s a few things to check:
In this example, we can see a lot of time is being spent in HDFS scan phase. In particular notice the distribution of time. On average it’s taking ~3min while the max time is ~8min. This means there’s a large variation in reading of data across the cluster.
This can be the result of data skew or nodes in the cluster that are “stragglers”.
Make sure you are using a columnar storage format like PARQUET or ORC and that compression is enabled.
In this example we see parquet being used, but without compression. Two widely used compressions are SNAPPY or GZIP.
You can read more on how to create tables with compression here
Having many small files can also hurt performance. It’s recommended to keep the storage block size on HDFS to anywhere in the range of 128MB to 512 MB.
Ensure HDFS block size and parquet block size are the same (1 parquet block per HDFS file).
In this example we see over 6K files for relatively small table of ~36GB
As a best practice, you should also build analytical views if your dashboards are going to be used in production setting. Analytical views are acceleration objects that are stored in an efficient manner back in the cluster.
Read more about how to enable and build them using “Smart acceleration” feature: