Refreshing table metadata in Arcadia so that it sees the latest information

Hive Metastore stores information like table structure, file locations, format, etc. It’s a central “ledger” that can be used across disparate tools. Arcadia Data also uses hive metastore, and tools need to notify when there are new tables or updates to existing tables

Option1 : Connecting to Arcadia Analytical Engine (arcengine) using external shell (ETL pipeline)

One can connect to Arcadia Analytical Engine from the HiveServer2 command line shell Beeline.
(https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-
Beeline–CommandLineShell). Here are the steps to connect to Arcadia Engine in a kerberized
cluster:

  1. login to the name node (HOSTNAME)
  2. get a Kerberos ticket:

    kinit

  3. beeline

    !connect jdbc:hive2://HOSTNAME:31050/;principal=arcadia/HOSTNAME

  4. USERNAME/PASSWORD
    Note: 31050 is the Arcadia Analytics Engine port#

Alternatively, you can use Arcadia’s own command line shell (arcadia-shell). Refer to these links for more info:
http://documentation.arcadiadata.com/4.5.0.0/#pages/topics/arc-shell.html

Update Hive Table meta data in Arcadia cache

After connecting via beeline or other external tool to Arcadia Analytics Engine run the following statement with the
appropriate schema name and table name to refresh the metadata inside Arcadia:

invalidate metadata <schema>.<tablename>


Option 2

You can also manually run refreshes of tables from within the Arcadia UI:
http://documentation.arcadiadata.com/4.5.0.0/#pages/topics/data-refresh.html

image

1 Like

Whether you connect through Beeline or arcadia-shell, you would want to run this for tables that are not partitioned more typically:

invalidate metadata <db>.<tablename>

And run this instead for tables that are partitioned:

refresh <db>.<tablename>

Refresh commands only refresh metadata for new or changed partitions which is less expensive on larger tables.

Also avoid running global invalidate metadata statements as this will slow down queries across your cluster significantly.