Avoiding Out of Memory (OOM) errors and managing resources with Query Admission Control

In Production environments, avoiding Out of Memory (OOM) errors is critical in order to avoid degrading the user experience.

To avoid this problem, Query Admission Control can be configured with Arcadia Analytics Engine (Arcengine) to ensure the the cluster avoids unexpected spikes in usage and throttle resources if need be to avoid too much memory consumption. Query Admission Control queues and limits concurrent queries based on a set of conditions applied as run-time parameters. Additionally, there are also queue settings that let ensure queries do not wait indefinitely, which can help avoid starvation of resource scenarios. For more information on Query Admission Control, please take a look at Cloudera’s documentation on the subject.

Setting Query Admission Control parameters

To set additional query parameters for Arcengine in Cloudera Manager, go to the Arcadia Enterprise Configuration menu, search for “flag”, and then enter the parameters in the “Arcadia Analytics Engine Advanced Configuration Snippet (Safety Valve) for flagfile” box:

To set additional query parameters for Arcengine in Ambari, go to the Arcadia Enterprise Config menu, Advanced arcadia-analytic-engine tab, and enter the parameters in the “Optional parameters for Arcadia Analytic Engine” box:

Enabling and Configuring Arcengine Query Admission Control

First to enable Admission Control with Arcadia Analytics Engine (Arcengine), you must set this parameter to false:

--disable_admission_control=false

Next, you should check the current Memory Limit (mem_limit) set in Cloudera Manager or Ambari for the Arcengine process. This will be important when calculating the memory limit for the entire request pool, and can also be modified later if more memory is needed to reduce queuing of queries in busier environments.

By default this is set to 8GB in CDH installs, and 80% for HDP based installs.

If mem_limit is set to a percentage (i.e 80%), then Arcengine will only utilize up to 80% of the physical memory (RAM) on that node if memory is available.

Finally, you should consider these optional parameters to control admission of queries based on number of concurrent requests and memory allocation:

--num_cores=<int>

If > 0, it sets the number of cores available to Arcengine. Setting it to 0 means Arcengine will use all available cores on the machine according to /proc/cpuinfo.

--default_pool_max_requests=<int>

Maximum number of concurrent outstanding requests allowed to run before queueing incoming requests. A negative value indicates no limit. 0 indicates no requests will be admitted. Ignored if fair_scheduler_config_path and llama_site_path are set.

Default value is -1.

--default_pool_max_queued=<int>

Maximum number of requests allowed to be queued before rejecting requests. A negative value or 0 indicates requests will always be rejected once the maximum number of concurrent requests are executing. Ignored if fair_scheduler_config_path and llama_site_path are set.

Default value is 200.

--default_pool_mem_limit=<string>

A cluster wide limit on how much memory can be used before new requests to this pool are queued. Specified as number of bytes (’[bB]?’), megabytes (’[mM]’), gigabytes (’[gG]’), or percentage of the physical memory (’%’). Defaults to bytes if no unit is given. Ignored if fair_scheduler_config_path and llama_site_path are set.

Example: Cluster with 5 data nodes, 32GB of RAM per node for Arcengine (as established by the mem_limit setting above) (5 x 32 = 160GB) - --default_pool_mem_limit=160G

--queue_wait_timeout_ms=<int64>

Maximum amount of time (in milliseconds) that a request will wait to be admitted before timing out.

Default value: 60000 (60 seconds)

1 Like