Useful Impala commands that you can use to improve queries performance are:
CREATING TEMP TABLE
Those statements will make your code smarter.
Performance will be greater and your managers, users and DB's will be happy to work with you.
INSERT OVERWRITE - will make your life much easier when handling deletes.
Improve Impala performance - COMPUTE STATS
You created a table from scratch. Huge table for business users. You know that business will join it to other tables, perform analysis and heavy computational tasks.
To let Impala engine "know" how to query data efficiently, you need to COMPUTE STATS.
This is one use case.
Another one is a huge insert to a table.
You may consider running COMPUTE STATS after the insert.
The third use case will be creating a temporary table, that you will use only to create the output table, but after the output table is created, the temp table is no longer needed.
COMPUTE STATS big_analytics.vehicle_order;
Improve Impala performance - SET MEM_LIMIT
Sometimes it happens that your query needs more memory than quota assigned to you by administrators.
Just to test something.
You know that you are not alone in the Data Lake universe, you know that there are many other users working at the same time, so you will not overuse this statement.
Once more, please don't overuse this trick.
Also, this is a way to test if your query will work with different memory quotas. You can also lower the value and verify if it will work, for example within 1 GB or 2 GB memory limit.
SET MEM_LIMIT = 3 GB
Improve Impala performance - INSERT OVERWRITE
Looking from different perspectives, it can be an advantage or disadvantage: Data Lake does not have updates or deletes.
So, how can you handle situations when you would like to delete something from the table?
Imagine a huge table, you have data there loaded on a snapshot basis. You have more than 1000 days loaded.
Then, your business comes to you and tells you that they need only last 30 days.
INSERT OVERWRITE is one of the ways you can solve this task.
INSERT OVERWRITE replaces data in a table. The benefit of using this approach is that you don't need to use any additional structures.
INSERT OVERWRITE vehicle_order
SELECT * FROM vehicle_order
WHERE snapshot_date >= DATE_TRUNC('DAY', DATE_ADD(NOW, -30))
This code will remove snapshots older than 30 days.
Improve Impala performance - CREATE TEMP TABLE
This is a trick old as the Egyptian pyramids and probably also used by the Egyptians, but don't go with the ancient analogy as far.
How to solve a giant problem? Break it into smaller pieces.
In cases when you have a complicated query with multiple joins, unions, group by's and analytic function is not performing you can break it down into smaller ones.
Build multiple fairly simple temp tables. Compute stats on them and use them in subsequent queries.
This is a good place to ask, but how is the temp table different from the regular table?
Temp table exists only in order to finish your script and load data into the destination table. After your big script is completed, you are dropping temporary tables as they are no longer needed.
Template for creating temporary queries:
DROP TABLE IF EXISTS temp_order_vehicle PRUGRE;
CREATE TABLE IF NOT EXISTS temp_order_vehicle STORED AS PARUQET AS..
DROP TABLE IF EXISTS temp_vehicle_variant PRUGRE;
CREATE TABLE IF NOT EXISTS temp_vehicle_variant STORED AS PARUQET AS..
-- Target table insert
INSERT INTO vehicle_order
SELECT snapshot_date, …
LEFT JOIN temp_order_vehicle ord_veh ON (ord_veh.snapshot_date = … )
LEFT JOIN temp_vehicle_variant veh_variant ON (veh_variant.snapshot_date = … )
-- After the load, drop temporary tables
DROP TABLE IF EXISTS temp_order_vehicle PURGE;
DROP TABLE IF EXISTS temp_vehicle_variant PURGE;
Pro-tip: if your complex query has been built as a series of multiple CTE's they are good candidates for temporary tables.