NDV Function
An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values". It is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
Syntax:
NDV([DISTINCT | ALL] expression)
Usage notes:
This is the mechanism used internally by the COMPUTE STATS statement for computing the number of distinct values in a column.
Because this number is an estimate, it might not reflect the precise number of different values in the column, especially if the cardinality is very low or very high. If the estimated number is higher than the number of rows in the table, Impala adjusts the value internally during query planning.
Return type: DOUBLE in Impala 2.0 and higher; STRING in earlier releases
Complex type considerations:
To access a column with a complex type (ARRAY, STRUCT, or MAP) in an aggregation function, you unpack the individual elements using join notation in the query, and then apply the function to the final scalar item, field, key, or value at the bottom of any nested type hierarchy in the column. See Complex Types (CDH 5.5 or higher only) for details about using complex types in Impala.
describe region; +-------------+-------------------------+---------+ | name | type | comment | +-------------+-------------------------+---------+ | r_regionkey | smallint | | | r_name | string | | | r_comment | string | | | r_nations | array<struct< | | | | n_nationkey:smallint, | | | | n_name:string, | | | | n_comment:string | | | | >> | | +-------------+-------------------------+---------+ select r_name, r_nations.item.n_nationkey from region, region.r_nations as r_nations order by r_name, r_nations.item.n_nationkey; +-------------+------------------+ | r_name | item.n_nationkey | +-------------+------------------+ | AFRICA | 0 | | AFRICA | 5 | | AFRICA | 14 | | AFRICA | 15 | | AFRICA | 16 | | AMERICA | 1 | | AMERICA | 2 | | AMERICA | 3 | | AMERICA | 17 | | AMERICA | 24 | | ASIA | 8 | | ASIA | 9 | | ASIA | 12 | | ASIA | 18 | | ASIA | 21 | | EUROPE | 6 | | EUROPE | 7 | | EUROPE | 19 | | EUROPE | 22 | | EUROPE | 23 | | MIDDLE EAST | 4 | | MIDDLE EAST | 10 | | MIDDLE EAST | 11 | | MIDDLE EAST | 13 | | MIDDLE EAST | 20 | +-------------+------------------+ select r_name, count(r_nations.item.n_nationkey) as count, sum(r_nations.item.n_nationkey) as sum, avg(r_nations.item.n_nationkey) as avg, min(r_nations.item.n_name) as minimum, max(r_nations.item.n_name) as maximum, ndv(r_nations.item.n_nationkey) as distinct_vals from region, region.r_nations as r_nations group by r_name order by r_name; +-------------+-------+-----+------+-----------+----------------+---------------+ | r_name | count | sum | avg | minimum | maximum | distinct_vals | +-------------+-------+-----+------+-----------+----------------+---------------+ | AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 | | AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 | | ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 | | EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 | | MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 | +-------------+-------+-----+------+-----------+----------------+---------------+
Restrictions:
This function cannot be used in an analytic context. That is, the OVER() clause is not allowed at all with this function.
Examples:
The following example queries a billion-row table to illustrate the relative performance of COUNT(DISTINCT) and NDV(). It shows how COUNT(DISTINCT) gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient. The NDV() function gives an approximate result but is much faster.
select count(distinct col1) from sample_data; +---------------------+ | count(distinct col1)| +---------------------+ | 100000 | +---------------------+ Fetched 1 row(s) in 20.13s select cast(ndv(col1) as bigint) as col1 from sample_data; +----------+ | col1 | +----------+ | 139017 | +----------+ Fetched 1 row(s) in 8.91s
The following example shows how you can code multiple NDV() calls in a single query, to easily learn which columns have substantially more or fewer distinct values. This technique is faster than running a sequence of queries with COUNT(DISTINCT) calls.
select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2, cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4 from sample_data; +----------+-----------+------------+-----------+ | col1 | col2 | col3 | col4 | +----------+-----------+------------+-----------+ | 139017 | 282 | 46 | 145636240 | +----------+-----------+------------+-----------+ Fetched 1 row(s) in 34.97s select count(distinct col1) from sample_data; +---------------------+ | count(distinct col1)| +---------------------+ | 100000 | +---------------------+ Fetched 1 row(s) in 20.13s select count(distinct col2) from sample_data; +----------------------+ | count(distinct col2) | +----------------------+ | 278 | +----------------------+ Fetched 1 row(s) in 20.09s select count(distinct col3) from sample_data; +-----------------------+ | count(distinct col3) | +-----------------------+ | 46 | +-----------------------+ Fetched 1 row(s) in 19.12s select count(distinct col4) from sample_data; +----------------------+ | count(distinct col4) | +----------------------+ | 147135880 | +----------------------+ Fetched 1 row(s) in 266.95s