APPX_COUNT_DISTINCT Query Option (CDH 5.2 or higher only)
Allows multiple COUNT(DISTINCT) operations within a single query, by internally rewriting each COUNT(DISTINCT) to use the NDV() function. The resulting count is approximate rather than precise.
Type: Boolean; recognized values are 1 and 0, or true and false; any other value interpreted as false
Default: false (shown as 0 in output of SET statement)
Examples:
The following examples show how the APPX_COUNT_DISTINCT lets you work around the restriction where a query can only evaluate COUNT(DISTINCT col_name) for a single column. By default, you can count the distinct values of one column or another, but not both in a single query:
[localhost:21000] > select count(distinct x) from int_t; +-------------------+ | count(distinct x) | +-------------------+ | 10 | +-------------------+ [localhost:21000] > select count(distinct property) from int_t; +--------------------------+ | count(distinct property) | +--------------------------+ | 7 | +--------------------------+ [localhost:21000] > select count(distinct x), count(distinct property) from int_t; ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT x); deviating function: count(DISTINCT property)
When you enable the APPX_COUNT_DISTINCT query option, now the query with multiple COUNT(DISTINCT) works. The reason this behavior requires a query option is that each COUNT(DISTINCT) is rewritten internally to use the NDV() function instead, which provides an approximate result rather than a precise count.
[localhost:21000] > set APPX_COUNT_DISTINCT=true; [localhost:21000] > select count(distinct x), count(distinct property) from int_t; +-------------------+--------------------------+ | count(distinct x) | count(distinct property) | +-------------------+--------------------------+ | 10 | 7 | +-------------------+--------------------------+
Related information: