Storage Space Planning for Cloudera Manager
Minimum Required Role: Full Administrator
Cloudera Manager tracks metrics of services, jobs, and applications in many background processes. All of these metrics require storage. Depending on the size of your organization, this storage can be local or remote, disk-based or in a database, managed by you or by another team in another location.
Most system administrators are aware of common locations like /var/log/ and the need for these locations to have adequate space. This topic helps you plan for the storage needs and data storage locations used by the Cloudera Manager Server and the Cloudera Management Service to store metrics and data.
Failing to plan for the storage needs of all components of the Cloudera Manager Server and the Cloudera Management Service can negatively impact your cluster in the following ways:
- The cluster might not be able to retain historical operational data to meet internal requirements.
- The cluster might miss critical audit information that was not gathered or retained for the required length of time.
- Administrators might be unable to research past events or health status.
- Administrators might not have historical MR1, YARN, or Impala usage data when they need to reference or report on them later.
- There might be gaps in metrics collection and charts.
- The cluster might experience data loss due to filling storage locations to 100% of capacity. The effects of such an event can impact many other components.
The main theme here is that you must architect your data storage needs well in advance. You must inform your operations staff about your critical data storage locations for each host so that they can provision your infrastructure adequately and back it up appropriately. Make sure to document the discovered requirements in your internal build documentation and run books.
This topic describes both local disk storage and RDBMS storage. This distinction is made both for storage planning and also to inform migration of roles from one host to another, preparing backups, and other lifecycle management events.
The following tables provide details about each individual Cloudera Management service to enable Cloudera Manager administrators to make appropriate storage and lifecycle planning decisions.
Cloudera Manager Server
Configuration Topic | Cloudera Manager Server Configuration |
---|---|
Default Storage Location | RDBMS:
Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases. Disk: Cloudera Manager Server Local Data Storage Directory (command_storage_path) on the host where the Cloudera Manager Server is configured to run. This local path is used by Cloudera Manager for storing data, including command result files. Critical configurations are not stored in this location. Default setting: /var/lib/cloudera-scm-server/ |
Storage Configuration Defaults, Minimum, or Maximum | There are no direct storage defaults relevant to this entity. |
Where to Control Data Retention or Size | The size of the Cloudera Manager Server database varies depending on the number of managed hosts and the number of
discrete commands that have been run in the cluster. To configure the size of the retained command results in the Cloudera Manager Administration Console, select
|
and edit the following property:
Sizing, Planning & Best Practices | The Cloudera Manager Server database is the most vital configuration store in a Cloudera Manager deployment. This
database holds the configuration for clusters, services, roles, and other necessary information that defines a deployment of Cloudera Manager and its managed hosts.
Make sure that you perform regular, verified, remotely-stored backups of the Cloudera Manager Server database. |
Cloudera Management Service
Configuration Topic | Activity Monitor |
---|---|
Default Storage Location | Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases. |
Storage Configuration Defaults / Minimum / Maximum | Default: 14 Days worth of MapReduce (MRv1) jobs/tasks |
Where to Control Data Retention or Size |
You control Activity Monitor storage usage by configuring the number of days or hours of data to retain. Older data is purged. To configure data retention in the Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices |
The Activity Monitor only monitors MapReduce jobs, and does not monitor YARN applications. If you no longer use MapReduce (MRv1) in your cluster, the Activity Monitor is not required for Cloudera Manager 5 (or higher) or CDH 5 (or higher). The amount of storage space needed for 14 days worth of MapReduce activities can vary greatly and directly depends on the size of your cluster and the level of activity that uses MapReduce. It might be necessary to adjust and readjust the amount of storage as you determine the "stable state" and "burst state" of the MapReduce activity in your cluster. For example, consider the following test cluster and usage:
Sizing observations for this cluster:
|
Configuration Topic | Service Monitor Configuration |
---|---|
Default Storage Location | /var/lib/cloudera-service-monitor/ on the host where the Service Monitor role is configured to run. |
Storage Configuration Defaults / Minimum / Maximum |
Total: ~12 GiB Minimum (No Maximum) |
Where to Control Data Retention or Size |
Service Monitor data growth is controlled by configuring the maximum amount of storage space it can use. To configure data retention in Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices | The Service Monitor gathers metrics about configured roles and services in your cluster and also runs active health tests. These health tests run regardless of idle and use periods, because they are always relevant. The Service Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow, even in an idle cluster. |
Configuration Topic | Host Monitor Configuration |
---|---|
Default Storage Location | /var/lib/cloudera-host-monitor/ on the host where the Host Monitor role is configured to run. |
Storage Configuration Defaults / Minimum/ Maximum | Default (and minimum): 10 GiB Host Time Series Storage |
Where to Control Data Retention or Size | Host Monitor data growth is controlled by configuring the maximum amount of storage space it can use.
See Data Storage for Monitoring Data. To configure these data retention configuration properties in the Cloudera Manager Administration Console:
|
Sizing, Planning and Best Practices | The Host Monitor gathers metrics about host-level items of interest (for example: disk space usage, RAM, CPU usage, swapping, etc) and also informs host health tests. The Host Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow fairly linearly, even in an idle cluster. |
Configuration Topic | Event Server Configuration |
---|---|
Default Storage Location | /var/lib/cloudera-scm-eventserver/ on the host where the Event Server role is configured to run. |
Storage Configuration Defaults | 5,000,000 events retained |
Where to Control Data Retention or Minimum /Maximum |
The amount of storage space the Event Server uses is influenced by configuring how many discrete events it can retain. To configure data retention in Cloudera Manager Administration Console,
|
Sizing, Planning, and Best Practices |
The Event Server is a managed Lucene index that collects relevant events that happen within your cluster, such as results of health tests, log events that are created when a log entry matches a set of rules for identifying messages of interest and makes them available for searching, filtering and additional action. You can view and filter events on the tab of the Cloudera Manager Administration Console. You can also poll this data using the Cloudera Manager API. |
Configuration Topic | Reports Manager Configuration |
---|---|
Default Storage Location | RDBMS:
Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases. Disk: /var/lib/cloudera-scm-headlamp/ on the host where the Reports Manager role is configured to run. |
Storage Configuration Defaults |
RDBMS: There are no configurable parameters to directly control the size of this data set. Disk: There are no configurable parameters to directly control the size of this data set. The storage utilization depends not only on the size of the HDFS fsimage, but also on the HDFS file path complexity. Longer file paths contribute to more space utilization. |
Where to Control Data Retention or Minimum / Maximum |
The Reports Manager uses space in two main locations: on the Reports Manager host and on its supporting database. Cloudera recommends that the database be on a separate host from the Reports Manager host for process isolation and performance. |
Sizing, Planning, and Best Practices | Reports Manager downloads the fsimage from the NameNode (every 60 minutes by default) and stores
it locally to perform operations against, including indexing the HDFS filesystem structure. More files and directories results in a larger fsimage, which consumes more disk
space.
Reports Manager has no control over the size of the fsimage. If your total HDFS usage trends upward notably or you add excessively long paths in HDFS, it might be necessary to revisit and adjust the amount of local storage allocated to the Reports Manager. Periodically monitor, review, and adjust the local storage allocation. |
Cloudera Navigator
Configuration Topic | Navigator Audit Server Configuration |
---|---|
Default Storage Location | Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases. |
Storage Configuration Defaults | Default: 90 Days retention |
Where to Control Data Retention or Min/Max | Navigator Audit Server storage usage is controlled by configuring how many days of data it can retain. Any older data
is purged.
To configure data retention in the Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices | The size of the Navigator Audit Server database directly depends on the number of audit events the cluster’s audited
services generate. Normally the volume of HDFS audits exceeds the volume of other audits (all other components like MRv1, Hive and Impala read from HDFS, which generates additional audit events).
The average size of a discrete HDFS audit event is ~1 KB. For a busy cluster of 50 hosts with ~100K audit events generated per hour, the Navigator Audit Server database would consume ~2.5 GB per day. To retain 90 days of audits at that level, plan for a database size of around 250 GB. If other configured cluster services generate roughly the same amount of data as the HDFS audits, plan for the Navigator Audit Server database to require around 500 GB of storage for 90 days of data. Notes:
To map Cloudera Navigator versions to Cloudera Manager versions, see Product Compatibility Matrix for Cloudera Navigator. |
Configuration Topic | Navigator Metadata Server Configuration |
---|---|
Default Storage Location |
RDBMS: Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases. Disk: /var/lib/cloudera-scm-navigator/ on the host where the Navigator Metadata Server role is configured to run. |
Storage Configuration Defaults |
RDBMS: There are no exposed defaults or configurations to directly cull or purge the size of this data set. Disk: There are no configuration defaults to influence the size of this location. You can change the location itself with the Navigator Metadata Server Storage Dir property. The size of the data in this location depends on the amount of metadata in the system (HDFS fsimage size, Hive Metastore size) and activity on the system (the number of MapReduce Jobs run, Hive queries executed, etc). |
Where to Control Data Retention or Min/Max |
RDBMS: The Navigator Metadata Server database should be carefully tuned to support large volumes of metadata. Disk: The Navigator Metadata Server index (an embedded Solr instance) can consume lots of disk space at the location specified for the Navigator Metadata Server Storage Dir property. Ongoing maintenance tasks include purging metadata from the system. |
Sizing, Planning, and Best Practices |
Memory: See Navigator Metadata Server Tuning.RDBMS: The database is used to store policies and authorization data. The dataset is small, but this database is also used during a Solr schema upgrade, where Solr documents are extracted and inserted again in Solr. This has same space requirements as above use case, but the space is only used temporarily during product upgrades. Use the Product Compatibility Matrix for Cloudera Navigator product compatibility matrix to map Cloudera Navigator and Cloudera Manager versions. Disk: This filesystem location contains all the metadata that is extracted from managed clusters. The data is stored in Solr, so this is the location where Solr stores its index and documents. Depending on the size of the cluster, this data can occupy tens of gigabytes. A guideline is to look at the size of HDFS fsimage and allocate two to three times that size as the initial size. The data here is incremental and continues to grow as activity is performed on the cluster. The rate of growth can be on order of tens of megabytes per day. |
General Performance Notes
When possible:
-
For entities that use an RDBMS, install the database on a separate host from the service, and consolidate roles that use databases on as few servers as possible.
-
Provide a dedicated spindle to the RDBMS or datastore data directory to avoid disk contention with other read/write activity.
Cluster Lifecycle Management with Cloudera Manager
Cloudera Manager clusters that use parcels to provide CDH and other components require adequate disk space in the following locations:
Parcel Lifecycle Path (default) | Notes |
---|---|
Local Parcel Repository Path (/opt/cloudera/parcel-repo) |
This path exists only on the host where Cloudera Manager Server (cloudera-scm-server) runs. The Cloudera Manager Server stages all new parcels in this location as it fetches them from any external repositories. Cloudera Manager Agents are then instructed to fetch the parcels from this location when the administrator distributes the parcel using the Cloudera Manager Administration Console or the Cloudera Manager API. Sizing and PlanningThe default location is /opt/cloudera/parcel-repo but you can configure another local filesystem location on the host where Cloudera Manager Server runs. See Parcel Configuration Settings. Provide sufficient space to hold all the parcels you download from all configured Remote Parcel Repository URLs (See Parcel Configuration Settings). Cloudera Manager deployments that manage multiple clusters store all applicable parcels for all clusters. Parcels are provided for each operating system, so be aware that heterogeneous clusters (distinct operating systems represented in the cluster) require more space than clusters with homogeneous operating systems. For example, a cluster with both RHEL6.x and 7.x hosts must hold -el6 and -el7 parcels in the Local Parcel Repository Path, which requires twice the amount of space. Lifecycle Management and Best PracticesDelete any parcels that are no longer in use from the Cloudera Manager Administration Console, (never delete them manually from the command line) to recover disk space in the Local Parcel Repository Path and simultaneously across all managed cluster hosts which hold the parcel. Backup ConsiderationsPerform regular backups of this path, and consider it a non-optional accessory to backing up Cloudera Manager Server. If you migrate Cloudera Manager Server to a new host or restore it from a backup (for example, after a hardware failure), recover the full content of this path to the new host, in the /opt/cloudera/parcel-repo directory before starting any cloudera-scm-agent or cloudera-scm-server processes. |
Parcel Cache (/opt/cloudera/parcel-cache) |
Managed Hosts running a Cloudera Manager Agent stage distributed parcels into this path (as .parcel files, unextracted). Do not manually manipulate this directory or its files. Sizing and PlanningProvide sufficient space per-host to hold all the parcels you distribute to each host. You can configure Cloudera Manager to remove these cached .parcel files after they are extracted and placed in /opt/cloudera/parcels/. It is not mandatory to keep these temporary files but keeping them avoids the need to transfer the .parcel file from the Cloudera Manager Server repository should you need to extract the parcel again for any reason. To configure this behavior in the Cloudera Manager Administration Console, select |
Host Parcel Directory (/opt/cloudera/parcels) |
Managed cluster hosts running a Cloudera Manager Agent extract parcels from the /opt/cloudera/parcel-cache directory into this path upon parcel activation. Many critical system symlinks point to files in this path and you should never manually manipulate its contents. Sizing and PlanningProvide sufficient space on each host to hold all the parcels you distribute to each host. Be aware that the typical CDH parcel size is approximately 2 GB per parcel, and some third party parcels can exceed 3 GB. If you maintain various versions of parcels staged before and after upgrading, be aware of the disk space implications. You can configure Cloudera Manager to automatically remove older parcels when they are no longer in use. As an administrator you can always manually delete parcel versions not in use, but configuring these settings can handle the deletion automatically, in case you forget. To configure this behavior in the Cloudera Manager Administration Console, select and configure the following property:
|
Task | Description |
---|---|
Activity Monitor (One-time) |
The Activity Monitor only works against a MapReduce (MR1) service, not YARN. So if your deployment has fully migrated to YARN and no longer uses a MapReduce (MR1) service, your Activity Monitor database is no longer growing. If you have waited longer than the default Activity Monitor retention period (14 days) to address this point, then the Activity Monitor has already purged it all for you and your database is mostly empty. If your deployment meets these conditions, consider cleaning up by dropping the Activity Monitor database (again, only when you are satisfied that you no longer need the data or have confirmed that it is no longer in use) and the Activity Monitor role. |
Service Monitor and Host Monitor (One-time) |
For those who used Cloudera Manager version 4.x and have now upgraded to version 5.x: The Service Monitor and Host Monitor were migrated from their previously-configured RDBMS into a dedicated time series store used solely by each of these roles respectively. After this happens, there is still legacy database connection information in the configuration for these roles. This was used to allow for the initial migration but is no longer being used for any active work. After the above migration has taken place, the RDBMS databases previously used by the Service Monitor and Host Monitor are no longer used. Space occupied by these databases is now recoverable. If appropriate in your environment (and you are satisfied that you have long-term backups or do not need the data on disk any longer), you can drop those databases. |
Ongoing Space Reclamation |
Cloudera Management Services are automatically rolling up, purging or otherwise consolidating aged data for you in the background. Configure retention and purging limits per-role to control how and when this occurs. These configurations are discussed per-entity above. Adjust the default configurations to meet your space limitations or retention needs. |
Log Files
All CDH cluster hosts write out separate log files for each role instance assigned to the host. Cluster administrators can monitor and manage the disk space used by these roles and configure log rotation to prevent log files from consuming too much disk space.
For more information, see Managing Disk Space for Log Files.
Conclusion
Keep this information in mind for planning and architecting the deployment of a cluster managed by Cloudera Manager. If you already have a live cluster, this lifecycle and backup information can help you keep critical monitoring, auditing, and metadata sources safe and properly backed up.