Why is RUNSTATS important? Almost every major database today uses some method of updating catalog statistics to provide the best information possible to its optimizer. Think of the optimizer as an explorer waiting to navigate a mountain represented by the data in the system. Updating the catalog statistics would provide the optimizer with the most current map to navigate quickly through the terrain. The DB2 optimizer uses catalog statistics to determine the best access path, and the main method employed to update these catalog statistics is to run the RUNSTATS utility. When there is any data modification to the user tables, the catalog statistics tables are not automatically modified. A RUNSTATS command has to be performed on tables and indexes so that columns in the catalog tables are updated with the latest information. The following catalog columns can be queried to determine if RUNSTATS has been performed on the tables and indexes:
After data is initially loaded to a table, subsequent changes in the form of updates, deletes, and inserts to tables may be inevitable. Indexes may be added or dropped. Data distribution in particular columns may change over time. None of these changes are reflected in the catalog, unless RUNSTATS is performed on these tables and indexes. Catalog statistics can become stale over time as changes to the table and the data occur. Performing RUNSTATS is important, but running it often enough may be an issue. For small tables, issuing the RUNSTATS command would be a trivial task. However, as the table grows, the RUNSTATS command takes more time, CPU and memory resources to complete. Eventually, you must consider the tradeoff of allocating more time and resources to RUNSTATS versus the potential degradation in performance if you do not run it. It is usually only when critical queries start slowing down that administrators give RUNSTATS the proper attention. You can avoid a knee-jerk reaction to tuning queries and performing RUNSTATS by formulating a strategy for collecting statistics efficiently and effectively.
Collect catalog statistics using RUNSTATS You should ideally perform RUNSTATS on tables and indexes in the following situations:
You can determine the effects of running RUNSTATS on the access plans by comparing the EXPLAIN outputs of the queries before and after RUNSTATS. After the completion of every RUNSTATS statement, you should execute an explicit COMMIT WORK. The COMMIT releases locks and avoids filling up the log when gathering statistics on multiple tables. Rebind packages containing static SQL using the BIND command, or the REBIND command (and optionally re-explain their statements) after collecting statistics with RUNSTATS. The The following examples illustrate using RUNSTATS to collect statistics:
Example 1: Collect statistics on all columns
This is equivalent to:
Example 3 shows how you collect statistics on key columns. The phrase "key columns" means the columns that make up the indexes defined on the table. If there are no indexes, no column statistics are collected. Example 3: Collect catalog statistics on key columns
If new indexes have been created, and the corresponding table has not been modified since the last RUNSTATS execution, then you can collect catalog statistics on indexes only. It is also possible to collect catalog statistics while the index is being created. I'll discuss this method in the section Alternative ways of collecting catalog statistics.
Use RUNSTATS WITH DISTRIBUTION When you have determined that the table contains data that is not uniformly distributed, run RUNSTATS with the WITH DISTRIBUTION clause. Normally, the catalog statistics tables contain information about the highest and lowest values in a table, and the optimizer assumes that the data values are evenly distributed between these end-point values. However, if data values are very different from one another, or data values are clustered together in some spots, or many duplicate data values are encountered - the optimizer cannot choose an optimal access path unless distribution statistics are collected. Using the WITH DISTRIBUTION clause can also help queries with predicates that do not have parameter markers or host variables, because the optimizer still doesn't know if the run-time value will have many or few rows. The following examples illustrate the different ways of using RUNSTATS to collect statistics with distribution: Example 9: Collect catalog statistics on the table and indexes with distribution statistics
RUNSTATS with frequency and quantile statistics When you perform RUNSTATS with the WITH DISTRIBUTION clause, a set of frequency and quantile statistics is chosen, depending on the options given in the RUNSTATS command. RUNSTATS collects two types of data distribution statistics: frequency statistics and quantile statistics. The frequency statistics default value is controlled by the num_freqvalues database configuration parameter, and provides information about the column and the data values with the highest number of duplicates. The default value is 10, and the recommended setting should be between 10 and 100. If num_freqvalues is set to zero, no frequent value statistics are retained. The quantile statistics default value is controlled by the num_quantiles database configuration parameter and provides information about how data values are distributed in relation to other values. The num_quantiles database configuration parameter specifies the number of sections into which the column data values should be grouped. The default value is 20, and the recommended range for this parameter is between 20 and 50. If you set this parameter to zero or "1", no quantile statistics are collected. If num_freqvalues and num_quantiles are not specified on either the RUNSTATS command column or table level, the num_freqvalues value is taken from the num_freqvalues database configuration parameter, while the num_quantiles value is taken from the num_quantiles database configuration parameter. The precision for the frequency and quantile statistics can be modified for an individual column or a group of columns. Increasing the precision of distribution statistics leads to greater CPU and memory consumption, as well as increased catalog space. Consider only the most important columns for the most important queries that have selection predicates for these distribution statistics. RUNSTATS will not collect distribution statistics when any of the following conditions are present:
The following examples illustrate the different ways of using RUNSTATS to collect catalog statistics and specifying num_freqvalues and num_quantiles: Assume that the num_freqvalues database configuration parameter value is 10, and the num_quantiles database configuration parameter value is 20. Example 14. Collect catalog statistics with distribution
The num_freqvalues parameter is set to 10, and the num_quantiles parameter is set to 20, since no num_freqvalues or num_quantiles parameters were specified on the command line. Example 15: Collect catalog statistics on the table only, on all columns with distribution statistics using a specified num_freqvalues while picking the num_quantiles from the database configuration setting
The num_freqvalues parameter is set to 40, and the num_quantiles parameter is set to 20.
For the deptname column, num_freqvalues is 50 and num_quantiles is 100. Example 17: Collect catalog statistics on all indexes and on column deptname with distribution. Distribution statistics for columns not listed will be cleared.
For the deptname column, num_freqvalues is 20 and num_quantiles is 40. Example 18: Collect catalog statistics on all indexes and on columns deptno and deptname. The num_freqvalues and num_quantiles values for the deptno column will be obtained from the default.
For the deptname column, num_freqvalues is 20, and num_quantiles is 40.
RUNSTATS with column group statistics Column Group Statistics obtains the number of distinct combinations of values for a group of columns. In general, the basic statistics available to the DB2 optimizer does not detect data correlation. The use of column groups provides more accurate estimates of the joint selectivity of multiple predicates. Column group statistics assume data is uniform; distribution statistics are not available yet on column groups. The product of individual column cardinalities is compared to the cardinality of the column group to get better correlation estimates. The following example illustrates using RUNSTATS to collect catalog statistics that capture column group information:
In this case, there are two column groups: (deptno, deptname) and (admrdept, location).
RUNSTATS with LIKE STATISTICS When the LIKE STATISTICS clause is specified in RUNSTATS, additional column statistics are collected. These statistics are stored in the SUB_COUNT and SUB_DELIM_LENGTH columns in the SYSIBM.SYSCOLUMNS table. They are collected for string columns only, and they are used by the query optimizer to improve the selectivity estimates for predicates of the type "column LIKE '%abc'" and "column LIKE '%abc%'". The following example illustrates using RUNSTATS to collect catalog statistics that capture LIKE statistics:
RUNSTATS with a statistics profile In DB2 V8.2, it is now possible to set up a statistics profile for RUNSTATS. A statistics profile is a set of options that predefine which statistics are to be collected on a particular table. A statistics profile is registered or stored in the table descriptor and system catalogs when the command parameter "SET PROFILE" is added to the RUNSTATS command. To update this statistics profile, use the command parameter "UPDATE PROFILE". There is no delete profile option. The following examples show how this feature is used: Example 21: Register a statistics profile only without collecting catalog statistics
The clause "SET PROFILE ONLY" for RUNSTATS does not collect statistics. Example 22: Register a statistics profile and execute the RUNSTATS command options of the stored statistics profile to collect catalog statistics
RUNSTATS with sampling With databases growing at an unrelenting pace, the ability to collect statistics by accessing all of the data may be hampered by fixed batch windows or memory and CPU constraints. Currently, a full table scan is performed whenever RUNSTATS on TABLE is run. With data sampling, only a subset of the data is scanned. If a query attempts to project overall trends and patterns, and approximate answers within a certain margin of error may be sufficient to detect these trends and patterns, then data sampling may be a better alternative than a full table scan. In DB2 V8.1, the SAMPLED DETAILED CLAUSE was introduced to allow sampling for computing detailed index statistics. Using this clause reduces the amount of background calculation performed for detailed index statistics and the time required, but produces adequate precision in most cases. Here are some examples of its usage:
In DB2 V8.2, two new methods of table data sampling are offered: row-level Bernoulli sampling and system page-level sampling.
RUNSTATS with row-level Bernoulli sampling Row-level Bernoulli sampling gets a sample of P percent of the table rows by means of a sargable (search + argument-able predicate is a predicate that can be evaluated by the Data Manager) predicate that includes every row in the sample with the probability of P/100 and excludes it with a probability of 1 - P/100. For example, for 10% Bernoulli sampling, (10/100) 10% of the rows will be selected and (1-10/100) 90% of the rows will be rejected. In Bernoulli sampling, an I/O will be incurred for each page since the table will be scanned. A random number will be generated to determine if a row will be selected or not (similar to flipping a coin with a probability of P/100). Even if the I/O happens for every page, we still save on CPU time required to process the data. With row-level Bernoulli sampling, every data page is read. However, it can still produce significant performance improvement because RUNSTATS is CPU intensive. If indexes are available, then the sampling is improved. It may also provide more accurate statistics if the data is clustered (obtains a sample that better represents overall table data).
RUNSTATS with system page-level sampling System page-level sampling is similar to row-level sampling, except that the pages are sampled and not rows. Each page is selected with a probability of P/100 and rejected with a probability of 1 - P/100. For each page selected all of the rows are selected. The benefit of system page-level sampling over a full table scan or Bernoulli sampling is the savings in I/O. The sampled pages are also prefetched, so this method would be faster than row-level Bernoulli sampling. Compared with no sampling, page-level sampling significantly improves performance. The RUNSTATS repeatable clause allows the same sample to be generated across RUNSTATS statements a long as the table data has not changed. To specify this option, the user must also supply an integer that will represent the seed to be used for sample generation. By using the same seed, the same sample can be generated. In summary, the accuracy of statistics depends on the sampling rate, the data skew, and data clustering for data sampling. Some examples of RUNSTATS using Bernoulli row-level and system page-level sampling are as follows: Example 29. Collect statistics, including distribution statistics on 10 percent of the rows
Alternative ways of collecting catalog statistics One alternative way of collecting statistics for all tables in a database is to issue a REORGCHK command as follows:
The update statistics option of the reorgchk command is similar to invoking the RUNSTATS routine to update catalog statistics for all tables in the database, but statistics on all columns is gathered using the default RUNSTATS options only. Using this command, it is also possible to collect statistics on individual tables or a group of tables with the same schema. Example 33. Collect catalog statistics for a table using REORGCHK
Although this method seems to be a quick way of collecting catalog statistics on multiple tables, it is advisable only for situations where the REORGCHK execution can be done in a reasonable amount of time. Collecting catalog statistics can also be issued during a LOAD REPLACE of a table, as well as during the creation of an index. In UDB V8.2, to invoke the collection of catalog statistics during a LOAD, the option "STATISTICS USE PROFILE" must be added to the LOAD control statements. The option "STATISTICS YES" will still work, but is now obsolete syntax. The goal of issuing a "STATISTICS USE PROFILE" during a LOAD is to gain savings in elapsed time over the usual process where a LOAD of a table is followed by a RUNSTATS on that same table. The statistics profile must be created before the load is executed, and it now allows one to specify the same statistics options as one would in a RUNSTATS command.
In most cases, a RUNSTATS is done after an index has been created. However, it is possible to collect index statistics while a create index operation is being done. This avoids another scan of the index in order to gather statistics. The following examples illustrate some of the options available to do this:
RUNSTATS authorization and user access
To be able to collect statistics against a table or index, one must be able to connect to the database that contains the tables and indexes and have one of the following authorization levels:
There are two options for user access to a table while a RUNSTATS is running against that table: allow read access and allow write access. The parameter ALLOW READ ACCESS in a RUNSTATS command specifies that other users can have read-only access to the table while statistics are calculated. The parameter ALLOW WRITE ACCESS in a RUNSTATS command specifies that other users can read from and write to the table while statistics are calculated. Use this clause if the table must be up and available at any time.
RUNSTATS in partitioned databases When a RUNSTATS command is issued in a partitioned database, and a table partition exists in the database partition where the RUNSTATS was issued, RUNSTATS executes on that database partition. If a table partition does not exist on the database partition, the request is sent to the first database partition in the database partition group that holds a partition for that table. RUNSTATS then executes on that database partition. RUNSTATS does not run in parallel in a partitioned database, but information is determined for one partition, and then extrapolated to determine a suitable estimate for all partitions. There is an implicit assumption that the rows of each table are evenly distributed across all the partitions in each multiple-partition database partition group. The following IBM technote talks about a possible problem that may be encountered with RUNSTATS when loading a table. The resolution is to ensure that a partitioning key is defined for the table before a load takes place. IBM Technote # 1153232 (2004-02-12): Abstract If one of the partitions has zero rows in it for a given table following a load and runstats is executed from that partition, it will update the column CARD in syscat.tables to show that the table has zero rows. This only affects multi-partition instances. Content The behavior of runstats is such that it will use the data on the partition that it is run from to extrapolate the number of rows in the table. For example, if after a load all the data is on one partition and runstats is executed from another partition, it will update syscat.tables to show zero rows for that table. However, if runstats is run from the partition that has all the data, it will update syscat.tables to show the table having ((num rows on that partition) x (number of partitions)) rows. To get around this issue, which is the default behavior of runstats, a partitioning key needs to be defined for the table before the load takes place. The partitioning key needs to be on a column or set of columns that are highly unique thus ensuring that the data will be evenly distributed across all the partitions that the table is defined on."
Adjusting the STAT_HEAP_SZ database configuration parameter The stat_heap_sz or the statistics heap size database configuration parameter indicates the maximum size of the heap used in collecting statistics using the RUNSTATS command. It is allocated when the RUNSTATS utility is started, then freed when it completes. Stat_heap_sz is part of agent private memory. When gathering distribution statistics, it is recommended to increase the stat_heap_sz parameter so that more columns can fit into the heap. Wide tables would also require more memory for processing. When executing RUNSTATS with the SAMPLED DETAILED option, an additional 2 Megabytes of memory must be allocated to ensure that RUNSTATS can run successfully.
Strategies to reduce RUNSTATS impact on system performance You can use the following strategies to help reduce RUNSTATS performance impact on the system:
Scheduling RUNSTATS to run only during low-activity periods in the system is a good way of minimizing the impact to the system. However, for a 24 x 7 system, there may be no available window or low-activity period in the system. One way to handle this situation is to use the RUNSTATS throttling option. Throttling limits the amount of resources taken up by a utility based on the current database activity levels. In UDB, the interaction of the util_impact_lim and the UTIL_IMPACT_PRIORITY parameters determine the behavior of RUNSTATS when throttled. The UTIL_IMPACT_PRIORITY keyword is used in a clause in a utility command such as RUNSTATS, while util_impact_lim is an instance configuration parameter. The util_impact_lim parameter is the percentage of allowable impact that all throttled utilities can have on the instance workload. If the util_impact_lim is 100 (the default value), no utility invocations such as RUNSTATS are throttled. If for example, the util_impact_lim is set to10, then a throttled RUNSTATS invocation is limited to consume not more than 10 percent of the workload. The UTIL_IMPACT_PRIORITY keyword serves as a switch to specify whether RUNSTATS subscribes to a throttling policy or not. Example 39. Use of UTIL_IMPACT_PRIORITY keyword
where 10 is the priority or the level at which RUNSTATS will be throttled. When util_impact_lim is not 100, and RUNSTATS is invoked with UTIL_IMPACT_PRIORITY, it will be throttled. If priority is not specified for UTIL_IMPACT_PRIORITY and util_impact_lim is not 100, RUNSTATS will have a default priority of 50 and will be throttled.If the UTIL_IMPACT_PRIORITY keyword is omitted from the RUNSTATS command, RUNSTATS will run unthrottled. If a priority is specified, but util_impact_limit is set to 100, RUNSTATS will run unthrottled. If priority is set to zero, RUNSTATS will run unthrottled. When RUNSTATS throttling is defined and operational, the RUNSTATS utility is expected to take longer, but will pose a lesser impact to the system.
Miscellaneous RUNSTATS issues RUNSTATS on declared global temporary tables The RUNSTATS command can be issued on a declared global temporary table (DGTT), but the resulting statistics are not stored in the system catalogs because declared temporary table entries do not have catalog entries. These statistics are stored in memory structures that represent the catalog information for declared temporary tables. The recommendation is to issue RUNSTATS if the savings in total query elapsed time is much greater than the additional time to do RUNSTATS. Can the RUNSTATS command be issued from within an SQL routine (procedure, UDF, method)? The following IBM technote discusses the question: IBM Technote # 1153051 It is possible to create a C stored procedure named "runstats" that invokes the db2Runstats() API function. This C stored procedure can then be invoked using the CALL command. How do I perform a RUNSTATS to gather statistics from the call level interface (CLI) or Java? The following IBM technote answers the question: IBM Technote # 1023788:
DB2 automatic statistics collection DB2 automatic statistics collection was introduced in DB2 UDB Version 8.2. Automatic statistics collection is part of a completely automated table maintenance solution. The objective is to allow DB2 to determine which statistics are required by the workload, and automatically run the RUNSTATS utility in the background periodically to update statistics as needed. To set up the SAMPLE database for automatic statistics collection, set the database configuration parameters for automatic maintenance switches as follows:
The following diagram (only showing the statistics collection and profiling options) shows the hierarchy of automatic maintenance commands for statistics collection and profiling and the dependencies. With this structure, it is possible to quickly turn off the parent automatic maintenance parameter, AUTO_MAINT, at the highest level without losing the configuration settings for the lower levels, i.e. AUTO_RUNSTATS. Figure 1. Automatic Maintenance Settings The parameters AUTO_STATS_PROF and AUTO_PROF_UPD are turned on if automatic statistics profiling is desired. Automatic statistics profiling goes one step further by determining when and how to collect statistics. A statistics profile is generated automatically for use by the automatic statistics collection process to schedule RUNSTATS. An internal algorithm is used to compare newly collected statistics with a saved set of statistics, and RUNSTATS with sampling is issued based on certain conditions being triggered. When automatic statistics profiling is enabled, information about database activity is collected and stored in a query feedback warehouse. Statistical profiles are then generated based on the data in the query feedback warehouse. To enable automatic statistics profile generation, set two database configuration parameters on:
Turning this parameter on activates the collection of query feedback data.
Turning this parameter on specifies that the RUNSTATS profile is to be updated with recommendations from DB2 analyzing the query feedback data. Before automatic statistics profiling is activated, a query feedback warehouse must be created by running the SYSINSTALLOBJECTS stored procedure. This stored procedure is invoked as follows:
where toolname is "ASP" or "AUTO STATS PROFILING" and action is "C" for create, and "D" for drop. The parameter schemaname is not currently used. For example, to create the feedback warehouse, run the followed stored procedure:
This procedure will create the following tables that store recommendations, and information about the predicates encountered during query execution:
When the AUTO_PROF_UPD parameter is disabled, recommendations are stored in the SYSTOOLS.OPT_FEEDBACK_RANKING table. The recommendations stored in this table can then be reviewed and refined when manually updating the RUNSTATS profile. DB2 automatic statistics collection and profile generation can only be activated in DB2 serial mode; they are not available in federated, SMP, and MPP environments.
Conclusion Understanding and properly using RUNSTATS is one of the keys to maintaining a high-performing database. In this article, I've discussed the importance of RUNSTATS and explained the many options you can use to make it more useful, including new options in V8.2 which ease administration by letting you set up a profile specifying which statistics you want gathered for each table. The examples I've shown will make it easy for you to use RUNSTATS to gain optimal performance for your own databases.
Resources
refer from :Larry Pay Larry Pay is an Engagement Specialist from the DB2 Data Warehouse Consulting Group, IBM Software Services for DB2 Information Management at the IBM Toronto Laboratory. He has been helping customers implement DB2 and Business Intelligence Solutions for over 5 years. Larry is an IBM Certified Database Administrator for DB2 UDB. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|