Experimenting With SQL Tuning In DB2

This paper provides a basic introduction into the SQL query tuning process in regards to the DB2 DBMS. Additionally, this paper will explore a few different SQL query optimization techniques while also analyzing the effectiveness of each method. The most effective method of decreasing SQL query cost appears to be column indexing.

April 29, 2014

DB2

This paper was originally published in APA format on April 24, 2014 as part of a school project. Also, these tests were run on DB2 Express-C edition installed on an Ubuntu VM, so your mileage may vary.

Abstract

This paper provides a basic introduction into the SQL query tuning process in regards to the DB2 DBMS (Database Management System). Additionally, this paper will explore a few different SQL query optimization techniques while also analyzing the effectiveness of each method. The most effective method of decreasing SQL query cost appears to be column indexing.

The DB2 SQL Tuning Toolset

DB2 provides a number of tools which are designed to assist application developers and database administrators in the optimization of databases. This paper will introduce and utilize two tools which were designed to help tune SQL queries. The first of these tools is is the RUNSTATS SQL query. When used, the RUNSTATS command provides DB2 with statistical information, which can then be used by the DBMS and various DB2 utilities (Pay, 2002). The RUNSTATS query has a lot of options, but this paper will use the following RUNSTATS query format:

runstats on table TABLE_NAME on all columns;

The above SQL query tells DB2 to gather statistical information on all columns of the specified table.

The second tool that this study used was db2expln. The db2expln command line utility was designed to use statistics provided by RUNSTATS to provide detailed information on the data access plan that DB2 uses for various database queries (db2expln). There is also a GUI version of the db2expln called Visual Explain, but this paper will only explore the use of the command line version. As with the RUNSTATS query, the db2expln command has many options, but this study only used three. The format of the db2expln command used in this study was as follows:

db2expln -database DATABASE_NAME -statement "SQL_STATEMENT..." -terminal

Running a command like the above will output various information about the provided SQL query in the terminal. This information includes estimated cost of the query, estimated cardinality of the query, and the data access path used by DB2. This paper will primarily use the estimated cost to determine the efficiency of various SQL queries in relation to other queries.

Experiment Structure

Both the RUNSTATS query and the db2expln command require an actual DB2 database with data in it to be useful. This study of various SQL queries used a database with tables of a structure similar to the following (within a schema named bank):

create table bank.product (
    product_cd varchar(10) not null,
    name varchar(50) not null,
    constraint pk_product primary key (product_cd)
);
create table bank.customer (
    cust_id integer not null generated always as identity(start with 1, increment by 1, no cache),
    fed_id varchar(12) not null,
    cust_type_cd char(1) not null check(cust_type_cd IN ('I', 'B')), address varchar(30),
    city varchar(20),
    state varchar(20),
    postal_code varchar(10),
    constraint pk_customer primary key (cust_id)
);
create table bank.account (
    account_id integer not null generated always as identity ( start with 1, increment by 1, no cache),
    avail_balance decimal(10,2),
    product_cd varchar(10) not null,
    status char(6) check (status IN ('ACTIVE','CLOSED','FROZEN')), constraint fk_product_cd foreign key (product_cd)
    references bank.product (product_cd), constraint pk_account primary key (account_id)
);
create table bank.acct_tran (
    txn_id integer not null generated always as identity (start with 1, increment by 1, no cache),
    account_id integer not null, amount decimal(10,2) not null,
    constraint fk_t_account_id foreign key (account_id) references bank.account (account_id),
    constraint pk_transaction primary key (txn_id)
);

The actual database was larger and more complex than the above set of tables, but the given tables will suffice for the purposes of this paper. All tables were populated with semi-randomly generated data using an online data generation tool. After each alteration to the database structure and/or dataset was imported, then the RUNSTATS query was run for all affected tables. Finally, the db2expln command was then used to find the estimated cost associated with various SQL queries.

Simple Select Queries

The first type of queries tested were basic select queries with a single ‘where’ clause. For example, here is the first tested query:

select * from bank.account where status = 'ACTIVE';

According to db2expln, the above query had a cost of 25.807417. At that point, the status column on the account table had not been indexed. Next, as shown below, an index was created for the status column and RUNSTATS was run again.

create index index__account_status on bank.account (status);
runstats on table bank.account on all columns;

After the creation of the index, db2expln reported a cost of 14.259101 for the exact same select query. Adding the index almost doubled the estimated speed of the query.

The next step was to see what effect that a ‘limit’ clause would have on a select query. The following query was analyzed with db2expln:

select * from bank.account where status = 'ACTIVE' limit 1;

This query had a reported cost of 12.870489, which was slightly lower than the query without a ‘limit’ clause. It is worth noting that the account table had only about 100 rows of data in it at the time of these tests, so it is possible that the gap in cost between these queries could increase significantly with a larger dataset. However, even with a small dataset, it is very clear that properly configured indexes can greatly speed up select queries.

The last select query tested for this group also accessed the account table, but the ‘where’ clause did a comparison on the primary key column instead of the indexed status column. Below is the query:

select * from bank.account where account_id = 1;

The db2expln utility reported that the above query had an estimated cost of 12.870740, which is extremely close to the estimated cost of the previous query. This indicates that selecting by a primary key and selecting a single row by an indexed column have roughly the same cost.

Count Queries

After testing basic select queries, I was eager to see how quick count queries were in comparison. According to the db2expln utility, all of the following count queries had an estimated cost of only 0.048595 and returned the exact same results:

select count(*) from bank.account where status = 'ACTIVE';
select count(1) from bank.account where status = 'ACTIVE';
select count(*) from (select * from bank.account where status = 'ACTIVE');

In comparison to the previously measured select queries, the above count queries are all extremely quick. This means two things. First of all, unless used excessively, count queries will not be a bottleneck for the DBMS. Secondly, it shows that DB2 appears to handle syntactically different but semantically identical count queries equally efficiently.

Cross Operations And Join Queries

Up until this point, all of the discussed queries have been relatively simple, so it was no surprise that DB2 handled the semantically identical count queries all the same. It was time to see how DB2 would handle different queries that touched multiple tables. To start things off, the following cross operation was run through db2expln:

select * from bank.product, bank.account;

The above query had a whopping estimated cost of 52.059380. This is no surprise though, because it returned a result set with a cardinality of 840. Clearly, indiscriminately using cross query operations can be extremely expensive.

Next, I looked at three different versions of a basic join query on the same two tables. The three queries looked like this:

select * from bank.product p, bank.account a
    where p.product_cd = a.product_cd;
select * from bank.product p
    join bank.account a on a.product_cd = p.product_cd;
select * from bank.account a
    join bank.product p on p.product_cd = a.product_cd;

Despite the small variations in syntax, db2expln reported that all three of the above queries had the same estimated cost of 38.671654. So, a join across two tables appears to be quicker than a cross operation. This is likely due to the significantly lower estimated cardinality of 84, which was ten times less than the cross operation.
It made sense that returning less data would be quicker, so the next step was to limit the selected result set to a subset of columns.

select a.status, p.name from bank.account a
    join bank.product p on p.product_cd = a.product_cd;

The above query had a reported estimated cost of 38.671432, which was only slightly less than the join queries which selected all columns. However, it is worth noting that neither table contained columns with large amounts of data. The presence of more data in the excluded columns would have increased the cost gap.

Lastly, I decided to see what would happen when query complexity was increased. The following three queries all selected customers whom had at least one account that had made a transaction:

select distinct c.cust_id, c.cust_type_cd from bank.customer c
    join bank.account a on a.cust_id = c.cust_id
    join bank.acct_tran t on t.account_id = a.account_id;
select c.cust_id, c.cust_type_cd from bank.customer c
    where exists (
        select * from bank.account a where a.cust_id = c.cust_id
            and exists (
                select * from bank.acct_tran t
                    where t.account_id = a.account_id
            )
    );
select c.cust_id, c.cust_type_cd from bank.customer c
    where exists (
        select * from bank.account a where a.cust_id = c.cust_id
            and exists (
                select * from bank.acct_tran t
                    where t.account_id = a.account_id
                    limit 1
            ) limit 1
    );

According to db2expln, the two first of the above queries both had an estimated cost of 53.940853. The estimated cost of these two queries were higher than the other previously studied queries, but that was expected because these queries were run against significantly more data. Additionally, if this operation were to be split into multiple queries by the client, then the total estimated net cost would likely be much higher.

Lastly, there was one strange finding regarding the third query. The third query had an estimated cost of 103.375275, which was nearly double of the other two. The only syntactic difference between the second and third queries was that the third had two added ‘limit’ clauses. I do not have an explanation for this phenomenon. However, this does show that although many semantically equivalent queries also have the same estimated cost, some do not. Therefore, always be cautious when attempting to tune SQL queries.

Conclusion

With DB2’s provided RUNSTATS and db2expln utilities, analyzing SQL queries is a straightforward process. For this study, I used these two utilities to determine the cost of a number of SQL select queries. Of the tested queries, it was found that semantically identical but syntactically dissimilar select queries had the same estimated cost (with only one exception). Additionally, it was found that utilizing indexes and primary keys can greatly reduce estimated query cost. According to the tests that were run, indexing appears to be an effective way to tune SQL queries, while tweaking query syntax generally does not.

References

db2expln - SQL and XQuery Explain command. (n.d.). IBM DB2 9.7 for Linux, UNIX and Windows Information Center. Retrieved April 18, 2014.
Pay, L. (2002, December 14). IBM developerWorks. RUNSTATS in DB2 UDB Version 8.2. Retrieved April 21, 2014.