from onjava.com
by Dejan Bosanac
01/31/2007
There is a big difference in the behavior of a database when it is populated with a small amount of test inputs and when it holds a large amount of data. Usually, you would not address these database performance issues early in the development process, but when the time comes, you should take some action to ensure that the application is working correctly with large amounts of data.
The all-Java open-source database Derby is no exception, so you'll have to make sure it will not be a bottleneck to your application. Although you can find comprehensive material on this topic among
Derby
's manuals, I would like to focus on certain issues in more detail and give some examples from my own experience. I will focus on application performances related to selecting data from large tables.
First of all, there are various tips on how you should tune
Derby
properties such as page size and the size of the cache. Playing with these parameters can help you improve performance to some degree, but usually the bigger problem lies in your application and database design, so you should focus on these issues first and leave
Derby
properties for the end.
In the following sections, I will cover some techniques that can help you optimize problematic parts of your application. But as with all other performance-tuning activities, measure and positively identify problems before optimizing.
A Simple Example
Let's start with a simple example: We have a "search"/"list" page in our web application that has to deal with a table of nearly 100,000 rows, and let's say that the table is not trivial (i.e., that it has at least 10 columns). I will write an example in plain JDBC so we can focus on database and JDBC issues. The principles explained in this article should be applicable to all Object-Relation mapping tools as well.
In order to give your users the ability to list a large table, you would normally start with the simple query:
select * from tbl
The resulting JDBC code snippet for this would be similar to the following:
Class.forName("org.apache.derby.jdbc.ClientDriver").newInstance();
Connection connection = DriverManager.getConnection (
"jdbc:derby://localhost:1527/testDb;");
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery("select * from tbl");
ArrayList allResults = new ArrayList();
while (rs.next()) {
// Object-Relation mapping code to populate your
// object from result set row
DomainObject domainObject = populate(rs);
allResults.add(modelObject);
}
System.out.println("Results Size: " + allResults.size());
Here, we encounter our first problem. Trying to execute code like this and populate 100,000 (or even more) domain objects will almost certainly lead to a java.lang.OutOfMemoryError as Java runs out of heap space. So for starters, we have to find a way to make this code just work.
Paging Result Sets
As the amount of data in your application grows, the first thing you will want to do is add paging support for certain pages (or views in general). As you saw in the introductory example, simple queries that try to fetch large result sets can easily produce OutOfMemoryErrors.
Many database servers support specialized SQL constructs that can be used to retrieve a specified subset of query results. For example, in MySQL you'll find the LIMIT and OFFSET keywords, which can be used in SELECT queries. So if you execute a query like this:
select * from tbl LIMIT 50 OFFSET 100
your result set will contain 50 rows starting from the 100th result, even if the original query returned 100,000 rows. Many other database vendors provide similar functionality through different constructs. Unfortunately,
Derby
does not provide such functionality, so you have to stay with the original select * from tbl query and implement a paging mechanism on the application level. Let's look at the following example:
Class.forName("org.apache.derby.jdbc.ClientDriver").newInstance();
Connection connection = DriverManager.getConnection(
"jdbc:derby://localhost:1527/testDb;");
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM tbl");
ArrayList allResults = new ArrayList();
int i = 0;
while (rs.next()) {
if (i > 50 && i <= 100) {
// O-R mapping code populate your row from result set
DomainObject domainObject = populate(rs);
allResults.add(modelObject);
}
i++;
}
System.out.println("Results Size: " + allResults.size());
With these extra few lines, we have provided "paging" functionality. Although all result sets are being fetched from the database server, only the rows of interest are actually mapped to Java objects. Now we are safe from the OutOfMemoryError problem that we had before and can be sure that this code will actually work with large tables.
But still, with this solution the database will scan through the whole table and return all rows, and that is certainly a time consuming task. For my example database, this operation takes up to ten seconds to execute, which is certainly not acceptable behavior for the application.
So, we have to come up with a solution; we do not want to retrieve all database rows but only those of our current interest (or at least the minimal possible subset of all rows). The trick we'll use here is to explicitly tell the JDBC driver how many rows we need. We can do this by using the setMaxRows() method of the java.sql.Statement interface. Let's look at this example:
Class.forName("org.apache.derby.jdbc.ClientDriver").newInstance();
Connection connection = DriverManager.getConnection(
"jdbc:derby://localhost:1527/testDb;");
Statement stmt = connection.createStatement();
stmt.setMaxRows(101);
ResultSet rs = stmt.executeQuery("SELECT * FROM tbl");
ArrayList allResults = new ArrayList();
int i = 0;
while (rs.next()) {
if (i > 50 && i <= 100) {
// O-R mapping code populate your row from result set
DomainObject domainObject = populate(rs);
allResults.add(modelObject);
}
}
System.out.println("Results Size: " + allResults.size());
Notice that we have set the max rows value to the last row that we need (incremented by one). So, with this solution we didn't fetch only the 50 rows that we wanted, but first fetched a hundred rows and then filter to the 50 rows of interest. Unfortunately, there is no way tell the JDBC driver to start with a certain row, so we must specify the maximum row of the page that will be displayed. This means that performance will be good for early pages and drop in performance as the user browses results. The good news is that in most cases, the user will not go far, but will usually either find what he's looking for in the first few pages or refine the search query. In my environment, execution time dropped from 8 seconds to 0.8 seconds for the above example.
This was an easy example of how to browse the whole table. But when you add certain WHERE conditions and ordering instructions to your query, things can change dramatically. In the following section, I will explain why this happens and how we can ensure acceptable application behavior in those cases.
Make Sure Indexes Are Used (Avoid Table Scans)
Indexes are a very important concept in database design. Since the scope of this article is limited, I will not go into detail about indexing theory. Briefly though, indexes are special database structures that allow quick access to table rows. They are usually created in relation to one or more columns, and since they are much smaller then the whole table, their primary use is to enable quick searching of values in a column (or columns).
Derby
automatically creates indexes for primary and foreign key columns and for columns that have unique constraints on them. For everything else, we must explicitly create indexes. In the following section, we'll go through a few examples and explain where and how indexes can be helpful.
But first, we have to make some preparations. Before we can start tuning performance, we need to be able to see what is going on in the database when our query is executing. For that purpose,
Derby
provides the derby.language.logQueryPlan parameter. When this parameter is set,
Derby
will log the query plan for all executed queries in the derby.log file (located in the derby.system.home folder). You can achieve this through the appropriate derby.properties file or by executing the following Java statement:
System.setProperty("derby.language.logQueryPlan", "true");
before you start the server.
By examining the query plan, we can see whether
Derby
uses indexing for some queries or performs a full table scan, which can be a time-consuming operation.
Now that we have our environment set, we can proceed with the example. Let's say that in our previously used tbl example table, we have an unindexed column called owner. Because the sorting of the result is the usual suspect for poor query performance, I will illustrate all performance-tuning examples on problems related to sorting. Now, if we wanted to modify the previous example to sort our results by the value of this column, we would change our query to something like this:
SELECT * FROM tbl ORDER BY owner
If we now run our example with this query instead of the original one, the execution time will be an order of magnitude higher then before. Despite the fact that we paginated the results and dealt carefully with the number of rows to be fetched, the total execution time will again be about 8 seconds.
If we look at the query execution plan in the derby.log file, we can easily spot the problem:
Table Scan ResultSet for TBL at read committed isolation
level using instantaneous share row locking chosen
by the optimizer
This means that
Derby
performed look-up throughout the entire table in order to sort the row set. Now, what can we do to improve this situation? The answer is simple: create an index on this column. We can do that by issuing the following SQL statement:
CREATE INDEX tbl_owner ON tbl(owner)
If we now repeat our previous example, we should get a result similar to the one we got without ordering (under one second in my case).
Also, if you look into derby.log now, you will see a line like this (instead of a line like the previous one):
Index Scan ResultSet for TBL using index TBL_OWNER
at read committed isolation level using share row locking
chosen by the optimizer
which means you can be sure that
Derby
used our newly created index to get the appropriate rows.
Use Appropriate Index Order
We have seen how indexes helped us improve performances of sorting data by a column value. But what would happen if we tried to reverse the order of sorting? For example, let's say that we want to sort our example data by owner column but in descending order. In that case, our original query would be something like this:
SELECT * FROM tbl ORDER BY owner DESC
Notice the added DESC keyword, which sorts our result set in descending order. If we run our example with this modified query, you'll notice that the execution time increases to the previous rate of 8 to 9 seconds. Also, in the logfile, you will notice that the full table scan was performed in this case.
The solution is to create a descending index for the column in question. For our owner column, we can do that with the following SQL statement:
CREATE INDEX tbl_owner_desc ON tbl(owner desc)
Now we have two indexes for this column (in both directions), so our query will be executed with acceptable performances this time. Notice the following line in the query log:
Index Scan ResultSet for TBL using index TBL_OWNER_DESC
at read committed isolation level using share row locking
chosen by the optimizer
which confirms that our newly created index was used. So, in case you often use queries that sort results in descending order, you may think of creating a suitable index to achieve better performances.
Recreate Indexes
Over time, index pages can fragment, which can cause serious performance degradation. For example, let's say we have an index, created some time ago, on the time_create column of our tbl table.
If we execute the query:
SELECT * FROM tbl ORDER BY time_create
we can get poor performance, much as if we didn't have an index at all. If we look into the query plan log, we can find the source of our problem. You will see that index scan has been used, but you can usually find a line similar to the following one in the log:
Number of pages visited=1210
This means the database performed a lot of IO operations during index search, which is the main bottleneck of this query execution.
The solution in this case is to recreate the index (i.e., drop and create it again). This will make the index defragmented again and save us from a lot of IO operations. We can do this by issuing the following SQL statements:
DROP INDEX tbl_time_create
CREATE INDEX tbl_time_create ON tbl(time_create)
You'll notice that execution time drops to an acceptable value (under one second). Also, in the log file you will now find the following line:
Number of pages visited=5
As you can see, the execution time dropped significantly because the database had to perform only a few IO operations.
So, the general rule of thumb is to make your application recreate indexes on a regular basis. It's best to schedule a background job in your application to do this every now and then.
Multiple-Column Indexes
Thus far, we have concentrated on simple, single-column indexes and simple queries that can be tuned in this way. Single-column indexes created on owner and time_create columns will help us with the queries we'll use to filter or sort their values. Even the following query:
SELECT * FROM tbl WHERE owner = 'dejan'
AND time_create > '2006-01-01
00:00:00
'
ORDER BY time_create
will have acceptable performances. But if you try to execute this query:
SELECT * FROM tbl WHERE owner = 'dejan' ORDER BY time_create
you will get a very long execution time. This is because of the extra sorting step that the database needs to perform in order to sort data.
The solution for these types of queries is to create an index that will cover both the owner and time_create columns. We can achieve this by executing the following query:
CREATE INDEX tbl_owner_time_create ON tbl(owner, time_create)
With this index in use, the query performance will dramatically improve. Now, notice the following lines in the analyzer log:
Index Scan ResultSet for TBL using index TBL_OWNER_TIME_CREATE
at read committed isolation level using share row locking
chosen by the optimizer
We have helped the database by letting it use a handy index to quickly find already sorted data.
The important thing to notice in this example is that column order in the CREATE INDEX statement is very important. Multiple-column indexes are optimizable by the first column defined during index creation. So, if we had created the following index:
CREATE INDEX tbl_time_create_owner ON tbl(time_create, owner)
instead of one we used previously, we wouldn't see any performance benefits. That is because the
Derby
optimizer could not consider this index as the best execution path and it would simply be ignored.
Index Drawbacks
Indexes can help us improve performance when data selection is in question. But they slow down database insert, and delete and possibly update operations. Since we not only have table structure, but various index structures, it takes longer for the database to maintain all these structures when data changes.
For example, when we are inserting a row in a table, the database must update all indexes related to columns of that table. That means that it has to insert an indexed column value in the right place in the appropriate index, and that takes time. The same thing happens when you delete a certain row, because the index must be kept ordered. Update actions affect indexes only when you update indexed columns, since the database must relocate those entries in order to keep indexes sorted.
So, the point is to optimize database and application design according to your needs. Don't index every column; you might not use those indexes, and you might need to optimize your database for fast inserting of data. Measure your performance early and identify bottlenecks; only then should you try to implement some of the techniques provided in this article.
Conclusion
In this article we have focused on just a small subset of performance-related issues you can find in everyday development tasks. Most of the principles shown here could be used (with some modifications) to any relational database system available. There are many other techniques that can help you improve the performance of your application. Caching is certainly one of the most effective and widely used approaches. There are many caching solutions for Java developers (some of them, such as OSCache or EHCache, have open source licenses) that could serve as a buffer between the application and database and thus improve overall application performance. Also, many object-relation frameworks used in Java projects (such as Hibernate) have built-in caching capabilities, so you should consider those solutions as well, but that's the material for another discussion.
Resources
Dejan Bosanac is a software developer, technology consultant and author. He is focused on the integration and interoperability of different technologies, especially the ones related to Java and the Web.