Monday, August 29, 2011

Smallworld Technical Paper No. 9 - The Why and the How of the Long Transaction

by Richard G. Newell

Abstract

The recent literature on GIS technology has seen the emergence of a new set of terminology including "long transaction", "version management", "check-out", "check-in", "seamless mapbase" and so on. As is common with new terminology, few people understand what these things mean, why they are important and what they are for. This paper attempts to explain what a long transaction is, why it is necessary and how it is managed in a GIS.

Introduction

Any user of a GIS, mapping system, CAD system, word processor or indeed any system which involves updating data over a significant period of time is in fact engaged in a long transaction. Contrast this with the user of a commercial DBMS application such as banking or airline reservation. In such an application, a user may prepare an input screen over a period of a few seconds which then updates the system resulting in a transaction which lasts a small fraction of a second.

What is the difference between these two uses of a computer system? Why is it that the short transaction mechanisms implemented in all of today's commercial DBMSs do not satisfy the requirements of design-type systems?

Long transaction applications in a GIS involve the modification of data that is relatively static. These include:

  • Data conversion
  • Map-base and asset management
  • Analysis which produces large amounts of intermediate results
  • Design studies with multiple alternative designs
  • Short transaction examples occur in the everyday operational use of a system. These might include:
  • Vehicle tracking
  • Customer service bureau
  • Fault logging
  • Emergency planning

Drawing office managers, using paper drawings are forced into implementing a long transaction mechanism, known as drawings management. It is impractical to allow two draftsmen to have a drawing out for update at the same time. There is only one master copy of every drawing and only one draftsman at a time can modify it.

Any other user can have read only access to a copy of a drawing, in which case the information he has may be out of date, and this is usually deemed acceptable. In fact it is not uncommon in organisations such as local authorities, for there to be multiple copies of the same set of maps, all independently updated and maintained. Although this is deemed to be not so acceptable, these organisations put up with it because there is no alternative where the maps are on paper.

Sheet-Based or Tiled Systems

The early digital mapping systems were based on CAD systems where the mapbase was held as a collection of CAD drawings. The methods of managing such a system where multiple users wish to access and update the mapbase are based on the manual drawing office approach. Indeed there is a market for Document Management systems in which an intelligent drawing register is held in a database alongside the drawings to record the status of each drawing in the registry. The advantage of these sheet based digital mapping systems is their simplicity, but their disadvantage is that handling any object which lies across a tile boundary becomes exceedingly cumbersome.

However, as digital mapping systems tried to evolve into truly seamless databases, it was found that the information that needed to be held to store the relationships between parts of objects on adjacent map-sheets was not nearly so easy to manage as the strictly partitioned map-base that one found in digital mapping systems. Indeed, it was found that so much code was required in map-management systems that more modern systems took a radically different approach which abandoned the concepts of sheets and tiles to implement a truly seamless database.

This appeared attractive because it allowed implementors to move from a a file based system to an implementation based on a database. It was now feasible to hold all of the map data in a commercial relational database management system (RDBMS). Much of the functionality that is provided by these systems is required to handle the large data volumes involved in a GIS.

Commercial Relational Database Management Systems

The commercial relational database vendors have invested man-centuries of effort in producing very robust systems which ensure the safety and integrity of data at all times. They also include rich facilities for designing and building data models, an aspect which everybody now realizes is the most important point of departure in building a GIS. However, vendors who build their systems on such engines have to overcome three things which are not provided by the database vendors:

  • Spatial modelling and queries
  • Performance of spatial queries
  • Long transaction handling

On the first of these two, the database vendors and the standards organisations are beginning to make progress in addressing them. Indeed, provided the RDBMS provides facilities to control data clustering, it is not too hard to obtain adequate spatial performance. However, on the issue of long transactions, there is little sign yet that the vendors are doing anything. This may be because the GIS market is considered to be small compared to the total DBMS market, and it does not get the attention that it deserves. Also it requires fundamental changes to existing approaches.

Commercial DBMSs are designed to handle short transactions and to maximize transaction throughput. In theory, one could use the short transaction mechanism to handle multiple users of a GIS, but it is not effective, for the following reasons.

Commercial DBMS vendors adopt one of two approaches to transaction locking, known as the pessimistic approach and the optimistic approach. In the pessimistic approach the system requests locks on all records that are to be updated before commencing a transaction. Thus all other users are locked out of accessing these records. When the transaction is finally closed the locks are released. The problem with this is that if one imagines many users holding locks for a long period of time, the system becomes totally unusable because other users are denied access to the locked records.

In the optimistic approach, each user carries on updating records within the privacy of his own transaction and if, at the time the transaction is closed, a conflict is detected, he may well lose all of the work that he has just completed. For a small amount of work this is deemed to be acceptable, but if somebody has been working for days or weeks it certainly is not.

In either method, should there be a system failure while a transaction is open, then all work is lost. Thus these methods can handle transactions which are open for a few seconds or minutes, but certainly not those which last for hours or days.

So in order to get round this problem, GIS vendors who base their systems on commercial RDBMS avoid using the short transaction mechanism completely and instead implement a long transaction mechanism of their own called "check-out".

Check-out and check-in

In a system which employs check-out, the user who wishes to update the database requests of the system the part of the database he wishes to work on to be copied into a single user database. Whether or not the single user database is proprietary or is a commercial RDBMS does not matter as it only handles temporary data.

One advantage that stems from doing this is that the checked out database can be held on the local disk of a workstation, and so the user puts no load at all on the database server or the network while he is working. The only work that the server has to deal with is checking out data and then later checking in the updates.

However, there are a number of disadvantages of using check-out:

  • Check-out may take a long time
  • The user is restricted to a subset of the database
  • The handling of alternatives is cumbersome
  • It is difficult to maintain relationships between that data which is checked out and that which is not.

From the vendor's point of view, the biggest disadvantage of check-out is that to make it work effectively requires an enormous amount of develop-ment effort and so the gains made on relying on the R&D resources of the DBMS vendors are lost in overcoming the limitations of the short transaction.

A further problem is that, as in the two mechanisms of short transactions, the same applies to check-out. Either one locks the data intended for update or one employs a system of conflict resolution on check-in. In the former case, the system suffers the same problems as in pessimistic locking. At least in the latter case one can salvage most of the work in the event that a conflict is detected. However, conflict resolution is not a trivial matter.

Given that there are still problems of using check-out to overcome the long transaction problem, one has to ask what are the alternatives. Either, one has to wait for the commercial DBMS vendors to provide long transaction support and indeed some of the early implementations of object-oriented database management systems claim to address this issue, or the vendor has to implement his own long transaction mechanism.

The lack of long transaction support is the most serious short coming of today's commercial RDBMSs for the support of GIS.

One of the most powerful approaches to handling long transactions is to implement a mechanism for version management deep in the database engine itself.

Version Management

A version managed database is capable of holding any number of versions of the whole database without replicating data that is common between versions. Thus all users can see the whole database at all times, subject to any changes made within the privacy of their own versions.

A long transaction commences with the creation of a new version from an existing version. At the start, the new version will look identical in all respects to the parent version from which it was created. However, as the user proceeds in modifying the database, the database stores the effects of his changes, but no other user operating in a different version can see these changes. The user of course works within a sequence of short transactions, each of which can be committed at any time. Thus the database can store persistently the results of a long transaction at all stages in its evolution. Intermediate commit stages may sometimes be given a name, in which case they are known as check points

The operation of closing a long transaction is achieved by merging any changes that have been made by other users to the parent version, followed by posting the combined changes back up to the parent. The step of merging the parent's changes is where conflicts may be detected and dealt with.

As in the case of check-out, version management also minimizes the load on the database server by maximizing the utilization of the workstation. This allows good performance of many workstations on one server. This contrasts with the situation in most commercial DBMSs in which query processing for all users is carried out by the database server, thus giving it an enormous workload in a large system.

Version management has many advantages over both map-management and check-out in handling the long transaction issue:

  • No delay before commencing update
  • Always access to the whole database by all users
  • Simultaneous alternatives can be handled

One of the disadvantages of version management is that it is extremely difficult to implement it on top of a commercial DBMS, thus the vendor is forced either into the compromise of check-out or of implementing his own database engine which supports the concept. However, one of the most difficult aspects of database implementation is handling short transactions efficiently. Since this is not required for most GIS applications it is much easier to build a robust system with good performance.

Short Transactions in a GIS

There is much data in corporate DBMSs which needs to be accessed from a GIS. Most of this data is maintained in a commercial DBMS in a short transaction environment. Short transactions are important where it is essential for all users of the system to see the most up-to-date version of the database. GIS access to such data is typically for read purposes only and thus a simple interface mechanism will normally suffice. It is of course desirable from the user's point of view to hide differences in the user interface between the GIS and the external DBMS. In cases where it is also desirable to maintain short transaction data via the GIS user interface, then it makes a lot of sense to use a commercially available database engine.

Summary

All multiple-user GIS systems maintain their data by using a long transaction mechanism. This paper has examined simple mapping systems which maintain multiple map-sheets using a document management approach, through systems which try to maintain continuity between map-sheets by means of an extension of document management known as map management, ultimately leading to truly seamless GIS systems which need a different approach. Two approaches available in the market place are explored, namely check-out and version management. The drawbacks and benefits of all approaches are described. The conclusion is reached that the most elegant and powerful solution is version management and the lack of support for this in today's commercial RDBMSs is a major drawback to using these systems to underpin a GIS. Thus today's vendors who wish to bring the benefits of version management to their customers must implement it themselves.

No comments: