sworldwatch

Esri and Smallworld in danger of irrelevance by open geo?

2014-01-02T10:44:00.002-07:00

(The headline is simple sensationalism to get eyeballs)

But I do think irrelevance is something that GE Smallworlds needs to think about.

Read http://boundlessgeo.com/press-release/top-esri-sales-executive-joins-opengeo/

I wonder if GE's Smallworld product will experience similar move-to-open pressures that ESRI is responding to? Maybe it already is but I have not seen any public-facing evidence of that.

ESRI is actively acknowledging Open Source and I imagine that it is doing so in order to stay ahead of the game (http://www.esri.com/news/arcnews/spring11articles/open-source-technology-and-esri.html). But when you read the article that says that one of ESRI's top federal agency account managers has moved to OpenGeo, it makes me wonder if all the "getting in front of open source" is not enough to control the trend.

One of the first blog posts I ever wrote was "Smallworld Open Source Software". I read it again today and was amazed that not much has changed in the Smallworld community as it relates to open source. The comments to the post are definitely worth reading.

I used to think that the version managed datastores (VMDS) were something that set Smallworld above its competition. In my unscientific analysis, I actually think it did/does provide better functionality than the ESRI and Oracle workspaces/versioning, etc. What seems to have snuck up without too much fanfare are the open-source projects like GeoGit that are also trying to address this issue. In addition, github announced in June 2013 that it now supports rendering GeoJSON and TopoJSON. I realize that GeoJSON and TopoJSON are not necessarily good matches for some of the complex networks modelled by Smallworld users. It is interesting that open tools like git are being pushed to support the same issues that Smallworld was originally designed to address (versioning and distributed data).

I think that if GE does not open up Smallworld to a wider community it runs the risk of becoming irrelevant. I am sure that the irrelevance will not happen overnight. But if GE keeps fighting the GIS war with ESRI without considering projects like GeoGit, we could be in for an interesting few years in the Smallworld realm.

WebMapsConnector for Smallworld Supports basemap.at

2013-11-05T13:33:00.001-07:00

As you may be aware, iFactor Consulting's Web Maps Connector (WMC) for Smallworld allows Smallworld users to seamlessly integrate both global and regional data sets in their Smallworld thick client.

Today I want to share with you that WMC now supports data layers from http://basemap.at (The most recent Administrative basemap of Austria).

While I have mentioned in a previous post that access to datasets like Google do provide value to the Smallworld user, I have also seen a movement in this domain towards regionally-generated data sources. basemap.at falls into that category of regional datasets supported by WMC joining already supported datasets from Thailand (Longdo Maps), Japan (NTT Geospace Imagery and Road Tile) and North Korea (Naver Maps).

If you have a regional tile-based map that you would like to show in Smallworld, please contact the Web Maps Connector team with your request.

Insert coordinates from text file into Smallworld database

2013-08-23T08:18:00.001-06:00

I recently received a query from a reader asking if I could show him how to insert coordinates from a text file into Smallworld database. I have put together a gist that shows how to do this. I started off by using a LWT approach but then remembered that the recommended approach is to use the record transaction APIs. So the gist includes both styles of database writing.

Smallworld Welcome To 2013

2013-03-12T08:20:00.000-06:00

I recently saw a tweet welcoming Smallworld-using utilities to 2007...

Welcome to 2007, utilities. Google, GE Join Forces in Energy Sector Mapping Tool energy.aol.com/2013/02/27/goo…
— Helen Fairman (@helen_fairman) February 28, 2013

... while it may be true that the GE/Google announcement brings back memories of 2007, it is also true that Smallworld users have had the option of accessing web maps since at least 2009. You can see many examples of that here and here.

The web mapping world has changed a lot since 2007. In my opinion, one of the cooler web mapping options involves MapBox. I have blogged about them here. MapBox provides a way for users to quickly create beautiful maps and host them on MapBox infrastructure. You have access to pre-packaged Terrain, Satellite and Road layers that you can then customize to your style and visibility requirements. In addition, you can create your own map layers and upload them to MapBox.

The following video clip demonstrates Smallworld access to MapBox maps with a specific example around styling your own contour lines and hillshading to integrate with your asset data layers.

While having Smallworld access Google map layers may seem "last decade", it definitely still has its value. Extending Smallworld even more to access powerful platforms like MapBox solidly moves Smallworld users from 2007 into 2013.

Welcome to the future! You can find out more about Smallworld web maps integrations here.

Google GE Smallworld demo

2013-02-26T16:07:00.006-07:00

The recent partnership announcement by GE and Google is very exciting. Finally there will be GE product support for cloud-based mapping (no mention was made of the various GE partners that have supplied this functionality for years to varying degrees).

There were lots of articles written about this partnership on the web and the major online news services covered it. Clearly there is a desire to see GE move in this direction.

The first focus has been on integrating GE's Smallworld enterprise GIS system with Google basemaps. The GE partner webinar today provided an introduction to some of the tools and applications that will benefit from this partnership. I have not found online demo videos for the new Smallworld tools, but I expect that the tools will feature prominently at GE demo booths at the upcoming GE 2013 Digital Energy International Software Summit this May.

While we all wait for public demos and evaluation installations, I thought it would be helpful to highlight some of the functionality that Smallworld partners have already developed that already plays in the space that GE is entering.

Basemaps
If GE were to demonstrate a Google Basemap addon that allows Smallworld desktop users to view Google maps as base layers under their asset layers it might look like this (plus a bit more as presented by iFactor Consulting)

Other vendors that provide similar functionality include FCSI and dpgeo.

Network Viewer
Another way to view Smallworld data is to push data snapshots into the cloud and then use browser-based tools to overlay the Smallworld asset data there. If GE were to demo this functionality it might look like this (plus a bit more as presented by Ubisense)

Another vendor that provides Smallworld-in-the-cloud functionality is Cliffhanger Solutions.

Mobile Apps
According to press releases, GE will also use the Google partnership to push them into the mobile app space. Lucky for them, there are partners that have already moved into this space and gleaned valuable market experience. Partners include Astec Gislet and Ubisense myWorld.

I'm sure their are other partner offerings related to web maps and Smallworld. If you know of others, please comment to this post.

Sprinkle Some [#swgis] Magik on that Java Virtual Machine

2012-11-07T08:28:00.002-07:00

Jim Connor's Weblog published an article a few days ago explaining Smallworld's move from a Magik virtual machine to the Java Virtual Machine. Interesting reading.

You can read the article here: https://blogs.oracle.com/jtc/entry/sprinkle_some_magik_on_that

Thanks to @aardvark179 for bringing this to my attention.

View Beautiful (and Offline) Maps in #swgis

2012-10-22T17:36:00.001-06:00

Recently, I have seen much excitement around MapBox's TileMill and the beautiful maps that can be created with that tool. You can definitely do some cool stuff with TileMill.

That made me think back to a year or two ago when I was creating custom map tiles using Maperitive. So I updated my version of Maperitive and had a look. The GUI has had many improvements since I last used it and I really like the Python-based scripting that is now supported. You can also do some really cool map creation with Maperitive. A closer look at both applications showed me that both of them allowed a user to generate output in the MBTiles format.

So I thought that it might be helpful to see if we could view these formats within Smallworld. iFactor Consulting's Web Maps Connector (disclosure: I am the product manager for this product) has been focused on web-based map sources. But because of its flexible framework, it was easy to modify it to read MBTiles.

What does that mean for Smallworld users? Now you can create your own beautiful maps using CartoCSS or Maperitive style sheets and integrate them into Smallworld as additional layers. That is pretty cool, in my opinion. In addition to beautiful maps, you can also use Maperitive or TileMill to generate your background tiles for a predefined service area. Then you can copy the file to a laptop or mobile field device running Smallworld and your field crew will have access to beautiful custom-styled background tiles even when disconnected from the web.

The following video provides some more details of how this all works.

@MDTnet #swgis webinar on 23 October

2012-09-24T15:16:00.002-06:00

ASTEC will be hosting a webinar for their Magik Development Tools (MDT) on 23 October.

I have been an Emacs traditionalist when it comes to Magik development. Recently, however, I have been working on other projects that use the Eclipse IDE for Java work. I can tell you that based on the cool refactoring, debugging and code assist functionality available to Java in the IDE, I have reconsidered my Emacs traditionalism and would recommend that you learn more about MDT and how it can help you become more efficient in your Magik development.

You can find more details about the webinar here.

@iFactor Consulting Presents Money-Saving Solutions at #SWUC

2012-09-14T21:00:00.000-06:00

This blog post's title "iFactor Consulting Presents Money-Saving Solutions at #SWUC" was meant to give the post an air of respectability. In reality, it should have been titled...

When You Leave Two Magik Hacks In Charge of the @ifactor #swuc Booth, This is What You Get

The GE Energy Americas Smart Grid Software Summit (formerly known as Smallworld Americas User Conference) is starting on Monday and I wanted to write a bit about what my company (iFactor Consulting) will be doing there. Graham Garlick and I will be staffing the booth with a cameo appearance by Brad Sileo.

Come see us at the booth and talk to us about all your Smart Grid questions. We might even be able to talk about Smallworld while we are at it :)

Here are some of the cool things we want to share with you...

Dialog Designer (DD)
Dialog Designer is the best Magik GUI design tool in the world! No one, not even GE, offers a tool like this that allows you to create SWAF-based GUIs with drag-n-drop ease. Each year Graham Garlick adds new functionality and language support so if you have not seen the product recently, please ask for a demo and we can show you the latest and greatest. And did I mention that it's free, courtesy of iFactor Consulting?

You can find out more about Dialog Designer here. You can even download your free copy from that site.

Vector Rubber Sheeter (VRS)
When you are ready to conflate your data to match a more accurate landbase (see WMC below), you will need to start some kind of data conflation program. Come talk to Graham about how VRS can automate some of the tedious control point creation tasks you have to do using your current conflation tool. With VRS, your conflation work flow is also improved with support for merging conflated data in different alternatives. Graham's the man to see about your data conflation needs.

You can find out more about VRS here.

Version Viewer (VV)
If you are a Smallworld administrator or super user, you will often need to know what is different between alternatives and then based on that you might want to copy data between alternatives. Version Viewer (VV) is a great graphical tool that every Smallworld super user should have in her toolkit to take full advantage of Smallworld's powerful core difference stream functionality.

Use Version Viewer to:

find out exactly what database changes your batch scripts are making
review work done by users in an alternative
recover "lost" data that has been deleted in an alternative but still exists behind an older alternative or checkpoint.

You can find out more about Version Viewer here

Google Charts API Magik client
The Google Charts API is a rich web-based API that allows you to create professional-looking charts, tables and other visualizations. With our Google Charts API Magik product, you can now embed those charts directly in a Magik canvas. But this is not just a simple image copy/paste. These charts in Magik also have hotspots enabled that allow a user to click on the various chart components and drill down to other detailed charts. It is some very cool Magik and you should take the time to see it at our booth this year.

Web Maps Connector (WMC)
WMC presents web-based tile and static maps as graphics layers in Smallworld. Our list of supported formats and functionality is growing every year.

World Wide

Bing Maps
Google Maps
OpenStreetMap Mapnik rendered tiles
MapQuest Open data
WeatherBug Lightning Strikes
ESRI ArcGIS Data Appliance

if your organization has an ArcGIS Data Appliance server within the firewall, we can configure WMC to let you view that data in a Smallworld client.

South Korea

Naver Maps

Thailand

Longdo Maps

Japan

NTT Geospace Imagery and Road Tile

Digital Map Products showing lot parcel data nation-wide
WeatherBug Doppler Radar with 3 hour history

We also recently embedded Google Street View in Smallworld to give users that view of the world. Come see a demo of this new functionality.

We have customers that use the aerial/satellite imagery to digitize buildings and other facilities into their database.
We have customers that have permanently turned off their legacy landbase datasets and are using WMC exclusively to provide their land basemap.
WMC is product-agnostic and works equally well in PowerOn, PNI, Electric/Gas Office and all manner of custom CST environments.

Let WMC get you out of the landbase maintenance business and let you focus on your company's core business.

You can find out more about Web Maps Connector here.

Linking Smallworld With a True "Killer App"
At this conference you will likely see more than a few applications that show Smallworld maps on mobile devices. I think that is very cool. But what we would like to show you is how you can integrate your Smallworld data and processes with two of the all-time killer apps already out there: voice and SMS. It doesn't sound nearly as exciting as mobile apps, but when you have finished our interactive demo, you will see these true "killer apps" in a new light.

Scenario#1. Your field technician wants to attach a picture or a voice recording to an existing work order or service point in Smallworld, we can show you how he can do it without having any special software on his phone. And the image and voice attachments are available to the back-office Smallworld users within seconds of the technician sending them from the field. All without any complex configurations on your part.

Scenario#2. You have a storm situation and many out-of-state crews have arrived to help you restore power. You cannot deploy mobile data terminals to these crews but still want some type of automated tracking of crew statuses. See our demo for how you can use some paper and SMS/Voice-based technologies coupled with Smallworld applications to get the visiting crews hooked into your automated work flow with minimal cost.

Scenario#3. Your utility hasn't introduced automated meter reading yet and you are looking for a way to reduce meter reading costs. See our demo that shows how you can engage your customers using SMS text messages and have them assist you with some of the meter reads. Bring your phone with you and be part of the SMS/Voice demo.

HTML Rendering and JavaScript Support in Magik
Magik does not support HTML rendering or JavaScript support... but iFactor does! Based on some crazy skunk works R&D we are doing at iFactor Consulting, we can now show you an early release of our yet-to-be-named HTML rendering and JavaScript engine framework for Magik. Many years from now you can tell your children that you saw it here first!

FME Server Lite
I put this in here to see if the FME folks would read this far in the article :) There is no FME Server Lite. But I have been starting to dabble with node.js and various queuing technologies and it seems to me that it could be coupled with an existing FME Desktop session to give you a kind of poor man's FME Server. Come by to talk about new technologies and how you can use them to improve your Smallworld experience.

OK, now that I've written this post, I realize we have a ton of stuff to show you. So come early to get the best seats. And if you spend the entire day at our booth, that's OK.

Remember to tag your conference-related tweets with #swuc. I think Mark at FCSI might be offering some incentives for the person with the most #swuc tweets. I'm thinking we can exceed the 10+ tweets that were generated at last year's conference :)

Exporting Smallworld Data to KML

2011-10-10T20:44:00.002-06:00

There was a recent question on the sw-gis Yahoo group asking about how to export a Smallworld trail to KML for use in other applications. It turns out that can be done easily using the open source Magik Components Library (mclib). The following video demonstrates how this is done. Important information:
- Magik Components Library link is here
- Thanks to Brad Sileo (iFactor Consulting) for contributing this code to the MCLIB project.

Give it a try and let me know what kind of cool export applications you are using. And if you have any improvements to make to the functionality, please feel free to contribute to MCLIB.

Smallworld Technical Paper No. 14 - GIS in the Cable Market

2011-08-29T07:59:00.001-06:00

by John Rand MSCTE, Design Manager, Cambridge Cable Ltd, Cambridge U.K.

John Rand has considerable experience in the cable TV and Telecommunications industry, originally specialising in the Local Network with British Telecom. He joined Cambridge Cable on formation and was instrumental in the development of the integrated cable TV and telecommunications network design. Responsible for selecting and implementing GIS including advising on development. He studied Telecommunications at Southgate College, London and cable television at Atlanta Georgia. Currently studying Management at Anglia University, Cambridge, specialising in Operations and Project Management.

Abstract

This paper will look at the cable industry's requirements for a Geographical Information System (GIS) and the contribution a GIS will make to the cable TV telecommunications business. Analysis will be made of existing CAD systems and the GIS capabilities applicable to the unique UK cable industry and the reasons for changing systems. This industry is expanding at a rapid rate and I have illustrated how GIS can be used to meet the business plan goals. The areas of implementation, finance and marketing are also discussed.

The paper concludes that to reap the full business opportunities presented to this unique industry a GIS is the only credible system available for utilising the company's resources to their maximum benefit.

Introduction

GIS is only just starting to emerge as a useful tool within the cable industry. Its presence has been known for many years and although a few vendors have tried to develop a successful product, none until recently appear to have achieved this. However that scene now appears to be changing and we are at a point now where before us lies a seemingly endless vista of possibilities for the use of the system within the industry. Though the nature of the system is far-reaching in every aspect of our business, the implementation and development of our requirements are by no means straightforward and indeed are proving to be quite complex. This paper will look at how Cambridge Cable intends to use this "State of the art" technology through its benefits, to reach our vision of becoming the premier provider of entertainment, information and communications services for the benefit of the community and our customers, employees and shareholders and how this technology can benefit the whole industry. The presence of GIS will eventually be experienced in just about every department of the company providing the core information to drive the company forward. It will eventually be as commonplace as any other information system only more crucial.

The leading cable companies in the UK are beginning to implement GIS. This paper will look at the benefits of GIS to the cable industry, the newest of the utility companies.

The UK Cable Market

In order to appreciate fully the importance of the GIS industry within the cable market it is essential to understand the unique nature of the cable industry within the UK and the implications. A brief illustration of the industry follows.

UK Cable Market

Within the UK, 139 cable television franchise areas have been created by the Department of Trade and Industry (DTI). Franchise areas cover areas of dense population therefore only 70% of the UK is covered, however, expansion and creation of other franchise areas are possible. The first 12 franchises were awarded in the mid 80s and the remaining 127 franchises areas in the last five years. Cable television companies holding franchisee area licences can also apply for "Public Telecommunication Operator Licences" (PTO) for their areas, therefore creating a dual service industry, cable TV and telecommunications. The regulatory bodies for cable television and telecommunications are the Independent Television Committee (ITC) and Office for Telecommunications (OFTEL) respectively. The provision of two major service products by one company makes us unlike any other utility company.

[ Figure 1 not available ]

Cambridge Cable's Position within the UK Market

Cambridge Cable Limited (CCL) was formed in July 1988. It was awarded the Cambridge franchise in June 1990 and started constructing the network in June of 1991. The Anglia franchise was acquired in December 1992 thus making a total of approximately 200,000 homes covered by our operation. CCL is jointly owned by Comcast Communications of Philadelphia, USA and Singapore Telecom International.

How the Industry Works

The measure of the size of a franchise or company is how many homes fall within the franchise boundary; the penetration of our services into this number is one of the core statistics to watch. This represents expected income revenue with which to repay investment. The income from the two services is subtly different in that from cable TV it is a set flat monthly rate, depending on the chosen package, whereas telecommunications revenue is dependent upon usage. As owners of a cable TV franchise, there is no competition for broadband services, (British Telecom cannot operate cable television services on its network until 1997); however, British Telecom is our main competitor for "Local Loop" services. This is where the cable companies must use all their resources to succeed and gain the upper hand. The leading UK cable companies pride themselves in using "state of the art" technologies and practices to achieve this and thus the importance of using GIS becomes abundantly clear.

Economics of the Industry

The basic economics of the industry are similar to those of any other; finance is raised and used to construct an infrastructure network over which our services can be carried. Both services will be constructed as one network. The incremental costs for the second network are minimal as the greatest investment lies within the civil construction costs.

Cambridge Cable has four main strategic goals. With the assistance of GIS all these objectives can be achieved and maintained efficiently.

1. To create and develop profitable market opportunities. Through geographical market analysis, correct products and services can be determined and potential opportunities exploited.

2. To provide a wide range of differentiated quality services and products at competitive prices. Again GIS will be invaluable as a tool for market analysis.

3. To ensure the network is "future-proof", user-friendly and cost-effective. GIS will be used to simulate different architectural models and assess new technologies, giving us the required information to build an economical and reliable network.

4. To hire, develop, and retain the right people at the right time. The implementation of GIS as "state of the art" equipment demonstrates commitment by Cambridge Cable to new technology and to providing people with the right tools and information to develop careers.

[ Figure 2 not available ]

GIS will be a significant force in achieving these goals and contributing to the company's success.

The obstacle to this is that the existing situation relies on manual interaction between the different departments. Currently, design is drafted on a CAD system and from there on is printed and used in paper format. Other systems exist within the cable TV operation, Subscriber Management System, Network Management Systems and the Telecom Network Circuit Assignment Systems, however none of these interact leaving numerous opportunities for miscommunication and "information-error".

The supra-system of the business and the requirement for return on investment and instant current information on network and customers is causing stress on the sub-systems of:-

Subscriber Management System

Network Management System

Telecom Circuit Assignment System

plus various manual systems.This creates the need for a global system. The answer is the implementation of a GIS which can facilitate the interaction of all these sub-systems.

The combination of return on investment, instant access to current information on the network and customer information is essential; the supra-system is causing stress on all three of these sub-systems, creating the need for a global system. The answer lies in implementing GIS which can facilitate the interaction of these sub-systems.

The Cable Industry and GIS

Cable Requirements of a GIS

The process of obtaining customers to bring in revenue begins with constructing the network, therefore, design is required. The principal requirement is for a system able to produce comprehensive designs, information on the areas already constructed and on the network status which will be readily available to those requiring it at any time. An ideal example of how a complete system would work is as follows:

1. Survey information would be collated from the field on a portable PC and input directly on to the digital Ordnance Survey map. This information would then be downloaded into the GIS and the design created thereon.

2. On completion of the design, customer addresses would be transferred to the Subscriber Management System and automatically "populated". Telecom assignment information would also be transferred to the Circuit Assignment System and purchasing would also automatically receive Bill Of Information (BOM) information.

3. The GIS information would be linked to the cable television and Telecommunication Network Management Systems. Should there be a network performance problem or outage, instant geographical information/reports can be generated alongside instantaneous information for customer services.

4. Black spot analysis for maintenance purposes becomes effortless and marketing can identify meaningful information.

[ Figure 3 not available ]

History of GIS in the Cable Industry

As mentioned earlier GIS has never quite found its feet within the cable industry. The majority of cable companies have either used pen and paper or CAD systems which are extremely stylised (Newell and Sancha 1990). A few GIS vendors have tried to develop GIS systems for the cable market. Cambridge Cable purchased a CAD system early 1991 and had been using it as a successful tool until recently. Many functions available on GIS were not available on CAD and this was compounded by the lack of support given to the product. The CAD system imposes severe limitations on effective use within this rapidly growing industry, Newell and Sancha (1990) commented "Several of the established CAD vendors tried to adapt their CAD systems for GIS applications. This resulted in most unsatisfactory compromises"; "CAD vendors continued to try to convince the industry that they had a viable product by integrating their databases with the CAD function" and item referred to "marrying together two inadequate systems" (Newell and Sancha 1990). The two technologies of database and CAD do not integrate easily. In late 1992 the industry started to talk more about GIS. No single GIS system proved to be totally reliable and no single vendor stood out. Cambridge Cable were approached by Smallworld - a company well established in GIS and based locally in Cambridge - to work with Smallworld to develop the combined cable TV and telecommunications model. Also to establish an unrivalled product for the industry. Fundamental requirements included; data capture, performance, customisation and integration (Newell and Theriault 1989). These aspects were severely limited or non-existent with Cambridge Cable's current system.

[ Figure 4 not available ]

Implementation of GIS

It is clear now that no improvements to the CAD system could have provided our business with its requirements. The GIS is now installed within the design department and is already proving beneficial through its ability to calculate system performance thus helping to obviate the need to use design contractors.

The next stage of implementation is to integrate the Network Management Systems and Subscriber Management System etc.

The final and more idealistic stage of implementation is that to integrate GIS throughout the company to provide full, up to date network information to everyone. "As the number of users sharing information in this way increases, the system will constitute a continually improving Geographic Information System for the benefit of all" (Bernhardsen and Tveitdal 1986). This will also improve the "work conditions for the specific personnel groups" Kubik, Merchant et al 1987) in that having relevant current information immediately at hand will be of enormous benefit for optimum performance.

Why Invest in GIS

Should a cable company invest in GIS? Considering the magnitude of the initial investment, should they stay with their manual paper or CAD methods? When also is the best time to invest? The analogy I put forward is that of comparing GIS to that of computing in the 60s. The first generation have high purchase costs, high maintenance and few benefits but as things have progressed no business would be without one. GIS is now sufficiently developed to be useful to the cable industry. Moving eventually, like computing in the 60s, no cable operation will exist without a GIS. With networks growing at a tremendous rate any wise company would invest in GIS. The thought of transferring enormous amounts of data at an advanced stage of the build is horrifying and expensive.

The investment in a GIS system as a proportion of the total investment in constructing the total network is only a fraction of the costs. Considering this will be controlling the network assets and providing the benefits described later, it can be seen as an essential long-term investment.

What are the Benefits to the Cable Business

Improve the quality of network design by enforcing engineering rules and standards which can be preset and fine tuned.
Facilitate the achievement of cost reduction and quality improvements in passing addresses to the Subscriber Management System with implementation of an automated interface.
Improve repairs to the network through the provision of visual aids on fault investigation.
Simulation of different architectural network models to evaluate the most cost-effective solution to design scenarios, e.g. "Fibre to the Feeder" architecture versus "Fibre to the Kerb".
A GIS is a very useful tool in evaluating new technologies and their impact on existing networks, e.g., PDH Versus SDH
Analysis of financial comparisons of percentage turnover ploughed in against speculation.
Having advanced equipment attracts the right calibre of staff and enables them to advance their careers with current technology.
Attraction of investment within the company by being seen as innovative and conscious of the need to have accessibility to vital information.
Accurate inventory of assets and asset management for capital accounts. Also analysis of potential acquisitions including identification of existing or potential plant within those areas.
Interactive queries for precise retrieval of information concerning the network.
Quicker response times to customer orders, due to readily available information, especially regarding telecom enquiries and indication of likely installation dates.
Substantial savings can also be made through integrating GIS with purchasing, warehousing, and the construction programme. Bill of Materials (BOM) created by GIS can be transferred to the purchasing computer system where they can be ordered on minimum lead time in relation to the construction schedule and received in the warehouse for "just in time" materials management.
Geographic survey information captured on GIS can be sold commercially to any other parties interested in such data.
It can also query information without the need to survey.

Economic Benefits

Many of the above points can represent very tangible cost savings and through collation of data this can be proved. However with GIS there are considerable intangible cost benefit savings that only GIS can give, as opposed to improving existing systems. An example of this is that, after the initial investment in GIS, savings in staff can be made without the usual element of human error.

Intangible benefits:-

More information (marketing, customer service, fault locates etc) Better analysis with less labour time (marketing, new technologies) Ability to do analysis not possible before (new RF and telecom technologies) Better decisions (build areas, new technologies) Better planning (network design, business plan) Better understanding and analysis of highly complicated systems

[ Figure 5 not available ]

Return on Investment

Deciding where to build currently concentrates on areas of highest density. This is desirable because eventually we want to provide service to every home in the franchise area. The denser areas are typically those with the best demographics although, currently, no marketing analysis is done to determine if any of these areas are better than others. Through using GIS as a sophisticated marketing tool and analysing areas, the best potential dense areas can be built first. Early high penetration will be achieved and high revenue will be received, yielding a high return on investment.

Cable Marketing and GIS

Marketing

It is essential to hit the right potential customer base with the right services. Traditionally lower socio-economic groups are better target groups for cable TV while the higher groups are more likely to be interested in telecommunication services. However it is being found now that the types of socio-economic groups mean slightly less than customer "Lifestyles" which concentrate on the use of disposable income. If this is now to be used, data set analysis can be done prior to design in order to identify the correct market and concentrate in building in that area first. Without GIS the process would be a very lengthy and laborious task.

Since telecommunications is regarded as an essential service, churn is not experienced to the same extent as with cable TV customers. If market identification and "right sizing" can be advised prior to a sale, enormous savings can be made in the areas of abortive sales calls, installation and equipment retrieval. Sophisticated marketing analysis can be done on the remaining potential customer base to determine the required product.

Conclusion

Clear benefits can be seen in implementing a GIS system within a cable business and the advantages are clearly defined. There is also confidence within the industry that there are credible vendors with a tremendously useful product of enormous value to a company's operation.

The near future for GIS looks exiting and in the long term there will be far reaching effects on our business. It is an essential tool in effective competition. A culture change in the working environment will be required to make acceptable this prolification of invaluable information. Precise marketing is that key and, by using GIS to interact and analyse all available information, cable companies will be able to achieve greater success within their market.

References

Bernhardsen,T and Tveitdal,S. 1986. Community Benefit of Digital Spacial Information. VIAK A/S - Auto Carto London, Vol.2.

Dickinson,H.J. and Calkins,H.W., 1988. The Economic Evaluation of Implementing a GIS. International Journal of Geographical Information Systems, Vol.2, No.4, pp307-327.

Joint Nordic Project, 1987, Digital Map Data Bases, Economics and User Experiences in North America (Helsinki, Finland: Publications Division of the National Board of Survey, Finland).

Kubik,K., Merchant,D. and Schenk,A. 1987. Design Considerations for Urban Information Systems. A-ASPRS- ACSM, Vol.5.

Marble,D.F. and Peuquet,D.J., 1983. Geographic Information Systems and Remote Sensing. Manual of Remote sensing, 2nd ed, American Society of Photogrammetry, Vol. 1.

Newell,R.G. and Sancha,T.L., April 1990. The Difference Between CAD and GIS. Computer Aided Design magazine.

Newell,R.G. and Theriault,D.G. September 1989. Ten Difficult Problems in Building a GIS. Presented at British Cartographic Society Symposium, Cambridge.

Theriault,D.G. April 1989. An overview of Geographical Information - the technology and its users. Presented at conference, Geographic Information Systems.

Acknowledgements

David Theriault, Smallworld Systems Ltd. Keith New, Cambridge Cable Ltd.

Glossary CHURN - Turnover of customers, disconnections after connection. OUTAGE - Complete loss of service. PDH - Presynchronous Digital Hierarchy. SDH - Synchronous Digital Hierarchy.

Smallworld Technical Paper No. 13 - The Wide Area Connection

2011-08-29T07:58:00.002-06:00

by John Rowland, Grampian Regional Council

Abstract

GIS data is voluminous, demanding upon bandwidth and therefore normally requires high speed network links. This has served to constrain "real time" wide area distribution of GIS data. In conjunction with British Telecom, Gandalf Digital Communications Ltd and Smallworld Systems Ltd, Grampian Regional Council believes it has been able to implement a realistic solution to this problem using Smallworld's recently developed "Persistent Cache" functionality running over British Telecom "Kilostream" links.

This paper:

briefly explains Grampian Regional Council's requirement for wide area GIS;
overviews wide area communication options;
explains the basic concept of intelligent bridging;
describes the key features of the Smallworld System which have been used to implement wide area connections;
reviews experience to date;
briefly considers what the future may hold.

Grampian Regional Council & its Corporate GIS

Grampian Regional Council administers a land area of approximately 8,000km² which is home to a population of 530,000, half of whom live in the City of Aberdeen. As with other Scottish Regional Councils, its responsibilities include the provision of water, drainage, roads, economic development, strategic planning, fire, police, education and social services.

The Council's main headquarters is Woodhill House in Aberdeen, some departments also operate from a number of divisional and other offices located throughout the Region.

In 1992 the Council commenced implementation of a Corporate GIS which was installed in Woodhill House for use by four departments (Economic Development and Planning, Property, Roads and Water Services). The Council selected Smallworld GIS running under UNIX as its core system. At present all departments share a single corporate GIS database which is managed by a Sun MP630 file server. With ongoing data capture this database continues to increase in size; at the time of writing it held 4GB (Giga bytes) of GIS data.

Responsibility for maintaining this database and ongoing implementation of the system on behalf of user departments is vested in a six person team called the "GIS Unit". To date the Council has acquired a total of twenty nine GIS "seats" from Smallworld with more on the way. Ten of these seats have recently been acquired by the Department of Water Services for installation at six different office locations remote from Woodhill House (see figure 1).

Until recently it had not been viable for the Council to operate their Corporate GIS over a wide area network. However, Smallworld's recently developed Persistent Cache database management software combined with "state of the art" network bridge technology has enabled the Council to implement wide area connections using leased British Telecom Kilostream lines. At the time of writing two of the Department of Water Services' remote offices have been connected to the main file server in Woodhill House.

[ Figure 1 not available ]

Wide Area Connection Components

The wide area connection has four key components: a physical communication link (British Telecom 64Kbps Kilostream in the first instance), intelligent bridging (Gandalf LANLine), Smallworld version managed GIS database and Smallworld Persistent Cache software (2).

Physical communication links

A 500m x 500m Ordnance Survey vector tile of an urban area typically contains around 250Kbytes of uncompressed data. In order to pass such a tile over a network and display it in a total elapsed time of less than 45 seconds the network has to pass data at a speed of in excess of 44Kbps (Kilo bits per second). In order to view 1km² of similar data in the same time the speed would have to increase to in excess of 180Kbps.

This should not be a problem over a local area networks with a bandwidth of10Mbps (Mega bits per second). However, if all that there is between office locations is a public telephone network, a couple of high speed modems operating at 14.4Kbps and the inherent "dial up" delay of analogue communications, then there clearly is a problem.

There is no alternative but to seek a digital communications link . Depending upon what you are prepared to pay, digital links can provide effective line speeds of 64Kbps up to in excess of 8Mbps with minimal "dial up" delay. They can either be ISDN ("pay when you use") dial up links or dedicated "Kilostream" or "Megastream" leased lines.

ISDN

ISDN In United Kingdom ISDN (Integrated Services Digital Network) is available either as ISDN2 providing an effective 128Kbps line speed using two 64Kbps channels or ISDN30 providing an effective 1.92Mbps using thirty 64Kbps channels. At the time of writing British Telecom ISDN2 socket installations were being charged at approx £400 per site, line rent at £84 per quarter and transmission at normal telephone call rate charges.

Leased Lines

Leased lines normally incur an initial installation charge and a subsequent annual rental charge which varies according to distance from the nearest digital exchange. At the time of writing British Telecom were charging £900 per site to install 64Kbps "Kilostream". The annual line rent of a link between two sites varies according to distance and proximity to BT exchanges, some indicative figures are quoted in the ISDN2 v Kilostream comparison below.

In contrast 2Mbps "Megastream2" currently costs £6,200 per site plus £750 per link for a first installation and 8Mbps "Megastream8" £9,734 per site plus £2,625 per link. Line rents vary according to distance between BT exchanges, for example if two exchanges were 50km apart, Megastream2 would currently cost £15,740 per annum to rent and Megastream8 £55,108 per annum. Even the most optimistic GIS cost benefit analysis may have difficulty in justifying expenditure of this magnitude!

Despite current talk of information super highways it is of little surprise that many multi site GIS installations are still reliant upon using tapes, discs and couriers to transfer data between individual sites.

The wide area connections to Grampian Regional Council's six Water Service remote offices are being implemented using a single 64Kbps Kilostream channel to each office.

[ Figure 2 not available ]

Intelligent Bridging

Bridge or gateway devices are needed to connect the physical wide area communication link between two remote sites to the local area networks (LANs) at those sites.

A bridge is effectively a filter which joins two network segments such that data will only pass through the bridge to a second segment if it is destined for a device connected to it. Bridges are commonly used to segment local area ethernets so that unwanted data packets are not allowed to flow along segments where they are not needed.

In a UNIX environment bridging is achieved using the IP (Internet Protocol) part of the TCP/IP protocol (1). Every device connected to an ethernet has its own unique IP address. A data packet being transmitted from one device to another always carries with it the IP address of the device to which it is being sent. In the case of a data packet which is broadcast to all devices on a network the IP address is coded so as to indicate that it needs to be delivered to every device.

Gateways are special devices for transferring data between two different networks which adhere to different network protocols. As such they actually have to restructure the data packets which pass through them and are therefore inherently slower than bridges.

Wide area physical communication links between sites are nearly always slower than the local area networks they connect together, hence bridge or gateway devices are needed to prevent unwanted local area traffic from escaping to and causing congestion on the physical wide area link. Bridges supplied by Gandalf and other vendors for this purpose incorporate a number of intelligent features to enhance their performance.

Data Compression

Data Compression algorithms are used to compress transferred data, so as to achieve actual throughput which exceeds the quoted bandwidth of the physical wide area communication link. The degree of compression depending upon the extent that data is already compressed. For example tests at Grampian Regional Council indicate that their Gandalf "LANLine" bridges operating over 64Kbps Kilostream are able to compress raw NTF files by ratios in excess of 3:1 and already compressed TIFF files by ratios of around 2:1, thus achieving effective throughput of data in excess of 192Kbps for raw NTF and 128Kbps for TIFF. Even higher compression ratios of up to 8:1 can be achieved with these devices.

[ Figure 3 not available ]

Transparent Automatic Dial Up

Transparent Automatic Dial Up Bridges built specifically for connecting local area networks to "dial up" links such as ISDN embody an "automatic dial up facility whereby (for UNIX networking) the bridge is configured with a table which maps different network IP addresses to the phone numbers to which they are connected. Thus packets emanating from a "departure" site will cause their interconnecting bridge to automatically dial up the phone number of the "destination" site.

ISDN bridges will normally also have a configurable "time out" connection period which specifies how long an ISDN connection should remain connected for after a packet has been transmitted. For example if the time out were set to 30 seconds then the connection will close every time there is a break of 30 seconds between transmitted packets. Given that ISDN connection dial up can be made in as little as 5 seconds it is quite feasible to make several very short connections during the course of the working day and only incur a relatively small phone bill.

Automatic dial up and subsequent timed out disconnection is totally transparent to the user thus the ISDN bridge provides a virtual permanent connection.

Bandwidth On Demand

Bandwidth On Demand ISDN2 incorporates two individual 64Kbps channels which can either be used in parallel to achieve an effective 128Kbps bandwidth (with compression actual throughput will be even faster), or separately to send data to two different destinations at the same time. Similarly Kilostream can be installed in multiples of 64Kbps channels and used in much the same way.

"Bandwidth on demand" characteristics of local to wide area bridges enable individual ISDN and Kilostream channels to be automatically opened and closed to different destinations according to actual traffic volumes. With the Gandalf "LANLine" bridges it is also possible to mix and match Kilostream and ISDN together such that an ISDN connection can be opened when a single permanent Kilostream channel becomes overloaded.

Virtual Extended Local Area Networks

The net effect of state of the art intelligent bridging used in conjunction with digital wide area communication links such as ISDN and Kilostream is to create a virtual extended local area network. In a UNIX environment client workstations located at one site can access server devices at another site several kilometres away as if both devices were connected to the same local area network. Albeit with degraded performance if the volume of data being transferred between sites exceeds the available bandwidth of the physical wide area link.

Not only does this permit remote offices to access main office data, but also to output data to peripheral devices, such as expensive large format electrostatic plotters, located in the main office.

Database Version Management

In order to understand how Persistent Cache is being used to provide Grampian Regional Council's "wide area connection" it is first necessary to provide a brief explanation of their implemen-tation of Smallworld's version managed database.

Smallworld Version Management permits several versions of the database to exist simultaneously. In Grampian's case these versions are organised hierarchically as illustrated by figure 2. There is a single definitive top alternative" which is normally never written to directly. Each department is then provided with its own version of the "top alternative" which again are normally never written to directly, instead all users who are required to write to the database are each provided with their own "personal writable alternative".

For routine data capture work users are usually asked to update their departmental alternative on a daily basis by "posting up" their own personal alternative to it. This has to be preceded by a "merge down" of all changes which have already been posted to their departmental alternative. Once all personal versions have been "merged and posted" a departmental administrator then ensures that their own department's alternative is "merged and posted" to the "top" definitive alternative. Thereby inheriting changes and updates made by other departments.

Grampian Regional Council's GIS Unit is responsible for maintaining the Ordnance Survey map base and other shared corporate datasets such as a number of different gazetteers. Within the alternative structure the GIS Unit is treated as another department thus departments, and in turn end users, have their map base maintained for them by virtue of the "post and merge" procedures.

Within the UNIX file system the GIS database is held in a set of files storing different types of data (eg geometrical points, lines, areas, associated attributes etc). Database alternatives can be created so as to either be located totally within a file set held in a single directory or, created so as to reside in a separate sub directory with the same file structure. Thus the UNIX file system can if desired be configured so as to totally or partially mirror the database alternative structure (figure 3). This in turn implies that different alternative versions of the database can be stored on different storage devices on the same network.

Persistent Cache

Smallworld's Persistent Cache software (2) enables all or a subset of a GIS database to be cached to a local disc attached to a client workstation which is in turn configured to be a local cache server to both itself and other clients. By maintaining a copy of frequently accessed data in the local cache, it is an elegant and transparent way of providing large systems with high performance over low speed communication links.

In figure 4, workstation A is a local cache server located at a remote site along with client workstation B. GIS read transactions generated by workstations A and B look first to the local cache to retrieve data. If the requested data has not been cached it is retrieved from the main file server via the wide area connection and then cached.

The local cache has a configurable operating capacity, once this capacity has been filled old cached data is deleted from the cache on a "least recently used" basis. The cache capacity can be set to be large or small depending upon the size of the required database subset. If need be (local disc space permitting) it could be set to be large enough to replicate the original database.

When using Persistent Cache, remote site users are able to retrieve cached data very quickly and uncached data at the speed of the wide area connection. Hence if a subset of the main database is cached there will be occasions when read transactions may suddenly appear to slow down as data is retrieved over the wide area connection.

Write transactions write directly to the user's alternative every time a database record is inserted, updated or deleted and then subsequently copied back to the local cache.

At appropriate periods of time, remote site users initiate merging and posting of their changed data with higher order alternative versions. The merge and post processes are run on whichever machine the various alternatives are held. The local cache being updated where new "merged down" change data is located in a geographical area that is already held in cache.

By virtue of the ability of being able to map alternative versions of the database onto different UNIX directories (see figures 2 and 3) user's alternatives can either be held on the main server back at headquarters or somewhere locally at the remote site. This provides organisations with a high degree of flexibility as to how they operate over wide area connections.

Holding Remote Site Alternatives on Main Server

If users' alternatives are located at headquarters then all write data is passed over the wide area connection whenever a database record is inserted or updated. In a data capture environment this implies that relatively small amounts of data are passed frequently over the wide area connection.

Database commits and alternative version posting are processed back on the main server and therefore no data is passed over the wide area connection. Similarly the merge process (merging down of changed data from higher order alternatives) is also undertaken back on the main server, however the amount of changed data passed back across the wide area connection will depend upon the volume of merged down changed data which maps onto currently cached "geography". By holding all remote site alternative change data on the main server the remote site users do not need to be concerned with data backup and other routine system administration tasks which can all be undertaken back at headquarters.

[ Figure (diagram) not available ]

Holding Remote Site Alternatives Locally

By holding user's alternative change data locally no write data is passed over the wide area connection until the locally held alternative versions are merged and posted with and to higher order versions located back on the main file server. If daily posting and merging is undertaken then this implies a daily transfer of a larger volume of change data over the wide area connection.

The volume of changed data merged back down to the locally held alternatives is entirely dependent upon the amount of data which has been recently posted to the top (definitive) version of the database by other users. This could be considerable if say a new batch of Ordnance Survey maps had been recently loaded.

Populating the Local Cache

The local cache is essentially an extended reflection of the data which a local client work-station holds in memory. It is therefore composed of a subset of object class layers for "blocks" of geographical extent. For example Grampian's Water Service divisional offices cache background map and water supply object class layers for all or part of their divisional areas of operation.

Upon initial creation the local cache is "empty" and must be populated. Users can be left to do this during the course of natural usage, upon first access all data is "hauled" over the wide area connection and then cached. This could be a little tedious if two or more users at the local site are simultaneously hauling data over a 64Kbps line. They could therefore instead organise to "zoom out" to a large extent of geography as they leave for home so that the area in which they wish to work the following day has been cached upon return to work the following morning.

Alternatively initial cache data can be written to tape by staff back at headquarters and then copied into the local cache in order to "kick start" it.

Grampian Regional Council's Wide Area Connection

Kilostream v. ISDN2

Although the Council already had some operational wide area communication links it was decided that the Corporate GIS would have its own dedicated links because of difficulties in extending heavily subscribed existing facilities to sites where GIS was required. The lowest cost option able to provide acceptable performance was therefore sought. This turned out to be a choice between ISDN2 and single channel Kilostream. Capital installation costs were very similar for both (approx £2,500 per site) however, in the case of ISDN2 ongoing running costs varied considerably according to degree of use.

For total daily connection times of less than about four hours per working day ISDN2 is cheaper to operate than fixed fee Kilostream as illustrated below for a notional 247 working days per year at current British Telecom day rate call charges:

[ Figure (cost notes) not available ]

The above costings indicate that the most cost effective option is dependent upon the nature of GIS use at the remote site. If there is a low level of write transaction at a site where a significant proportion of the database is held on the local cache then ISDN2 provides a very flexible and potentially inexpensive wide area link. However, if there is a high level of regular write transaction or considerable regular "hauling" of uncached data throughout the working day then Kilostream is going to be the more viable.

Because it was known that the first two Water Service offices to be connected were "heavy" GIS users (they had been previously using GIS in a standalone capacity) and there still appeared to be technical problems handling broadcast messages over ISDN it was decided to adopt Kilostream for the first wide area connections.

Experience to date

Initial use indicates that the successful operation of the wide area links is more dependent upon operational management than technical factors. The two remote sites connected to date comprise of two locally networked GIS workstations currently used for data capture work. By its very nature data capture work does not involve frequent extended panning across the map base, hence "hauling" of uncached data has not been a problem with a relatively large capacity cache which was pre-populated prior to installation.

Data transfer across the wide area connection performs rather like a motorway contraflow, in so much that if there is very little traffic on the motorway then, ignoring speed limits, traffic flow is virtually as quick as if there were no contra flow. However as the volume of traffic increases the actual throughput speed decreases in almost exponential proportion.

1km² of inner city water data takes only slightly longer to display when retrieved over the wide area connection as when retrieved straight from cache. However 1km² of inner city water data plus all Landline OS data takes significantly longer to display.

Grampian's two Water Service offices have been configured so that local user's alternative versions are stored back on the main server, consequently data is passed over the Kilostream every time a record is inserted or updated. Users have noticed a degradation of write transaction time when they both write simultaneously. The degree of degradation is acceptable but does indicate that sites with a number of writing users may need to either store their alternative versions locally or be provided with access to additional communication channels over the wide area link.

The conclusion to date is that the nature of GIS usage needs to be understood in order to specify and configure a wide area connection for optimum performance.

[ Figure 4 not available ]

What Of The Future

Grampian Regional Council believes that it has been able to implement wide area networked GIS at realistic cost using technology which is available today. It has been proven that a single channel Kilostream link operating at 64Kbps is adequate for the scale of present implementation. Furthermore this has been achieved with a great deal of "behind the scenes" activity which is totally transparent to the user.

The computer press makes great play of cheap high speed local and wide area ATM (Asynchronous Transfer Mode) networks being the way of the future (3), however the technology is not yet available and until it is, it is difficult to see how GIS data can be viably transferred between different systems in anything like real time.

In the longer term the Council is keen to reduce the cost of providing wide area connections to more marginal GIS users by using ISDN2 instead of Kilostream. It is also keen to exploit the potential for transfer of data between different organisations using ISDN. The cost of operating ISDN2 between locations over 35 miles apart is the same no matter whether they are 36 or 500 miles apart. Unlike "fixed" Kilostream links, ISDN connections can be made between any two locations which can dial to one another.

Persistent Cache has also been seen as a way of relieving congestion on heavily used local area networks. The Council is currently planning a 6 seat GIS sub network in its headquarters which will use Persistent Cache to reduce the volume of GIS data over the building's main backbone LAN.

Acknowledgements

The authors wish to thank British Telecom, Gandalf Digital Communications Limited, Grampian Regional Council and Smallworld Systems Limited for their support and assistance in compiling this paper. Particular thanks go to Alistair Reid, Andrew Swanson and George Wallace of Grampian Regional Council for their part in installing wide area connection components and Andrew Reid of Gandalf for his enthusiastic support, also to the staff of the Department of Water Services for acting as "test drivers".

References

1. SOUTHERTON A. Modern UNIX, Chapter 4, Wiley 1992.

2. NEWELL R.G. BATTY P.M. GIS databases are different. Proceedings of the AGI 93 Conference Part 3.

3. UNIX NEWS No 56 October 1993, ATM is the wave of the future p63-65.

Smallworld Technical Paper No. 12 - AM/FM Data Modelling For Utilities

2011-08-29T07:58:00.001-06:00

by Peter M. Batty

Abstract

Data modelling for AM/FM is more complex than for traditional applications for a variety of reasons, in particular because of the need to model spatial and topological relationships. This paper examines the general differences between data modelling for AM/FM and other applications, and looks at a range of common modelling issues in utility applications, including the efficient handling of various types of tracing and network analysis, outage management, and generation and maintenance of schematics.

Introduction

There are some significant differences between data modelling for traditional applications and GIS or AM/FM, which arise from the need to model spatial and topological relationships. Utility applications have some additional complications compared to many GIS applications, due to the need to model complex networks, and these complications are not well handled by traditional GIS data models (the term GIS data model is used in two different contexts in this paper: one is the system data model used by the GIS vendor to model spatial and topological aspects of data stored in the system, and the other is the user-specific data model developed on top of the core system for a specific application).

This paper starts by discussing the general differences between data modelling for GIS and other applications, from a user application perspective. It goes on to look at a variety of modelling issues relevant to utility applications, primarily from a system data model point of view. In particular a number of non-traditional modelling techniques which offer benefits for utility applications are examined. A more detailed look is then taken at a fairly complex application which is common to most utilities, outage management, considering how the system data model features which have been discussed can be applied in practice, and what sort of trade-offs need to be considered in the design. Finally, the creation and maintenance of the user-specific data model is discussed, and in particular the use of CASE tools is examined.

Why Is GIS Data Modelling Different?

A traditional data modelling exercise goes through two major stages: the production of a logical model, and then a physical model. The logical model represents a model of the real world and is completely independent of how the system is physically implemented. The physical model represents the actual data structures (typically tables) which are used to implement the logical model. By far the most common way of representing a data model is using some form of entity-relationship diagram in which the objects (or entities) which need to be modelled are displayed as boxes, and the relationships between them are represented by lines.

Dangers in Trying to Represent Spatial Relationships

A data model for a traditional application explicitly records all relationships between entities which may be of interest to the application. This provides a big trap for the unwary person producing a GIS logical data model, because in a typical GIS application there are many relationships which will be derived implicitly from the spatial characteristics of objects. For example, in a utility application you might want to know which facilities are in a given work area, which are in a given tax area, which cables are underneath roads, and so on. In a traditional logical model this would result in defining relationships between work area and all facilities, between tax area and all facilities, between cable and road, and so on. It is possible to very quickly end up with a logical model in which almost everything is related to everything, like the following:

[ Figure not available ]

It should be fairly obvious that this is not a very helpful model, although the author has seen a number of logical models like this. The second variant of the not terribly useful logical model is the following, which again the author has seen in real situations:

[ Figure not available ]

In this case the designer has realised that the previous logical model is not useful, and has tried to show that objects are related by their location. While this model represents a little better what we are trying to model, it is still not very helpful. Firstly, objects are in the majority of cases not related because they have exactly the same location, but because their locations have some more complex relationship between them, such as one crosses the other, one is inside the other, or one is within 100 feet of the other. This could in theory be modelled by defining suitable relationships between locations, but in practice it is totally impractical to maintain explicit relationships between all locations. Secondly, since all objects are shown as having a relationship to location on this diagram, this again clutters things up (if non-spatial relationships were also shown on this diagram it would be very confusing). If we understand that any object may have a spatial component to it, and that through this it can be related to any other object with a spatial component, then it is generally clearer not to try to show this on our logical model.

Although the previous two models were somewhat simplified examples, as a general rule it makes sense to omit spatial relationships from a logical model. This is not an absolute rule though. In particular there are some relationships which could be derived spatially but which could also be regarded as more explicit. A common example is a hierarchical aggregation of areas. For example, in a utility there is usually a hierarchy of organiz- ational areas, such as local office areas, which are aggregated into districts, which are aggregated into regions (use of the terms district, division and region, and their position in the hierarchy, seems to vary from one utility to another). Since a district is always the aggregation of a number of local office areas, this is quite a strong relationship and it may be useful to show this on the logical model (whether this relationship is modelled explicitly in the GIS is a separate implementation decision which will be discussed later).

Representation of Topological Relationships

Similar issues apply to the representation of topo-logical relationships: which objects can be connected to which other objects? As with spatial relation-ships, topological relationships can be quite exten-sive, especially in a utility application, and show-ing them explicitly may clutter up the data model diagram. However, there is a better case for explicitly showing topological relationships than spatial relationships. The main one is that in a utility application there are usually definite rules which are associated with topological relationships, for example a cable can be connected to a trans-former but not to a gas main. This is not generally the case with spatial relationships, for example you would typically not prohibit a cable from crossing a gas main (there may be some such spatial constraints, but they are normally far fewer, and too complex to represent on a data model diagram).

It is very important to make sure that topol-ogical relationships (or connectivity rules) are documented in some form, but this does not necessarily need to be done by showing them on the data model diagram. There are various other options, such as creating a separate entity-relationship type diagram just showing topol-ogical relationships, using a matrix or table to show which pairs of objects can be connected, or just listing the valid connections for an object with the rest of the documentation about that object.

A diagrammatic or tabular representation of connectivity rules can generally only convey high level information about the rules. Rules can be quite complex, so it is often necessary to include some additional description in addition to a diagram. For example, a rule might be that it is only valid to connect two primary electrical conductors if the phases of one are a subset of the other - so a cable with phase BC could be connected to a cable with phase B, but not to a cable with phase AB. Another rule might be that it is not possible to connect two gas mains with different diameters unless they have a suitable fitting in between them.

Some people might debate whether defining these sort of complex rules is part of the data model or part of the application. There is a strong argu-ment for including these as part of the data model, since enforcing them is fundamental to the integrity of the database. Also, when using an object-oriented approach to design and development, which is widely recognized as being of benefit in complex applications like GIS, the behaviour of an object is an important part of its definition and when taking this approach, connectivity rules should definitely be regarded as part of the data model.

In summary, it is difficult to give a hard and fast rule for how to document connectivity rules. In many cases it may be appropriate to display the high level rules on the data model diagram (ignoring complex constraints), if this can be done without detracting from the clarity of the diagram. Whether this is practical depends on the number of topological relationships and the number of explicit relationships which need to be shown on the diagram. In any case, it is likely that addit-ional documentation will be required for each object to explain more complex aspects of the rules.

Constraints Imposed by the System Data Model

We have already mentioned that we are really discussing two separate types of data model in this paper: the application-specific or user data model, which has been the main topic of discussion so far, and the system data model which is provided by the GIS vendor.

It is the system data model which handles fundamental issues such as how spatial data and topological relationships are modelled. When implementing a GIS, you typically cannot change the system data model. This puts constraints on the way that you do data modelling, which at times may hinder you but at other times (hopefully most of the time!) should help you. It helps you because the GIS vendor has implemented a model and functionality which handles the most complex fundamental aspects of the system. It also gives you a frame of reference in which to think about how to specify spatial and topological relationships, which as we have seen already can be difficult to do if you start with a blank sheet of paper. For example, most systems use some variant on the traditional GIS model in which an object can be a point, line or area. As you define the objects in your data model, you categorise each one as a point, line or area, which makes the modelling process much simpler than if you did not have this framework to work within.

However, while this simplification generally helps you, there is a danger that the system data model may not be sufficiently rich to handle the complexity of what you want to model. The experience of this author is that the traditional point-line-area model has a number of shortcomings, especially for modelling utility networks. The next section of this paper looks at various aspects of the system data model, and in particular looks at some examples of non-traditional modelling approaches which offer benefits for utility applications.

System Data Model Issues

This section considers various aspects of the system data model provided by the GIS vendor which are important in being able to model utility networks effectively.

Sheet-based Versus Continuous Models

At a fundamental level, it is very important that the database is seamless, so that an object like a cable is always stored as a single object regardless of how long it is, and it does not have to be split into mul-tiple objects because it crosses arbitrary map sheet or tile boundaries. This significantly simplifies application development and data maintenance.

Most systems now provide this capability at a basic level. However, it is important to consider how you can access these objects as a user or application developer. Because of the very large data volumes in typical GIS databases, most systems only allow you to work on a subset of the database at one time for analysis or update purposes. Being able to access the whole database at once without constraints offers significant advantages for many applications, such as network analysis and outage management.

Providing this seamless access to a database which may be tens or even hundreds of gigabytes in size, and obtaining good performance, is obviously not a simple task. There are two key issues in achieving this. The first is a good spatial indexing mechanism. The most common approach is to use some form of quadtree index. There is extensive literature on this topic - for example see Samet, 1990. The second key issue is the under-lying DBMS architecture. The server-oriented archi-tecture used by standard commercial DBMSs like Oracle, Ingres and Sybase is fundamentally unsuitable for providing the required performance in a networked environment with many users. A client-oriented DBMS architecture can provide an order of magnitude better performance for this sort of application. This DBMS architecture is actually far more important in achieving good performance in multi-user networked environments than the spatial indexing mechanism used, but it has recei-ved far less attention in the literature. See Newell and Batty, 1993, for more details on this topic.

In summary, the important things to look for in a system data model in this area are that it is seam-less, that it allows unconstrained access to the whole database for update or analysis, and that it delivers good performance in a production environment.

Spatial Object Versus Spatial Attribute Model

As we mentioned earlier, most system data models are based on some variant of the point-line-area model, in which each object belongs to one of these categories. Some systems have extended this model, for example by adding a "control point feature" which is like a point, but has two connection nodes. This is suitable for modelling certain electrical objects like transformers and simple switches. However, the basic approach is still that each object has a (single) spatial type. Each object will then have a number of alpha-numeric attributes defined for it (size, material, equipment number, etc.).

A much more flexible model, especially for utility applications, can be obtained by looking at the spatial aspects of an object in a different way. Instead of insisting that an object has a single spatial type, we can allow an object to have multiple spatial attributes, each of which has a spatial type such as point, line or area. This simple step of moving spatial information from an object level to an attribute level gives lots of new modelling possibilities. We will look at some examples to illustrate this.

Depending on the application, you may wish to regard a road either as a line or as an area. If doing some kind of route analysis, you will be interested in tracing along its centreline. If looking at access to properties along the road, you will be interested in the right of way area associated with the road. With the traditional spatial object model you would need to model these two things as separate objects, and typically you would need to write some specific code to create and maintain the relationship between these two objects. With the spatial attribute model, you can simply give the road two spatial attributes, a centerline which is linear, and a right of way which is an area.

A very common requirement in utility applicat-ions is to be able to display an object at a location which is offset from the location where it really exists. For example, many electric utilities display transformers offset from the cable to which they are attached. This can be very simply handled by the spatial attribute model: one spatial attribute can be used to store the actual location, and another spatial attribute can store the location where its picture is to be displayed. This applies to many situations, for example where multiple conductors are running through the same duct, and you may want to dis-play each conductor offset by a different amount.

Another area in which the spatial attribute model greatly simplifies modelling is in the hand-ling of multiple representations of the same object. It is particularly common in utility applications for the same object to appear on multiple different types of map, including various schematics. Again this can be handled simply by defining multiple spatial attributes on an object: one which repres-ents the actual location, and additional ones which represent the position of the object in each type of schematic representation. There are additional modelling techniques which can be useful in handling schematics, such as the use of multiple worlds: we will return to this subject later.

Basic Network Topology Modelling

The way in which network topology is modelled is obviously of fundamental importance to utility applications. With the traditional model, a linear object typically has two nodes, one at each end. Connected objects are defined as those sharing a node, so other objects can only be connected to the end of a linear object. If an new service line needs to be connected in the middle of a cable, for example, the cable must be split into two separate cables to model the connectivity correctly. This can lead to having to split something which is really a single object into many different objects, typically replicating the attributes on every instance, which causes problems in terms of data storage, data maintenance and performance. The following diagram shows the sort of situation we are talking about:

[ Figure (drawing) not available ]

These issues can be overcome by using a two level linear network model. With this approach, every high level linear object - we call this a chain - is made up of one or more (continuous) low level linear objects - we call these links. In the above drawing, the main (secondary) cable geometry would be a single chain consisting of nine links, and each service cable geometry would be a chain with just a single link. The links define the connectivity, but the chains are the spatial attributes associated with an object - so we can just have a single secondary cable object with one set of attributes, rather than having to create nine secondary cable objects. This approach obviously allows points to be connected in the middle of a chain too, for example a single primary conductor could have many transformer connection points along its length.

Modelling Complex Network Topology

Tracing through a linear network is a fundamental requirement for many utility applications. Common requirements for controlling a trace include stopp-ing at specified objects, possibly qualified by attribute, for example stopping at all open valves or switches. Another important requirement for electrical networks in particular is being able to do directional tracing - upstream or downstream.

For simple networks, these requirements can be met by the traditional linear network model consisting of links and nodes, where a link runs between two nodes, and two links are connected if they share a common node. However, utility networks often include objects whose connectivity cannot be easily modelled using this simple model, in particular various types of switching or control facilities. For example, consider the following diagram of a transfer switch:

[ Figure (diagram) not available ]

This switch has three connections, one input and two outputs. The switch can be in one of two positions. In position 1, shown in the drawing, current from the input goes out on output 1, and in position 2 it goes out on output 2.

There are several things which can help us produce an elegant solution to the problem of modelling these complex objects. The first is the spatial attribute model: we could model our transfer switch with three point geometries (spatial attributes), to represent the input connection and the two output connections. The second is that we need some way of defining the behaviour of this object within a trace - the trace needs to know that if it reaches the input connection point, then if the position attribute of the transfer switch is equal to "Position 1" then it should continue tracing from the output 1 connection point, and otherwise it should continue tracing from the output 2 connection point.

This is a situation where object-oriented programming is very useful. With conventional procedural programming, we would need to modify our tracing code each time we wanted to handle a special case object like this, which makes it impossible to create a general purpose trace routine, and causes support and maintenance problems. In an object-oriented system, we define the trace behaviour on each object, so that the general trace code does not need to be modified and new object behaviour can be introduced very easily. For example, our trace code could be set up to check whether any object it hit had a special method defined called trace_outputs, and if it did then it would call this method to get a list of nodes from which it continue the trace (a method is similar to a function in a procedural programming language - see Batty, 1993, for more information on object-oriented programming in GIS). The following is an example of what this method would look like for the transfer switch:

method transfer_switch(trace_input)         if trace_input = input_connection         then                 if self.position = 1                 then                          return {output_connection1}                 else                         return {output_connection2}                 endif         elif trace_input = output_connection1 and                  self.position = 1         then                  return {input_connection}         elif trace_input = output_connection2 and                  self.position = 2         then                  return {input_connection}         else                 return {}         endif  endmethod

This method can contain any kind of program-ming logic, so the behaviour can be extremely sophisticated if necessary. This gives us an elegant way of handling the transfer switch.

The transfer switch is a fairly simple example - we also need to model more complex devices such as the following switch cabinet (this particular example is an S&C model PMH9):

This is displayed as a single object on the map, but internally it contains four switches which can be operated independently. Each switch controls current on three phases (A, B and C). The left hand diagram shows all three phases combined, and the right hand diagram shows the phases separately. The two switches on the left hand side are group operated switches, which means that they are either open on all three phases or closed on all three phases, and the two switches on the right hand side are fuse switches, which can be independently open or closed on each phase. The trace behaviour we want needs to recognise the positions of all these switches and derive the correct output for a given input appropriately.

The behaviour of the PMH9 switch cabinet is significantly more complex than that of the transfer switch, so we really need something more to help us model that. For this we will introduce another couple of new concepts: multiple worlds and hypernodes.

[ Figure (diagram) not available ]

A good approach to this problem is to model the internal structure of the switch cabinet as a separate set of GIS objects with their own attributes and topology - in this case we need to model bus- bars, fuse switches and group operated switches. We will lay these out in a schematic representation as in the diagram above (the simpler left hand representation is sufficient providing that we store separate attributes for the switch position on each phase, and that the tracing function used can stop based on complex predicates involving these attributes). An issue we need to resolve is where to place these objects - they provide more detail than we really want in our main geographic data-base. This is one area where the concept of multiple worlds is useful. A world is an independent coordinate system within the same database. Many different worlds can be created, and it is possible for a world to be related to an object. In this case we create a new world which we think of as being owned by the switch cabinet. We then create the objects representing the internals of the switch cabinet in its internals world. These can be created from a list of standard templates, or objects can be created and edited individually. These objects are connected using ordinary topological rules, and normal trace constraints will apply, like not going through open switches.

The one thing which is still missing is a link between the cables which are connected to the switch cabinet, which exist in the main GIS world, and the internal switches and busbars, which exist in a separate world belonging to the switch cabinet. This is where the hypernode comes in. A hyper-node is an object which has two point geometry attributes, and special tracing behaviour defined on it, similar to that which we defined on the transfer switch. In this case the special behaviour is quite simple - it just says that if the trace hits a point belonging to a hypernode, then it should continue tracing from the other point belonging to the hyper-node. In this way a hypernode can be used to make a trace jump ("through hyperspace" - hence the name!) from one point to another. The two points (or "ends") of a hypernode can be in different worlds. Hence we can add hypernodes which connect the cables coming into the switch cabinet to the appropriate connection points in the internal model. The nice thing about this approach is that we do not have to define any special trace behaviour on any objects, even though we are modelling some very complex behaviour. We simply specify that the switch cabinet is an object which has internals, and all the special behaviour we need is already defined on the hypernode, which is a standard system object. For a more detailed discussion of the use of multiple worlds, see Newell and Doe, 1994.

Schematics

We have already mentioned that it is a common requirement in utility applications for an object to have multiple representations, appearing not only on one or more types of geographical map, but also potentially on various types of schematic diagram. Several of the modelling techniques which have already been discussed are very useful in handling schematics. Multiple spatial attributes can be used to store the location of the object in each schematic. The geometry for each different schematic can be stored in a different world, to provide a clean separation between each schematic and the geographic representation.

In some types of schematic there may not be a one to one correspondence between objects in the geographic world and objects in a schematic. For example, many cables shown in the geographic world may be combined into a single line section in a schematic. The spatial attribute model is not sufficient to handle this case: we need to be able to handle (explicit) relationships between objects. The ability to define and maintain explicit relation-ships of various types (such as one to many and many to many) is important for many aspects of data modelling. The system should be able to automatically enforce rules relating to relationships, like referential integrity, or more complex rules. For example, in the case of a schematic line section which is related to multiple cables there must be a mechanism for ensuring that the schematic is updated appropriately if any of the cables are modified. These data modelling require-ments can be met using a DBMS feature known as a trigger, which is discussed in the next section.

Maintaining a Complex Model

In a GIS there are often complex relationships which need to be maintained, and complex rules which need to be validated whenever certain objects are updated. It is difficult to consistently validate rules via specific code in the application, because we need to ensure that validation is always done, whichever mechanism is used to update the record. For example, whether the record is created or updated by a data translator, or via one of a number of interactive menus, we always want to make sure that the same validation is done. This can be implemented by the use of a DBMS which supports triggers. A trigger is a function (or method in object-oriented terms) which is invoked whenever a specified object or attribute is inserted, updated or deleted. Ideally, it should be possible to invoke the full range of GIS functions from within a trigger, and it should be possible to cause the current transaction to be rolled back if an invalid condition is found within a trigger.

Triggers can be used for a wide variety of functions. At a very simple level, for example, a trigger could be defined to create a given anno-tation at a standard offset from a certain type of point whenever the point was inserted or updated. A trigger could also be used to implement complex connectivity rules, for example checking that the phases of two connected cables are compatible, and returning an error condition which will roll back the current transaction if they are not. A trigger could also implement more complex functionality such as updating an associated schematic geometry when the geographical representation of a record is updated.

An Application Example

This section looks briefly at some design issues involved in a common utility application, outage management, as an example of the sort of trade-offs which need to be considered when designing a data model.

Outage Management

We will consider the design of an outage management system for a radial electricity network. The basic idea of this application is to record calls from customers whose power is out and from this information predict which device is most probably causing the outage. The customer calls may be entered into a separate system by the telephone operators, and then passed on to the GIS for outage analysis. In a hierarchical network, several customers will be fed from a single transformer. A number of transformers will typically be on a section of network which is isolated by a fuse switch (i.e. if that fuse switch is out, all customers served from all transformers downstream of that fuse switch will be out). Further up the hierarchy there will be other devices such as reclosers which are possible causes of an outage.

In order to predict which device or devices are the likely cause of an outage, we need to look at the pattern of calls. If we receive several calls from customers served by the same transformer, then we would predict that the transformer was the probable cause of the outage. However, if we had predicted several transformers beneath the same fuse switch, then we would change our prediction to say that it was most likely that the fuse switch was out rather than all of the individual transformers. Exactly how many devices need to be predicted before a device further upstream is predicted depends on the type of device upstream and a range of other factors. A detailed discussion of all the design requirements is beyond the scope of this paper. However, it is sufficient to know that for any predictable device on the network, we need to be able to efficiently identify the switchable devices immediately downstream of that device, the transformers directly fed from this device, and if the device is predicted as being out we need to be able to calculate the number of customers and the total load (kVA) downstream of that device.

To meet the requirement of being able to efficiently identify the immediately downstream predictable devices and transformers from a given device, we have several options. We could dynamically trace downstream each time we needed this information. This could potentially involve tracing downstream for some distance, which could be a performance issue. A second possibility is that we could construct a separate network, using a similar approach to that which was discussed for schematics, which contained only the devices we were interested in for outage management, connected by linear objects which could be formed from an aggregation of several cables in the detailed geographic network. We would still have to do a trace each time we needed the downstream devices, using the "outage network" but performance should be better as the network is simpler. However, creating a separate network has a data storage overhead, and we need to write some application code (probably a set of triggers), which creates and maintains the second network automatically. A third option is that we could maintain a set of explicit relationships which models the hierarchy of predictable devices. This would again have some storage overhead, and would need some triggers writing to maintain the hierarchy, but would probably give the best performance of any of the approaches.

So we have (at least) three possible approaches to this problem. The simplest one in terms of the data model and application development require-ments is likely to be the least efficient in terms of performance. This is a case where prototyping is very useful to try the different approaches. This application was recently implemented by the author, and the first approach which was tried as a prototype was the first option described, which was most attractive because of its simplicity. The performance obtained with this approach was tested and found to be good, so it was decided that it was not worth prototyping the other options. This sort of situation occurs quite frequently, where you have a choice of deriving complex relationships between objects on the fly, or of creating more explicit relationships which will give you better performance when querying the relationship, but which has overheads in terms of data storage, application development and performance of updates. On a case by case basis you need to evaluate the pros and cons of the different options, and this may often involve prototyping some of the options.

Managing The Data Model

This section briefly discusses technology which can help in the development and maintenance of a data model, in particular the use of CASE technology.

Problems in Designing and Maintaining a Data Model

One of the largest costs in most large GIS implementations is the cost of customising the system to meet an organisation's specific requirements. In turn, designing and maintaining the data model for the GIS is typically one of the most significant elements of this customisation. Data modelling is a complex task for most applications, but this is particularly true for GIS as we have seen already. GIS projects typically have quite long life cycles, and the technology is relatively new to most users, which means that at the outset of the project they typically do not realise the full capabilities of the system. Both of these things contribute to the fact that requirements are very likely to change during the course of the project, and these often require the data model to change. Also as new applications are developed and added to the system, there may well be requirements for further changes to the data model. Since GIS projects involve capturing large amounts of data, it is critical that these changes to the data model can be made without losing any data which is already stored in the system.

With most traditional GIS software, it has been very difficult to address these issues. Typically the design of the data model takes a long time and is difficult to subsequently change. This usually means that a very long period of time is spent at the beginning of the project doing requirements analysis and data model design to try to make sure that it is exactly right (which of course it never will be) before any other work begins, as it is so difficult to make changes subsequently. This section briefly discusses how a CASE tool can be used to address these issues, and how this in turn radically changes the way in which one can approach the problem of customising a GIS.

What is a CASE Tool?

The acronym CASE stands for Computer Aided Software Engineering, and it is used to describe a variety of computer-based tools which can be used to assist in the design and development of computer programs. Such tools have been developed for various purposes, including the analysis and documentation of procedures and data flows, and the design and documentation of a data model. It is the latter function which we consider here: the use of a tool which can be used to define a graphical representation of a data model, in the form of an entity-relationship diagram which we discussed earlier. By clicking on individual objects it is possible to define more detailed information about them, such as what attributes they have, what the types of these attributes are, and so on. With some CASE tools, it is possible to automatically generate code which will create the data structures which have been designed.

Compatibility Between the CASE Tool and the DBMS

For doing a high level conceptual design, you do not necessarily need a close correspondence between the CASE tool being used and the DBMS which will eventually be used to implement the actual system. However, if the CASE tool is to be used to do the physical database design and to actually generate code, then clearly a much closer link is required between the CASE tool and the DBMS, and application development environment, used to implement the actual system.

This requirement can really be split into two. The first is that the CASE tool supports all the data modelling concepts supported by the DBMS: all its datatypes, all the types of relationships it supports, and other concepts which we have discussed such as triggers. In particular for GIS the CASE tool needs to understand the way in which spatial information and relationships are stored in the DBMS. The CASE tool may extend to covering aspects of the user interface of the application, such as which fields are visible to the user and what sort of interface is used to edit objects and their individual fields.

The second requirement is that the CASE tool must be able to generate code in an appropriate format which will implement the data model which has been designed. Providing that common data modelling concepts are supported, it is obviously technically possible for one CASE tool to output code in multiple formats suitable for different DBMSs.

These requirements are to a certain extent independent of each other. If the first one is met but not the second then it is at least possible to use the CASE tool to do the physical design for the system, but the code to implement this then has to be written manually. On the other hand, it is possible to have a CASE tool which supports a subset of the concepts supported by the DBMS (with GIS, for example, a CASE tool which supports alphanumeric datatypes but not spatial datatypes), which could be made to output code in an appropriate format for the DBMS, but further development work would then be required in the DBMS environment (outside the CASE tool) to incorporate any of these additional concepts in the application. In this situation it is likely to be much more difficult to usefully maintain the data model using the CASE tool after its initial creation, since the CASE tool does not know about any of the changes which have been made to the data model within the DBMS environment. Clearly, the CASE tool is much more useful if it meets both of these requirements in full.

Maintenance of the Data Model

While the requirements in the previous section demand a very close link between the CASE tool and the DBMS, the biggest benefits the author has found from the use of a CASE tool come from taking this integration one stage further and providing the ability to update the data model of an existing DBMS which is already populated with data, without losing any existing data. This is particularly important in GIS projects, since as we mentioned, they tend to last for a long time, requirements are particularly prone to change during the course of the project, and typically the database will contain large amounts of data when these changes have to be made.

It is highly desirable to be able to make data model changes on a test version of the database so that they can be tested before applying them to the production version. Ideally one would like to be able to do this without replicating data in the master database.

Use of an Incremental Development Methodology

If suitable tools are available to allow the data model of a populated database to be easily changed, this can significantly change the approach which is taken to an application development project. Instead of taking the traditional approach of trying to completely design the data model at the beginning of the project, which typically takes a very long time, it is possible to start with a much simpler core data model and develop it over time in parallel with the development of application prototypes. This allows benefits to be delivered to users much more quickly. For a more detailed discussion of the use of CASE tools with GIS, see Kendrick and Batty, 1994.

Conclusion

This paper has covered a range of issues relating to GIS and AM/FM data modelling. We started by considering how best to represent spatial and topological relationships when designing an application data model, which is not specifically covered by any of the common approaches to data model design. We then considered how new features in the GIS system data model could help the user model certain things more easily. From this perspective it is important that GIS vendors continue to look at enhancing their system data models rather than just continuing to use the simple point-line-area spatial object model which is still the most commonly used approach. It is also important for users to consider the system data model of any system which they are evaluating. Finally, we discussed CASE technology which can simplify the creation and maintenance of a data model, and which in doing so can radically change the approach which is taken to a GIS development project, by allowing the data model to be developed incrementally rather than having to completely design it at the beginning of the project.

References

Batty, P.M, 1993: Object-Orientation - some objectivity please!: Proceedings of GIS 93 Conference, Birmingham, UK.

Kendrick, G., and Batty, P.M., 1994: Use of an Integrated Case Tool for GIS Customisation: Proceedings of EGIS 94.

Newell, R.G., and Batty, P.M., 1994: GIS databases are different: Proceedings of AM/FM Conference XVII, pp 279-288.

Newell, R.G., and Doe, M., 1994: Discrete Geometry with Seamless Topology in a GIS.

Samet, H., 1990: The design and analysis of spatial data structures: Addison-Wesley, Reading, Massachussetts, 493 p.

Smallworld Technical Paper No. 11 - Use of an Integrated CASE Tool for GIS Customization

2011-08-29T07:57:00.002-06:00

by Gillian Kendrick and Peter Batty

Abstract

The implementation of large corporate GIS systems places heavy demands on the customisability of GIS products. This paper examines the use of an integrated CASE tool in GIS customisation. The authors start by discussing some of the common problems which designers of GIS systems face in developing new applications. The paper describes the or features which a CASE tool should include in order to address these issues. The similarities between the technology which can be used to underpin both a CASE tool and a GIS will be mentioned.

The paper describes the practical use of a CASE tool in application development. It also discusses the management of multi-user application development and the benefits gained from utilising a CASE tool in the development process. It is shown that the use of an integrated tool facilitates the adoption of a new incremental design methodology.

Introduction

One of the greatest costs in large GIS implementations is that of customising the basic system to meet an organisation's specific requirements. There are many aspects to this customisation. The user interface of the system may be tuned to speed up data capture or else to streamline the execution of particular queries. The way in which object's spatial attributes are displayed must be specified. However, one of the most significant parts of any customisation involves the design and maintenance of the application's data model.

Data modelling is a complex task for most computer applications, but this is particularly true in GIS for several reasons. The first of these is the large number of classes of objects which are involved in these systems. Each implementation will typically involve hundreds of different object classes. The second factor is the number and variety of relationships between these objects. Relationships are of three main types: aggregation and association, e.g. a building is 'part of' a school; spatial, e.g. a house is 'near to' a lake; and topological, e.g. a valve is 'connected to' a pipe. Of these types, only the first may be explicitly represented in terms of traditional relational joins. The others will be derived indirectly from spatial attributes of the objects or else through the interaction between these spatial attributes.

For many organisations, the GIS will only form a part of the corporate computing system. There will be existing data bases holding data such as customer information. Some of this data may be relevant to the GIS, and therefore part of the customisation will involve integrating these existing data models with that of the GIS (Bundock & Theriault 1992). This integration will again increase the size and complexity of the overall design.

Another aspect of GIS implementations, which has an impact on the maintenance of the data model, is that user requirements change. GIS projects typically have quite long life cycles, and the technology is relatively new to most users. This means that at the outset of the project they may not realise the full capabilities of the system. As new applications are developed and added to the system, changes are required in the data model.

Since GIS projects involve capturing large amounts of data, it is critical that these changes to the data model can be made without losing or compromising any data which is already stored in the system.

With most traditional GIS software, data model evolution is a difficult issue. This has meant that a long period of time has been spent at the beginning of the project on requirements analysis and data model design to try to make sure that it is exactly right, which of course is rarely achieved. This has had to happen before any other work could begin, meaning that there has been a very long delay for users between the purchasing of a GIS and the start of productive use of the customised system.

This paper describes how an integrated CASE tool can be used to address all of these data modelling issues. This in turn radically changes the way in which one can approach the problem of customising a GIS, using an interactive design methodology.

What is a CASE Tool?

The acronym CASE stands for Computer Aided Software Engineering. It is used to describe a variety of computer-based tools which can be used to assist in the design and development of computer programs. Such tools have been developed for various purposes including the analysis and documentation of procedures and data flows, and the design and documentation of a data model. It is the latter function which we consider in this paper as this is the most appropriate kind of tool to help with GIS customisation.

A CASE tool which is to be used for data model design must display a graphical representation of the model. This is usually in the form of an entity-relationship diagram which shows the entities, or classes of objects, and the relationships between them. Such a facility gives the designer a good picture of a large design. The tool allows plots of the diagram to be made. These can be included as part of the documentation of the overall design.

The tool allows the designer to interact with the spatial representations of object classes and relationships. A graphical user interface (GUI) provides an environment in which the attributes and behaviour of the objects and relationships can be edited and queried. The tool produces automatic documentation on selected parts of the design.

One of the ways in which a CASE tool can reduce the time taken to implement a design is by automatically generating the code which creates the data model. This facility has two main benefits. Firstly the designer no longer has to worry about implementation details; he can concentrate on the task of data modelling. Secondly, the automatic code generation speeds up the process of creating data mode]s, and avoids programming errors.

The tool should also provide facilities which enforce the correctness of the data model. It should validate that the design is consistent. Checks made by the tool at the design stage will reduce the amount of time spent in tracing bugs by the designers at a later date. This will again speed up the delivery of finished applications.

This section has described general requirements for a CASE tool which is to be used for data model design. The next two sections describe in more detail some of the requirements for CASE tools which are to be used in GIS. The first section covers general requirements for tools for GIS data model design. The second considers the extra requirements for an integrated tool and covers some of the benefits provided by such a tool.

CASE Tools for GIS

Multi-User Design

It was mentioned in the introduction that one of the reasons for the complexity of GIS data models was that they involved a large number of classes of objects. It will usually be the case that more than one designer will be involved in the development of these objects, their relationships and their behaviour. This means that it is important that the tool used to develop the data model is multi-user.

Each designer should be able to work on additions or modifications to the data model without being affected by changes which others are making to different parts of the design. These changes will also be made over a long time scale - maybe hours, days or weeks. The designer may wish to try out several alternative versions of the data model in order to settle on a best design. When a new part of the model has been completed or re-engineered, the tool must offer support for 'merging' the new work in with the rest of the design.

The requirements of CASE tool database technology in this area, involving long transactions and version management, are very similar to those of GIS. (Easterfield, et al. 1990).

Modelling of Spatial Data and Topological Relationships

GIS data models are special because they require the user to model spatial attributes of objects. New data types such as point, area, chain, raster and grid are supported. The tool also models the topological relationships which will exist between some objects. For example a valve location 'connects to' a pipe centreline, a land parcel 'shares' its boundary with another land parcel.

These aspects of GIS data models should be supported in a CASE tool which is to be used in GIS customisation.

Integrated CASE Tools for GIS

For high level conceptual design, it is not necessary to have a close coupling between the CASE tool and the DBMS used to implement the system. However, if the CASE tool is to be used for the physical database design and to generate code, then a much closer link is required between the tool, the DBMS, and the application development environment used to implement the actual system. The requirements for the tool if it is to achieve this higher level of integration can be separated into two parts.

First, the CASE tool must support all of the data modelling concepts supported by the GIS. These include things such as the datatypes, relationships and other system specific concepts such as triggers, validators, and enumerators. For object oriented systems, the CASE tool should also manage the behaviour for the object classes (Booch 1991). In particular for GIS the CASE tool needs to manage spatial information and topological relationships. The CASE tool may extend to covering aspects of the user interface of the application, such as which fields are visible to the user and what sort of interface is used to edit objects and their individual fields.

Second, the CASE tool must generate code in an appropriate format which implements the data model which has been designed. in the case of the work reported here, the tool produces a script in the language Smallworld Magik.

These requirements are to a certain extent independent of each other. If the first one is met, but not the second, then it is at least possible to use the CASE tool to do the logical design for the system, but the code to implement this then has to be written manually. on the other hand, it is possible to have a CASE tool which supports a subset of the concepts supported by the DBMS (with GIS, for example, a CASE tool which supports alphanumeric datatypes but not spatial datatypes). This could be made to output code in an appropriate format for the DBMS, but further development work would then be required in the DBMS environment (outside the CASE tool) to incorporate any of these additional concepts in the application. In this situation it is likely to be much more difficult to usefully maintain the data model using the CASE tool after its initial creation, since it does not know about any of the changes which have been made to the data model within the DBMS environment. Clearly, the CASE tool is much more useful if it meets both of these requirements in full.

A CASE tool which can generate the data model for a GIS application automatically is a very useful part of a development tool kit. There are two more facilities however which are necessary in order to make the tool truly useful for corporate GIS design.

Integration of Existing Data Models in the GIS Data Model

Something that was mentioned in the introduction to this paper is that a common requirement in GIS implementations is that existing (legacy) databases, and therefore data models, should be integrated into the system. It will often be the case that there are relationships between objects which have been created as part of the GIS and those which belong to these other databases.

As an example consider a large utility company that has a customer information database. It is undesirable to move this data to the GIS as many other applications will already be designed to run on it. Part of the requirement of the GIS application is to associate spatial information with those customer records.

To facilitate this integration, the CASE tool can 'reverse engineer' the data models from these existing databases. The objects can then be integrated into the whole design within the domain of one tool.

Maintenance of the data model

GIS applications involve large amounts of data, and as mentioned before, they are also particularly prone to having the user requirements change during the life cycle of the project. If a change in the user requirements means that the data model must be modified, then it is essential that facilities are provided to 'forward engineer' the populated database to be consistent with the new datamodel.

A common feature of CASE tools is that they generate the code which creates a design. It is less common to generate the code which will 'forward engineer' an existing design when the data model definition held in the tool has been changed. A CASE tool for GIS should offer support to evolve a database from one schema to another.

Features of an Integrated CASE Tool for GIS

Based on the previous sections, we can summarise that an integrated CASE tool should have the following features:

Support all of the data modelling concepts which are supported by the GIS DBMS
Generate the code which will implement the data model in the GIS DBMS 'reverse engineer' existing data models so that their schema can be integrated into the overall design for the system
'forward engineer' populated databases when the design of the objects and relationships is updated.

Such a tool provides significant benefits in the speed of GIS customisation and also in the maintenance of the applications throughout their life cycles.

Similarities Between a GIS Application and a CASE Tool

Several of the capabilities provided by a CASE tool are similar to those which are found in a GIS.

Firstly, in both systems, the user creates objects which combine both spatial and alpha-numeric attributes. In the GIS, these objects are things such as houses, pipes or rivers. The alpha-numeric attributes of a house are things such as its address and owner, and its spatial attributes are its position and footprint. In the CASE tool, the objects which the user works with are the definitions of the classes of objects which are to be used in the GIS application. For example, a CASE tool object has an alphanumeric attribute which holds the name of an object class, this could be 'House'. The spatial attributes of these objects are used to position them in the entity-relationship diagrams. Similarly, the CASE tool stores relationship objects, their attributes would include the type of the relationship, e.g. 'part of' or 'connected to'.

In both the CASE tool and the GIS, users can query and edit the properties of the objects through a GUI. This interface provides facilities for plotting and reporting.

One of the areas in which a GIS and CASE tool are most similar is in their need to provide support for multi-user working in a long transaction database. This should not be surprising as both systems are used as design tools as well as information systems.

The Smallworld CASE Tool

After a consideration of the benefits of using an integrated CASE tool, and a survey of the tools currently available on the market, Smallworld chose to implement its own. Because of the similarities outlined above between GIS and CASE, it was possible to implement the tool as a specific GIS application with its own data model.

Working with the CASE Tool

The tool is activated during a normal GIS session. The designer can have both the GIS application and the CASE tool running at the same time.

[Figure not yet available]

Figure 1. Integration of GIS and CASE in one environment

Application Development with the CASE Tool

Smallworld GIS contains a powerful data dictionary which supports many advanced data modelling features. These include the definition of new data types, triggers and validators. The data dictionary is capable of storing the definition of object classes, their attributes, behaviour and also the details of the topological and associative relationships between them. The data dictionary can manage many different versions of a data model and allows the designer to look at these different versions within one GIS session.

The CASE tool operates directly on the GIS application's data dictionary. If the GIS application involves integration with external databases, the data models held in these databases are reverse engineered into the tool. Once present in the tool, the user can add spatial attributes to the objects, and can also define relationships between them and the objects which live in the main GIS database. The user can also add behaviour to these external objects.

The CASE tool supports a steady progression from a logical to a physical definition of the required data model. It does not demand that the data model is completely defined but will advise on areas which are not yet complete. Facilities are provided which allow the designers to document the model extensively as they are working.

Developing the Data Model

As already mentioned, the CASE tool is a multi-user facility. We will now describe how many users can work on the same model.

[Figure not yet available]

Figure 2. Version management of both schema and GIS data

In figure 2, the boxes to the left of the dotted line represent the different alternatives in the CASE tool's database. Beneath the Top alternative there are two others where the designers Phil and Betty can work independently. The user Phil also has another two sub alternatives where he can try out different designs. The area to the right of the dotted line shows the alternative tree in the main GIS application. Beneath the Top, there are two alternatives Live Data and Test. The alternatives beneath the Live Data alternative are those where the GIS application is being used. The GIS applications may include data capture, analysis of existing data and design. The part of the tree under the 'Test' alternative is used by CASE designers to try out their developments. When Phil decides that he has completed a part of a design in the alternative 'Design 1', he can make the CASE tool 'apply' the new data model to the alternative 'Test 1'. At this point, he can immediately start testing the new data model in the GIS application. If after testing, the new part of the design is found to be a success, the changes he has made can be 'posted' up to the Top of the CASE database. If the design was unsatisfactory, he can proceed to improve it and then 'apply' the new changes.

When changes have been made to the design which have been thoroughly tested they can be applied to the Top alternative of the GIS application's database. That alternative will have its data model forward engineered'. Data model changes then spread through the alternative tree as users at lower alternatives request to see them.

[Figure not yet available]

Figure 3. Part of a GIS application's data model

The use of an integrated CASE tool in GIS allows designers to develop data models iteratively. There no longer needs to be such a great reliance on getting the design right first time. This in turn means that the end users can rapidly be presented with a prototype system which allows them to more easily understand what the system can offer. They are therefore able to offer much more feedback in the design of the system, leading to the development of systems which are much better suited to their requirements.

Data Model Re-use

Another feature of the CASE tool is that it facilitates the re-use of pieces of design. It can archive complete sections of data model in a format which can be read in by another tool. This has enabled the setting up of a library of designs. This greatly speeds up the development of new applications as they rarely have to be designed from scratch. An existing 'template' data model can be loaded in to an applications CASE tool and this can be thought of as the first prototype.

Conclusions

Since Smallworld commenced using an integrated CASE tool for application development the following benefits from this approach have been found:

Customisation of the system has been made more accessible. The tool has reduced the size of the knowledge barrier which designers have to overcome, as they no longer have to be trained in how to talk to the underlying DBMS directly. They can instead spend more of their time with design considerations. This means that more people are able to customise the system with less initial training
The tool has enabled the creation of applications which better meet the customer's requirements through the interactive development approach. Users are given the chance to interact with a version of the application at an early stage. They can then refine their requirements, the developers can evolve the design and the tool will evolve the database.
Designs are more easily re-used. The archiving facility in the Case tool allows parts of designs to be stored in a form where they can be loaded in to another tool. This means that a library of commonly used parts of designs can be set up. When a new application is being developed, much of the initial data model can be created from these standard components. This has two main benefits: firstly the development time for new applications is greatly reduced; secondly, the quality of developed applications is improved. Time invested in improving the quality of the basic data model components leads to improvements in the overall system quality.
Multi-user working is easier to manage. It is always a difficult problem when many designers are working on the same project to avoid duplication of effort, and conflicts between their different designs. The CASE tool aids in these problems by incorporating a number of validation and completeness checks which will prevent the design from becoming inconsistent. The users can develop parts of the data model independently in separate alternatives. When they incorporate these into the total design by merging' their changes, the DBMS automatically spots any conflicts in the design which the designer has introduced and allows him/her to reconcile them.
An Incremental development methodology means that designers can get production systems in place much more quickly.

References

Booch, G. (1991) Object Oriented Design with Applications. Benjamin/Cummings, Redwood City California, 1991

Bundock, M. & Theriault, D. (1992): Integration Of Case Technology into the GIS Environment, in AGI92 Conference Papers, Birmingham, November 1992

Easterfield, M., Newell, D.G. & Theriault, D. (1990): Version Management in GIS - Applications and Techniques, EGIS 90 Conference Proceedings, Amsterdam, April 1990.

Smallworld Technical Paper No. 10 - Discrete Geometry with Seamless Topology in a GIS

2011-08-29T07:57:00.001-06:00

by R. G. Newell and M. Doe

Abstract

Modern GIS systems go to great lengths to model a seamless world containing a rich topological model. However, there is a requirement, particularly among utility companies such as electricity and telecommunications companies to model their network in both geometric and, possibly, multiple schematic coordinate systems. The schematics may hold an alternative representation or a detail of par of the network. This paper describes an elegant method of modelling such situations so that topology is handled seamlessly and data common to the alternative

representations is not replicated. The technique, known as 'multiple worlds , is a natural and straightforward extension of the spatial indexing techniques used to handle a seamless database efficiently. One of the benefits of the approach is that any network analysis application written to work in a single coordinate world will work unchanged in a multiple world situation. Applications include the integration of electricity substation schematics within a geographic database, facilities management, multi-sheet schematic applications such as power systems diagrams as well as P&ID diagrams in chemical plant applications.

Introduction

The work reported in this paper was carried out as a result of the requirement to model complex network applications, particularly in the electricity supply and telecommunications industries. The ideas presented have been developed within Smallworld GIS. Smallworld GIS is one of a number of systems whose origins lay more in database management foundations than in file and CAD foundations (Chance et al. 1990, Easterfield et al.1990). In such systems, the underlying principle is that the GIS manages a single logical database that manages both spatial and aspatial data together. The principle extends to cases where the data is held in multiple existing (legacy) databases.

One of the benefits of this approach is that the spatial data can be modelled as it really is - a seamless carpet of topologically structured geometry. Such an architecture requires different solutions to those implemented in tile-based systems. In particular, spatial organisation is very important so the system gives a good response time for spatial queries on very large seamless databases. There are many successful spatial indexing schemes (Abel 1983, Abel 1984, Batty 1990, Bundock 1987, Guttman 1984, Kriegel 1990, Libera 1986, Newell 1991, Samet 1989, Seeger 1988, Smith 1990). In this paper we briefly describe the method employed in Smallworld GIS. For a fuller explanation see Newell et al. 1991.

Although much has been written about managing spatial data within one coordinate system, less has been written on managing data in multiple coordinate systems, including cases where there is a topological structure, but possibly no corresponding spatial coordinate data: for example, a complex junction where a large number of inputs are connected to corresponding outputs using a connection'matrix'.

A common example of multiple coordinate systems in the electricity supply and telecommunications industries is schematic diagrams of parts of a network where there is too much detail to represent the structure in the 'geographic' coordinate system. Such a case arises with substation schematics. Figure 1 shows such an example, where a substation at Barnes Road has 3 separate representations, one geographic and 2 schematic, as well as a duct containing cables represented geographically and as a cross-section.

[Figure not yet available]

Figure 1: Geographic, schematic and cross-section representations

On the face of it, one might think that a tile-based architecture is more suited to handling such situations, since presumably such systems have ways of handling connections across tile boundaries. We hope to go on to show in this paper an elegant approach that maintains connections across multiple coordinate systems without incorporating any other compromises into the system.

Handling Topology

Smallworld GIS models the world as a collection of objects that have relationships and attributes. Complex object relationships are modelled by one-to-one, one-to-many or many-to-many relations that are defined in a CASE tool. Object attributes are aspatial (represented as numbers or characters for example) and spatial (represented as points, chains and areas). An aspatial attribute may be defined as a 'logical' attribute; such an attribute is not defined in the database, it is always derived by a method on the object when needed.

[Figure not yet available]

Figure 2: Smallworld Object Structure

Topological relationships are modelled by points, chains and areas sharing nodes, links or polygons. A chain consists of a linear sequence of links (xy or xyz coordinate strings) and an area consists of a contiguous set of polygons. The two level structure allows for a much richer modelling of topology than a layer based approach. Figure 2 shows an idealised representation of the structure underpinning the Smallworld topology model. One object may have multiple geometries and many objects may share one geometry.

Spatial Organisation

The Smallworld GIS Version Managed Datastore holds all of its spatial and aspatial data seamlessly; there are no tiles or layers. The basic storage structure is tabular, but unlike a conventional relational database, the access language is the object-oriented language, Smallworld Magik. Fast spatial retrievals are achieved by means of an improved quadtree indexing technique, which neither complicates nor compromises the logical data structure described above. The primary key of all spatial data contains an automatically generated 'spatial key'. The method both clusters data spatially on disk and provides an efficient index; Figure 3a illustrates how a simple quadtree index can sometimes produce an inefficient index. Figure 3b illustrates the overlapping quadtree index in which a small overlap produces additional digits in the key.

[Figure not yet available]

Figure 3a: Traditional Quadtree Index

[Figure not yet available]

Figure 3b: Smallworld Overlapping Quadtree Index

Because the spatial key is not necessarily unique for two objects in proximity, it is extended by an automatically generated unique identifier (see figure 3c).

       Quadtree key Unique id       Figure 3c: Unique spatial key

Multiple Worlds

Although it is essential in many cases to model a seamless real world with a seamless geographic database, there are also cases where some information needs to be handled in a different coordinate system - a different world. Examples include applications where part of the model is represented as a drawing schematic, either as an alternative representation or as an expanded detail of part of the geographic world. An example of the latter is a schematic that represents the internal structure of an electricity substation. In such cases, although the geometry may be spread across different coordinate systems, it is important to handle the topology seamlessly, so that applications such as network analysis can work across multiple worlds unhindered by modelling artifacts.

Implementation

Smallworld GIS handles multiple worlds as an extension to the quadtree indexing scheme described above. The spatial index can be extended to include extra high order bits. In this way topological tracing operations work in an identical manner regardless of whether the geometry is in a single coordinate system or multiple coordinate systems. Figure 4 shows how the spatial key is extended to handle multiple worlds,

       World Id Quadtree Key  Unique id       Figure 4: Structure of Smallworld spatial       index.

A further extension of this idea allows the data to be partitioned according to various non-spatial criteria. In Smallworld GIS, this technique is used to control both drawing priority and to implement the idea of multiple worlds. A number of high order bits are reserved to store a world ID. The next block of bits is reserved to store drawing priority (see figure 5).

The clustering properties of the spatially indexed database will ensure that all geometry in a particular world will be clustered on disc and it is therefore very efficient to scan for geometry within one world.

The spatial scanners, used by the GIS in hit-searching and in screen refresh, can be constrained to find data which has a particular world ID. Hence, viewing and interaction with the spatial database is made world-specific.

Types of World

There may be many types of world in a GIS. Each world type typically has a different meaning for the geometry it contains.

Geographic worlds represent the location of objects in the real world. This geometry may or may not have associated topology.

Schematic worlds have geometry which represents the existence of objects and possibly their topology, but does not carry any real-world positional information.

Cross Section worlds are essentially geographic, i.e. geometry represents real world locations of objects, but use a coordinate system in which the x axis represents distance along some arbitrary direction and the y axis represents vertical position.

Worlds and Universes

There are a fixed number of bits available to be shared between the spatial index, drawing priority and world ID.

The number of bits used for the spatial index (together with the overall extents of the scanned area) determines the size of the smallest quad-tree box and hence the resolution of the quad-tree. This, in turn, affects the efficiency of hit-searching and graphics re-draws. In a world which is densely populated and which has many geographically small objects, a relatively fine resolution would be required. Conversely, a world with very little geometry, or with very sparse geometry, can be scanned with coarser resolution.

The number of bits used for the world ID and for drawing priority determines the total number of worlds and priorities that can exist in the database. Hence, there are competing requirements which have to be juggled to achieve a compromise allocation of bits.

Using the topmost few bits of the spatial index to store a 'universe id', each 'universe' uses a different allocation of the rest of the bits. This allows each universe to contain worlds with certain characteristics. A geographic universe may contain just one world and hence uses its remaining bits to obtain maximum spatial resolution. A substation internals universe or cross-section universe would need to contain a large number of worlds, but each world is small and has little geometry in it. We can therefore use up a large number of bits to store the world ID, leaving relatively few for the spatial index without performance loss. A universe for network schematics lies somewhere between the previous two examples: there may be a relatively small number of schematic diagrams and the diagrams will be relatively sparsely populated compared with the geographic world. Figure 5 shows the full structure of the spatial key.

  Universe  World  Drawing      Quadtree  Unique   Id        Id     priority     Key       id

Figure 5: Complete structure of spatial key

Alternative Representations of Objects and Networks

An object may be represented in several worlds and it may take a different form in each. For example, an electricity substation is represented in the geographic network by an area which is its boundary. In a schematic diagram, however, the substation is represented as a collection of transformers, busbars and switches.

A cable or pipe may be represented by its centre line (chain) geometry in the geographic world, but by a point in the world owned by a cross section object. The cable may appear in many cross sections.

A network may also have more than one representation. An electricity network, for example, may be represented in the geographic world by a connected network of cables, overhead lines and joints, and in a schematic world as a connected network of load sections. Both collections of geometry describe the same network.

Connections Between Worlds

Topological connections from one world to another may be made with a special object called a hypernode. This object has two point geometries, one in each world. This allows network traces to jump from world to world. For example, a network trace in an electricity network may arrive at a substation, jump into its internals world follow the geometry related topology inside it, then jump back out into the network again.

Hypernodes are a simple example of a multi-pin device and can be given special behaviour to change the trace state, or to prevent certain types of trace from jumping.

Complex Objects

An object may have a very simple representation in the network but may have considerable internal topological complexity which is hidden from the outside world. This internal topology may be implemented in a number of ways:

Objects with Internals worlds

Each object owns its own internals world. The internal topology of the object is associated with geometry in that world.

For example, an electricity substation has components: busbars, transformers and switches, which have their geometry in the owning substation's internals world. These components form an internal topological network for the substation.

Multi-pin-devices

These are objects with several connection pins (point geometries). The number of pins may be fixed, as in a simple switch or junction box, or dynamically variable - pins may be added or removed as required by the user.

External connections to the network are made to these pins, but the internal topology of the object is encapsulated as behaviour on the object itself. This topology may be implemented as a hard coded set of rules, which may refer to attributes of the object itself or to other conditions, or as a connectivity table which may be updated by the user.

For example, the following pseudo-code describes a method on a simple switch object which responds to the message 'connectivity_trace' from the network tracer.

      method simple_switch.connectivity_trace(pin)       if self.closed       then if pin = input_pin            then return output pin            else return input_pin            endif       else            return unset       endif

A multi-input/output junction box may have an associated connectivity table. When asked for output pins connected to a particular input pin, the junction box will perform a lookup of this table and return a number of pins. Records may be added to or removed from this table as connections are made or broken in the junction box.

Objects with trace behavior

These are objects with simple, geometry related topology but which have behaviour which modifies that simple topology in some way.

For example, a road_junction may have a single node in the road network, which may be connected to several roads. However, when asked for exit roads given a particular approach road, the junction object overrides the simple topology (in which all roads which join that junction are connected) and may take into account such things as one way streets or restricted access. The road junction may also assign various costs to these transitions; turning right (in the UK) may be a more costly move than going straight on or turning left.

As with the multi-pin device, this internal topological complexity may be defined algorithmically or by table lookup, or some combination of the two. There may be a dependence on various internal external parameters such as the day of the week, or the angle between approach road and exit road at a roundabout.

Encapsulating topology inside objects in this way allows us to simplify the geometry of topologically complex networks. In a fibre-optic cable network, for instance, each cable may contain 256 fibres and at each junction, the connections between individual fibres must be modelled. However, the cable can be modelled by a single chain and when network traces are carried out, the current fibre number is carried as an attribute of the trace. The junction objects can use this attribute to determine connectivity and will modify it when passing back output cables. Hence, the network tracer will ask the junction "I have arrived along this cable on fibre number 125. Give me connected output cables and corresponding fibre numbers".

Conclusion

The approach described in this paper to handling topologically structured data in multiple coordinate systems is a simple and natural extension to the spatial indexing techniques employed to organise a single large seamless geographic database. The method is elegant in that it adds no complexity to the topological capabilities of the system implemented for a single coordinate world. Even though geometry is stored seamlessly, the spatial key implemented organises the data physically so that data belonging to each coordinate system is stored contiguously in different parts of the base datastore tables. The approach has been highly effective for implementing complex models for a range of network applications.

References

Abel, D.J. & Smith, J.L. (1983): A Data Structure and Query Algorithm Based on a Linear Key for a Rectangle Retrieval Problem, Computer Vision, Graphics, and Image Processing 24,1, October 1983.

Abel,D.J. & Smith, J.L. (1984): A Data Structure and Query Algorithm for a Database of Areal Entities, The Australian Computer Journal, Vol 16, No 4.

Batty, P. (1990): Exploiting Relational Database Technology in GIS, Mapping Awareness magazine, Volume 4 No 6, July/August 1990.

Bundock, M. (1987): An Integrated DBMS Approach to Geographical Information Systems, Autocarto 8 Conference Proceedings, Baltimore, March/April 1987.

Chance, A., Newell, R.G. & Theriault, D.G. (1990). An Object-Oriented GIS - Issues and Solutions, EGIS '90 Conference Proceedings, Amsterdam,, April 1990.

Easterfield, M.E., Newell, R.G. & Theriault, D.G. (1990): Version Management in GIS Applications and Techniques, EGIS '90 Conference Proceedings, Amsterdam, April 1990.

Guttman, A. (1984): R-trees: A Dynamic Index Structure for Spatial Searching, Proceedings of ACM SIGMOD Conference on Management of Data, Boston, June 1984.

Kriegel, H., Schiwietz, M., Schneider, R., Seeger, B. (1990): Performance Comparison of Point and Spatial Access Methods, in Design and Implementation of Large Spatial Databases: Proceedings of the First Symposium SSD '89, Santa Barbara, July 1989.

Libera, F.D. & Gosen, F. (1986): Using B-trees to Solve Geographic Range Queries, The Computer Journal, Vol 29, No 2.

Newell, R.G., Easterfield, M. and Theriault, D.G. (1991): Integration of Spatial Objects in a GIS, Proceedings of Auto-carto 10, Baltimore March 1991, Volume 7, pp 408-415.

Samet, H. (1989): The Design and Analysis of Spatial Data Structures, Addison Wesley, 1990, ISBN 0-201-50255-0.

Seeger, B. & Kriegel,H. (1988): Techniques for Design and Implementation of Efficient Spatial Access Methods, Proceedings of the 14th VLDB Conference, Los Angeles, California, 1988.

Smith, T. R. and Gao, P. (1990): Experimental Performance Evaluations on Spatial Access Methods Proceedings of the 4th Spatial Data Handling Symposium, Zurich 1990, Vol 2, p991.

Smallworld Technical Paper No. 9 - The Why and the How of the Long Transaction

2011-08-29T07:56:00.002-06:00

by Richard G. Newell

Abstract

The recent literature on GIS technology has seen the emergence of a new set of terminology including "long transaction", "version management", "check-out", "check-in", "seamless mapbase" and so on. As is common with new terminology, few people understand what these things mean, why they are important and what they are for. This paper attempts to explain what a long transaction is, why it is necessary and how it is managed in a GIS.

Introduction

Any user of a GIS, mapping system, CAD system, word processor or indeed any system which involves updating data over a significant period of time is in fact engaged in a long transaction. Contrast this with the user of a commercial DBMS application such as banking or airline reservation. In such an application, a user may prepare an input screen over a period of a few seconds which then updates the system resulting in a transaction which lasts a small fraction of a second.

What is the difference between these two uses of a computer system? Why is it that the short transaction mechanisms implemented in all of today's commercial DBMSs do not satisfy the requirements of design-type systems?

Long transaction applications in a GIS involve the modification of data that is relatively static. These include:

Data conversion
Map-base and asset management
Analysis which produces large amounts of intermediate results
Design studies with multiple alternative designs
Short transaction examples occur in the everyday operational use of a system. These might include:
Vehicle tracking
Customer service bureau
Fault logging
Emergency planning

Drawing office managers, using paper drawings are forced into implementing a long transaction mechanism, known as drawings management. It is impractical to allow two draftsmen to have a drawing out for update at the same time. There is only one master copy of every drawing and only one draftsman at a time can modify it.

Any other user can have read only access to a copy of a drawing, in which case the information he has may be out of date, and this is usually deemed acceptable. In fact it is not uncommon in organisations such as local authorities, for there to be multiple copies of the same set of maps, all independently updated and maintained. Although this is deemed to be not so acceptable, these organisations put up with it because there is no alternative where the maps are on paper.

Sheet-Based or Tiled Systems

The early digital mapping systems were based on CAD systems where the mapbase was held as a collection of CAD drawings. The methods of managing such a system where multiple users wish to access and update the mapbase are based on the manual drawing office approach. Indeed there is a market for Document Management systems in which an intelligent drawing register is held in a database alongside the drawings to record the status of each drawing in the registry. The advantage of these sheet based digital mapping systems is their simplicity, but their disadvantage is that handling any object which lies across a tile boundary becomes exceedingly cumbersome.

However, as digital mapping systems tried to evolve into truly seamless databases, it was found that the information that needed to be held to store the relationships between parts of objects on adjacent map-sheets was not nearly so easy to manage as the strictly partitioned map-base that one found in digital mapping systems. Indeed, it was found that so much code was required in map-management systems that more modern systems took a radically different approach which abandoned the concepts of sheets and tiles to implement a truly seamless database.

This appeared attractive because it allowed implementors to move from a a file based system to an implementation based on a database. It was now feasible to hold all of the map data in a commercial relational database management system (RDBMS). Much of the functionality that is provided by these systems is required to handle the large data volumes involved in a GIS.

Commercial Relational Database Management Systems

The commercial relational database vendors have invested man-centuries of effort in producing very robust systems which ensure the safety and integrity of data at all times. They also include rich facilities for designing and building data models, an aspect which everybody now realizes is the most important point of departure in building a GIS. However, vendors who build their systems on such engines have to overcome three things which are not provided by the database vendors:

Spatial modelling and queries
Performance of spatial queries
Long transaction handling

On the first of these two, the database vendors and the standards organisations are beginning to make progress in addressing them. Indeed, provided the RDBMS provides facilities to control data clustering, it is not too hard to obtain adequate spatial performance. However, on the issue of long transactions, there is little sign yet that the vendors are doing anything. This may be because the GIS market is considered to be small compared to the total DBMS market, and it does not get the attention that it deserves. Also it requires fundamental changes to existing approaches.

Commercial DBMSs are designed to handle short transactions and to maximize transaction throughput. In theory, one could use the short transaction mechanism to handle multiple users of a GIS, but it is not effective, for the following reasons.

Commercial DBMS vendors adopt one of two approaches to transaction locking, known as the pessimistic approach and the optimistic approach. In the pessimistic approach the system requests locks on all records that are to be updated before commencing a transaction. Thus all other users are locked out of accessing these records. When the transaction is finally closed the locks are released. The problem with this is that if one imagines many users holding locks for a long period of time, the system becomes totally unusable because other users are denied access to the locked records.

In the optimistic approach, each user carries on updating records within the privacy of his own transaction and if, at the time the transaction is closed, a conflict is detected, he may well lose all of the work that he has just completed. For a small amount of work this is deemed to be acceptable, but if somebody has been working for days or weeks it certainly is not.

In either method, should there be a system failure while a transaction is open, then all work is lost. Thus these methods can handle transactions which are open for a few seconds or minutes, but certainly not those which last for hours or days.

So in order to get round this problem, GIS vendors who base their systems on commercial RDBMS avoid using the short transaction mechanism completely and instead implement a long transaction mechanism of their own called "check-out".

Check-out and check-in

In a system which employs check-out, the user who wishes to update the database requests of the system the part of the database he wishes to work on to be copied into a single user database. Whether or not the single user database is proprietary or is a commercial RDBMS does not matter as it only handles temporary data.

One advantage that stems from doing this is that the checked out database can be held on the local disk of a workstation, and so the user puts no load at all on the database server or the network while he is working. The only work that the server has to deal with is checking out data and then later checking in the updates.

However, there are a number of disadvantages of using check-out:

Check-out may take a long time
The user is restricted to a subset of the database
The handling of alternatives is cumbersome
It is difficult to maintain relationships between that data which is checked out and that which is not.

From the vendor's point of view, the biggest disadvantage of check-out is that to make it work effectively requires an enormous amount of develop-ment effort and so the gains made on relying on the R&D resources of the DBMS vendors are lost in overcoming the limitations of the short transaction.

A further problem is that, as in the two mechanisms of short transactions, the same applies to check-out. Either one locks the data intended for update or one employs a system of conflict resolution on check-in. In the former case, the system suffers the same problems as in pessimistic locking. At least in the latter case one can salvage most of the work in the event that a conflict is detected. However, conflict resolution is not a trivial matter.

Given that there are still problems of using check-out to overcome the long transaction problem, one has to ask what are the alternatives. Either, one has to wait for the commercial DBMS vendors to provide long transaction support and indeed some of the early implementations of object-oriented database management systems claim to address this issue, or the vendor has to implement his own long transaction mechanism.

The lack of long transaction support is the most serious short coming of today's commercial RDBMSs for the support of GIS.

One of the most powerful approaches to handling long transactions is to implement a mechanism for version management deep in the database engine itself.

Version Management

A version managed database is capable of holding any number of versions of the whole database without replicating data that is common between versions. Thus all users can see the whole database at all times, subject to any changes made within the privacy of their own versions.

A long transaction commences with the creation of a new version from an existing version. At the start, the new version will look identical in all respects to the parent version from which it was created. However, as the user proceeds in modifying the database, the database stores the effects of his changes, but no other user operating in a different version can see these changes. The user of course works within a sequence of short transactions, each of which can be committed at any time. Thus the database can store persistently the results of a long transaction at all stages in its evolution. Intermediate commit stages may sometimes be given a name, in which case they are known as check points

The operation of closing a long transaction is achieved by merging any changes that have been made by other users to the parent version, followed by posting the combined changes back up to the parent. The step of merging the parent's changes is where conflicts may be detected and dealt with.

As in the case of check-out, version management also minimizes the load on the database server by maximizing the utilization of the workstation. This allows good performance of many workstations on one server. This contrasts with the situation in most commercial DBMSs in which query processing for all users is carried out by the database server, thus giving it an enormous workload in a large system.

Version management has many advantages over both map-management and check-out in handling the long transaction issue:

No delay before commencing update
Always access to the whole database by all users
Simultaneous alternatives can be handled

One of the disadvantages of version management is that it is extremely difficult to implement it on top of a commercial DBMS, thus the vendor is forced either into the compromise of check-out or of implementing his own database engine which supports the concept. However, one of the most difficult aspects of database implementation is handling short transactions efficiently. Since this is not required for most GIS applications it is much easier to build a robust system with good performance.

Short Transactions in a GIS

There is much data in corporate DBMSs which needs to be accessed from a GIS. Most of this data is maintained in a commercial DBMS in a short transaction environment. Short transactions are important where it is essential for all users of the system to see the most up-to-date version of the database. GIS access to such data is typically for read purposes only and thus a simple interface mechanism will normally suffice. It is of course desirable from the user's point of view to hide differences in the user interface between the GIS and the external DBMS. In cases where it is also desirable to maintain short transaction data via the GIS user interface, then it makes a lot of sense to use a commercially available database engine.

Summary

All multiple-user GIS systems maintain their data by using a long transaction mechanism. This paper has examined simple mapping systems which maintain multiple map-sheets using a document management approach, through systems which try to maintain continuity between map-sheets by means of an extension of document management known as map management, ultimately leading to truly seamless GIS systems which need a different approach. Two approaches available in the market place are explored, namely check-out and version management. The drawbacks and benefits of all approaches are described. The conclusion is reached that the most elegant and powerful solution is version management and the lack of support for this in today's commercial RDBMSs is a major drawback to using these systems to underpin a GIS. Thus today's vendors who wish to bring the benefits of version management to their customers must implement it themselves.

Smallworld Technical Paper No. 8 - GIS Databases are Different

2011-08-29T07:56:00.001-06:00

by Peter Batty and Richard G. Newell

Synopsis

There has been much debate in the GIS industry about the suitability of standard commercial database management systems (DBMS's) for use in GIS. Historically most GIS's used proprietary storage mechanisms for their geographical data, as the performance of commercial DBMS's did not really make them a viable proposition. However, in the last few years, advances in hardware and software performance have made it possible to develop GIS products on top of commercial DBMS's. This approach has several obvious attractions.

However, the main argument of this paper is that current commercial DBMS technology has fundamental restrictions for GIS, and that a radically different client-server architecture based on version management has major advantages over traditional DBMS architectures. This new architecture is described and its benefits are explained. Clearly integration with existing databases and conformance to existing standards are also very important, and these issues are also discussed.

The Use of Standard DBMS's for GIS

The attractions of using standard DBMS's for GIS have been described in some detail by various people, including one of the authors of this paper (see Batty (1), Seaborn (2)). In summary, the advantages of this approach are that the GIS vendor should be able to take advantage of functions provided by the DBMS vendor and concentrate on developing GIS functions.

In particular, functions such as security and backup and recovery are well proven in standard DBMS's and the GIS can take advantage of these. The GIS user can exploit existing database skills and use common procedures for many database administration tasks, for both GIS and non-GIS data. Integration between GIS and non-GIS data should be easier to achieve when using this approach. There is increasingly good capability available for integration of multiple DBMS's from different vendors in a heterogeneous distributed environment.

The drawbacks of the standard DBMS approach are perhaps less obvious, with the exception of performance. There is an argument that performance will not be an issue in the longer term because of advances in technology. Certainly performance has improved, and it is now possible to implement a reasonably efficient spatial indexing system on top of a standard DBMS, especially one which supports physical clustering of data. However, the performance issue is more complex than this, and we will discuss later how a client-server version managed architecture can offer an order of magnitude better performance than conventional systems in a networked GIS with a large number of users.

The biggest single drawback of standard DBMS's is their lack of capability in the area of handling long transactions and version management, which is discussed in the next section. Handling long transactions is a fundamentally different problem from that of handling short transactions, which is what standard DBMS's are designed for, and handling the former would require significant re-architecting of existing DBMS products.

Some of the other apparent advantages of standard DBMS's are not as clear cut as they might appear. While the GIS vendor may not have to worry about writing code to cater for some DBMS tasks such as backup and recovery, this advantage may be outweighed by the amount of code which has to be written to implement other functions which are not provided by the standard DBMS, such as the storage and retrieval of spatial data, and the provision of some sort of long transaction handling mechanism such as checkout.

The integration of geographic data into external applications is also not necessarily simplified by storing it in a standard DBMS. Because spatial data types are not explicitly supported by most standard DBMS's, the GIS vendor typically has to use complex record structures and retrieval algorithms for the geographic data, which means that it cannot be easily read or updated by external applications. The external application also needs to understand any special locking mechanisms which are used by the GIS to handle long transactions, since the standard DBMS locking does not handle this. The most obvious short term solution to this problem is for the GIS vendor to provide an API (Application Programming Interface) which can be used by external applications, but that approach can be used equally well whether or not a standard DBMS is used for the GIS data. This is an area where progress is being made by the standard DBMS vendors, as they begin to provide the ability to define new data types and operators within SQL. Capabilities are included in the draft SQL3 standard which will be useful in this respect.

Long Transactions

A database transaction is a group of related updates against a database which form a single "unit of work". Standard DBMS's are designed to handle "short transactions", so called because they usually take a short time - typically a few seconds at most. A simple example of a short transactions in a banking system would be transferring some money from a current account to a savings account. The system first has to subtract the appropriate amount from an entry in the current account table, and then add the same amount to an entry in the savings account table. If the system fails half way though the transaction then the database is in an inconsistent state, and it must be rolled back to the state at the start of the transaction to make sure that the database is still consistent. An inconsistent state could also arise if someone else were able to update the records involved while the transaction was in progress, so these records will be locked and nobody else will be able to update them for the duration of the transaction. Standard DBMS's are designed around being able to handle this sort of transaction in a very robust way which will handle system failures appropriately.

In contrast, a typical GIS transaction is something very different. For example, an engineer designing an extension to a gas utility network will need to digitise a number of pipes, valves and other objects, which may take hours, days or even weeks. While he is in the process of doing this design, the database will not be in a consistent state - if someone else looked at the work he was doing when he was half way through it, it would probably not make engineering sense. So he in some sense needs to have a private copy of the data for the duration of the transaction, which he can make available to other users when he has finished his work. He may also want to create multiple different designs for his network extension and do some analysis on each alternative design before deciding which one to use. Once the work has been carried out, it may be necessary to make some modifications to the database to reflect any changes which were made to the original design when the work was done ("as-built" changes).

This whole process may well take weeks or months, and since separate copies of the data have to be managed, the issue of concurrency control also has to be looked after. Some mechanism is required to handle the situation where multiple people want to work in the same area at the same time. There are two basic approaches to the concurrency problem: optimistic or pessimistic. The pessimistic approach prevents more than one person working on the same area at the same time by locking out any area where a user is currently working. The optimistic approach allows multiple users to work in the same area with no constraints, on the assumption that conflict in what they are doing is unlikely to occur. Conflicts are checked for at the time that the changes are posted to the master version of the database and they can be resolved at that time. The optimistic approach is generally most suitable for GIS, since typically the volume of updates is small in relation to the total amount of data, and also working practices often dictate that there should not be conflicts even if two people are working in the same area.

Checkout

The most common approach which has been used to address the long transaction problem is checkout. In this approach, the user specifies an area in which he wishes to work, and the data selected is copied to a separate working area which is just used by that user. The working area may or may not use the same DBMS and data structures as the master database. Updates are made to this working data set, and when the work has been completed the changes are applied to the master database. The concurrency control mechanism used may either be pessimistic, in which case all data retrieved by a user is locked and can only be viewed by other users, or it can be optimistic, in which case multiple users can retrieve overlapping areas and conflicts are identified when changes are passed back to the master database. For a more detailed description of a checkout based system, see Batty (3).

Checkout has a number of disadvantages. The first is that the initial retrieval of data can take a long time - in general, checkout times tend to be measured in minutes rather than seconds. A second drawback is that the user has a restricted subset of the database to work in. If he discovers that he needs to work outside the area which he originally requested, then it is necessary to do another retrieval, which may again take a significant time. If relationships (including topological relationships) are allowed between objects in the database, this introduces further complications in deciding exactly how much data to check out - some mechanism is required for controlling updates to objects which have been extracted which are related to objects which have not been extracted.

Version Management

Another approach to handling long transaction is to use a version managed database. In such a database it is possible to create different versions of the database called alternatives. Only one user can update an alternative at one time, but any number of users can read an alternative. Changes made by a user within an alternative are only seen within that alternative. The whole database can be seen within an alternative, but data is not replicated: only the changes relative to the parent version are stored in an alternative. To implement this efficiently requires version management to be built in at a fundamental level in the DBMS. When changes to an alternative have been completed, they can be posted up to the parent alternative. An optimistic approach to concurrency control is used, and any conflicts are detected and corrected at this time. It is also possible to have a tree structure of alternatives, so in addition to handling simple long transactions, this approach also provides a mechanism for handling alternative designs in an elegant way. Version management overcomes all the problems with checkout mentioned above: there is no initial retrieval time, no copying of data is required, and the user has access to the whole database at all times. For a more detailed description of version management see Newell (4) and Easterfield (5)

A Client-server Implementation of Version Management

This section looks at the way in which version management is implemented in the Smallworld Version Managed Data Store (VMDS). It is necessary to consider the structure of the database at quite a low level in order to appreciate the difference between this architecture and that of traditional DBMS's in terms of the performance which is achievable. This section will just summarise the most important points about the implementation - for a more detailed discussion see Easterfield (5).

Tables in VMDS are implemented using a standard structure called a B^-tree which allows efficient navigation based on a key value to find a record with that key. Many DBMS's implement their tables using B-trees. However, the key difference between the VMDS approach and more traditional approaches, for the purposes of this discussion on performance, is that the datastore never updates a disk block directly while any version of the database still refers to it. Whenever some data is changed and committed to disk, a copy of the original disk block containing that data is created and the update is made to that copy. In turn, any other disk blocks in the tree which referred to the old disk block are also copied and updated so that they point to the new version of the data. These other disk blocks may still point to other disk blocks which have not been changed - so common data is shared between versions.

This is in contrast to traditional DBMS architectures, where an update will cause the data in the original block in which that data was stored to be modified. In such a short transaction based DBMS, a change to a record is immediately seen by all users. In contrast, when an update is made by a user of VMDS, it is only seen by that user until a version management operation such as a merge is carried out by another user.

A huge benefit of the fact that a disk block is never updated by VMDS lies in the fact that this makes it possible to cache disk blocks on client workstations. In any subsequent processing, if a disk block is required and it has been cached on the local workstation, that block can be immediately used, in the knowledge that it has not been updated on the server. This is not the case with a standard DBMS, since as soon as a block was updated by any user, a mechanism would be required for updating or uncaching all cached copies of that block (in a way which will work consistently even in the event of hardware, software or network failures on part of the system). This is a very complex unsolved problem.

Performance

This ability to cache blocks on the client workstation also means that most of the processing can be done by the client. Thus this approach removes the two main bottlenecks in GIS performance which arise with standard DBMS's - processing on the server and network traffic. The requirement for a typical GIS redraw transaction is to be able to retrieve hundreds or thousands of records, and potentially megabytes of data, from the database in a few seconds. In standard DBMS's, all the optimisation and query processing is done by the server, which quickly leads to a processing bottleneck on the server when processing large numbers of concurrent queries of this complexity. The second bottleneck is that all the data which is returned from the query needs to be transferred across the network, which leads to serious problems in terms of network traffic.

The version managed client server approach offers enormous improvements in both of these areas, as very little processing is done by the server - it is essentially just a very simple fast block server - and typically much (or all) of the data required will be cached on the client, in which case network traffic is reduced and processing on the server is also further cut down. This makes it possible to have very large numbers of users concurrently updating a single continuous database. Tests carried out at one customer showed no significant loss in performance as the number of users accessing a single server on a local area network was increased from 1 to 35. The users were all running an intensive data capture application. The only way in which anywhere near this level of performance could be achieved with the standard DBMS approach is by checking out data and storing it locally on each workstation, but this approach has other significant drawbacks which were discussed earlier.

Distributed Database

A very common requirement for companies who use GIS is to be able to access the GIS data in multiple geographically separate locations, which may not be connected by very fast communications. However, for some applications it may be necessary to view the database as a whole, and people working in one district may occasionally want to look at data in other districts. The disk block caching described in the previous section has a natural extension to this sort of distributed database environment (Newell (6)).

It is possible to have a central master database, and then at each remote location which is accessing the database, have a slave database which stores cached blocks on disk rather than in memory. This persistent cache will store blocks from one session to another, and these cached blocks can be accessed by all users working at this remote site. If users at the remote site typically access a specific subset of the master database (which could be a geographic subset, such as one of ten districts of a utility, or a functional subset of the database, such as water network objects), then the local disk cache will automatically become populated with blocks containing data from that subset. Updates made at remote sites still go directly back to the master database.

This persistent cache approach to distributed database further extends the capabilities described in the previous section, of reducing the load on the central server and reducing network traffic. It provides a very elegant solution to the distributed database problem, in that data is automatically distributed to the appropriate location based on demand. If the usage of the database at a remote location changes - for example, if the areas managed by different district offices are changed - then the data cached at the location will automatically change over time as users look at the new areas.

Once again, this whole approach will only work in a version managed environment because of the problem of synchronising updates to multiple copies of a block in a short transaction environment.

The persistent cache approach can also be used in a multi-level way. A district could have its own large persistent cache, and outlying depots have their own persistent cache. A request for a block from a user in the depot would first check the cache on that client machine, then look in the persistent cache on the depot server, and then the persistent cache on the district machine, before finally looking in the master server.

Benefits of Version Management for System Development

Version management can also be extremely useful when doing development, or when making changes to a production database. Not only can changes be made to data within a version, but it is also possible to make changes to the data model within a version without affecting the data model (or data) in any other versions. This is extremely useful for an application developer, who can test any changes to the data model against the whole database in their own version, before applying the same changes to the master version of the database. Similarly, any operations which do large scale changes to existing data in the production databases can be run in a version and the results can be thoroughly tested before posting the changes to the master version. If any problems are found then that version can just be thrown away without having done any harm to the master version of the database.

Version management is also an important requirement for CASE (Computer Aided Software Engineering) applications, and indeed other aspects of GIS technology are useful in implementing a CASE tool, such as the ability to define objects with data which can be represented graphically, and the ability to maintain complex relationships between them. Smallworld have used this fact to implement a CASE tool which can be used to define and modify data models for GIS applications. This uses version management techniques extensively to help provide functions like the ability to create alternative versions of the data model, apply these to alternative versions of a populated database for testing and development, and finally apply the changes to the master version of the database with a mechanism for propagating the schema changes down to other versions in the database.

Integration

Integration between GIS and non-GIS data and applications is generally accepted as being very important in gaining maximum business benefits from the GIS. So although significant advantages can be obtained by using a new type of DBMS for GIS, it is also very important that any new DBMS should provide good integration with existing DBMS's and applications.

The Smallworld system caters for this by allowing tables to be defined either in the Smallworld version managed data store or in a standard DBMS such as Oracle or Ingres. To both users and application developers, access to either type of table is identical (an example of encapsulation in an object-oriented system). The only differences are that any updates to the tables in a short transaction database will be immediately visible to all users in all versions of the database, whereas updates made to the Smallworld VMDS will only be visible in the alternative in which the change is made. Any distributed functions provided by the external DBMS can also be used to access other DBMS's on remote machines. It is also possible to use SQL gateway products like SequeLink to access remote DBMS's.

External applications may also need to access data stored in the GIS. This can be done by providing an SQL server which allows external applications to query the data using standard SQL. There are some issues with this approach in that standard SQL does not support the spatial operators or data types which are supported by the Smallworld database. However, some standard DBMS's now include the ability to add new operators and datatypes, and the SQL3 standard also addresses these areas. However, there are certainly shortcomings in SQL in this area at the moment. Another option for providing external access to the GIS data is to provide an API which allows queries or commands written in the GIS development language (Smallworld Magik, in the case of the Smallworld system) to be executed by an external application and to get the results back. This provides the potential for executing more complex GIS queries such as network traces, which would be very difficult to express in SQL.

One issue which still requires some careful thought is the exact nature of the integration between long and short transaction data and how commits in both environments interact. Although any table can be stored in either the short or the long transaction database depending on what type of updates are most frequently carried out against that table, there may be occasions where related updates are required across both environments. There may be some cases where controlled replication of certain data in both environments is a valid option.

Acknowledgements

The authors would like to acknowledge the work of the developers of the technology described in this paper. Mark Easterfield developed the Version Managed Data Store and most of the concepts explained in this paper. Nicola Terry developed the Data Dictionary and Gillian Kendrick developed the CASE tool, which were touched on briefly in this paper and will be revisited in more detail in a future paper. Betty Barber developed the capability to integrate external DBMS's with VMDS.

Conclusion

The conclusion of this paper is that, although there are some attractions to using standard DBMS's for GIS, an alternative client server database architecture based on version management has very significant advantages over traditional DBMS architectures for GIS. These advantages are firstly in the area of handling long transactions and managing alternative versions of data, which are critical issues in handling large multi-user GIS databases, and secondly in the area of performance with large numbers of users in a networked environment.

The authors have personal experience of implementing both approaches, and believe that, although progress is being made on some of the shortcomings of standard DBMS's for GIS, it is hard to see how they can provide the advantages of the database architecture described in this paper without some major re-architecting, which seems unlikely to happen. Thus it is our belief that the best way forward is for us to continue to develop the database technology described in this paper, and further enhance its ability to integrate with commercial DBMS's using SQL standards. The alternative is to wait for the day when the DBMS vendors provide equivalent functionality to what we now have.

References

1. Batty P.M. An Introduction to GIS database issues, Proceedings of AGI 92, Birmingham 1992.

2. Seaborn D. 1995: The Year GIS Disappeared, AM/FM XV Conference Proceedings, San Antonio, April 1992 pp 822-826

3. Batty P.M. Exploiting Relational Database Technology in a GIS, Mapping Awareness magazine, July/August 1990.

4. Newell R.G. Theriault D.T. and Easterfield M.E. Temporal GIS - modelling the evolution of spatial data in time, Computers and Geosciences Vol 18 No 4 pp427-433, 1992

5. Easterfield M.E., Newell R.G. and Theriault D.G. Version Management in GIS: Applications and Techniques, Proceedings EGIS '90, Amsterdam, April 1990

6. Newell R.G. Distributed Database versus Fast Communications, AM/FM XVI Conference Proceedings, Orlando, March 1993, pp 647-659

Smallworld Technical Paper No. 7 - Object-orientation: Some Objectivity, Please!

2011-08-29T07:55:00.002-06:00

by Peter Batty Senior Applications Consultant Smallworld Systems

Abstract

"Object" would appear, to most people, to be a fairly innocuous, uninteresting word. Strange, then, that the objective of many GIS vendors these days appears to be to mention the word object as frequently as possible, objectivity being no object in labelling a system as object-oriented, to the extent that the word object is in danger of becoming an object of ridicule in the industry. Many people object to this rather objectionable state of affairs, where the word object is used as frequently and in as many different contexts as in this abstract, so the object of this paper is to try to introduce some objectivity into the use of the word object in relation to GIS.

The paper explains the difference between object-based systems, object-oriented interfaces, object-oriented programming and object-oriented databases. It concentrates in particular on explaining object-oriented programming, using real examples from a GIS application which the author has just implemented in an object-oriented environment.

The paper also asks the question, "So what?" Even if the user can work out in what respects a particular system is object-oriented, need they be concerned about the answer?

Peter Batty is a Senior Applications Consultant with Smallworld Systems of Cambridge, England. He has seven years of experience in the GIS industry- Most of this time was spent working for IBM in both the UK and the USA, and he moved to Smallworld Systems in 1992. His experience includes a wide range of GIS development and implementation projects in many different industries and countries. He has written and presented many articles and papers on technical issues in GIS. He has a BA in Mathematics and an MSc in Computing from Oxford University.

Outline

The term object-orientation has become so widely used in GIS that describing a system as object-oriented has become fairly meaningless. Rather than try to produce a single definition of what constitutes an object-oriented system, this paper attempts to outline the various ways in which the terms object and object-oriented are used in GIS, and produces a summary checklist which can be used to clarify exactly what a vendor means when they describe their system as object-oriented.

Topics which will be covered include the following:

Object-based (as opposed to sheet-based or tile-based) systems
Object-centred (as opposed to geometry-centred) systems
Object-oriented user interfaces
Object-oriented programming
Object-oriented databases

The first two topics are largely a question of defining terminology and are fairly straightforward to understand. The third topic is rather vague, and is only briefly discussed. The fourth area of object-oriented programming is somewhat more complex to understand and it is this area that this paper will primarily focus on. The final topic of object-oriented databases is not well defined, and various definitions are discussed. Most of the (sensible) definitions of an object-oriented database relate back to object-oriented programming, which emphasises the importance of understanding this area in order to make sense of all the other definitions of object-orientation which one may meet.

The following sections explain each of the above terms, and in each case discuss the relevance of the topic to GIS.

Object-based Rather Than Map-based or Tile-based Systems

Perhaps the lowest level of functionality which is sometimes described as object-oriented in the context of GIS is what this author would describe as an object-based or feature-based approach to storing geographic data. Many GIS, especially those derived from CAD systems, split the database into map sheets, tiles, or geographic partitions. In such systems, any feature which crosses a map or tile boundary needs to be physically stored as multiple geometric objects, although the system will usually contain some functionality to make this split largely invisible to the end user.

In contrast, an object-based system does not partition the database into tiles, and stores geographic objects or features as a fundamental unit in the database. In such a system, linear or area objects never need to be artificially split because they cross a tile boundary.

Although, as mentioned above, systems which use a tile-based approach can make this reasonably transparent to the end user, extra code is required to do this, and typically application development is more complex because of the need to handle special cases where objects are split. There are also potential data integrity problems in trying to ensure that all parts of the object are correctly maintained. It is therefore generally agreed that an object-based approach is preferable to a tile-based approach. The main reason for using a tile-based approach is that this makes it simpler to achieve reasonable performance. However, modern spatial indexing techniques make it possible to get very good performance with a seamless object-based approach.

An object-based system as described in this section has nothing to do with the term object-orientation as it is used in the rest of the computer industry. However, in the experience of this author, many of the GIS which are described as object-oriented by their vendors fall into this category.

Object-centred Systems (as Opposed to Geometry-centred Systems)

Another sense in which the word object is used in relation to GIS data modelling is the term "object-centred" used by Newell (1). He contrasts what he terms object-centred and geometry-centred data models for use in GIS.

A geometry-centred model is one in which the primary classification of objects is a geometric one - for example each object is either a point, line or area. Each of these geometric types are then subdivided into classes which represent objects in the real world, for example a line might represent a road, a river or a gas pipe, and have appropriate attributes associated with it in each case.

In contrast, in an object-centred model, the primary classification of objects is based on the real world - so an object might be a road or a school. This object has multiple attributes, which may be either alphanumeric or geometric. Hence an object could have multiple different geometries of different types with this model. For example, a road might have a line geometry representing its centreline, which could be used for route tracing applications, and an area geometry representing its extent, which might be used in cadastral applications. This approach also facilitates generalisation, by allowing multiple geometric representations of an object which can be used at different scales.

In general, an object-centred data model provides a number of advantages over a geometry-centred model. However, as with the previous case, this use of the word object has nothing to do with object-orientation. At the risk of confusing the issue, it is possible to have an object-oriented system which is either geometry-centred or object-centred as defined in this section (real-world-object-centred might be a more accurate, if rather long, description for the latter data modelling approach). These two approaches are just different ways of classifying the objects within the system.

Object-oriented User Interfaces The term object-oriented interface is a somewhat nebulous one. Some people use it to describe any graphical user interface which makes use of windows, icons, etc., such as Microsoft Windows or X-windows. However, almost all GIS use a standard windowing system, so this is hardly a distinguishing factor in comparing different systems.

Some people use the term object-oriented interface in more specific ways, for example to describe the sort of interface used by systems such as the Macintosh, where the general approach is to select an object first and then choose an action to be carried out upon it, rather than choosing an action first and then an object. However, there is no general agreement on a precise definition of an object-oriented interface.

It is also true to say that, while user interfaces are obviously important in GIS, they are not really a major consideration in terms of the use of object-orientation in GIS, so we will not consider them any further here.

Object-oriented Programming

As stated earlier, understanding object-oriented programming is really the key to understanding definitions of object-orientation in relation to other areas such as databases, so it is this area that this paper will focus on.

One of the main challenges in explaining object-oriented programming is to find examples which are detailed enough to show how it can give significant benefits in practice, without being too long and difficult to understand. This section attempts to provide some such examples, based on real GIS applications.

The language used in the following examples is Smallworld Magik. This paper will just explain the minimum amount about the language syntax which is necessary to understand the examples, since the aim is to explain the important concepts of object-oriented programming in general, rather than any specific language. For a more detailed introduction to the Magik language, see (2).

First we will introduce the basic ideas of object-oriented programming: objects, classes, messages and methods. We will then look in turn at the concepts of encapsulation, polymorphism and inheritance, which are defined by most authors to be the key things which characterise an object oriented programming language.

Objects, Classes, Methods and Messages

Somewhat predictably, the idea of an object is central to object-oriented programming. An object is an item of data, very much like a variable (or constant) in a conventional programming language. Every object belongs to an object class, which is analogous to a data type in a conventional language. So for example, the number 1 is an object belonging to the class integer, and the letter x is an object belonging to the class character. These basic classes are defined as part of the system, as are slightly more complicated classes analogous to other data types in conventional languages, such as arrays.

However, one of the most important things about object classes is that new classes can be defined by the programmer, based on existing classes. For example, we could define a coordinate class, specifying that each coordinate has an x component and a y component, each of which are floating point numbers. This is similar to defining a structure in a conventional programming language such as C. We can access the components of an object as shown in the following example, which creates a new coordinate object with x coordinate of 100 and a y coordinate of 200, and then prints out the x and y coordinates separately:

c << coordinate.new(100, 200) write("x = ", c.x) write("y = ", c.y)

This would produce the output:

x = 100 y = 200

The first line creates a new coordinate object and stores this in a variable called c (<< is the Magik assignment operator, like = in C or := in Pascal). In the second line, the expression c.x sends the message x to the object c, and this returns the value 100 (we will look further at messages in a moment). Components of an object, like x and y in this example, are known as slots in Magik or instance variables in Smalltalk.

As an aside, in Magik a slot does not have a fixed type - we could store a character string like "Hello" in the x component of a coordinate if we wanted to (although this might not be a good idea in this case, we will look at examples later where this capability is very useful). In some other languages like C++, the type (or class) or a slot has to be declared in advance and cannot be changed (this is known as strong typing).

Now we will look at messages and methods. So far everything we have discussed relating to objects has an equivalent in conventional languages, like C, which support the definition of composite data types or structures. However, a key difference in an object-oriented programming language is that an object class not only defines the data stored in objects of that class (as we have just briefly discussed), but it also defines all the functions which can operate on objects of that class. These functions are known as methods in an object-oriented system. Data in an object can only be accessed via methods defined on its object class. The significance of this will be discussed in the next section on encapsulation. A method is invoked on an object by sending a message to that object, which causes a method of the same name to be invoked. The distinction between messages and methods can be confusing at first, but the same message could be sent to objects of different classes and result in different methods being executed, because the method was defined differently on each class. This will be discussed further in the section on polymorphism.

To finish this section, we will look at a few examples of invoking methods on objects by sending messages to them. The Magik syntax for sending a message to an object is of the form

object name.message_name

Many methods will return a value (more accurately, they return an object). For example, suppose we had an object called a_road. The following example shows several Magik expressions in the left hand column and the object which is returned on the right:

"High Street"   a_road.name               (a character string                             object)                             A chain object (a   a_road.centre_line        chain is a basic                             geometric object in                             the Smallworld GIS)    a road.centre_line.length 255.0 (the length of                             the road centre-line                             in metres - this                             example shows how we                             can send another                             message to an object                             which is returned from                             another method . . .                             expressions like this                             are evaluated from                             left to right).

As well as simply returning objects, as shown so far, methods can change data or cause other actions. For example, the message draw() will invoke a method which draws an object on all current windows in the GIS:

a_road.draw()

Parameters can be passed to a method - for example, the method draw_on() will draw an object on a specified window:

a_road.draw_on(a_window)

Some methods create or change objects. In our first example we saw the method new(), which creates a new object:

c << coordinate.new(100, 200)

There is a special message syntax for assigning data to slots, as in the following example:

c.y << 300

This sets the y component of the coordinate c to 300.

Encapsulation

We mentioned in passing in the previous section that the only way that data within an object (i.e. data in a slot) can be accessed or changed is via methods defined on that object's class. This is known as encapsulation, and we will consider its significance in this section. The most important thing about encapsulation is that it provides a well-defined and strictly enforced external interface to an object. This makes it possible to change the internal implementation of an object without affecting any of the other code which uses the object. This is a great advantage when building large and complex systems. We will look at some examples of how encapsulation could be used.

First consider the coordinate example we have already looked at. Our coordinate object class has two slots, and we have methods x and y which allow us to access these slots, and methods x<< and y << which allow us to directly assign values to those slots. For some operations it may be more convenient to work with coordinates expressed in terms of a polar coordinate system, as a radius and angle. We could define two new methods on the coordinate object class, called radius and angle, as follows:

method coordinate.radius   return sqrt(x*x + y*y) endmethod  method coordinate.angle   return atan2(y, x) endmethod

Now executing the following . . .

c << coordinate.new(3, 4) write("x = ", c.x, " y = ", c.y,       " radius = ", c.radius, "angle = ", c.angle)

Would produce this output . . .

x = 3 y = 4 radius = 5 angle = 0.9273

Notice that there is no visible difference between the methods which directly access data in the slots (x and y) and the methods which access derived data (radius and angle). At the moment we have no way of setting the radius or angle directly though, as we have not defined methods to do this. However, we could do this as follows:

method coordinate.radius << new radius   current_angle << self.angle # self.angle tells this                                     # object to send the                                     # message angle to itself   x << x * new radius * cos(current_angle)   y << y * new_radius * sin(current angle) endmethod  method coordinate.angle << new angle   current_radius << self.radius   x << x * current radius * cos(new angle)   y << y * current radius * sin(new_angle) endmethod

We can now change and access radius and angle just as though they were slots. If we now discovered that our application was using the polar form of the coordinates much more than the cartesian form, we could redefine our coordinate object to have slots called radius and angle instead of x and y, for efficiency, and define appropriate methods x, y, x<< and y<< so that all the methods which were previously defined were still available. Any existing programs using coordinates would run without any change, even though the underlying implementation of coordinate has completely changed. Note that slot access and update methods need not exist for all slots in an object, so slots can be hidden from the external programming interface for an object.

Encapsulation is a technique which can actually be used in non-object-oriented language, but it is usually not enforced by the language itself. For example, one could define a coordinate data structure in a language like C, and define functions called set_x, set y, get x and get_y, which were analogous to the slot access methods we have described. However, we are entirely reliant on the discipline of the programmers who use this data structure whenever they access or update a coordinate, they must use the specially provided access functions for doing so, rather than accessing the underlying data structure directly. An object-oriented system strictly enforces this principle of encapsulation.

Polymorphism

Polymorphism is the ability for the same variable to refer at different times to different classes of object. We have found this particularly useful in GIS applications, where there is often a requirement to handle heterogeneous groups of objects. We can send a message to an object without knowing its class, and the appropriate method for that class will be invoked on the object.

We will consider as an example a function to carry out Quality Assurance (QA) on electrical network data which has just been captured. We have a set of rules such as the following, for each object class which is relevant:

All low voltage (LV) joints must have at least 1, and no more than 4, cables connected to them.
All pole mounted transformers must have at least 1, and no more than 2, lines or cables connected to each of the low voltage and high voltage connections.

If we find an object which does not satisfy all the specified rules, we want to tell the user, highlight the object, and change the currently displayed area so that the object is in the centre of the screen. The way in which we will implement this function is to define a method called valid? on each relevant object class, which returns true or false depending on whether the object satisfies all the rules or not.

At capture time we run interactive checks to ensure that the only object which can be connected to an LV joint is an LV cable, so all we need to check at this stage is the number of objects which are connected to the LV joint. In this particular data model, an LV joint has a single point geometry called location. Thus our validation method can be written as follows:

method lv_joint.valid?   num cables << self.location.all   connected_geometry.size   if num_cables < 1 or num_cables > 4 then     return false   else     return true   endif endmethod

In the first line, self.location returns the point geometry associated with this Iv_joint object. We then send this point object the method all_connected_geometry, which returns a set containing all the geometries which are connected to that object. In this case we do not wish to look at the individual items in this set, we just want to know the size of the set, so we just send the set the message size. All these methods are already defined in the standard class libraries (i.e. class definitions and methods) which are provided with the system. This last object which is returned (the size of the set, i.e. the number of cables connected to this joint) is assigned to the variable num_cables. We then do a simple test on the value of num cables to check whether this is valid or not, and return true or false accordingly.

We can define a similar method for pole mounted transformers as follows. This is slightly more complicated since this object has two point geometries, called Iv_connection and lv connection. These represent the distinct connection points for low voltage and high voltage cables or lines belonging to this transformer. Again we validate interactively that only LV cables and LV lines can be connected to the Iv_connection, and that only HV cables and HV lines can be connected to the hv connection, so we just need to check the number of objects connected to each of these geometries.

method pm_transformer.valid?   num_lv_conns << self.lv_connection.all_connected._geometry.size   num_hv_conns << self.hv_connection.all_connected_geometry.size   if min(num lv_conns, num_hv_conns) < 1 or      max(num_lv_conns, num_hv_conns) > 2 then      return false   else      return true   endif endmethod

This method is very similar in principle to the last one, except that in this case we have to check the connections to each of the two geometries belonging to the object.

We can now define our QA validation function as follows:

method qa_menu.validate objects()   for an_object over grs.objects_inside_area(current_qa_area)   loop     if not an_object.valid? then       grs.current object << an_object       an_object.goto()       grs.show_message("Invalid object found")       return     endif   endloop   grs.show_message("QA completed   successfully") endmethod

This is the complete code for this application. We define the validation function as a method on an object called qa_menu, which is a menu we have created from which the user will initiate the QA function by pressing a button. The qa_menu object has a couple of slots which are referred to in this method. The first is called grs, which is the graphics system we are currently running - this is quite a complex object which is essentially the whole GIS application, which has slots referring to the current database, all the menus displayed, the currently selected object, etc. There is also a slot which stores the QA area we are currently working within. We just want to check objects inside this area, so we send the message objects_inside_area() to the graphics system. This is a special type of method called an iterator method - it returns objects one at a time to the loop which follows. Inside the loop, we send the message valid? to each object which is found inside the area. This is where polymorphism is important - even though we don't know the type of object which has been returned (we could find out, but we don't need to), we can send it the same message, valid?, and the appropriate method called valid? gets invoked depending on the class of the object. This example should clarify the difference between a message and a method. We have defined two distinct methods called valid?, one on the class lv_joint and one on the class pm_transfomer. However, we can send exactly the same message called valid? to an object of either class and the appropriate method will be invoked.

If the object fails the validation test then we make it the current object in the graphics system, which causes its geometry to be highlighted and its object class and attributes to be displayed. We then send the object the message goto(), which causes this object to be displayed in the current graphics view, and finally we display an alert message to the user and exit from the loop (and the method) with a return statement.

The great beauty of this approach is that we can add a new object class to our application, define a method called valid? for it, and the validation code will work immediately without requiring any changes. You don't even need to compile or link anything. In contrast, with a conventional procedural language it would be very hard to write an equivalent QA function which could be extended to accommodate new object classes and rules without having to modify the source code of the validation routine itself. Allowing customers to directly modify product source code is highly undesirable for a software vendor (and indeed for the customer, as it makes support and problem resolution much more difficult), so it is much easier to produce systems which can be easily and cleanly extended in an object-oriented environment like the one we are discussing.

As another aside, one important point which has been touched on in passing is that the Magik programming environment is interactive. One can be running the GIS, modify a validation method like those above while the system is running, and immediately test the effects of the change without having to compile or link anything. The same is true of Smalltalk, but not of C++, which requires you to compile, link, and re-run your application before you can test the change. Having an interactive programming environment makes a huge difference to development productivity.

Inheritance

The third main area which characterises object-oriented programming is inheritance. Inheritance allows new object classes to be defined in terms of existing object classes, inheriting both data structure (i.e. definition of slots) and behaviour (definition of methods) from the defining parent class or superclass. A class which inherits from another class is said to be a subclass of its parent. It is possible to define additional slots and additional methods on a subclass. It is also possible to define a method in a subclass with the same name as a method in its parent class, and this new method will override the method from the parent class. We will look at examples of all these things shortly. In overview though, the inheritance mechanism provides a very powerful way of writing generic code which can be shared by many classes, whilst at the same time allowing any differences from this generic behaviour to be easily defined in subclasses. This results in much smaller amounts of code overall, which again greatly helps the reliability and maintainability of a system.

The value of inheritance is most apparent in quite complicated systems, so it is difficult to illustrate its full benefits in a short paper such as this. To illustrate the basic concept of inheritance though, we will return to our coordinate example. Suppose that for some applications we need to handle 3-D coordinates, and that for the most part these will be used in the same way as 2-D coordinates (displayed on 2-D maps etc), but that in some cases 3-D coordinates will have additional, or different, behaviour.

First we will define a few examples of behaviour on 2-D coordinates. We will assume that we have the access methods x and y which we used before.

We can define a method to measure the distance between two coordinates as follows:

method coordinate.distance_to(another_coordinate)   dx << self.x - another_coordinate.x   dy << self.y - another_coordinate.y   return sqrt(dx*dx + dy*dy) endmethod

We could also define a method to check whether a coordinate was inside a bounding box (this is a horizontal rectangular area, often used for initial area comparisons in a GIS, which is defined by the its bottom left comer (xmin, ymin) and its top left corner (xmax, ymax)).

method coordinate.inside?(a_bounding box)   bb << a bounding box   if self.x >= bb.xmin and self.x <=bb.xmax and      self.y >= bb.ymin and self.y <= bb.ymax   then     return true   else     return false   endif endmethod

There would obviously be a lot more methods defined on a coordinate in practice, but these will suffice for this example. A simple example of creating and using some coordinates and related objects is as follows:

# Create a coordinate c1 << coordinate.new(5, 5)  # Create another coordinate c2 << coordinate.new(5, 15)  # Create a bounding box bb << bounding box.new(0, 0, l0, 10)  # Check if c1 is inside the box write(cl .inside?(bb))  # Check if c2 is inside the box write(c2.inside?(bb))  # Calculate the distance from c1 to c2 write(cl .distance_to(c2))

This would produce the following output:

True False 10

We can now define a subclass of coordinate called 3d_coordinate which inherits from coordinate and has an additional slot called z. This will immediately inherit all the methods we have defined on coordinate, so the operations we have defined above will still work in the same way, accessing the x and y coordinate of the 3d_coordinate and ignoring the z coordinate. We could define a new method to calculate the 3d distance between two 3d coordinates as follows:

method 3d_coordinate.3d_distance_to(another_3d_coordinate)   dx << self.x - another 3d_coordinate.x   dy << self.y - another 3d_coordinate.y   dy << self.z - another_3d_coordinate.z   return sqrt(dx*dx + dy*dy + dz*dz) endmethod

In this way we can easily extend the behaviour of existing classes. We can also modify the behaviour of a subclass relative to its parent by overriding methods. We will look at a different example to illustrate this. As mentioned earlier, Magik allows multiple inheritance, i.e. inheritance from more than one parent class. It is possible to define special classes called mixins, which do not have any slots but are just used to define behaviour (methods) which can be inherited by other classes.

The example we will consider is a data conversion application. We will look at defining methods which specify how objects are interactively created. Within Smallworld GIS, there is standard functionality provided to allow the user to create and manipulate an object called a trail, which is just a general piece of geometry. The trail is a multi-point line, and functions are provided to add points to the trail, move and delete them, generate points by raster line following, etc. The geometry in the trail is used to define the geometry of objects which are added to the system, such as cables or poles. Point objects can be defined either with a single point trail, for an object with fixed orientation, or with a two point trail for an object with variable orientation, where the first point defines the centre of the object and the direction from the first to the second point defines the orientation of the object. This is illustrated in the following diagram:

[Fig not available at this time]

There are various types of behaviour common to point objects in this data capture application, so we define a class called point_object, on which we will define behaviour common to point objects which can be inherited by application objects such as joints, poles and transformers.

We will define a general method for creating the main geometry of a point object from a trail which will cover both of the cases above. It turns out to be useful to allow a point object to be added at the end of a long trail in certain situations, for example when digitising linear objects such as cables. We will therefore specify that point objects without orientation will be added at the location of the last point in the trail with an orientation of zero, whilst point objects with orientation will be added at the last but one point in the trail, and the orientation of the last trail segment will define the orientation of the object. This is illustrated in the following diagram:

[Figure not available at this time]

In this application, when the user presses the insert button, a new object of the current type is created with no geometry, and then this object is sent the message create_geometry_from_trail(), so that the appropriate default geometry will be created from the current trail. Since we have two different sets of behaviour, point objects with and without orientation, we can define two new classes called point_object_with_orientation and point_object_without_orientation on which we can define the appropriate behaviour to create geometry from the trail. Both of these classes inherit from point_object, so that any behaviour which applies to any point object (with or without orientation), can be defined on the point object class, and it will automatically be inherited by these two subclasses.

We now define our methods as follows:

method point_object_without_orientation.        create_geometry_from_trail(grs)   new_point <<   point.new_at(grs.trail.coords.last)   self.default_geometry << new_point endmethod  method point_object_with_orientation.        create_geometry_from_trail(grs)   trail << grs.trail   if trail.size > 1 then     new_point <<     point.new_at(trail.coords[trail.size - 1])     new_point.orientation <<     trail.segment_angle   else     new_point <<     point.new_at(trail.coords.last)   endif   self.default_geometry << new_point endmethod

The first method creates a new point at the last coordinate in the trail. This is done by sending the graphics system object, grs, the message trail which returns a trail object. This in turn is sent the message coords, which returns a vector (array) of coordinates, and this is sent the message last, which returns the last element of any ordered collection. So we now have a coordinate, and we create a new point at this coordinate (a point has more information than a coordinate, such as an orientation, and information on other geometries which are connected to that point). No orientation is specified for the point here, since the default orientation is zero, which is what we want in this case. We then assign the default geometry of the new object to the point we have created. This assignment causes user-definable rules to be invoked to connect this geometry to other specified geometries within a given tolerance, as appropriate.

The second method is similar, but in this case we define the location of the point to be at the last but one point of the trail, provided that the trail has more than one point. To do this we use the indexing method [n1 which accesses the nth element of any ordered collection. We also assign an orientation to the point, which we obtain by sending the trail the standard message segment angle, which returns the angle of the last segment in the trail. If there is only one point in the trail, we create the new point in the same way as for a point object without orientation.

When we define application point objects like joints, poles and transformers, each of them will inherit either from point object_with_orientation or point object_without_orientation. We could also define a new create_geometry_from_trail() method on any of these specific objects if we wished its behaviour to be different in terms of how its geometry was created from the trail. For example, we might wish to regard a substation as a point object with orientation, since like the other point objects we have considered it is a valid end point for a cable, so it shares behaviour in this respect. However, we wish to represent the primary geometry of a substation as a rectangular area geometry, of a size which depends on the voltage level of the substation.

We would like to define the location of the substation by placing a point at its bottom left corner and making a second pointing to indicate its angle. This could be done with the following method:

method substation.create_geometry_from_trail(grs)   trail << grs.trail    	# Define the bottom left corner and the angle   # from the trail   if trail.size > 1 then     base_coord <<     trail.coords[trail.size-1]     orientation << trail.segment_angle   else     base_coord << trail.coords.last     orientation << 0   endif    # Set the substation size (in mm) depending   # on the voltage   if self.voltage = "LV" then     xsize << 5000     ysize << 3000   else     xsize << 12000     y size << 8000   endif    # Now create the relevant area geometry   new_area <<   area.new_rectangle(base_coord, xsize,   ysize, orientation)   self.default_geometry << new_area endmethod

So now our substation object has all the behaviour of a point object with orientation, except for the way in which its geometry is created from the trail. It can be seen that in this way inheritance gives us a very powerful technique for sharing code between object classes - we only need to write additional code for a new object class where its behaviour differs from its parent class. This example also illustrates the flexibility of a "real world object centred" data model rather than a "geometry centred" data model: we can define a range of objects to be regarded as "point objects" for the purposes of this application, even though they have different geometry types.

Object-oriented Databases

Whilst there is a reasonable degree of agreement as to what constitutes object-oriented programming, as described in the previous section, there is less agreement as to what constitutes an object-oriented database. Some people, including some well known figures in the GIS industry, seem to use the term for any database which can store "blobs" (binary large objects, such as images), in addition to traditional data types such as numbers and character strings.

An alternative definition is that it is a system which provides a persistent store for objects in an object-oriented programming environment, so that objects continue to exist when a program finishes running. To be regarded as a proper database management system (DBMS), such a system should also support multi-user access to the data, and handle the associated issues of concurrent update, and also provide other standard database functions such as security, backup and recovery. The interface to objects in such an object-oriented database, from a programming point of view, is usually exactly the same as the interface to non-persistent objects.

The advantages of an object-oriented DBMS are essentially an extension of those for object-oriented programming: with such a DBMS it is possible to use all the same data modelling techniques on objects which need to be stored in the database.

However, whilst there is general agreement in the computer industry that object-oriented programming is a good thing, database experts seem to be divided over the virtues of object-oriented databases. There is a well-developed theory behind relational databases, a lot of experience has been gained with them, and there are established standards such as SQL. No such formal theory has been developed for object-oriented databases. There are still some unresolved issues with object-oriented databases, such as providing a general query language and optimising queries. There is therefore a school of thought which says that rather than regarding object-oriented databases as something completely independent of current database technology, relational database systems should be extended to accommodate object-oriented ideas, and to allow them to be used within an object-oriented programming environment.

Smallworld has taken an approach which uses a version managed relational database management system, with an object-oriented programming interface added to it. Accessing objects in the database is very similar to accessing non-database objects, but with a couple of restrictions. The first restriction is that slots in database objects have to have a fixed type (class) which is declared in advance, as with any relational database. This is in contrast to non-database objects in Magik, whose slots can be used to store any object of any class. The second restriction is that in the current version of the system, behavioural inheritance (inheritance of methods) is supported on database objects, but structural inheritance (inheritance of slots) is not. However, exploratory work has been done on structural inheritance and it is planned to support this in a future release of the product.

A database table is regarded as a collection in Magik. A collection is a general class which stores a group of objects. There are many standard subclasses of collection which are provided with the system, such as sets, arrays, ordered collections, etc. Database tables form a class called ds_collection (datastore collection). There are various standard methods which apply to all collections, for example size, which returns the number of elements in a collection, and elements(), which is an iterator method which returns all the elements of the collection in turn.

The following example gives a brief flavour of how database objects can be accessed. Suppose that we have a cost attribute in pipes, which we wish to set in all pipes based on other attributes in the pipe (in reality we would probably set this interactively, triggered by any change in the pipe, but this sort of batch update is a reasonable illustration of access to the database).

for p over pipe_table.elements() loop   material_cost << material_table.at(p.material).unit_cost   p.cost << p.length * material_cost endloop

In this example we loop over each of the objects (records) in the pipe table. For each one we obtain the material cost by looking in another table. The method at() returns the database object at a specific primary key value in a table, which is a very efficient means of accessing a record. We can also access records using generic SQL-like predicates. The record object returned has slots like any other object, so we access the unit_cost slot of the material record returned. We then assign the cost attribute of the current pipe database record to the pipe length multiplied by the material cost. Since this is a database object, this value is automatically stored in the database.

Summary

We have discussed a number of uses of the term object-oriented in relation to GIS. The following is a summary set of questions which you should ask of anyone who calls their system object-oriented in order to clarify what they mean:

Does it store objects as a fundamental unit in the database, with no need to split objects across tile boundaries or partitions? This is what we called an object-based system: we would not call such a system object-oriented.
Does it have a "real world object centred" data model rather than a "geometry centred" model, as described above? The answer to this question has no bearing on whether or not a system is object-oriented.
Does it provide an object-oriented programming environment which supports the following:
- a) Encapsulation
- b) Polymorphism
- c) Inheritance
Does it provide a set of standard class libraries which can be extended by the customer?
Does it provide a database system which supports each of the previous concepts?

We will not be pedantic about trying to specify a precise set of answers to these questions which mean that a system is object-oriented or not, since this seems a rather pointless exercise.

Conclusion: So What?

Even if you can obtain answers to these questions, what does all this mean? To an end user of the system, it really makes little direct difference. When sitting in front of a system, you cannot tell whether or not it is object-oriented. The primary benefits of object-orientation are in ease of customisation and maintenance of the system, so the person who really sees the benefits of object-orientation is the application developer. In turn, this of course benefits the end user, who can expect to see applications delivered, and bugs fixed, much more quickly in an object-oriented system. It is Smallworld's experience after several years' work developing a GIS using an object-oriented programming environment that this approach is significantly more productive than traditional approaches to development.

References

1. Richard G. Newell. Practical experiences of using object-orientation to implement a GIS, Proceedings of GIS/LIS 92.

2. Arthur Chance, Richard G. Newell and David G. Theriault. An Overview of Smallworld Magik, Smallworld Technical Paper no. 9.

Smallworld Technical Paper No. 6 - Integrated Data Capture

2011-08-29T07:55:00.001-06:00

by David G. Theriault, Nicola Terry, Cees van der Leeden and Richard G. Newell

For the past 15 years GIS users who enthusiastically embarked on grandiose projects have, with very few exceptions, spent enormous amounts of time and effort mired in the initial phase of data capture. The organisations worst affected are those where the capture effort is concerned with the conversion of old documents to a digital, structured database, e.g. utilities, local authorities and cadastral agencies. The reasons for the difficulties are mainly inappropriate tools being applied to the problem. This paper presents an approach in which the problem has been completely re-examined at four levels. The first consideration is at the project management level where a proper analysis of goals and targets is required. The next level is concerned with the rapid development of specific applications in an ergonomic, integrated object-oriented environment. These applications must be supported by facilities for rapid take-on of data (e.g. raster-vector integration, interactive vectorisation tools and multiple grey-level images) and a topologically structured, seamless spatial database. Finally, powerful workstations with a visually pleasing window environment will be needed. Benchmarks so far seem to indicate that factors of 3 to 5 improvement in time may be achieved.

Introduction

Everybody now recognises that one of the largest costs (possibly the largest cost) of implementing a GIS is data conversion. In particular, the problem of converting data held on maps into a database containing, points, lines and polygons which are intelligently structured, is particularly burdensome. Much effort has been invested in trying to improve the process beyond the traditional methods based on manual digitizing. These include the use of scanned raster data, automatic conversion of raster into structured vector, automatic symbol and text recognition and "heads up" (on-screen) digitizing. All of these have their place, but we will advocate in this paper that other aspects of a system's function-ality and its structure can have more dramatic effects in reducing conversion costs in many cases. Our experiences stem from applying these methods to the capture of utility foreground data and cad- astral polygon data, where automatic approaches based on artificial intelligence break down.

Putting it another way, it seems that, with the current state of the art of automatic recognition methods for data conversion, there will always be a large proportion of documents which can only be interpreted and input by human operators. The exercise then becomes one of providing the best possible environment for humans to operate. This includes, not just providing much improved user interfaces, but also the very structure of the system itself. Badly structured, poorly integrated systems contribute greatly to the costs of conversion.

We contend that data capture is not a separate function carried out in isolation of the GIS it is to feed, but the data capture system itself should embody much of what one would expect from a modern corporate GIS.

Goals and Targets

It is important at the start of a GIS implementation to identify what it is that the GIS is intended for. This may be an obvious thing to say, but it is amazing how in many cases in the past, the conversion exercise has borne no relationship to any pre-assigned goals. An organisation embarking on GIS can define the business case for the whole project. This will result in:

identifying the end users of the system (including those who might
never actually sit down in front of a terminal).
a detailed breakdown of activities on the system.
an outline of possible future requirements.

From the analysis it is possible to know the exact data requirements, including the coverage required and the level of intelligence required in the data structures.

This business case analysis has not always been done rigorously. The result of this is that frequently far more is converted than is strictly necessary. It does not seem very many years since the silly arguments about the relative merits of raster and vector. Now that the discs are cheap and the processors and displays are fast, everybody recognizes that a raster representation for a landbase is an excellent and cost effective way to get started. Raster not only provides a perfectly adequate representation for producing displays and plots, but also is a good basis for the capture of intelligent data.

For a utility company, the data may be divided into background and foreground data. The source of the background data is good quality paper maps from the national mapping agency (e.g. the Ordnance Survey in the UK and the IGN in France). A complete digital vector representation of these background maps is usually not available or is expensive, so we would advocate in this proposal that these maps should be scanned using a good quality scanner. The cost of producing a scanned background mapbase is a small fraction (perhaps one tenth) of the cost of producing a vector background mapbase.

It is thus essential to decide at an early stage, what are the objects in the real world that need to be represented in the system, in order that the foreground capture can be limited to these.

After all, a GIS is no more than a corporate database management system, in which one of the query and report environments is a map. Thus the same data modelling and analysis that goes into DBMS implementation needs to be done for GIS implementation. This data model will determine that which needs to have an intelligently structured graphical representation and that which does not.

It is essential that goals are set so that a complete coverage of a defined and useful area can be achieved in about one year to 18 months. If the time frame is longer than this, the management will get impatient to see results, the maintenance problem will overtake the primary data-capture and the whole thing will fall into disrepute. We would go as far as to say, if the organisation cannot commit the money and resources to achieve this, then it should wait until it can.

Development of Specific Applications

For the purpose of the data conversion project, specific applications for input and edit need to be developed for each data item. In parallel, it is necessary to implement and test the end applicat-ions of the system. Traditionally in a GIS project, these tasks have consumed an enormous amount of time and valuable resource. For sound reasons, the data capture project is postponed until these applications show robust results. The advances that have been made in data processing and database management, such as rapid prototyping and 4GL environments, have only recently surfaced in some commercial GIS systems.

This task of applications development is made much easier if the GIS already has an extensive library of generic functions which can be applied to the specific data model implemented. These functions might include generic editors for geometry, attributes and topology, network functions as well as polygon operations including buffering and overlay. It is in this area that interactive object-oriented environments really come to the fore as new functionality can inherit much of the existing functionality of a generic system as well as providing very rapid prototyping and development capabilities.

Integrated System Approach

This section contains the main burden of our argument. In the past, most data capture was carried out using conventional digitizers on single user systems, one sheet at a time, in isolation of the GIS. We contend that data capture should be carried out using on-screen digitizing in a multi-user environment using a continuous mapbase in the environment of a fully integrated GIS.

Such an approach gives considerable savings in time and cost in making a complete coverage. The basis of the approach is to eliminate many of the activities which take time in systems which are not well integrated.

The procedure starts, by capturing an accurate background landbase. This may well be a base constructed of vectors and polygons, if it is available, such as the data available from the Ordnance Survey. Alternatively, a base constructed of scanned raster is much cheaper. Scanning at 200 dots per inch is barely adequate for this purpose and it is worth the marginal additional expense of scanning at a higher resolution of say 300 or even 400 dots per inch. With modern compression techniques, 400 dpi only takes twice, not four times, the space of 200 dpi. Semi- automatic vectorisation (see below) is considerably improved at the higher resolutions.

[ Figure 1 not available ]

Figure 1: Basic Data Model

The next stage is to input data which can be used to locate objects and mapsheets as well providing an essential geometric reference base. This includes such things as street gazetteers, map sheet locators and street centre-lines. Thus locating any street or map sheet and getting it onto the screen should be very efficient. Conversely, the system should be able to identify the mapsheet(s) containing any point or object indicated on the screen. Some locational data may already exist in existing company databases. Doing this before the foreground capture can improve overall efficiency.

Finally, the source maps containing the foreground data of interest can be scanned and the process of on-screen data capture can begin in an environment in which many of the hurdles in conventional systems are removed. Figure 1 illustrates the order in which information is assembled.

The following is a discussion of several important issues, including:

On-screen digitizing
Digitizing across many sheets without interruption
Geometry, topology and attribute capture in one integrated GIS environment
Multi-user data capture into one continuous database
The full facilities of a GIS for validation and running test applications

On-screen Digitizing

The conventional way of digitizing map data is to use a tablet or digitizing table. While, for certain kinds of data, this is usually of sufficient accuracy, a significant amount of time is taken by the operator in having to validate the input on a work-station screen, which is a separate surface to that on which the digitizing is performed. Thus it is difficult to check for digitizing errors and also for completeness, as the two images are not together.

Further, it is common for there to be minor mis-alignments between a quality background map sheet and the map sheet on which the foreground data is drawn. With on-screen digitizing it is possible to do a quick local realignment to compensate for such errors.

The process can be considerably improved with facilities for semi-automatic vectorisation of the raster data, so that high accuracy can be obtained while working at a relatively small scale. This reduces the need to zoom in and out frequently as the automatic vectorising facilities are usually more accurate than positioning coordinates by eye.

It is also necessary to be able to handle greyscale scanned maps in cases where the source maps are of low quality, or contain a wide variety of line densities, such as pencil marks on a map drawn in ink. Compression techniques should allow a large number of greyscale or colour maps to be stored efficiently.

Digitizing Across Many Map Sheets Without Interruption

With conventional digitizing, it is not possible to digitize at one time any object which lies across a map sheet boundary. Objects such as pipes or electricity cables frequently cross map sheet boundaries at least once. In these cases, the parts of the object must be connected together by an additional interactive process. Ad hoc techniques such as "off sheet connectors" and artificial segmentation of the object must be avoided.

With on screen digitizing, the process starts by scanning all (or at least a significant number of) adjacent foreground maps into the database before digitizing commences. This, in effect simulates a single large continuous raster mapbase. Thus the digitizing of complete objects can be achieved without interruption, as all of the information is presented together on the screen about the object.

Geometry, Topology and Attribute Capture in one Integrated GIS Environment

It is also necessary for the input, edit and query functions of the GIS to be available to the operator during data capture. Thus, little time is wasted moving from one user interface environment to another. Objects with similar attributes to objects already in the system can use those attributes as a starting template for modification.

The integration of the digitizing capability with the geometric CAD type facilities means that constructive geometry can be mixed with digitized geometry at will. Further, the integration of the geometry capability with the topology facilities means that most topology can be generated automatically by the system at the time that objects are positioned on the screen. For example, a water valve placed on a water pipe causes a node to be inserted in the water pipe, and the correct connectivity to be inserted in the database.

The inclusion of gazetteering functions considerably reduces the amount of time spent locating objects in the database.

Multi-user Data Capture into one Continuous Database

Approaches based on digitizing one sheet at a time by means of single user data capture systems result in a lot of time being spent merging and managing the many independent data files generated. The problem becomes particularly acute when edits are required to data which has already been input.

The system should permit many users to simultaneously update one continuous mapbase. This can only be achieved efficiently if the system allows any number of versions and simultaneous alternatives to be held in the database. This results in making the task of administering many users engaged in simultaneous additions and modifications to the database much easier.

This single issue is one of the dominant reasons why data capture can be very time consuming and costly.

The GIGO myth

A popular cliché these days is the GIGO (garbage in garbage out) story. Well it is not actually true in many cases, or at least the situation may be recoverable. It is our experience that many of today's GIS do in fact contain so- called garbage. The reason we say this is because frequently the geometry is good enough, but the structure of the data is defective, often objects are split across map sheet boundaries and object attributes are wrong or just plain missing. In other words the data cannot be used for the purpose it was intended.

In our experience, there is usually enough in these databases to convert the data into a satisfactory quality by a process of "laundering". This is a process where the data is imported into a GIS environment, and automatic processors can be applied to the data to infer missing topology and missing attributes and a range of validators can be run to detect and highlight problems graphically for a human operator to deal with. The cost of doing this is usually very much less than the cost of the original data capture, but it does require a rich GIS environment in which to do it.

It is particularly useful if the system contains rapid prototyping and development facilities so that it makes the writing of special programs to clean up particular datasets more feasible. These programs may well be used only once, then thrown away.

User Interface and Platform Issues

Much of what has been proposed above has only been made possible by the huge improvements in hardware and software in the last three years. On the hardware front, the drop in price combined with the huge increases in performance, memory and disk sizes means that data capture can be carried out in the environment of a large seamless database, managed by a modern GIS. The improvement in emerging user interface standards, coupled with the arrival of object-oriented development environments for the rapid development of custom built user interfaces for particular data capture tasks is another contributing factor.

Conclusion

Whereas most effort in the past in improving data capture has been expended on fully automated methods for converting source map data into a structured data base, we believe that for many types of documents, the human operator cannot be eliminated, but the present situation can be considerably improved by providing a fully integrated system.

Many of the hurdles which contribute to the cost of data-capture can be eliminated by several levels of integration including:

The integration of source and target images in one database and on one screen via the use of raster scanned maps.
The integration of all map sheets in one seamless mapbase.
The integration of map sheets and locational data.
The integration of data capture, validation and GIS functions in one environment.
The integration of many users capturing data into one version managed multi-user database.

As well as these issues of integration, the application of robust intelligent techniques, such as semi-automatic vectorisation and automatic topology generation further improve the process. Finally, the advent of modern user interfaces and object-oriented development environments lead to the production of optimised custom built user interfaces ideally tuned to the task in hand.

Smallworld Technical Paper No. 5 - An Overview of Smallworld Magik

2011-08-29T07:54:00.000-06:00

by Arthur Chance, Richard G. Newell & David G. Theriault

Introduction

Smallworld Magik is an extremely powerful language for the implementation of large interactive systems. The language is a hybrid of the procedural and object oriented approaches and program development is carried out in an interactive environment.

It would be pertinent of the reader to ask, why would a small company like Smallworld devote considerable resources to the creation of such a development environment? Are not the common languages available on the market today adequate for the job? Well, the simple answer to the second question is no and it is the purpose of this paper to try and give an answer to the first.

Background

Traditionally, large interactive systems were developed in a non-interactive procedural language, such as Fortran or C. In order that an end user could drive such a system, an interactive command language was provided so that the user could type in his commands. Many command languages evolved into programming languages usually by borrowing the programming concepts of Basic. In more modern user interfaces, this command language may well be hidden behind a system of screen menus, tablet menus or other input devices. Large modular systems were glued together using an operating system command language or script. Within the system, developers and customisers had a number of other languages available to them to define such things as syntax, menus, data descriptions, graphic codes, etc, all running as different processes communicating by files.

The structure of such systems were commonly organized at the highest level around the command syntax and complex commands were structured in a top down approach from there.

If one examines systems that have been put together in this manner over the last five to ten years they all suffer a number of difficulties:

Development is slow. Users' requests for enhancement have to wait for the next release, which is usually over a year away.
They are difficult and expensive to maintain. During the life cycle of the system, probably 90% of development goes into maintenance.
Major restructuring in the light of five or ten years of hindsight is unthinkable.
Customisation is arbitrarily done in one or more of the many languages used to put the system together, typically Fortran or the command language.
Integration with other systems is nearly impossible.

Smallworld has implemented Magik in order to avoid all these difficulties. The way this is achieved is by embodying the following features in the language and its development environment:

There is but one language for system, application and customisation development.
Both object orientation and procedural methodologies are supported.
Development is in an interactive environment.
The language is expressive and very readable.
There is an extensive library of standard object classes, methods and procedures.
The language is built as a platform suitable for delivering commercial systems.
Applications can be transferred with a minimum of effort between hardware platforms.

It is Smallworld's belief that the presence of all these features is essential if commercial systems are to be developed, maintained and customised with a minimum of programmer effort. It is the lack of a viable language with a sufficient subset of these facilities that has stimulated Smallworld to produce its own which embodies all of them.

Magik allows programs to be developed in one seamless environment, meaning that systems programming, applications development, system integration, and customisation are all written in one environment in the same language. Thus, end users who wish to customise the system can be confident in the quality of the tools provided because they are identical to the development tools used by the core and application system developers. Further, existing systems, such as most database management systems, can be fully integrated so that to the user they appear as part of one homogeneous system.

We have already started to use some of the jargon of object orientation, and in order to understand Magik, it will be necessary for us to try and explain object orientation and a number of the other terms used to explain it.

Object Orientation

Object orientation is not an easy concept to explain, and we are not likely to fully succeed here, how-ever, it is suffice to say that experienced computer industry observers are in no doubt as to the power of the technique and indeed are convinced that within the next few years object orientation will be the dominant approach to structuring and building large complex systems. The argument is subtle, but the benefits of it are profound.

Object orientation refers to a way of structur-ing software systems, thus people who refer to object oriented databases may be misusing the term, as it is not clear (at least to this writer) what they mean. Putting it at its most terse, object orientation structures software around the things being processed, not around the functions being performed. In order to try and get over some of the flavour of what object oriented programming is all about let us try and describe some of the common terminology surrounding it.

Object

An object comprises two things, its own state (manifested as a set of instance variables) which no other part of the system can access directly and a set of procedures (called methods) which describe its behaviour. Everything about an object is encapsulated within it and the only way of getting data out of it, or changing it, or getting it to do something is by sending messages to it. An object is a rather sophisticated extension of the concept of variable in other languages. The fact that data is hidden and that the only way of communicating with an object is via a rigorously defined system of message passing (see below) means that extremely robust systems can be created.

Class

Every object belongs to a unique class. Some classes are primitive such as real numbers, integers and text strings. A class is very similar to a type in normal procedural languages. All objects of a given class will exhibit the same behaviour, i.e., have the same set of methods. In an object oriented language, the programmer himself defines new classes in order that he can later make instances of them. In order to be more precise here, we should say that Magik is based on "exemplars" in that a new class is defined by having a special way of defining the first instance of it, i.e. an example of it. From then on all further instances are cloned from the exemplar.

Method

A method is a procedure, no more, no less. The reason a different term is used is because of the way in which it is called. In a normal procedural language if one sees a reference to a procedure, then it refers unambiguously to a particular piece of code. However a reference to a method in a message (see below) may well refer to any one of many pieces of code depending on the class of the object receiving the message. Further, it will not be possible to ascertain which method will be executed until run time. What this means is that routines can be written which do not depend on the type of the data thrown at them and therefore the same code can be reused over and over again in different contexts. Such code written without a knowledge of what it is to be used for is sometimes referred to as a software IC.

Inheritance

Quite frequently, different classes might be similar, in that one might exhibit all the behaviour of another plus some additional behaviour. In other words, one class may share a number of another classes' methods, but also has a different version of some methods and some new methods of its own. In these cases, one class can be defined to be a sub-class of another. This is another way of exploiting code reusability. Note that the behaviour of a sub- class is not a subset of the behaviour of its super-class, because typically it does things in addition to the superclass. A subclass is a specialization of its superclass in the same sense that a duck-billed platypus is a specialization of a monotreme which is a specialization of a mammal which is a special-ization of a vertebrate which belongs to the animal kingdom. A duck-bill is unique to the platypus, whereas egg laying behaviour is common to all monotremes, warm bloodedness is common to all mammals and all vertebrates have a backbone.

Usually, classes are organized in a strict hier-archy, each subclass in the hierarchy inheriting its behaviour (methods) from its superclass. In Magik, the concept of multiple inheritance is also supported (equivalent to a hybrid in our example from the animal kingdom) so that a class may inherit behaviour from any number of super classes (thus forming a heterarchy). Some objects are special in that their only function is to supply behaviour to other classes, it is not possible to create an object instance from them; these are called "mixins".

The point of the class heterarchy is to try and maximize code reusability.

Message Passing

Message passing in an object oriented language means exactly the same thing as a procedure call in a procedural language. "Message expression" is the term used to describe how a message is sent. A message expression comprises three parts: an object, a message name, and a list of zero or more parameters, which are themselves objects.

In Magik a message expression would look like:

p3 << point.new(10,20)

Here a new object "p3" is manufactured by taking the object "point", sending it the message "new" with the parameter objects 10 and 20 (note: "<<" is the Magik assignment operator).

x3 << p3.x    this makes the object "x3" by                  sending the message "x" to "p3"                 (thus x3 becomes 10) p3.y<< 50 this could be regarded as                 changing p3 by sending the message                 "y<<" to "p3" with parameter 50.

Magik - a Hybrid Language

We have said in the introduction to this paper that Magik is a hybrid language. This is because whereas object orientation is the preferred method of organizing large systems (programming in the large), it can be cumbersome and contorted for writing certain kinds of routines, especially pure functions. Such routines are usually not very large and are reasonably self contained. It is therefore preferable to write such parts of the system using a conventional procedural approach, as is often the case for small programs (programming in the small).

In Magik, object oriented code can be freely mixed with procedural code, and indeed the message expression has been made to look rather like a procedure call where the procedure name can be thought of as a concatenation of the object class with the message.

Interactive Development Environment

At least as important as the power of a language is the nature of the development environment provided for programmers. Progress with conventional system programming languages is considerably slowed by having to re-link every time a new piece of code needs to be tested. If one provides programmers with an interactive language with powerful tools to explore and discover existing code in the extensive library of standard objects, then programmer productivity is improved by a large factor, regardless of the quality of the language itself. In Magik, the development environment and code browsing facilities are inspired by Smalltalk.

The inheritance relationships for the objects in the Magik environment can be complex but inheritance diagrams can be generated automatically. Also, the implementing class of any particular method can be easily identified. The annotated source code for most of the environment is available online. There are code browsers and object inspectors; the system can locate methods and variables given only part of the name. Debugging tools include tracing of calls and access to global variables. It is possible to fix and continue after an error.

Magik - a Readable Language

Some popular languages, although very powerful, are extremely difficult to read. Occasionally language constructions are so cryptic that even the original author has difficulty deciphering the code that he has written, let alone passing it to anyone else. The same criticism might also be levelled at very low level languages such as most assemblers. In order that large systems can be maintained, it is essential that the language syntax is designed so that the majority of the program-ming population does not get immediate dyslexia. As far as procedural programming is concerned, Magik has been designed to fit in with the Algol style of syntax, so that conditionals, procedure calls, expressions and loops will all appear familiar and readable. The object oriented facilities such as method definitions and message expressions have been designed to fit in with the Algol style.

Library of Object Classes, Methods and Procedures

At the heart of the programming environment is a variety of useful classes that can be used directly or subclassed. These classes include:

Different kinds of collections for grouping objects (sets, association tables, ordered lists, stacks and queues and so on). Some of these collections support relational algebra.
Multi-dimensional arrays.
Stream of objects or text, either to or from an external channel such as a file or to and from buffers.
A text editor
Application framework and menus for putting together interactive applications. This is in turn built on a library of objects that do graphics.

There are also packages of common mathematical and statistical functions.

Magik - a Delivery Platform

Magik is designed with three strategies specifically aimed at delivery platforms. The first strategy is simply a defensive programming style. The environment and application framework are designed to encourage this and error-handling mechanisms in the programming environment that are intended for debugging can easily be replaced in a delivery version with appropriate error recovery procedures.

A second strategy is to protect application and system code. Online source code can be removed from the delivery version if preferred and method tables can be locked so that methods cannot be changed or removed from the sensitive classes. Global variables can be made constant. If necessary, the compiler can be removed from the platform and users restricted to menus instead.

A third strategy is to separate text and pictures intended for human interpretation from other data wherever possible. This is vital for internationalization so that words and symbols can be translated into different languages.

Appendix - Snippets of Smallworld Magik code

Simple assignments and expressions:

y << x + 1/x x << (-b + sqrt(b*b-4*a*c))/(2*a)

Multiple assignments. Given two objects a and b, both the following examples swap a and b, regardless of their classes:

 (b,a) << (a,b) # Parallel assignment a << b ^<< a   # ^<< Boot and becomes

Calling procedures and sending messages:

largest << max(p1, p2, p3) (red,green,blue) << chair.colour.rgb()

Conditionals:

if chair.height < min then   report('This chair is too small') elif chair.height > max then    report('This chair is too big') else   report('This chair is just right') endif

Loops:

member << for mem over membershiplist.elements() loop    # statements to be looped over   if mem.age < 18 then continue abc endif   # more statements   if mem.available then leave with mem endif endloop

Procedures:

quadratic << proc(a,b,c)   if a = 0   then     if b <> 0     then       return c/b     endif   else     discriminant << b*b-4*a*c     if discriminant < 0     then       return     else       s << sqrt(discriminant)       ta << a+a       return (-b+s)/ta, (-b-s)/ta     endif   endif endproc  # leaves x1 with 2 and x2 with 3 (x1,x2) << quadratic(1,-5,6)    # leaves x1 with 2 and x2 unset (x1,x2) << quadratic(0,2,4)                 # leaves both x1 and x2 unset (x1,x2) << quadratic(4,1,3)

Returning objects. The normal way to return objects from a procedure or method is to use the return statement.

# returns three objects return a,b,c              # returns a single list of three objects return vec(a,b,c)       # Any statement, simple or complex can also return # objects using the form '>>' # "a" ends up being 42 or 99 a <<    if p > q    then >> 42   else >> 99    endif

Iterators. Iterators generate values used in loops. This example uses the fibonaccinumbers iterator to supply values for f.

 # The fibonacci series is the series 0, 1, 1, 2, 3, 5, 8, 13 . . . for f over fibonacci_numbers(10) loop   # statements endloop  # The iterator "fibonaccinumbers" could be defined as follows: fibonacci_numbers << iter proc(n)   if n < 1 then return endif   loopbody(0)   if n < 2 then return endif   loopbody(1)   f0 << 0   f1 << 1   for i over range(1,n)   loop     (f0,f1) << f1,f0+f1     loopbody(f1)   endloop endproc

Method definition:

method rectangle.new(width,length)   clone.init(width,length) endmethod  method rectangle.area   >> width*length endmethod

Iterator methods. Iterator methods are very similar to iterators, except they are defined as part of the behaviour of a particular class.

"Membership.list.elements()" under loops above is an example  of the use of an iterator method.

Protected code. The protection code in a protect statement is guaranteed to be executed, even if an error occurs:

input << text_fileinput.new(filename) protect   # code to process the file protection   input.close() endprotect

Smallworld Technical Paper No. 4 - Version Management in GIS - Applications and Techniques

2011-08-29T07:53:00.001-06:00

by Mark E. Easterfield, Richard G. Newell & David G. Theriault

Abstract

The majority of today's GIS systems adopt an approach to multi-user working which is based upon the standard model underpinning conventional commercial RDBMS systems. This model is appropriate in applications where only the most recent state of the database is of interest, but it does not work well over long transactions, and is inappropriate for manipulating multiple versions. Many GIS activities involve transactions lasting days or weeks, and others could benefit from the ability to develop, compare, and retain or merge different versions of data. This paper proposes an alternative approach to multi-user working, based around deferred detection of conflict. The paper reviews a number of benefits which the approach brings, such as multi-user data capture, developing and assessing alternative designs, and the ability to include external mapbase updates over time. The paper concludes by considering the features required of the underlying system in order to support the approach.

Introduction

The majority of today's GIS systems adopt an approach to multi-user working based on the one designed for conventional data processing requirements. This approach is based on the principal that it is only permissible to make changes to the most up to date version of the data, a rule which is satisfied by a combination of locking and enforced rollback.

This rule is entirely appropriate in situations where changes tend to be small, where the data represents a single state, and where conflict must be detected at the earliest opportunity. A good example is a ticket reservation system, where it is vital to know with absolute certainty whether a seat has been sold, or is still available.

The approach is much less appropriate to most GIS applications, where changes may be substantial, and be self consistent only over a considerable period of time, and where it may be legitimate to construct alternative versions of data before deciding upon one or another to represent the real world. However, most of the GIS products available currently are locked into the conventional model, as they tend to be heavily dependant upon standard commercial DBMS offerings, which are designed for the DP environment.

The particular deficiency of conventional DBMS systems concerned with their inability to handle long transactions has been recognised, particularly where graphical data is concerned. However, attribute data is still generally managed in the traditional manner, and there are problems in maintaining consistency between the two forms of data. Currently, there is little support for the management of alternative versions of data.

This paper examines the benefits of a data management system which supports alternative versions. It illustrates how such a system also provides a much more appropriate multi-user model in the GIS environment, and provides additional spin-off benefits in a distributed environment.

The paper goes on to consider the requirements of such a system, independent of its implementation, and concludes with one possible implementation strategy.

[ Figure 1 not available ]

The Benefits Of Enabling Alternative Versions

Traditional database technology has encouraged a view of the world, as represented in the database, in which there is a single, canonical state. It is possible to modify the database, from one state to the next, but it is not permissible to have multiple canonical states simultaneously.

At first sight this seems most reasonable there is only one true state of the world (in real life). Furthermore, it precludes incompatible changes from being made in a very simple way there need never be any doubt about the logical integrity of the data. The approach is good at representing a static past, or a present which evolves in small, simple, discrete steps.

However, many applications are actually used to model the future, or to model a loose form of the present in which changes in the real world are not represented as a continuous series of small transitions but rather as a few discreet modifications, each of which is substantial.

When modelling the future, the ability to represent alternatives is of great value. For example, when planning a new development, a number of different possibilities can be reviewed and costed before one is chosen. A system which permits alternative models of the (future) world to co-exist independently of one another must also inherently permit these alternative views to remain separate from the accepted view of the present, so that if there is a canonical view of the world it is not affected by "what if"s.

This requirement to keep a canonical data set apart from changes to it also occurs during customisation. In this case, the alternatives do not necessarily represent real futures at all, but simply experiments to check that the customisation work does what is intended. Traditionally, such verification is performed on a small subset of the data, which has been copied out of the master database. However, this approach tends to conceal problems of scale until the customisation is released, when it is too late to rectify them. Using alternatives, customisations can be tested on the real database, highlighting problems immediately, but without interfering with users in any way.

When modelling the present, there is benefit in being able to defer making individual changes visible until a complete set of changes is in a logically consistent state. In the case of adding a newly acquired data set, or of making substantial changes to the use of a piece of land, say, the data may remain inconsistent for some time. The type and duration of change is more akin to that in a CADCAM environment, rather than the small transitional changes encountered in DP. In an environment which provides alternatives, the canonical state represents one possibility, and the new state resulting from the modifications represents another. The new state can remain hidden from general view for as long as is required, and only released when the whole change is considered complete.

Looking further afield, there are other advantages in an environment which supports alternatives. If a solution has been found to the problem of merging independently developed versions (which is fundamental to the entire alternative strategy), then it is also possible to contemplate merging changes which have occurred entirely outside the particular database.

One example is that of the map base vendor, who issues changes to the map base from time to time to a set of customers. Each customer may have wished to modify his existing base over time, to keep it in step with the real world. At the time the new base is issued, some of the changes will have ceased to be relevant, because they have now been incorporated by the vendor, but others, for reasons of timing or of the nature of the data, will still be required. In such a situation, it is necessary to be able to identify the differences from the original base, and to re- apply those modifications which are still relevant to the new base.

This is illustrated in Figure 1, where the act of bringing the map base up to date would actually prove undesirable if allowed to proceed in an uncontrolled manner. The alternative management approach permits the user's foreground to be reconciled with the new map base in a controlled fashion, so that day to day users of the data never see a dangerous and misleading combination.

A second example where the comparison and merging of disjoint versions is of value is in a distributed database environment. To be truly distributed, parts of the database must be able to function whilst they are cut off from other parts. This necessarily demands that changes must be able to be made which cannot be wholly checked for logical consistency until all the database is visible again. An approach which accepts the mutual coexistence of different versions, with consistency validation deferred until late in the day, makes this feasible.

An approach which allows alternative (and therefore necessarily conflicting) versions of the state of the database to exist is mutually exclusive to an approach which guarantees a single, fully consistent state. In many DP applications, the latter approach is absolutely necessary. We contend, however, that in GIS applications, trans- actions are naturally long, concurrent alternative scenarios are desirable, and that in practice mutually conflicting changes are either rare or wholly unavoidable, and that therefore a model based upon alternatives is more appropriate.

The last of the three points is the most crucial to this argument. If there is a large amount of avoidable conflict which has to be resolved late in the day, then the amount of work required to create a consistent state outweighs the benefits which have been proposed. We believe, however, that day to day conflict will generally be rare in a reasonably well managed environment, and that the other problem of substantial unavoidable conflict is one which is outside the scope of conventional database management techniques.

The Relationship Between Versions And The Management Of Time

The management of versions and the managment of time are often referred to synonymously. However, the two are not the same, and should not be confused (Langran 1988).

Version management, with which this paper is principally concerned, offers control over database time, which runs strictly forwards. As will be seen later, the management of versions may be conveniently implemented by successive overlaying of changes.

Management of real world time is different, because it is possible to modify the past by adding new data (and therefore time in one sense is no longer strictly forwards). For example, the data in a map base may reflect the external world at the current time, say 1990; later on, data may be added which describes what the world was like at an earlier date, for example in 1953. Database time and real world time are thus running in opposite directions.

It is possible to implement mechanisms which manage time in a database, but the methods described here are not generally appropriate. As with alternative version management, an efficient solution demands functions built into the database manager itself which are outside the scope of this paper.

The Requirements Of A System Providing Alternative Versions

Any system which permits alternative states of data to exist concurrently must minimally provide a certain set of features. These are considered in this section, and comprise:

The ability for different users to work on seemingly wholly independent views of the database, so that other users remain unaware of the changes they are making
The ability to commit changes to disk without the changes becoming visible to other users
The ability to make changes potentially visible to other users when they have been authorised as a consistent set
The ability of a user to defer, until it is convenient, the making visible of changes effected by someone else
The ability to determine precisely what has changed between two versions
The ability to validate the consistency of a set of changes to the database.

The first point is an underlying pre-requisite for alternative version management. Conventional GIS solutions to the problem of data access which are based on check-out and check-in suffer from restricting access through excessive locking of data, perform poorly during the check-in/check-out operations themselves, and are expensive in their use of disk. A system providing effective alternative version management demands solutions more directly in tune with that specific approach.

The separation of the action of committing changes to disk, to that of making changes visible to others, is actually more related to long transactions than to alternative management per se. However, since an alternative itself may evolve through a number of states, and may exist for an indefinite period of time, there is the de facto requirement to be able to save individual states of an alternative while it is still hidden from general view.

The traditional DP approach to transaction management combines the act of committing to disk, and that of making changes visible to others. It also incorporates rollback using the same technique, for good measure. These three functions, although related, do not have to be combined, and can actually be considered separately. In an environment which supports a number of individually identifiable alternatives, any one can be committed to disk and rolled back without affecting the others. The act of making the changes visible to a wider audience then becomes a separate administrative function, rather akin to the formal approval and release of drawings or documents.

In an environment where conflicting alternatives may co-exist, the merging of two such alternatives requires a high degree of formality. Three distinct phases are required. The first, making a set of changes potentially visible, has already been described. However, before a user working on another alternative can be allowed to see those changes, he too must take positive action, to prevent the view of the database he is using suddenly going from a self consistent state to an inconsistent one. Just as the user has been buffered from changes going on elsewhere up until now, so he should continue to be protected until there is a good chance that the uptake of the new view will not cause conflict. This process of accepting a set of changes which have been issued from another alternative is called refreshing. The refresh process itself consists of two parts, the first validating that the new combined state is logically consistent, and the second actually making the combined state the one to be used.

When two alternatives are developed independently, the only data which can truly be considered common between the two is that which existed in a common ancestor. The difference between a pair of alternatives is a function of the modifications made to that ancestor in each case. And as these modifications are made in isolation of one another, there is no physical relationship between them, so they may only be compared by considering what logical changes have been made. In particular, it is not possible to simply physically merge two versions together, it is necessary to re-apply the logical changes that have been made.

Thus the ability to refresh demands the ability to determine what changes have been applied to the database during the life of an alternative version.

Before a refresh can be completed, the new view of the database which would be created by incorporating the changes of one alternative into the changes of another must be validated to determine whether any integrity constraints have been broken. In databases based on the relational model, two sorts of conflict can arise. There is a conflict if the same record (identified by primary key value) has been modified in a different way in the two alternatives; and there may be a conflict arising from an application constraint. The former is easy to detect, the latter more difficult, requiring application specific code. However, things are not as bad as they may seem, since if the application possesses the ability to verify the integrity of data generally (for example, as it is entered) then precisely the same verification functions can be used on the set of changed records during the refresh process.

The detection of either form of conflict can be built upon the ability to retrieve the set of records which have changed between the version being refreshed and the common ancestral version. The ancillary problems of how to report or display conflict, and of providing tools to rectify the situation, fall into the application and user interface domain, and are not considered here.

The foregoing has described a set of requirements for a platform which supports the management of alternatives. We next describe one possible mechanism by which these requirements can be fulfilled.

An Implementation Strategy

Many of the demands made of a system which manages alternative versions of data concern the ability to identify the differences between them. Thus we are considering different states of the data, and, most importantly, what has happened to get from one state to the next.

Using the relational model, the changes between one state and another can be described by the set of changed records, identified by primary key value, and the type of change (insertion, deletion or modification). Non relational models present more problems, because there is no simple way of comparing entities between two independently generated sets of data. This paper only considers the relational case.

The implementation described here is based on a tabular datastore constructed from conventional B*-Trees, with the fundamental versioning mechanism built into the underlying data structure itself. This has the properties of low space overhead and almost zero performance overhead during conventional access. This level of the implementation permits multiple versions of the data to co-exist, with blocks which are common to two or more versions being shared between them.

Note that the approach does not demand that all blocks reside in the same file. In particular, benefit can be gained in the long term by storing a version on a mass storage medium such as CD ROM and maintaining the changes on conventional disk.

In order to determine the differences between a version and one of its ancestors, it is only necessary to examine those blocks in that version which differ from the ancestor's version, rather than all the blocks in the database.

Different versions of the same tree are simply accessed via their different rootblocks, and the action of committing a new version to disk simply involves storing the new root block number in an appropriate place. The old version continues to remain accessible while its root block number is still known.

[ Figure 2 not available ]

Alternative versions form a hierarchy, as illustrated in Figure 3. At the top (the root) is the master database, which would probably correspond to what is considered the canonical state. Nodes which are derived from this could be used, for example, to differentiate between themes (e.g. pipework, electrical, cadastral, etc.), between different spatial areas, or between any other aspects where the probability of conflict is low. At the leaves of the tree, each "user" has an alternative in which to work without conflicting with others. In many cases, the levels of the tree will have close correspondence with levels of approval.

[ Figure 3 not available ]

Changes are posted (i.e. made more widely visible) one level at a time. Provided that the up to date version at the higher level is the one from which the changed version was derived, only application validation in the form of "approval" is required the version itself will be self consistent.

If the version at the higher level has since been updated, the version about to post must first refresh, so that it is the up to date state which is being posted into. If this formality were not undertaken, the intervening changes to that higher level version would be lost.

The posting process can generally be implemented using purely physical means; the refresh process cannot. The refresh process is effected by re-applying into the new state the changes which had been made since the last refresh, by determining which records had changed, and the form of change, and repeating the process.

Of all the processes, refreshing is the most complex and potentially time consuming. However, the time taken is proportional to the degree of change, not the overall size of the database.

The cycle of modification, posting and refresh is illustrated in Figure 4.

[ Figure 4 not available ]

Accessing Conventional RDBMs

A GIS may be called upon to access and maintain data which is handled using a conventional DBMS system. We know of no system around today that supports the sort of functionality described in this paper, so if the functionality is to be provided to the end-user it must be achieved by means of a version managed front end to the non version managed underlying system. This solution is bound not to be wholly satisfactory, and includes such problems as:

Finding an optimal way of populating the version managed datastore tables which performs well and doesn't require too much disk
Coping with situations where the conventional database is updated by the external access (i.e. outside the version managed environment)
Retaining version information once data is posted back to the conventional database

The first of these problems is almost certainly soluble by techniques such as remembering frequently used queries and dynamic downloading of data. The latter two problems are much more difficult to resolve. However, we believe that the benefits of a system which supports multiple versions and alternatives in GIS applications outweighs these potential difficulties.

Conclusions

In this paper we have suggested that the conventional DP transaction approach to multi-user working is inappropriate for GIS applications. There are a number of essential differences between the conventional approach and the one described here. These include the ability to maintain, merge and administer a hierarchy of alternative versions in a controlled manner and to safely allow multi-user access without locking.

We contend that a system which uses deferred detection of conflict to permit alternative versions of data to be worked on concurrently provides many benefits. These include the ability to experiment and cost alternative scenarios, to develop and test customisations on a realistic size of dataset, and to resolve many of the current problems relating to long transactions and distributed working.

We have identified a number of requisites of a system which is to successfully support alternative working, which include the ability to separate the process of making one user's changes visible to another into discreet stages, and the ability to determine what has changed between one version and another.

Finally, we have demonstrated the feasibility of the approach by proposing one possible implementation technique which fulfils these requirements efficiently.

References

Armstrong Marc P. (1988). Temporality in Spatial Databases, Proceedings GIS/LIS'88, San Antonio 1988, pp 880-889.

Barrett, Myles 1989. Object-Oriented Language Extensions for CAD/CAM: Journal of Object-Oriented Programming, Vol. 2, No. 2, July/Aug 1989.

Chance, A., Newell, R. G., and Theriault, D. G. 1990. An Object Oriented GIS Issues and Solutions. Conference Proceedings of EGIS, Amsterdam, April 1990.

Chou H., and Kim, W., 1986. A Unifying Framework for Version Control in a CAD Environment. Proceedings of VLDB86.

Ecklund, D. J., Ecklund, E. F., Eifrig, R. O., and Tonge, F. M. 1987. DVSS: A Distributed Version Storage Server for CAD Applications. Proceedings of VLDB87. Katz, R. H. and Chang, E. 1987. Managing Change in a Computer-Aided Design Database. Proceedings of VLDB87.

Kent, William 1989. An Overview of the Versioning Problem. ACM SIGMOD 1989 Conference in Portland, Oregon. SIGMOD record Vol. 18 No 2, June 1989.

Langran Gail and Chrisman Nicholas R. 1988. A Framework for Temporal Geographic Information. CARTOGRAPHICA Vol. 25, No. 3, 1988 pp1-14.

Smallworld Technical Paper No. 3 - An Object-Oriented GIS - Issues and Solutions

2011-08-29T07:52:00.001-06:00

by Arthur Chance, Richard G. Newell & David G. Theriault

Abstract

The current generation of software tools is inadequate to satisfy the wide diversity of GIS requirements in a seamless manner. In addition the tools provided to develop user applications or to customise current GIS offerings exacerbate the problem. Object oriented programming systems (OOPS) are now recognised as a key component in building powerful applications which are robust and maintainable and which are also to be seamlessly extendible. Unfortunately, many myths surround OOPS: that they are difficult, simply fashionable, or inherently slow. This paper will propose that an OOPS coupled with an interactive programming environment can be highly effective when applied to the demanding requirements of GIS. It describes a rich, but user-friendly, polymorphic, exemplar- based environment which supports today's emerging standards and which has proven highly appropriate to the development and implementation of GIS.

Introduction

The largest costs incurred for any organisation embarking on the implementation of a GIS are data conversion, hardware and software, and system implementation, particularly customisation. It is now recognised that no GIS on the market today does it all, thus implying that the missing functionality has to be added to the base system in order to satisfy any particular customer. Yet rarely do you see, in the many papers describing how to choose a GIS, the need for assessing the ability of a system to be customised. It is common for the next most expensive item after data conversion to be customisation.

Most checklists for choosing a GIS concentrate on long lists of superficial functionality, and little is said of an assessment of the quality of basic system fundamentals and foundations. The most important fundamentals to get right in a technology as complex as GIS are the basic database architecture, and the data models implemented within that architecture together with the software architecture needed to extend the system at low cost.

During the last ten years or so, enormous strides have been made in the development of hardware. Processors are at least 20 times faster, memory is 500 times bigger, disks are hundreds of times bigger. The advance in hardware has fuelled a parallel advance in fundamental database software. The relational model has advanced from the era when there were doubts as to whether it could be made to work at all to now being the dominant technology.

In contrast, the advance in the development of tools to develop and adapt software systems is nothing like so dramatic. We all know that most of the world's commercial software is still written in Cobol, most of the technical software is still written in Fortran. Even "modern" languages such as C do not give a whole lot more in productivity and maintainability when compared to the huge advances in other technologies.

The advent of CASE tools and 4GLs make some attempt to address the issue. It is our belief that the fundamental system architecture itself has to be rethought if large strides are to be made. This paper proposes an interactive hybrid object-oriented, procedural language as a key component of such an architecture. We are well aware that the term "object-oriented" is frequently bandied about with very few people understanding what it means. We also attempt in this paper to explain what the technology means, and why it should be of significance to the implementation of GIS.

Conventional System Structure

Traditionally, large interactive systems were developed in a non-interactive procedural language, such as Fortran or C. In order that an end user could drive such a system, an interactive command language was provided so that the user could type in his commands. Many command languages evolved into programming languages sometimes by borrowing the programming concepts of Basic. In more modern user interfaces, this command language may well be hidden behind a system of screen menus, tablet menus or other input devices. Large modular systems were glued together using an operating system commands or script (see Figure 1). Within the system, developers and customisers had a number of other languages available to them to define such things as syntax, menus, data descriptions, graphic codes, etc, all running as different processes communicating by files.

The structure of such systems was commonly organized at the highest level around the command syntax and complex commands were structured in a top down approach from there. If one examines systems that have been put together in this manner over the last five to ten years they all suffer a number of difficulties:

Development is slow. Users' requests for enhancement have to wait for the next release, which is usually over a year away.
They are difficult and expensive to maintain. During the life cycle of the system, probably 90% of development goes into maintenance.
Major restructuring in the light of five or ten years of hindsight is unthinkable.
Customisation is arbitrarily done in one or more of the many languages used to put the system together, typically Fortran or the command language.
Integration with other systems is nearly impossible.

[ Figure 1 not available ]

Much effort has gone into toolkits for developing and customising systems including standard graphics libraries, user interface managers, data managers, windowing systems, etc. However, if one wishes to get in and program or customise any of these systems, one is confronted with operating system commands, Fortran, C, SQL, embedded SQL, some online command language, domain specific 4GLs or a combination of these; not to mention auxiliary "languages" to define user syntax, menus, data definitions, data dictionaries, etc. With these kinds of programming tools it can take many man-months of skilled programmer effort to achieve even modest system customisation.

Development Languages

It is a common problem with systems that contain parts that are front ended by different languages that it is not possible to integrate them properly. For example, a graphics system for mapping, which is "hooked into" a database, typically does not allow the full power of the database to be accessed from within the graphics command language, nor can the power of the graphics system be invoked from within the database query language. What is really needed is a system such that all data and functions can be accessed and manipulated in one seamless programming environment (Butler 1988).

What has been shown by a number of organisations is that the same development carried out with an on-line object orientated programming language can cut such development times by a very large factor (e.g. 20). Object orientation does not just mean that there is a database with objects in it, but that the system is organised around the concept of objects which have behaviour (methods). Objects belong to classes which are arranged in a hierarchy (preferably a heterarchy). Subclasses inherit behaviour and communication between objects is via a system of message passing thereby providing very robust interfaces.

Firstly, such a language should be able to provide facilities covering the normal requirements of an operating system, customisation, applications programming and most systems programming. Secondly, the language should have a friendly syntax that allows casual users to write simple programs for quick online customisation. Larger programs should be in a form that is easily readable and debuggable. Thirdly, the language must be usable in an interactive mode.

There are several languages around which satisfy some of these requirements: Basic is alright for simple programs (large impressive systems have been implemented using some of the more advanced modern Basics); Lisp has much of the power and speed, but is hardly readable (however, much of the success of Autocad may be attributed to Autolisp). Smalltalk has both speed and object orientation, but with the total exclusion of any other programming construct. Postscript is very widely used and has a number of the desired features, but is another write-only language (i.e. "unreadable" by anyone, including the original programmer). Hypertalk is wonderful, but you would not write a large system in it. C++ has much of the required syntax and semantics, but it is not available as a front end language and can therefore only be accessed by a select few system builders, normally employed by the system vendor.

Having dismissed most of the well known languages developed during the last 30 years, then what is required? It is an on line programming language, with a friendly Algol-like control structure and the powerful object oriented semantics of languages like Smalltalk.

What Is Object Orientation?

The dominant style of programming has always been procedural, the structure of the program being organised around the functions being performed, usually in a top down hierarchy of procedure calls. An object oriented programming language is one where the program is organised around the objects being processed, usually in a hierarchy of objects which can share (inherit) the procedures (methods) belonging to other objects.

Object-orientation is not an easy concept to explain, however, its importance is not in doubt. Object orientation will become the dominant approach to structuring and building large complex systems in the future.

Efficiency in the development of computer systems depends on how easily they are modified and enhanced. Changes in an evolving system are either concerned with changes in function or changes in data structure. Procedural programming does a reasonable job in localising changes to function by means of such devices as routine libraries, but changes to data structures usually mean a cascade of side effects as the data structures are referred to in many parts of the system. Object-oriented programming goes a long way to localising these changes also.

Quite frequently, objects of different classes might be similar, in that one class might exhibit all the behaviour of another plus some additional behaviour. In other words, one class may share a number of other classes' methods, but also has a different version of some methods and some new methods of its own. In these cases one class can be defined as a subclass of another so that it can inherit the shared behaviour.

Object orientation, which embodies the above concepts of encapsulation and inheritance brings the following benefits:

Code sharing is greatly increased, thereby increasing programmer productivity.
System modules communicate via well defined interfaces, which means it is easier to find bugs.
It is easier to maintain software: both minor enhancements and sometimes major restructuring. Maintenance costs are reduced.
The environment is not only extremely good for prototyping, but also can be used as a base for a production system.
User interfaces can be iterated to their optimum much more easily.
Object orientation is very suited to the manipulation of heavily structured data.

One of the commonest reservations echoed about object orientation is whether it can be made to work with acceptable performance. One recalls the same remarks being made about relational databases nearly 15 years ago.

It is true, object oriented languages do run slower than procedural ones. This is mainly because the message expression may take more time to evaluate and also programming style tends to lead to large numbers of messages being sent (procedures being called) to rather small methods (procedures). This performance issue is now considerably offset by improved compiler techniques, faster hardware, and using a procedural approach where appropriate.

In any case, much customisation today is carried out by writing macros in the system command language. As these are run via an interpreter, they are far slower to execute than a properly implemented object-oriented language.

At the end of the day, the benefits to the system customiser are that he can perform major system enhancements in a fraction of the time taken with a conventional system, and for the end user, a much richer functionality in his system and far fewer delays in waiting for enhancements.

Object Oriented Databases

One often hears the term object-oriented applied (sometimes wrongly) to many kinds of systems. So what does the term object-oriented database mean? At first sight it seems strange for a term which was originally used to describe a programming language, usually in comparison to procedural languages.

Now when one comes to databases, all conventional databases are (sort of) object-based. In a relational database, the objects around which the data are organised are tables, records and fields. There are some higher level objects, such as views, but there is no explicit representation of the highest level objects such as real world things. Much of the more recent development has been how to embody high level semantics in the model, but still, the database itself does not embody itself any of the behaviour definitions of the objects it contains (Oxborrow 1989).

Now, one doesn't often hear the term procedural applied to databases, although one recalls the work by Martin Newell (Newell 1975) on procedure models, in which he built a modelling system out of procedures, and all operations on the model (such as rendering) made calls to the procedure models. These then made the right responses and did the required things when asked. The behaviour of the objects to be rendered was encapsulated within the procedure models and not within the rendering algorithms. This is similar to the encapsulation concept of object-oriented programming languages.

The term object-oriented database is commonly used to mean that the unit of communication to the database is an object: you put an object in, and you get an object out. But this is a facile use of the term, since the crucial thing about object-orientation is that the objects contain their own behaviour and therefore the database needs to manage the procedures (methods) that define that behaviour. Further, communication with the database should be by a system of message passing where the user is isolated from the actual internal representations of the objects.

One view of an object-oriented database is that it is an extension of an object-oriented language to handle persistence, queries, concurrency (multi-user), security and integrity. Another view is that it is an extension of a conventional database to handle procedures, inheritance and message passing. It is a moot point as to whether the conventional divide between the ephemeral working data of the programming environment and the persistent data in the database should be maintained or not.

Commercial object-oriented database systems are only just beginning to appear in the market place. Relational databases suffer a number of serious short-comings for applications in GIS (Egenhofer 1989, Frank 1988, Oxborrow 1989). Some of these problems stem from the particular features of the implementation of current commercial relational databases, not using the facilities of those databases in the most appropriate manner, and the awful syntax and current limitations of SQL (Herring et al 1988). However, the absence of real world semantics in the relational model itself means that the tools provided are at a very low level.

All implementations of the relational model compromise to some extent Codd's rules, from those which are no more than tabular representations to the ones that satisfy most. In particular, the semantics of range queries and versions are missing. However, tabular representations are a good way to make an efficient engine for managing persistent data. It is therefore our belief at the moment, that a practical way to implement an object-oriented database across many platforms is to combine relational (or tabular) technology with an object-oriented language. This allows higher level semantics to be embodied in the object-oriented environment of the language.

The Magik Object-Oriented Programming Environment

Magik is an extremely powerful language for the implementation of large interactive systems. The language is a hybrid of the procedural and object oriented approaches and program development is carried out in an interactive environment. The interactive environment allows changes to the system to be immediately tested, without a prolonged linking process and regardless of the size of the system.

We have implemented Magik in order to build an open, seamless development environment. The way this is achieved is by embodying the following features in the language and its development environment:

There is but one language for system, application and customisation development.
Both object orientation and procedural methodologies are supported.
Development is in an interactive environment.
The language is expressive and very readable.
There is an extensive library of standard object classes, methods and procedures.
The language is built as a platform suitable for delivering commercial systems.
Applications can be transferred with a minimum of effort between hardware platforms.

It is our belief that the presence of all these features is essential if commercial systems are to be developed, maintained and customised with a minimum of programmer effort. It is the lack of a viable language with a sufficient subset of these facilities that has stimulated us to produce our own which embodies all of them.

[ Figure 2 not available ]

The Virtual Database

It has been said that GIS could be regarded as an integrating technology providing a window into many disparate distributed databases. If this goal is to be achieved then an architecture is needed in which the databases to be integrated need to be set up as servers to the single client GIS.

There is a number of shortcomings in existing available database technology for the building of a GIS. Nevertheless, there are now many organisat-ions which have committed in a big way to one of the emerging de facto standard database systems. It is not acceptable for a GIS vendor to try to displace such a database with something else tuned for GIS applications. It is necessary to engineer a solution which preserves the user's investment while at the same time doing as good a job as possible in providing a GIS capability. As mentioned above, if one tries to handle all data in the commercial DBMS, then it is highly likely that a serious performance problem will result. If one runs a geometric DBMS alongside, then serious problems of backout, recovery and integrity may result. It would seem that what is needed is some "virtual DBMS" which can act as a front end to two or more physical DBMSs, and that this should handle versioning (Easterfield et al 1990, Newell et al 1990) and have a knowledge of all aspects to do with data dictionary, integrity, and access to the various data. Data modelling of objects allows the user's models to be built with full recognition of their semantic content a key feature not provided in the relational world (Worboys et al 1989).

We have built a low level interface between the object-oriented and tabular worlds in which a table maps onto an object class, a record maps onto an instance and a field maps onto a slot. Higher level abstractions are then modelled wholly in the object-oriented world. Such a representation is ideal for many of the navigational style queries that one undertakes in a GIS.

The User's View Of An Object-Oriented GIS

Any system built using an object-oriented environment could also be built by other methods, and the end user may well be hard pressed to tell the difference. Sometimes, systems claim to be object-oriented, because they are built out of an object-oriented language which is non-interactive. In such systems, end users are denied the major merit of the approach, which is the ability to modify and extend the system on line. The claim to object-orientation may be valid, but, so what, if the rest of the world cannot use it.

Icon driven user interfaces are sometimes called object-oriented. This description might be justified if the icons represent data objects and when clicked they know what to do. For example on the Macintosh, clicking on a document icon results in the appropriate word processor (drawing program, spread sheet, etc.) being started. This differs from function icons to which you must later supply the data.

For a GIS to deserve the term object-oriented it needs rather more than an object-oriented systems language and user interface.

The customiser of a system built with an interactive object-oriented front end language is provided with an extremely open architecture in which he can access and use many existing classes and their methods. He is provided with browsers to explore this rich environment of existing functionality in order that he can utilise it and modify it to make his own extensions to the system. The analogy has been made to the hardware designer who is given an extensive library of standard components of which he knows how they perform to given inputs, even though it is unnecessary to know how they work internally. However, in the object oriented world he can also make his own components (classes) which can borrow behaviour from one or more existing classes (multiple inheritance).

The GIS applications programmer perceives all items as objects which have their own behaviour. Although the data and behaviour may eventually be stored in separate locations (in our case, in a Magik object library and an underlying database), from the user's point of view the objects are self contained items.

As a simple example, consider the following fragment from a GIS shown in Figure 3.

The object type BUILDING understands messages, which are relevant for all types of building, such as foot print (square metres) and volume (cubic metres), which are then automatically inherited by HOUSE and OFFICE.

House extends the behaviour of BUILDING with, for example, approximate gas consumption according to the rules used by the gas board knowing the volume of the house and the number of occupants.

[ Figure 3 not available ]

From the application programmer's point of view, he can retrieve HOUSEs from the database using queries on stored or calculated values.

Within a GIS context, for all these objects, spatial properties such as AREA may be inherited from SPATIAL_BEHAVIOUR. BUILDING and its sub-classes could understand and respond to messages like ADJACENT_TO, CONTAINED_IN, NEAR_TO, etc. Should a user wish to enforce his own definition of say, NEAR_TO for a HOUSE he could do so.

One step further is that geometry (one of the spatial attributes) is also treated as an attribute of an object. That is the geometry of a HOUSE can be retrieved or changed by using methods which operate on the object HOUSE, the fact that the actual geometry may be stored in many separate tables in an underlying database is irrelevant for the GIS applications programmer.

Conclusions

We have been engaged in developing a new kind of software architecture for building and maintaining large, interactive, databased applications such as GIS. The main issue that we have tried to address is the large costs involved in developing such systems and particularly the costs of implementation and customisation. It is our conclusion, that object-oriented programming technology is sufficiently well understood and gives such astounding benefits that now is the time to apply it to real commercial systems. We strongly advocate that such an environment itself should be interactive and that access to all objects in the system should be available in the user interface, and not hidden deep in the bowels of the system where only the vendor's system programmers have access.

We advocate implementing an object-oriented database capability by front ending a version managed tabular datastore with an object-oriented language. Existing databases should also be accommodated in the same way, as large amounts of data are already committed to these databases.

We observe that GIS is particularly well suited to object-orientation, and so the benefits of the approach are considerable.

References

Butler, R. (1988). The Use of Artificial Intelligence in GIS. Mapping Awareness and Integrated Spatial Information Systems, Vol. 2, No. 3.

Easterfield, M. E., Newell, R. G. and Theriault, D. G. (1990). Version Management in GIS Applications and Techniques. Conference Proceedings, EGIS, Amsterdam, April 1990.

Egenhofer, M. J., and Frank, A. U. (1989). Object-Oriented Modelling in GIS: Inheritance and Propagation. Auto- Carto 9, Baltimore, April 1989.

Frank, A. U. (1988). Requirements for a Database Management System for a GIS. PE & RS Vol. LIV, No. 11, November 1988.

Herring, J. R., Larsen, R. C. and Shivakumar, J. (1988). Extensions to the SQL Query Language to Support Spatial Analysis in a Topological Database. Proceedings of GIS/LIS ï88, Vol. 2, San Antonio, Nov 1988.

Newell, M. E. (1975), The Utilization of Procedure Models in Digital Image Synthesis. Doctoral Dissertation, Dept. of Comp. Science, University of Utah, Summer 1975.

Newell, R. G., Theriault, D. G. and Easterfield, M. E. (1990). Temporal GIS Modelling the Evolution of Spatial Data in Time. Conference Proceedings, GIS Design Models and Functionality, Leicester, March 1990.

Oxborrow, E. and Kemp, Z. (1989). An Object-Oriented Approach to the Management of Geographical Data. Conference Proceedings: Managing Geographical Information Systems and Databases, Lancaster University, September 1989.

Worboys, M., Hearnshaw, H. and Maguire, D. (1989). The IFO Object-Oriented Data Model. Conference Proceedings: Managing Geographical Information Systems and Databases, Lancaster University, September 1989.

Smallworld Technical Paper No.2 - The Difference Between CAD and GIS

2011-08-29T07:51:00.000-06:00

by Richard G. Newell & Tom L. Sancha

Abstract

Although there are some similarities between CAD and GIS there are many more differences. The most fundamental difference is that GIS models the world as it exists, whereas CAD models artif-acts yet to be produced. As a result the data that GISs manipulate is an order of magnitude larger and more complex than CAD systems have to deal with, and the nature of the data, its sources and its uses, are quite different. This paper compares the two fields in terms of their technology, data, market, user applications and vendor organisations.

Introduction

CAD is used to design new objects, which have not existed in the world before, whereas GIS is used to build a model of the world as it exists, including its history, in order to understand, analyse and manage resources and facilities. The data necessary to represent the world as it is enormously larger and more complex than the data necessary to represent new products This fact leads to there being major differences between CAD and GIS. Anybody who has worked in both fields will immediately be struck by the difference in emphasis between the two, basically stemming from the vast differences in data volumes and the fundamental raison d'etre of the two kinds of system. The following are typical lists of terminology and jargon that one might hear if one attends a conference on CAD or GIS:

CAD	GIS
drafting	graphical editor
layer	theme
solid modelling	terrain modelling
FE analysis	spatial analysis
NC manufacture	map production
drawing database	seamless mapbase
document management	version management
design system	information system

Even a superficial examination of the parts of GIS and CAD that one might think are similar immediately reveals some important differences, namely the graphics:

CAD geometry is primarily constructed by a draftsman whereas GIS geometry is scanned, digitized or surveyed.
CAD geometry contains many horizontal and vertical lines. Lines at regular angles are common. GIS geometry contains virtually no horizontal or vertical lines and, apart from right angles, other regular angles are rare.
In CAD, circular arcs and curves are essential, in GIS they are virtually non existent. Some GISs do not even have a way of representing a curve, despite their frequent occurrence in urban areas.
In CAD a typical polygon has few vertices, often four; in GIS a polygon may have many thousands of vertices.
In GIS, operations such as mirror, rotate, scale and copy are unusual; on the other hand lines of a 'fractal' nature, such as contours and coastlines, are common.
In CAD, schematic drawings, such as those used to represent electrical circuitry, are extremely stylised; but in GIS, the layout bears a close resemblance to the real world. However, the topology in both cases may well be very similar. There are a few forms of stylised map, a good example being the London Underground map, but these are not common.

The use of databases in CAD is often peripheral to the main task, and is commonly not provided by the CAD vendors. Databases are used to hold such things as catalogues of standard components and drawing registers, seldom are they used in the mainstream of the design itself save a few systems which handle the design of complex assemblies [1]. In GIS, the database is the most important aspect of the system [2].

The authors of this paper have designed several major CAD systems (e.g. PDMS Plant Design Management System developed by CADCentre Ltd., and Medusa developed by Cambridge Interactive Systems Ltd., now owned by Prime Computer Inc.). One of us is now developing an advanced GIS and has discovered that many fundamentals have to be rethought.

This paper describes many of the differences in, not only the technology, but also the applications, the market, the typical users and the vendors.

System Evolution - Darwin versus the Creationists

One of the earlier applications of computer graphics was to apply it to problems of design, typically in mechanical and electrical engineering [3]. In order to try and broaden their market presence, early CAD vendors applied their systems to mapping to produce what is now known as a "CAD mapping system" [4]. In order to be effective, major extensions were needed in the handling of geometry, symbology, and data volumes.

In general, CAD systems are geared to handle individual parts, they seldom use a database except as a loosely interfaced adjunct on the side to hold catalogues or manage drawings. So in order to try and address the data volume problem, the next extension of CAD mapping systems was to hook them up to a proprietary database to try and produce what is known as a GIS. Unfortunately, this is trying to stretch the original CAD technology too far.

Early computer graphics was seen as an exciting new technological development with wide potential in many fields. Some had to wait a long time before realising the benefits of that early vision, and it was in the process of this evolution that the subject diversified into a wide range of different applications. From early point and vector graphics came the first electrical drafting systems, which then evolved into 2D mechanical design systems. The 2D systems evolved into 3D wireframe systems, which because of their roots soon became obsolete. What was really needed was not the Darwinian approach, but the Creationist approach, since as we all now know, solid modellers appeared from an entirely different beginning.

The process of evolution is ideal for refining and improving an initial concept, but if the concept is wrong or has important short-comings, then there is a limit to what can be done. Unless a radical new beginning is made, an evolving system can become over-developed, software fatigue will set in and eventually the product will die, because better alternatives will emerge from elsewhere. It is a shame that users are not more courageous about moving from one system to another, because if they wait too long, they are only stacking up far worse problems for themselves in the future.

Owning a CAD system is rather like owning a dog, the prudent owner knows that the dog is likely to die at about the age of 12 or so, and therefore when his dog is about 8 or 9, he acquires a new dog, so that when the time comes, his new dog is well trained, has become part of the family and the grief of losing his old dog is minimised. (This parable is not original, it was related to one of us by Roger Breuleux). If one wants to see a more relevant example, just witness how the world is switching its software development from Fortran to C.

What has all this got to do with CAD and GIS? The early CAD vendors assumed two things: first that utility companies, municipalities, environmental resource agencies etc. required mapping systems; and second, that these so called mapping systems could be evolved from their electrical and mechanical CAD systems. Well, they were wrong on both counts: these users require GIS systems which cannot just evolve from CAD systems.

It is common place these days to debase new terminology. How many database systems that handle tabular data label themselves "relational"? How many systems that have some concept of an object call themselves "object oriented" and how many graphics and CAD systems that can put a map on the screen of a CRT call themselves "GIS"? Building a GIS requires a radically new approach technologically. The state of the art systems now come from specialist companies, not from the major CAD vendors.

Technology

The normal approach to tackling a major design project with a CAD system is to segment the problem into manageable pieces. Each piece corresponds to a part, a subassembly or a main assembly, the limitation being the amount of data that one designer can hold in his head at one time and then get on one drawing. Partitioned databases seem to be the order of the day. Such databases may be managed by an additional continuous database which performs the function of document management. In most design situations, the problem is straightforwardly segmentable and therefore document management techniques are appropriate for handling such functions as drawing issue, approval, archive, revision and change control. One area where this approach breaks down is in plant design which, due to the 3 dimensional complexity and the difficulty of segmenting the problem either by discipline or space, requires a continuous database. However, most modern plant design systems segment the database, which makes the system much easier to implement for the CAD vendor, but much less convenient to use for a large design team.

GIS systems which have evolved from CAD systems work with partitioned databases, corresponding to map sheets, but this is not what the user requires. Our world is continuous, there are extremely few places where you could naturally segment a continuous landscape into pieces and therefore the GIS system must always work on a continuous database. Further, some attempts to extend a CAD system into GIS involve hooking the CAD system into some third party database management system. Unfortunately, today's generation of database systems does not address all aspects needed for a GIS. One might hope that marrying together two inadequate technologies would solve the problem, however, these two technologies do not integrate too easily. In such systems, the data management facilities of the database system are not fully accessible from the CAD system command language nor are the graphics facilities of the CAD system available from the database query language.

Whereas CAD is based on a database of parts, managed by a document control system, GIS should be based on a continuous database management system with integrated graphics facilities. If one had to choose, there is a better chance of extending a database system into a GIS than extending a CAD system into a GIS. Partitioned databases do make life much easier when trying to develop new systems. Continuous databases introduce some particularly difficult problems:

Performance
Concurrent users
Efficient spatial retrieval
Multiple version management

In GIS these issues must be addressed, save for a small number of niche applications of limited scope.

Data Capture

In CAD no sensible company would insist that the whole of their existing drawing archive was input to the computer before they could start work. With a CAD system you can start working with a blank sheet of paper. In GIS, you cannot even start before you have an adequate mapbase. People will argue as to what is regarded as "adequate", (e.g. raster backdrop versus vectorised versus full topological model), but all will agree that some form of mapbase completely covering the area of the GIS is essential.

Some people think that it would be the mechanical designer's dream if a machine existed which was capable of turning his drawing archive into a CAD database automatically. Assuming that the resulting database is to have value beyond merely the replacement of the paper archive (which may well be meritorious in its own right), then there is a requirement that the database must contain elements at the highest possible level, ideally objects. This means not only vectors, circles, arcs and curves, text strings, and symbols (all of which are now looking practical to achieve), but also topology and geometric precision. These latter two are essential if the CAD drawing is to be used for anything other than editing it and redrawing it again on paper. This might include parts listing, connectivity analysis and generating manufacturing information.

There is a parallel in the problem of trying to capture a map base which is to be incorporated into a GIS, but again if only vectors, text and symbols are recognized then the resulting map database has little use beyond that of a backdrop. Again, the meanings of lines need to be captured together with their topological relationships in order to be able to represent point, line and area features. Further, there is the additional requirement of being able to associate these geometric features with other real world data held in the database. It is interesting to compare and contrast these two apparently similar problems. It could be true that up to the point of obtaining vectors, text and symbols the two problems are similar, (which is probably why many scanner and vectoriser vendors address both markets) but the methods of obtaining topology and precision are different.

A typical map has many more differing line styles than a mechanical drawing. Each line in a map will correspond to one or more features whose codes need to be associated with it. The identification of topology in a map means recognizing point, line and area features and how they relate. In the case of mechanical drawings, lines need to be collected into objects, such things as dimensions need to be assembled and then associated with dimension points.

In GIS it is often the case that the precision of the lines on paper, when properly corrected for paper distortion by means of software rectification procedures, is good enough for inclusion in the database. This is rarely the case with mechanical drawings. The precise geometry is specified by the dimensioning and other annotation and not by the lines of the drawing. This information has to be extracted by sophisticated techniques such as variational geometry it is usually more efficient to recreate the drawing via a suitably tuned drafting system.

Both CAD and GIS systems have a requirement for geometric construction, and in the introduction to this paper we described a number of the differences in emphasis. However, in GIS there is an additional difference in the processing of surveyed data. Here it is not just a simple matter of applying Euclidean geometry, because surveyed data usually contains deliberate redundancies for cross-checking purposes. Closing a traverse and triangulating a new survey point from more than two existing points are common examples. For small scale work, there is the additional complication of taking account of the earth's curvature.

Although the widespread application of CAD technology is about ten years ahead of the application of GIS, bulk data capture for GIS is far more widespread than it is in CAD. However, this may change as the realisation dawns that CAD is but a small part of the overall requirement to handle design documentation and technical publication in the large. The use of scanned, unvectorised drawings is now considered to be an acceptable approach in systems such as desk top publishing. It is quite likely that systems of the future will address the overall problem of technical and design documentation using inputs from whatever technique seems appropriate, including scanned input, desk top publishing, word processing and of course CAD.

A unique characteristic of map data is that it is in many cases inaccurate and fuzzy. Often a map base will contain data of different standards and qualities. Sometimes the absolute positions of things are not known accurately, but relative positions may be known precisely. For example it may be known exactly where a pipe is buried in a road, even though the road centre-line is only known approximately.

All this leads to an important requirement for the GIS of the future to hold quality metadata in the database. Such metadata will describe the source, accuracy, reliability and completeness of all data in the database. Without this, a GIS in the wrong hands could become a very dangerous tool. CAD does not seem to have anything remotely resembling this requirement.

In CAD databases we are faced with the inverse problem, the database model of an object may be exact, but the real world manifestation of the object can only be an approximation to the database, the limits on the amount of variation allowed being specified by means of tolerances.

One whole area unique to GIS, but having no equivalent in CAD is remote sensing, for example by means of satellite. Here, the problem of capturing the data has been solved, the question arises as to how to handle the vast amounts of it that rain down upon us from the heavens. For example to cover the UK at a resolution of 30 metres (e.g. Landsat) requires about 300 million data samples and that's just one coverage at one time. The image processing techniques for handling this data have not found much application in CAD yet.

Lastly, we mentioned at the beginning that GIS systems are not only meant to model the existing world, but also its history. This comes under the heading of handling temporal data, which is still a research topic [6]. CAD has an allied problem in version management, which it usually neatly sidesteps by pushing the problem into the document management system. This of course is possible if the database is partitioned into documents, but is more difficult if one has a large, continuous, multiply accessed design database.

Applications

The emphasis of CAD is on the creation, modification and documentation of design information. The emphasis of GIS is on modelling, analysis and management of geographically related resources. CAD is used to create a model of something that does not yet exist in order that it can be made; GIS is used to create a model of the existing real world in order that it can better be understood, that we can monitor change and plan for a better use of our earth�s resources in the future.

Whereas CAD is used to design and create many of the products that we humans use to destroy this planet, hopefully GIS will be a tool for environmentalists to use to prevent this disaster. This is a CAD journal, so most readers are familiar with applications of CAD, if not then they only need to read a few back numbers. As the readers may be less familiar with GIS applications, we list here a few which illustrate the potential of the technique:

Birds - The Royal Society for the Protection of Birds were interested in finding out the distribution of a small wading bird, the Dunlin Calidris alpina, on Scottish moorlands. They had commenced a ground survey of moorland birds and came to the conclusion that, given the resources available, it would take 25 years to complete. A preliminary investigation which showed a correlation between known breeding grounds and just one spectral band of Landsat data led them to investigate other potential areas. Despite initial scepticism, RSPB representatives investigated some of the early predictions, with the pleasant surprise that a number of them were well populated with breeding Dunlin. The potential of this technique for predictive modelling is obvious.

Vineyards - An organisation in Spain wished to plan the best locations for future vineyards. Vineyards can only be planted on land that is available for agricultural use, where the soil type is suitable, the ground slope is within certain limits and the aspect is towards the sun. This problem is solved by taking separate map overlays (themes) representing land use, soil type, ground slope and aspect to perform what is known as an overlay analysis so that all areas with the desired properties can be identified.

Water - A water company receives a number of reports from customers that they have low water pressure. The water company has a GIS containing a complete representation of the network together with the relationship between all branches of the network and the consumers' addresses so the system can make an accurate prediction of the location of the pipe burst.

Planning - A planning department receives an application from a house owner to extend his house. One of the initial tasks is to find the address on the map, from the map all other property owners that might be affected are deduced, and a letter to each of them is drafted. The planning application can be compared against other applications, master planning schemes and urban classification zones, as well as checks for possibly affected services such as water, sewerage and electricity.

The above applications of GIS are typical examples which serve to demonstrate the entirely different nature of GIS from CAD. In short, GIS is much more of an information retrieval and analysis system than it is a system to create new data.

The Market

If one asks a CAD marketeer to classify the market according to different industry sectors, then he will come up with categories like:

Aerospace
Automotive
AEC (a confusing and vague term)
Electronics

If one asks the same question of a GIS marketeer, then you will get the following:

Utilities
Local and regional government
Environmental resource agencies
Transportation
Cartographic institutes

Within individual organisations of course there may be a requirement for both technologies, but this will frequently be in different departments. For example, the planning department may use a GIS to plan the location of a new hospital, but the architects department would use a CAD system to design the hospital.

The Vendors

It is true to say that CAD system sales have polarized. On the one hand there are standardised low price PC products sold via large distributor networks and on the other there are high priced, customised special purpose CAD systems running on more expensive hardware. The latter kind of sale requires a much greater degree of consultancy, customisation and long term relationship with the customer than the former. The days of the expensive standard system are nearly over, save for those users who must maintain compatibility with their existing investments.

The distribution structure of PC based products is different from sales organisations of larger systems. The software is usually distributed by a network of dealers who make the actual sale to the end user. The dealer is dependent on the hardware component of the sale in order to maintain his margin. He will be dis-inclined to get involved in lengthy education, training and customisation since the amount he can afford to spend on each sale is limited. His geographic area is usually relatively small, and he will not be focused on the particular product. He will usually be selling other CAD products and also a wide range of hardware as well. Some distributors may have a separate division in order to provide customisation services, but this is often not the case.

Large system sales require a different approach. The customer requires extensive benchmarking, often involving some preliminary customisation. The sale is accomplished by direct selling, often to a fairly high level in the company, the sales team must contain highly qualified people who are capable of understanding the broader requirements of the customer in order that they can translate into a customised system at the end of the day. In order to make customisation economic, the requirements of the product itself in providing tools are more demanding.

The vendor organisation of a GIS system is nearer to the one that is now required for selling up market CAD systems. That is not to say that there are no PC GIS systems, there is clearly a market for these systems which can tackle certain niche applications of limited scope. However, it will be a few years before PCs can handle the data volumes and CPU requirements of a full blown GIS. Today, implementing a GIS in a new organisation can be a very large task, which will involve the vendor for far longer than the typical CAD installation. The GIS vendor will need to provide extensive consultancy and customisation in the areas of data modelling, capturing the map base and interfacing to existing (sometimes very large) database systems, which implies extensive systems integration. Few CAD vendors are geared up for this scale of effort.

Conclusion

The CAD market is now mature, but the GIS market is still in its infancy. Thinking that GIS is simply digital mapping, several of the established CAD vendors tried to adapt their CAD systems for GIS applications. This resulted in most unsatisfactory compromises.

Despite the great expectations of CAD revolutionising the design processes, today most CAD systems are only used to produce drawings and pictures. Indeed the market is dominated by one PC based drawing system. This paper has shown the differences between the technology, the data, the applications and the usage of CAD and GIS systems. Will the evolution of GISs follow the same lines as the evolution of CAD and, if so, what will be the surviving core that is most used?

It will be interesting in the future to see how each technology will borrow from the other. One application, facilities management, could be considered to be positioned somewhere between the two, and future solutions will need to borrow from both, particularly the ability to handle large databases and seamless mapbases from GIS and the design capabilities (sometimes 3D) of CAD.

The CAD users of today now have their drafting systems and their modellers and the main problem that they now face is the management of the large amount of design information generated as well as the integration with manufacturing and other corporate activities. This is a problem in managing and coordinating large distributed database systems, a problem also encountered in GIS [7]. References to the shortcomings of today's commercially available database offerings are all too common in the literature now [8], this means that new generations of database will appear in the future to meet the challenge.

The requirement of a GIS to be modified and customised will force the pace of development of much more powerful software tools. The object oriented approach to structuring major software systems is becoming much more widely understood now, even though commercial systems embodying the approach are few and far between. We are beginning to see the arrival of systems which attempt to address the database issue in an object oriented environment. This technology will enable a quantum leap to be made in both GIS and the CAD systems of the future.

References

1. Neumann T. CAD Data Base Requirements and Architectures, Computer Aided Design Modelling, Systems Engineering, CAD-Systems (Ed. J. Encarnacao), Springer-Verlag, Berlin, 1980, pp 262-292.

2. Smith T. R., Menon S., Star J. L. and Estes J. E. Requirements and principles for the implementation and construction of large-scale geographic information systems, Int. J. Geog. Info. Systems, 1987, 1(1), 13-31.

3. Elliott W. S., Computer-aided mechanical engineering: 1958 to 1988, Computer Aided Design 21(5), 275-288, 1989.

4. Boyle A. R. and Boone D. Computer aided map compilation, Computer Aided Design, Conference Publication No. 86, IEE, London 1972, pp 400-406.

5. Burrough P. A. Principles of Geographic Information Systems for Land Resources Assessment, Clarendon Press, Oxford, 1986, Chapter 6.

6. Hunter G. J. Non-current data and geographical information systems. A case for data retention. Int. J. Geog. Info. Systems 1988, 2(3), 281-286.

7. Cowen D. J. GIS versus CAD versus DBMS: What are the Differences? P. E. & R. S. Vol. LIV No. 11, November 1988.

8. Ecklund J. D., Ecklund F. E., Eifrig R. O. and Tonge F. M. DVSS: A Distributed Version Server for CAD Applications, Proceedings of the 13th VLDB Conference, Brighton 1987.

Smallworld Technical Paper No. 1 - Ten Difficult Problems in Building a GIS

2011-08-29T07:49:00.001-06:00

Richard G. Newell & David G. Theriault

Abstract

Building a GIS is a fruitful area if one likes the challenge of having difficult technical problems to solve. Some problems have been solved in other technologies such as CAD or database management. However, GIS throws up new demands, therefore requiring new solutions. This paper has chosen to examine ten difficult problems, why they need to be solved and gives some indication of the state of the art of current solutions.

Introduction

The subject of Geographical Information Systems has moved a long way from the time when it was thought to be concerned only with digital mapping. Whereas digital mapping is limited to solving problems in cartography, GIS is much more concerned with the modelling, analysis and management of geographically related resources. At a time when the planet is coming under increasing pressure from an ever more demanding, increasing population, the arrival of GIS technology has come none too soon.

However, there is a widespread lack of awareness as to the true potential of GIS systems in the future. When the necessary education has been completed, will the systems be there to handle the challenge? It has to be said that the perfect GIS system has not yet been developed.

Many of the fundamental problems have now been identified, and indeed attempts (sometimes good ones) to solve them have been embodied in today's commercially available systems. However, no system has an adequate solution to all of them and indeed, some are not addressed by any system (Rhind and Green, 1988).

In this paper we have chosen to examine ten important problems which we believe need to be solved. Ten is a somewhat arbitrary number since it would have been easy to find many more. However, one has to draw the line somewhere. Also we call the problems difficult, as opposed to unsolved, because for most of them, there is now some light at the end of the tunnel and, indeed, implemented solutions exist.

The fundamental causes of difficulty in GIS stem from the huge volumes of spatially related data, the nature of the data itself, and the wide spectrum of different organisations with diverse application requirements. Problems can be classified under the following four categories:

Data capture - this will be the single biggest cost in implementing a new system.
Performance - most systems strain under the volume of data and scale of the tasks invoked.
Customisation - every organisation is different; no black box system is going to be adequate.
Integration - today's systems are disjoint, in the future data and functions will need to be provided in one seamless environment.

Our ten problems are really different manifestations of these four and so the rest of this paper is devoted to a commentary on each of them in turn. As most organisations are at the data capture stage, it is not surprising that this is the area where most work has been done in trying to improve the process.

The Topology Capture Problem

The lack of a really good solution to this problem is probably going to contribute most to the slow speed of implementation of GIS systems in the future. Although much progress has been made using scanning and vectorisation technology to capture map geometry (Nimtz, 1988), the addition of feature information, the generation of meaningful topology and the association of user data involves a large human element. Some modern topological editors do help to ease the burden (Visvalingam et al, 1986), but essentially, the only solution is to throw man-hours at it (Chrisman, 1987). Thus for some time we will be seeing expedient solutions using raster backdrops without topology.

Large Data Volumes

The volume of data required to represent the resources and infrastructure of a municipality is of the order of 0.5 to 1 gigabyte per 100,000 of population, supporting examples can be found where an urban region would require 14 million points simply to represent the map planimetry (Bordeaux, 1988). Not only are the municipal records required to be handled in the database, but also large amounts of map data in the form of coordinates, lines, polygons and associations of these. Today's database technology is barely up to the task of allowing the handling of geographic data by large numbers of users with adequate performance. Serious questions have been raised as to whether the most popular form of database, the relational model, will be able to handle the geometric data with adequate response. Certainly, if this data is accessed via the approved route of SQL calls, the achievable speed is orders of magnitude less than that which can be achieved by a model structure built for the task (Frank, 1988).

A common approach to this problem is to partition the database into separate map sheets. Frequently this is achieved by interfacing a sheet based CAD drafting system to a database (but see continuous mapping below). Another approach for avoiding the shortcomings of commercially available database systems is to build a special purpose database system to handle the graphics, which lives alongside the commercially available database system. A third approach, based on checking-out subsets of the database, followed by checking-in the changes, can overcome the problem of reviewing and editing (but see continuous mapping and version management below).

The idealistic solution of implementing one database system to handle both kinds of data with optimum performance is a large undertaking, and even then does not satisfy the user who has already committed to an existing commercially available database.

Continuous Mapping - the seamless database

As one begins to build a GIS, the problem of large data volumes appears to be the most important one to solve. A solution in the form of the map sheet approach, chosen by most systems today, simply mirrors what happens with physical map sheets. This is then combined with map management techniques to try to hide the cracks. While this partitioning helps to solve the first problem, it destroys the end user's dream of a single, continuous, seamless landscape. At best this may be termed contiguous mapping, but in all cases, such schemes lead to difficulties when one is trying to deal with objects which span more than one map sheet. The problem of dealing with variable density data, for example between urban and rural areas, is tackled in an ad hoc manner as there is no single size of map sheet to yield optimal performance in all cases.

The reason that a map sheet approach is popular is that it can easily be implemented with a conventional filing system and it is straightforward to find which files should be searched for a given spatial query. However, the system can hardly be called integrated, since the geometry, topology and the relationship between graphic and application data are all handled in an ad hoc manner.

What is now demanded is that map data is handled just as it is in the real world truly continuously, with no cracks. But this must be done so that spatial queries of the form "inside an area" or "near to an object" can be handled efficiently. A common query is of the form "conforming to such-and-such and is on the screen" i.e. filtering out the rest of the world. Normal indexing methods provided with conventional database systems are inadequate to answer this type of query, and so a variety of spatial indexing methods have been invented to try to solve this problem.

The general idea is to construct a tree structured index (e.g. quad tree, range tree, field tree) the leaves of which are not disjoint they overlap in two dimensional space and this accommodates objects of different sizes at different levels. In this respect, the quad tree (Samet, 1984) is beautifully simple but has serious deficiencies for large data volumes. The range tree (Guttman, 1984) is rather more complicated but is more efficient. Systems which are optimal for retrievals are not necessarily optimal for updates and vice versa. As data is read far more often than it is written, the emphasis should be on providing maximum retrieval performance, while still maintaining an acceptable rate of interactive update.

However, given today's disk technology, it matters little how efficient the index is if the data is scattered on disk. Thus it is essential that data is clustered on disk so that spatially similar data have similar addresses (Bundock, 1987). This can be achieved in the relational and other tabular models provided that data is stored ordered by a properly generated spatial key. There are a lot of good ideas now in this area but it is yet to be seen which are the best.

Accommodating Existing Databases

There are a number of shortcomings in existing available database technology for the building of a GIS. Nevertheless, there are now many organisations which have committed in a big way to one of the emerging de-facto standard database systems. It is not acceptable for a GIS vendor to try to displace such a database with something else tuned for GIS applications. It is necessary to engineer a solution which preserves the users' investment while at the same time doing as good a job as possible in providing a GIS capability. As mentioned above, if one tries to handle all data in the commercial DBMS, then it is highly likely that a serious performance problem will result. If one runs a geometric DBMS alongside, then serious problems of backout, recovery and integrity may result. It would seem that what is needed is some "virtual DBMS" which can act as a front end to two or more physical DBMSs, and that this should have a knowledge of all aspects to do with data dictionary, integrity, and access to the various data.

Version Management - the problem of the long transaction

It is necessary that large databases can be accessed simultaneously by multiple users. The normal assumption in a commercial DBMS is that a transaction is a short term affair, and that a locking system can be used to keep users from unintentionally interfering with one another. Such methods are used to preserve the integrity of the database at all times. As far as any particular user is concerned, there are two versions of the database of which he is aware, the version when he started his transaction and the version he can see at the instant prior to committing it. The problem is, when a user is in mid transaction in such systems, then all other users are locked out of doing certain things particularly updating any records that have been locked. This is all fine, provided that the transactions are short.

However, in a GIS system, as in design database systems in general, a transaction can be very long indeed, possibly days or even weeks. A planner for example may take a considerable time to complete his work before he gets it approved and therefore available to the other users of the system. In these circumstances it is totally unreasonable to expect other users to avoid the area.

Copying the whole database is completely impractical in many cases. A system of check-out, check-in can be used, but this gives rise to the problem of achieving integrity through combining incompatible changes in different checked-in versions. It is quite possible for two separate sets of changes to have integrity, but for the combination to lack it.

Users who are starting the implementation of a GIS system may be blissfully unaware of this problem. To start a project, most organisations will initiate a pilot study: a simplified data-model, only one or two users of the system and test data that can be thrown away if the pilot database becomes corrupted.

The situation in a production environment is the opposite of the above: a complex data-model, a large number of users and concurrent activities, and a large investment in the creation of the database. The problem of version management could dominate all database activities in a production system and yet does not manifest itself in a pilot project.

Versioning is not addressed by today's round of DBMSs. The complete problem covers the management of alternatives, the management of chronology (what did my database look like three months ago), the management and policing of change, and various forms of semantic versioning such as time series, variant designs and "as built" records.

Hybrid Vector Raster Database

It is an amusing paradox that whereas the bottle-neck with vector based GIS is in capturing the map base, the problem with remotely sensed data is the sheer volume of it that pours down upon us from the heavens, a veritable fire-hose of data. For example SPOT1, at 50Mbits/second, captures an area 60km x 60km in 9 seconds. Increasing resolution to 5 metres would mean boosting transmission to 200Mbits per second for every hour of daylight.

Whereas the value of an attribute-less map base is limited, the value of satellite data is potentially immense, for here surely we can observe and monitor the catastrophic way in which we humans are changing and damaging this paradise in which we live. Green issues aside, the vector people are still struggling with data capture while the image processing people have a major computational problem with the sheer bulk of data.

These two forms must come together in one system since the combination will yield far more than the parts. Currently vector based GIS systems tend to be separate from raster based image processing systems. Raster data is derived from a variety of sources e.g. rasterized paper maps (from scanners), grid data derived from digital terrain models and remotely sensed data (from satellites).

Many applications such as overlay analysis (described below), recording changes in land use and changes in the environment can be tackled much more effectively when the two forms of data can be manipulated, viewed and analysed in one integrated, seamless environment.

Overlay Analysis - a problem of robustness

Given a multi-thematic map base, overlay analysis is used to answer questions of the type: "show me all areas with sandy soil and agricultural land use, where the ground slope is less than 5 degrees because I want to grow potatoes".

The idea behind overlay analysis is very neat: one starts by producing a single combined topological model of all the themes which are to contribute to the analysis. Producing the combined topology is both a geometric as well as a topological problem. Once it has been produced, all queries and analyses can be carried out wholly in the topological domain without any further reference to the geometry and are therefore amenable to be tackled via conventional query languages such as SQL (Frank 1987).

The problem is that producing the combined topology robustly is far less trivial than it seems at first sight. Anyone who has wrapped their brain around the problem of producing a robust solid modeller will know the kind of pathological cases that can arise. Fortunately, the problem is not quite as bad as producing a robust solid modeller; however, any system implementor who underestimates the difficulty of this problem had better beware - schoolboy geometry just isn't good enough.

But robustness is not the only problem. It turns out that not infrequently two or more themes may share the same boundary curve. Typically these will have been digitized separately and with different accuracies, therefore their representations will be different in detail. If two such themes are now combined, then many superfluous small polygons (called slivers) will be produced. Slivers can easily outnumber the real polygons leading to an explosion in data volumes, messy graphics, a collapse in efficiency and questionable analysis. Sliver removal is just another of those irritating issues which must be addressed if a practical system is to result.

In any analysis, it is important to try to make some assessment of the accuracy of the results. In practice, each theme of data will contain inaccuracies and errors, the bounds of which may be known. The consequent effects of these individual errors on the results from many combined themes may be impossible to gauge (Walsh et al, 1987), overlay analysis systems available today should carry the caveat: "let the user beware". (As an aside here, the whole subject of quality meta data, including such aspects as source, accuracy and coverage would be a good candidate for the eleventh problem.)

Overlay analysis is one of those techniques which is mind-blowingly tedious to do without a GIS. However, without robust clean algorithms there is a grave risk that the results produced will be meaningless.

Very Large Polygons

When Slartibartfast designed the coast of Norway (The Hitch Hiker's Guide to the Galaxy, Adams, 1979), he was really setting a major challenge for the designers of Scandinavian GIS systems. He probably also designed the Stockholm Archipelago with its 23000 islands as a challenge to those trying to do spatial analysis in the Baltic. Whereas very large linear features can be straightforwardly subdivided into manageable pieces, very large polygonal areas, including those with many embedded islands are not amenable to the same techniques.

However, one doesn't need to travel to scenic northern regions to encounter this problem, since very large polygons can occur in everyday applications. The GIS designer is confronted with the problem of doing spatial analysis on small parts of such polygons, as well as answering questions to do with inside and outside, without processing the millions of points which there may be in such polygons.

As with very large databases, the goal is to produce systems whose performance is dependent only on the data of interest and not to be burdened with the irrelevant.

The Front End Language - the seamless programming environment

A problem with GIS customers, seen from the system vendor's viewpoint, is that they are all different. Each customer will have a different data-model and will want substantial enhancements to his user interface and functionality. Whereas there are good facilities for data-modelling, the tools for developing and customising today's systems leave a lot to be desired. Much effort has gone into toolkits for developing and customising systems including standard graphics libraries, user interface managers, data managers, windowing systems etc. However, if one wishes to get in and program any of the systems, one is confronted with operating system commands, Fortran, C, SQL, embedded SQL, some online command language, domain specific 4GLs or a combination of these; not to mention auxiliary "languages" to define user syntax, menus, data definitions, data dictionaries etc. With these kinds of programming tools it can take many man-months of skilled programmer effort to achieve even modest system customisation.

What has been shown by a number of organisations is that the same development carried out with an on-line object orientated programming language can cut such development times by a very large factor (e.g. 20). Object orientation does not just mean that there is a database with objects in it, but that the system is organised around the concept of objects which have behaviour (methods). Objects belong to classes which are arranged in a hierarchy. Subclasses inherit behaviour and communication between objects is via a system of message passing thereby providing very robust interfaces.

There are several languages around which satisfy some of these requirements: Basic is all right for simple programs (but look at what some people now achieve in the micro world with some of the more advanced modern Basics); Lisp has much of the power and speed, but is hardly readable (but just see what Autocad has achieved with Autolisp). Smalltalk has both speed and object orientation, but with the total exclusion of any other programming construct. Postscript is very widely used and has a number of the desired features, but is another write-only language (i.e. "unreadable" by anyone, including the original programmer). Hypertalk is wonderful, but you would not write a large system in it. C++ has much of the required syntax and semantics, but it is not available as a front end language and can therefore only be accessed by a select few system builders, normally employed by the system vendor.

The Query Language

Modern query languages such as SQL are not sufficient in either performance or sophistication for much of the major development required in a GIS system - but then one would argue that they were not intended for this. One can see why people like SQL; it can give immense power in return for some fairly simple "select" constructs. A problem which has to be addressed is spatial queries within the language, since trying to achieve this with the standard set of predicates provided is extremely difficult and clumsy. (An example of a spatial query is to select objects "inside" a given polygon.)

If the route adopted is to provide two databases in parallel, a commercial one driven by SQL and a geometry database to hold the graphics, then there is a problem constructing queries that address both databases. Ideally, the query language should be a natural subset of the front end language allowing access to the same seamless environment that the front end language provides. Much work needs to be done in the area of query languages for GIS.

Conclusion

Ten years ago the evolution of CAD systems was at about the same stage that the evolution of GIS systems is now. One can recall the arguments over whether mechanical design systems should be 2D or 3D; what was it that allowed a 3D system to claim to be a solid modeller? The purists endlessly discussed CSG versus BReps; Euler's rules had to be obeyed at all costs - while in the mean time the faceted modeller brigade stole the show. The surface designers insisted that C2 was a minimum and some pronounced that C7 was essential! Your surface system was definitely not macho unless it was founded on NURBS - even though the humble conic can handle adequately 95% of practical problems. Today, most of the world seems to be happy hacking away on a fairly modest little drafting system called Autocad.

So it is now with GIS: What is a GIS and what isn't a GIS? Do the ever longer synonyms for GIS say anything new? Should the system be based on raster or vector? One of us recently asked a vector bigot what percentage of spatial analysis problems can be tackled in the raster domain - back came his answer: "None!". Arguments about topological models are rife while protagonists argue about what are the acceptable ways to break the strict rules laid down for data-modelling, in order to achieve a working system.

Let us beware in GIS, too much concentration on how things are done as opposed to what we are trying to do will lead to much wastage of time and wrong decisions being made. It is all very well for the salesman to try to identify unique selling features in his product in order to exclude another vendor from the sale, but let us learn from the experience of CAD when this is happening. It is important to identify the real application requirements of a system, this can be crucially dependent on that system having strong fundamentals. These can be exposed by careful benchmarking (McLaren, 1988).

In this paper we have attempted to lay bare ten difficult problems in GIS. In fact they are not wholly in dependent problems indeed some are a consequence of trying to solve another. The encouraging thing is that most of them have now been identified, and indeed for many there are some promising solutions.

Acknowledgements

The authors gratefully acknowledge Robin McLaren for reading the draft of this paper and thank him for his thoughtful comments.

References

Adams, Douglas (1979). The Hitch Hiker's Guide to the Galaxy, Pan Books.

Bordeaux, La Communaut Urbaine de (1988). Programme du Concours pour un systSme interactif graphique de gestion de donnes urbaines localises.

Bundock, Michael (1987). Integrated DBMS Approach to Geographical Information Systems, Auto Carto 8.

Butler, R. (1988). The Use of Artificial Intelligence in GIS, Mapping Awareness and Integrated Spatial Information Systems, Vol. 2, No. 3.

Chrisman, N. R. (1987). Efficient digitizing for error detection and editing, International Journal of Geographical Information Systems, Vol. 1, No. 3.

Frank, Andrew U. (1987). Overlay Processing in Spatial Information Systems, Auto Carto 8.

Frank, Andrew U. (1988). Requirements for a Database Management System for a GIS, Photogrammetric Engineering and Remote Sensing, Vol. 54, No. 11.

Guttman, Antonin (1984). R-Trees: A Dynamic Index Structure for Spatial Searching, ACM

McLaren, Robin A. (1988) Benchmarking within the GIS Procurement Process, Mapping Awareness and Integrated Spatial Information Systems, Vol. 2, No. 4.

Nimtz, H. (1988). Scanning und Vektorisierungs Software zur Erfassung von Katasterplanwerken, Proceedings Automated Mapping/Facilities Management European Conference IV.

Rhind, D. W., Green, N. P. A. (1988). Design of a Geographical Information System for a Heterogeneous Scientific Community, International Journal of Geographical Information Systems, Vol. 2, No. 2.

Samet, H., (1984). The Quadtree and Related Hierarchical Data Structures, ACM Computing Surveys, Vol. 16, No. 2.

Visvalingam, M., Wade, P., Kirby, G. H. (1986). Extraction of area topology from line geometry, Proceedings Auto Carto London.

Walsh, Stephen J., Lightfoot, Dale R., Butler, David R. (1987). Assessment of Inherent and Operational Errors in Geographic Information Systems, Technical Papers, 1987 ASPRS-ACSM Annual Convention, Vol. 5.

Embedding Interactivity in a Static Paper Map

2011-08-24T11:03:00.004-06:00

If you have ever traveled through an airport or read a magazine in the last year, you are likely to have seen a QR Code.

Here's a QR code with my contact info. (I'll see if I can fit it on my name badge at the upcoming Smallworld Users Conference)

A QR Code is a 2D scannable image that can have content associated with it (e.g., URL, text, contact info, location, etc). It turns out that most mobile devices and smart phones now have either built-in capability to read these codes or their respective application stores/markets provide a host of QR Reader apps. It is being used a lot by product marketers to allow people to launch an interactive and immersive experience simply by pointing their smartphone camera at a magazine page or a billboard.

I have begun to think about how we could start integrating interactive experiences in utilities' paper maps. While it is true that many utilities are moving to mobile apps for data access, I know that there are still many paper maps being generated. I also know that smart phones are becoming more prevalent throughout society and that means that smart phones are also finding their way to field crews.

So, if a field crew has paper maps and they also have smart phones, what kind of interactivity can we add to paper maps. Right away I can think of things like adding attribute information or geolocation to paper maps. I am sure that there are many other applications and that is why I am posting about this technology. I would love to hear from the community about other ideas concerning how to add interactivity to the old paper maps.

The following video shows a demonstration of how a simple paper map was endowed with dynamic content.

I would enjoy discussing this more with any attending the Smallworld Users Conference (http://www.kinsleymeetings.com/ge/smallworld) in September.