Monday, August 29, 2011

Smallworld Technical Paper No. 6 - Integrated Data Capture

by David G. Theriault, Nicola Terry, Cees van der Leeden and Richard G. Newell

For the past 15 years GIS users who enthusiastically embarked on grandiose projects have, with very few exceptions, spent enormous amounts of time and effort mired in the initial phase of data capture. The organisations worst affected are those where the capture effort is concerned with the conversion of old documents to a digital, structured database, e.g. utilities, local authorities and cadastral agencies. The reasons for the difficulties are mainly inappropriate tools being applied to the problem. This paper presents an approach in which the problem has been completely re-examined at four levels. The first consideration is at the project management level where a proper analysis of goals and targets is required. The next level is concerned with the rapid development of specific applications in an ergonomic, integrated object-oriented environment. These applications must be supported by facilities for rapid take-on of data (e.g. raster-vector integration, interactive vectorisation tools and multiple grey-level images) and a topologically structured, seamless spatial database. Finally, powerful workstations with a visually pleasing window environment will be needed. Benchmarks so far seem to indicate that factors of 3 to 5 improvement in time may be achieved.


Everybody now recognises that one of the largest costs (possibly the largest cost) of implementing a GIS is data conversion. In particular, the problem of converting data held on maps into a database containing, points, lines and polygons which are intelligently structured, is particularly burdensome. Much effort has been invested in trying to improve the process beyond the traditional methods based on manual digitizing. These include the use of scanned raster data, automatic conversion of raster into structured vector, automatic symbol and text recognition and "heads up" (on-screen) digitizing. All of these have their place, but we will advocate in this paper that other aspects of a system's function-ality and its structure can have more dramatic effects in reducing conversion costs in many cases. Our experiences stem from applying these methods to the capture of utility foreground data and cad- astral polygon data, where automatic approaches based on artificial intelligence break down.

Putting it another way, it seems that, with the current state of the art of automatic recognition methods for data conversion, there will always be a large proportion of documents which can only be interpreted and input by human operators. The exercise then becomes one of providing the best possible environment for humans to operate. This includes, not just providing much improved user interfaces, but also the very structure of the system itself. Badly structured, poorly integrated systems contribute greatly to the costs of conversion.

We contend that data capture is not a separate function carried out in isolation of the GIS it is to feed, but the data capture system itself should embody much of what one would expect from a modern corporate GIS.

Goals and Targets

It is important at the start of a GIS implementation to identify what it is that the GIS is intended for. This may be an obvious thing to say, but it is amazing how in many cases in the past, the conversion exercise has borne no relationship to any pre-assigned goals. An organisation embarking on GIS can define the business case for the whole project. This will result in:

  • identifying the end users of the system (including those who might
  • never actually sit down in front of a terminal).
  • a detailed breakdown of activities on the system.
  • an outline of possible future requirements.

From the analysis it is possible to know the exact data requirements, including the coverage required and the level of intelligence required in the data structures.

This business case analysis has not always been done rigorously. The result of this is that frequently far more is converted than is strictly necessary. It does not seem very many years since the silly arguments about the relative merits of raster and vector. Now that the discs are cheap and the processors and displays are fast, everybody recognizes that a raster representation for a landbase is an excellent and cost effective way to get started. Raster not only provides a perfectly adequate representation for producing displays and plots, but also is a good basis for the capture of intelligent data.

For a utility company, the data may be divided into background and foreground data. The source of the background data is good quality paper maps from the national mapping agency (e.g. the Ordnance Survey in the UK and the IGN in France). A complete digital vector representation of these background maps is usually not available or is expensive, so we would advocate in this proposal that these maps should be scanned using a good quality scanner. The cost of producing a scanned background mapbase is a small fraction (perhaps one tenth) of the cost of producing a vector background mapbase.

It is thus essential to decide at an early stage, what are the objects in the real world that need to be represented in the system, in order that the foreground capture can be limited to these.

After all, a GIS is no more than a corporate database management system, in which one of the query and report environments is a map. Thus the same data modelling and analysis that goes into DBMS implementation needs to be done for GIS implementation. This data model will determine that which needs to have an intelligently structured graphical representation and that which does not.

It is essential that goals are set so that a complete coverage of a defined and useful area can be achieved in about one year to 18 months. If the time frame is longer than this, the management will get impatient to see results, the maintenance problem will overtake the primary data-capture and the whole thing will fall into disrepute. We would go as far as to say, if the organisation cannot commit the money and resources to achieve this, then it should wait until it can.

Development of Specific Applications

For the purpose of the data conversion project, specific applications for input and edit need to be developed for each data item. In parallel, it is necessary to implement and test the end applicat-ions of the system. Traditionally in a GIS project, these tasks have consumed an enormous amount of time and valuable resource. For sound reasons, the data capture project is postponed until these applications show robust results. The advances that have been made in data processing and database management, such as rapid prototyping and 4GL environments, have only recently surfaced in some commercial GIS systems.

This task of applications development is made much easier if the GIS already has an extensive library of generic functions which can be applied to the specific data model implemented. These functions might include generic editors for geometry, attributes and topology, network functions as well as polygon operations including buffering and overlay. It is in this area that interactive object-oriented environments really come to the fore as new functionality can inherit much of the existing functionality of a generic system as well as providing very rapid prototyping and development capabilities.

Integrated System Approach

This section contains the main burden of our argument. In the past, most data capture was carried out using conventional digitizers on single user systems, one sheet at a time, in isolation of the GIS. We contend that data capture should be carried out using on-screen digitizing in a multi-user environment using a continuous mapbase in the environment of a fully integrated GIS.

Such an approach gives considerable savings in time and cost in making a complete coverage. The basis of the approach is to eliminate many of the activities which take time in systems which are not well integrated.

The procedure starts, by capturing an accurate background landbase. This may well be a base constructed of vectors and polygons, if it is available, such as the data available from the Ordnance Survey. Alternatively, a base constructed of scanned raster is much cheaper. Scanning at 200 dots per inch is barely adequate for this purpose and it is worth the marginal additional expense of scanning at a higher resolution of say 300 or even 400 dots per inch. With modern compression techniques, 400 dpi only takes twice, not four times, the space of 200 dpi. Semi- automatic vectorisation (see below) is considerably improved at the higher resolutions.

[ Figure 1 not available ]

Figure 1: Basic Data Model

The next stage is to input data which can be used to locate objects and mapsheets as well providing an essential geometric reference base. This includes such things as street gazetteers, map sheet locators and street centre-lines. Thus locating any street or map sheet and getting it onto the screen should be very efficient. Conversely, the system should be able to identify the mapsheet(s) containing any point or object indicated on the screen. Some locational data may already exist in existing company databases. Doing this before the foreground capture can improve overall efficiency.

Finally, the source maps containing the foreground data of interest can be scanned and the process of on-screen data capture can begin in an environment in which many of the hurdles in conventional systems are removed. Figure 1 illustrates the order in which information is assembled.

The following is a discussion of several important issues, including:

  • On-screen digitizing
  • Digitizing across many sheets without interruption
  • Geometry, topology and attribute capture in one integrated GIS environment
  • Multi-user data capture into one continuous database
  • The full facilities of a GIS for validation and running test applications

On-screen Digitizing

The conventional way of digitizing map data is to use a tablet or digitizing table. While, for certain kinds of data, this is usually of sufficient accuracy, a significant amount of time is taken by the operator in having to validate the input on a work-station screen, which is a separate surface to that on which the digitizing is performed. Thus it is difficult to check for digitizing errors and also for completeness, as the two images are not together.

Further, it is common for there to be minor mis-alignments between a quality background map sheet and the map sheet on which the foreground data is drawn. With on-screen digitizing it is possible to do a quick local realignment to compensate for such errors.

The process can be considerably improved with facilities for semi-automatic vectorisation of the raster data, so that high accuracy can be obtained while working at a relatively small scale. This reduces the need to zoom in and out frequently as the automatic vectorising facilities are usually more accurate than positioning coordinates by eye.

It is also necessary to be able to handle greyscale scanned maps in cases where the source maps are of low quality, or contain a wide variety of line densities, such as pencil marks on a map drawn in ink. Compression techniques should allow a large number of greyscale or colour maps to be stored efficiently.

Digitizing Across Many Map Sheets Without Interruption

With conventional digitizing, it is not possible to digitize at one time any object which lies across a map sheet boundary. Objects such as pipes or electricity cables frequently cross map sheet boundaries at least once. In these cases, the parts of the object must be connected together by an additional interactive process. Ad hoc techniques such as "off sheet connectors" and artificial segmentation of the object must be avoided.

With on screen digitizing, the process starts by scanning all (or at least a significant number of) adjacent foreground maps into the database before digitizing commences. This, in effect simulates a single large continuous raster mapbase. Thus the digitizing of complete objects can be achieved without interruption, as all of the information is presented together on the screen about the object.

Geometry, Topology and Attribute Capture in one Integrated GIS Environment

It is also necessary for the input, edit and query functions of the GIS to be available to the operator during data capture. Thus, little time is wasted moving from one user interface environment to another. Objects with similar attributes to objects already in the system can use those attributes as a starting template for modification.

The integration of the digitizing capability with the geometric CAD type facilities means that constructive geometry can be mixed with digitized geometry at will. Further, the integration of the geometry capability with the topology facilities means that most topology can be generated automatically by the system at the time that objects are positioned on the screen. For example, a water valve placed on a water pipe causes a node to be inserted in the water pipe, and the correct connectivity to be inserted in the database.

The inclusion of gazetteering functions considerably reduces the amount of time spent locating objects in the database.

Multi-user Data Capture into one Continuous Database

Approaches based on digitizing one sheet at a time by means of single user data capture systems result in a lot of time being spent merging and managing the many independent data files generated. The problem becomes particularly acute when edits are required to data which has already been input.

The system should permit many users to simultaneously update one continuous mapbase. This can only be achieved efficiently if the system allows any number of versions and simultaneous alternatives to be held in the database. This results in making the task of administering many users engaged in simultaneous additions and modifications to the database much easier.

This single issue is one of the dominant reasons why data capture can be very time consuming and costly.

The GIGO myth

A popular cliché these days is the GIGO (garbage in garbage out) story. Well it is not actually true in many cases, or at least the situation may be recoverable. It is our experience that many of today's GIS do in fact contain so- called garbage. The reason we say this is because frequently the geometry is good enough, but the structure of the data is defective, often objects are split across map sheet boundaries and object attributes are wrong or just plain missing. In other words the data cannot be used for the purpose it was intended.

In our experience, there is usually enough in these databases to convert the data into a satisfactory quality by a process of "laundering". This is a process where the data is imported into a GIS environment, and automatic processors can be applied to the data to infer missing topology and missing attributes and a range of validators can be run to detect and highlight problems graphically for a human operator to deal with. The cost of doing this is usually very much less than the cost of the original data capture, but it does require a rich GIS environment in which to do it.

It is particularly useful if the system contains rapid prototyping and development facilities so that it makes the writing of special programs to clean up particular datasets more feasible. These programs may well be used only once, then thrown away.

User Interface and Platform Issues

Much of what has been proposed above has only been made possible by the huge improvements in hardware and software in the last three years. On the hardware front, the drop in price combined with the huge increases in performance, memory and disk sizes means that data capture can be carried out in the environment of a large seamless database, managed by a modern GIS. The improvement in emerging user interface standards, coupled with the arrival of object-oriented development environments for the rapid development of custom built user interfaces for particular data capture tasks is another contributing factor.


Whereas most effort in the past in improving data capture has been expended on fully automated methods for converting source map data into a structured data base, we believe that for many types of documents, the human operator cannot be eliminated, but the present situation can be considerably improved by providing a fully integrated system.

Many of the hurdles which contribute to the cost of data-capture can be eliminated by several levels of integration including:

  • The integration of source and target images in one database and on one screen via the use of raster scanned maps.
  • The integration of all map sheets in one seamless mapbase.
  • The integration of map sheets and locational data.
  • The integration of data capture, validation and GIS functions in one environment.
  • The integration of many users capturing data into one version managed multi-user database.

As well as these issues of integration, the application of robust intelligent techniques, such as semi-automatic vectorisation and automatic topology generation further improve the process. Finally, the advent of modern user interfaces and object-oriented development environments lead to the production of optimised custom built user interfaces ideally tuned to the task in hand.

No comments: