The Clarion Advisor: Detecting Duplicate Records

by Gordon Smith

Published 1999-10-26    Printer-friendly version

Recently on the newsgroups someone mentioned that TopScan didn't identify duplicate record errors while building the keys. After digging around in the source code I confirmed this to be indeed true and set about writing a "duplicate record" detector (DRD) which was file neutral. This article will take a look at the resulting class.

The Requirement

The DRD would have to perform the following functions:

  • Locate all duplicate errors, for each key.
  • For each duplicate error, locate all clashing records (remember each duplicate error can have more than two associated records)

The Solution Overview

The solution ended up being quite a bit simpler than I had originally envisaged, although on the down side it involves a brute force approach where every record in the file will need to be tested. The logical flow goes something like this:

  1. Rebuild all keys (this is essential, as keys must be up to date to enable location of clashing records).
  2. Loop through each record in file sequence.
  3. For each record use the DUPLICATE(FILE) function to see if it is a duplicate.
  4. For each duplicate record, loop through all the keys to find which ones are reporting clashes (remembering that it can be more than one key causing the error).
  5. For each duplicate key, find the one valid record and associate the current (error) record with it.

It is important to note that when duplicate records exist in a given key there will always be one valid record, while the rest of the duplicates will simply not exist for that key.

There are several ways a duplicate record can be created, the following are probably the most popular:

  1. APPEND: Since append doesn't update any keys it is probably the easiest way to create duplicate records.
  2. File conversion (common type A): If the conversion utility simply creates a new empty file and appends all the records from the "before" file, then this will be the same as 1.
  3. File conversion (Common type B): It is possible in some file drivers to modify a key structure (remove a DUP attribute for example) and then simply delete the key (somefile.k01 for example) from the hard drive, allowing Clarion to rebuild it afterwards.

The Solution

The solution has been presented in the form of a class. (The attached source code also has a small example PRJ and CLW file). The definition looks something like this:

FindDupClass    class, type
init              procedure(file f)
kill              procedure
buildKeys         procedure
findDup           procedure
display           procedure
                end

The class would typically be used as follows (and since it is a class, the actual code is appropriately simple):

TestFileForDup procedure(FILE TestFile)

cFind FindDupClass

  code
  cFind.Init(TestFile)
  cFind.BuildKeys()
  cFind.FindDup()
  cFind.Display()
  cFind.Kill()

The code "notables" now follow:

BuildKeys Method

Rather than calling BUILD(FILE), this method builds each key individually by using the PROP:Keys and PROP:Key file properties. The main reason for this is to avoid a nasty side effect of BUILD(FILE) where it aborts BUILDing all keys when an error is encountered (in this case a duplicate record error!). A nice addition to this method would be to actually check if any errors occurred during this operation, if not then there will be no need to check for duplicates. Another useful addition would be to use the PROP:Completed and PROP:ProgressEvents properties to display a nice progress window, with the option to gracefully cancel.

FindDupClass.BuildKeys procedure

i unsigned, auto
k &key

  code
  loop i = 1 to self.f{PROP:Keys}
    k &= self.f{PROP:Key, i}
    build(k)
  end

FindDup Method

The FindDup method is the brute force part of the solution (it loops through the entire file, record by record). Since files can be large it uses the EVENT:Timer event on a simple status window. When a duplicate is found, a private method is called (CalcDupInfo) to analyse the duplicate.

FindDupClass.findDup procedure

Prog long(0)
Found long(0)

Window WINDOW('Caption'),AT(,,200,44)cr.gif (846 bytes)
     ,FONT('MS Sans Serif',8,,FONT:regular)cr.gif (846 bytes)
     ,TIMER(1),SYSTEM,GRAY,DOUBLE
       PROGRESS,USE(Prog),AT(5,10,190,10),RANGE(0,100)
       PROMPT('Duplicates:'),AT(5,27)
       BUTTON('&Cancel'),AT(150,25,45,14),USE(?ButtCancel)
       STRING(@s64),AT(50,27),USE(Found)
     END

i unsigned, auto
FinFlag byte(FALSE)

  code
  open(self.f)
  assert(~errorcode())
  stream(self.f)
  assert(~errorcode())
  set(self.f)

  open(window)
  window{PROP:Text} = self.f{PROP:Name}
  ?Prog{PROP:RangeHigh} = records(self.f)
  accept
    case event()
    of EVENT:Timer
      if ~FinFlag
        loop 100 times
          next(self.f)
          if errorcode()
            FinFlag = TRUE
            post(EVENT:CloseWindow)
            break
          end
          Prog += 1; display(?Prog)
          if duplicate(self.f)
            Found += 1; display(?Found)
            self.CalcDupInfo()
          end
        end
      end
    end
  end
  close(window)

  close(self.f)
  return FinFlag

calcDupInfo Method

This method goes through the following steps to find duplicates.

  1. Save the current pointer and position for the current record.
  2. Check each key to find the ones reporting the error.
  3. For each duplicate record there will be one valid entry. This is located using GET(SELF.F,K) procedure. Since the current record buffer for the "error" record "matches" the one valid record, when the GET is called it will return the one valid record!
  4. For each duplicate AddResult is called twice (NOTE: AddResult will only add records to the result queue if they haven't already been added):
    1. To add the valid record to the result.
    2. To append the duplicate record to its associated valid record.

The original record is then restored with the RESET, NEXT combination (you must use the RESET(FILE) form of RESET, for obvious reasons).

FindDupClass.calcDupInfo procedure

j unsigned, auto
k &key
tmpPos like(ResultQueue.Pos)
tmpPoint long, auto

  code
  tmpPos = position(self.f)
  tmpPoint = pointer(self.f)
  loop j = 1 to self.f{PROP:Keys}
    k &= self.f{PROP:Key, j}
    if duplicate(k)
      get(self.f, k)
      assert(~errorcode())
      self.AddResult(pointer(self.f), k)
      self.AddResult(tmpPoint, k, TRUE)
      reset(self.f, tmpPos)
      next(self.f)
      assert(~errorcode())
    end
  end

AddResult Method

This method adds the duplicate record information to the result queue for displaying later (in the display method). It checks that this particular duplicate record hasn't been added already (based on the key name and its POSITION information), and if it has it will append the current record number to it.

FindDupClass.addResult procedure(string pnt, key k, byte AppendPos = FALSE)

  code
  clear(self.qResult)
  self.qResult.Rec = pnt
  self.qResult.Key = k{PROP:Label}
  self.qResult.Pos = position(k)
  get(self.qResult, +self.qResult.Key, +self.qResult.Pos)
  if errorcode()
    add(self.qResult, self.qResult.Rec)
  elsif AppendPos
    self.qResult.Rec = clip(self.qResult.Rec) & ', ' & pnt
    put(self.qResult)
  end

Summary

All in all it was quite pleasing that the solution ended up being so straightforward. Something similar will undoubtedly find its way into TopScan. One more point: to verify the sample program, try using TopScan to view the created file.

Download the source


Prior to joining TopSpeed Development Centre, Gordon Smith worked for an Irish company developing software for multi-national pharmaceutical companies. He was also a member of the 3rd party accessories program (Compile Manager 2) and developed the Clarion Class Browser. He is currently working with several of the original TopSpeed Developers at Seisint Inc..

Printer-friendly version

Reader Comments

Posted on Tuesday, June 12, 2001 by Greg Salyer

This is good except that it only shows the record number and key name for each dupe found. It needs a way to specify an additional field to display such as an ID number field.

To add a comment to this article you must log in.

 
 

Search

 

Advanced Search
Topical Index

Related Articles

Subscribe to
ClarionMag

One year: $189

(includes all back issues since '99)

Renewals from $139

Two years: $289

Renewals from $239

More Info

Subscribe Now!

ClarionMag Blog

RSS Feeds

Updates via Email

Enter your Email


Powered by FeedBlitz

Quick Links