![]() |
|
Published 1998-10-01 Printer-friendly version
Database Replication. Data Warehousing. Datamarts. The hottest new buzzwords in the database developers arena. Just the very words strike fear in the hearts of Clarion For Windows programmers, as they arent mentioned anywhere in the Clarion documentation. But these emerging technologies have tremendous possibilities for Clarion developers, even though Clarions language does not include statements that directly offer support for it. In this article, we will discuss the features and uses of database replicationthe magic behind the scenes of remote data entry and reporting, datamarts, and data warehousing. And, oh yes, of course! We will show you ways to implement these technologies using Clarion For Windows!
Database replication is a technique whereby multiple copies of the database tables are spread around servers on a network. Even though the term "replication" is fairly new, the concept is not. It has its roots in, you guessed it, mainframe era technology. As records are added and modified in one copy of the table (a replica) in off-line fashion, those changes will get propagated to the other replicas when the remote networks are connected together periodically. This is a useful alternative to remote control programs, such as PcAnywhere, that only send screens and keystrokes over a remote connection, and which require a full time connection while data entry or querying/reporting occurs.

For example, suppose you have a traveling salesman that moves from city to city and stays in a hotel at night. He carries his notebook computer with him at all times. He enters orders throughout the day into his notebook using a Clarion For Windows program that saves them in a local table. At night, when he arrives at the hotel, he connects his notebook computer and modem to the phone line and calls the home office. The home office then synchronizes its copy of the database table with his, effectively uploading all new orders the salesman has created from the notebook computer for the day. At the same time, any orders that were created by other salesman are downloaded into the notebook computer. Also, any pricing changes made by home office employees are replicated into local tables on the notebook computer. Now the salesman is ready to go for the next day. Note that all new information is entered "off line". Using remote control software such as PCAnywhere, the salesman must enter while he is "on line". Clearly not as simple or powerful.
Another example would be a remote office that must communicate routinely with the home office. Perhaps this company has several shipping points around the country. At night, each remote office can "poll" the home office for the new orders that came in that day. At the end of the synchronizing process, all sites have a heterogeneous data set that is fully up-to-date.
A third use for database replication is for data warehousing or datamarts. These are centralized repositories for databases that make it easy for any number of individuals to perform complex analysis of vast amounts of data. CEOs and board executives might want to "mine" this data to extract tidbits of information that can help make the company more competitive or have an edge. This data could easily be processed by mainframes or supercomputers, or using servers running SQL-based DBMSs.
For example, suppose your company travels around the world doing seminars, and you collect income from various sources, including bookings, tickets for the late bloomers to enter the arena, and product sales of books, video and audiotapes, and other items. You normally might be interested in just producing information on daily sales of these items and summarizing these for the accounting department. You might also make a series of reports that help the employees running the sales tables decide which products to stock and which should be dropped. The speaker that gives the spiel and signs autographs might be interested in knowing which customers have bought books and audio tapes that relate to certain topics (especially those he has written). Using data mining tools, executives can "drill down" into the sales data and extract not only the information, but can produce mailing labels that can be sent out to those who have bought a particular quantity on such topics. So, if John Q. Public bought $200 worth of audio tapes and video tapes, he gets a catalog. The guy that bought the $5 bookmark doesnt interest the executive. And on and on. Ordinarily, only the highest performance (and most expensive) databases would be used for the "drilling down". Therefore, getting the data to the highest performance computer reliably and timely is a must.
Database replication comes in many different forms and flavors. Some types of replication involves a "two-phase commit" (2PC), where all of the replicas must agree to accept a change to the database before a transaction will be replicated. 2PC requires that all servers be live and communicating with each other at all times, and unless a full time high performance WAN is already in place and has the extra bandwidth to accommodate the extra activity, it is not usually cost-effective, as WAN links are expensive and, in most cases, slower than LAN links. Therefore, most organizations choose a "warm" update procedure that updates records between the branches in periodic batches, rather than a "hot" update, where all transactions are available immediately after entry.
You might be wondering why a company might choose a 2PC scheme in the first place if they needed live access to data at all the branches. Why not just access the records via the WAN on a central file server and be done with it? Well, the answer is not that simple, Im afraid. You see, even using a SQL DBMS such as SQL server, the remote branches will be required to know certain pieces of lookup information from time to time. For example, suppose you are a payroll company that pays your employees at the remote branches. You are entering time cards into the payroll program, and you realize that you do not know the employees social security number that is used as the employee ID. Most Clarion programmers would write their program to pop up a drop list or lookup table of employee names and SS numbers on the time card form window. The problem is that, depending on the types of controls used, the entire database of 30000 employees might be sent via the WAN link to the remote system. Browse windows are page loaded, but even if you have ISDN, for example, even a one-page browse is SLOW. In fact, even a SQL-based database would have to return such a candidate record set if a lookup is needed. So now the payroll clerk must have a printout that lists all of the employees and their social security numbers, and manually look it upyou get the idea. Replication would allow the clerk to have a local copy of the employee table. Still, its usually better to update this periodically rather than every time a transaction is entered from anywhere in the world. Most system analysts would agree that remote control software like PCAnywhere is a better solution if live updates are needed to centralized PC-based databases, even though you cannot type as fast.
Warm updates use a replication scheme where the added and changed records are sent between the replicas periodically. This precludes the need to have full-time WAN access if, in fact, a WAN is needed at all. Much of the file transfer can be handled using modems. Many installations will find that a pair of very simple dedicated 386 or 486 computers with modems is all that is needed. Their job is to read the databases periodically and transfer the replicate set over the telephone line back and forth to each other and to other peers on other remote networks. They might be programmed to dial once per hour or once per day. This type of replication is cheap, and gets the job done without reliance on expensive ISDN, frame relay, or ATM WAN links.
First, your database must be carefully designed (or retrofitted <g>) to support replicas. One of the most important aspects of programming replicas is to insure that you have created a "GLOBALLY UNIQUE ID", also known as a GUID, for each file. This is important. Most programmers would call this a SysID. The main difference is that your code cannot just use a simple auto-incrementing field for this type of ID. For instance, in our remote salesman example above, we might be tempted to use an invoice number as the primary key for the ORDERS table. This is fine for a stand-alone office, but we must be able to insure that as records are added by different salesmen, remote offices, and the home office, that a unique invoice number is generated for each record, regardless of where it is generated. In an ordinary order entry application, we would simply ask Clarion to generate the next available order number for us. However, we cannot do this in a replicable database. Suppose the next order number is 10003. Now John Smith adds his first record for the day into his local replica (Order 10004), and Jane Doe enters her first record for the day into her local replica (also Order 10004). When all tables are synchronized, Order 10004 will have generated a conflict. Therefore, a common solution is to assign a "Location ID" to the beginning of the primary key. For instance, a prefix of 1 would be John Smiths orders, ie. 01-10004. Jane Does first order might be 02-10004. Clarion For Windows can still generate the second field of the key in its normal sequential manner. You can use a variable from an INI file to hold the location ID portion. In other words, you can use the Location ID and Order Number together in a composite key so that AppGen can use auto-increments for the Order Number portion.
I have found that it makes sense, though, to write a function call to generate the GUID. This function could be housed in a TXA file or DLL file that can be imported into an application. What makes it neat is that it can be called from your dictionary without too much fuss (you use the "Initial Value" attribute for the field in the dictionary to call the function), and you will have an easy way to change your primary key generation method should the time arise where you need to do so, without having to change the rest of your code.
Also, make sure that EACH TABLE includes a GUID in this way, including tables that do not normally need serialized numberseven on tables that are normally the MANY side of a one to many relationship. It is vitally important that your replicated table include a GUID as a primary key, even if the synchronizer program is the only program that uses it. Suppose you have a CUSTOMER table but you do not normally use account numbers, but instead select the name. You still need to have the system generate a globally unique key. This is necessary so that if a particular customer is modified or deleted at some later date, it will be able to find the correct record (i.e. more than one customer could be named ACME). You can hide this key value from the end-users view in browses and on forms, and Clarion For Windows can still generate the key either by using auto-inc with a composite key or by having a function call. Also, suppose a record in a table on the MANY side of a relationship is modified by a user; e.g., the third line item on an invoice. Without a GUID, there would be NO WAY to find the original record to be replicated across multiple copies of the table, since there would be nothing to identify individual records; i.e., all line items would only have an order number to identify them, which is by itself insufficient to indicate that we need to modify individual line item number 3. Note that we CANNOT rely upon using the physical record pointer assigned by Clarion to the record/row as a reference, because due to such things as re-use of deleted records and different numbers of records being added by different users, the record will probably not be assigned to the same physical row in each copy of the table.
Note also, conflict detection/resolution will have to be implemented at some level in the program. Suppose John modifies the zip code of a customer record at his Tampa office, and Janet modifies that same customers phone number from her Miami office. John replicates his change first and the change takes. Now Janet replicates her change, BUT SHE STILL HAS THE OLD ZIP CODE. Alas, a conflict has just sprung up! Janet must not be allowed to continue her replication of that particular record until the conflict has been resolved by an administrator!
The most common way of dealing with these conflicts is simply to use date and time stamps on the records. The person that gets his modification to the record in first is the "winner". The second person must deal with the conflict. Time and date stamps work very well for this, but keep in mind that the two computers might be in different time zones. You need a "time offset" that can be added into the comparison so that the synchronizing program does not make the wrong assumption about which update is the newer one. Date and time stamps are very common in replication, but it is not the only way to detect conflicts. You could use a series of generation numbers as a conflict detection method. In effect, every time a replica is updated, the generation field stored as a column within the table to be replicated is incremented by one. At the same time, the transaction log contains an "expected generation" field that is also incremented. During synchronization, the generation field in the remote replica can be compared to the expected generation. If they dont match, you have a conflict.
Regardless of the conflict detection method, you must still deal with the conflict. I am a firm believer that conflicts should be dealt with by a human being; therefore, I like to include a "Conflict Flag" in the transaction log. If, after synchronization, a transaction log entry is still present, I can use the Conflict Flag to draw one of two conclusions. If it has been set, a conflict was found and, therefore, the record could not be transferred. If it has not been set, the opportunity to synchronize this record with the subscriber never presented itself; e.g., maybe the phone line got disconnected, or the user never ran the synchronizer program for that subscriber. What is neat is that you could write a simple report in Report Writer that uses the Conflict Flag as a filter. Any records that show up in this report after synchronization could be checked for the cause of the conflict, the record redone (or at least the date/time stamps "touched"), and the synchronizer run again.
In part two, we will examine the implementation of database replication using Clarion for Windows. We will discuss how to set up the dictionary for your application, how to set up a "subscriber" file that identifies the recipients of your replica modifications, how to insert needed source code to keep up the replica transaction log for later synchronization, how to make the remote connection using dial-up networking, and how to write a synchronizer program that handles two-way replica updating and conflict detection.
See you next month!
Copyright © 1999-2008 by CoveComm Inc. All Rights Reserved. Reproduction in any form without the express written consent of CoveComm Inc., except as described in the subscription agreement, is prohibited.
Clarion Magazine ISSN 1718-9942
One year: $184
(includes all back issues since '99)
Renewals from $134
Two years: $274
Renewals from $224