注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

宁辉博客

 
 
 

日志

 
 

The SQL Processes By 宁辉  

2011-06-26 02:13:07|  分类: 默认分类 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

Introduction

Most batch processes for business applications today are still implemented by procedural languages. The typical approach is to process a record (or a set of related records) at a time until all the input data is exhausted. This record orientated processing method has been with us since the dawn of data processing. It has served us well for so many years and is still the only processing method used by a lot of programmers. However, it starts to reveal its weakness at the arrival of large-scale relational database.

Business data in a relational database is stored in tables. Data from these tables are retrieved by SQL statements and stored as variables within the program. Sometimes data from dozens of tables are required to perform a single business operation. Under that situation assembling the input data is itself a major effort, and the coding becomes more tedious as the number of input tables increases. It is not unusual for a programmer to spend more time in assembling the input data than implementing the business logic.

To make the matter worse, not all the data can be retrieved at the beginning of the program. This is because some of these data retrievals are conditional upon the outcomes of certain computations at a later stage. The debugging process will also become more difficult when the number of  IF THEN ELSE conditional commands grows.

Experienced programmers have noticed that the number of variables for their input data can be substantially reduced if they do more screening and checking in the SQL SELECT statements before passing the data to the host program. In practice this means retrieving input data by a more complex SQL statement that joints several input tables. In this way SQL is in fact playing a more important role because part of the processing logic is actually handled by its WHERE clause. This is true that such an approach can cause a performance problem when the programmer doesn't have the knowledge to properly tune the SQL statement, or when the existing index structure can't efficiently support the operations performed by such a complex  SELECT.

The work table approach

We do like the idea of letting SQL to do more in the process. In fact, we want to let SQL to do everything. Our approach is to create one or more work tables to house all the required data, which includes data retrieved from database and the temporary data required to carry out the computation. At the beginning of the process this work table will be populated by a series of Insert and Update SQL statements that draw data from the database. These are relatively simple SQL statements because each of them will only copy data from one or two database tables. The operation is done within the SQL environment and no procedural language commands will be involved. At the end of this step all the required data is available in in these work tables.

The next step is to continue our computation on the work tables to produce all the data for the output. This is also done by Update or Insert SQL statements.

When all the output data is available the additional Insert or Update statements will be used to update the database. If necessary, the procedural language can take over the control at this stage to produce a report from this work table.

In such an approach we are not trying to reduce the number of SQL statements. On the contrary we will have more SQL statements in the program. But each SQL statement will only be responsible for a simple task and the result is visible from the work table. To debug such a program you simply execute these statements one by one and inspect the content of the work table at each step.

An example

The following example is for illustration purposes only.

Suppose budgets for a given year have been determined manually for each department and the amounts have to be distributed to all the expense accounts within the same department proportional to last year's actual expenses.

Assuming input budget data for each department has been created in a table BUDGET_INPUT, a program is required to insert into the LEDGER table the budget figures for each expense account within these departments. The work table for this program will be called BUDGET_WORK_FILE.

Tables and columns used in this process: 

1. BUDGET_INPUT table - the table that contains budget figure for each department. 
BUDGET_INPUT (year, department, record_status, department_budget) 

2. LEDGER table - the database table for the account balance by year, department, and ledger_type (Actual or Budget) 
LEDGER (account, department, year, ledger_type, amount) 

3. ACCOUNT_TYPE table 
ACCOUNT_TYPE (account, account_type) 

4. BUDGET_WORK_FILE table - Temporary work table created for this process 
BUDGET_WORK_FILE (account, department, budget_amount, last_year_actual_amt, last_year_department_total, process_step)


**** The Program ****

/* Statement 1: To delete the old data in the work file */

delete from budget_work_file;

/* Statement 2: To copy to work table the actual expense from last year by department and account */

insert into budget_work_file 
(account, department, last_year_actual_amt, process_step) 
select a.account, a.department, sum(a.amount), 'LAST_YR_ACTUAL' 
from account_type b, ledger a 
where a.account = b.account and 
         b.account_type = 'EXPENSE' and 
         a.ledger_type = 'ACTUAL' and 
         a.year = 2011 - 1 
group by a.account, a.department;

/* Remark: We use 2011 - 1 instead of 2010 because the year entry can be replaced by a bind variable later. */

/* Statement 3: To calculate last year's total expense by department */

update budget_work_file a 
set last_year_department_total 
= (select sum(last_year_actual_amt) 
     from budget_work_file b 
     where b.department = a.department and 
                          b.process_step = 'LAST_YR_ACTUAL') 
where a.process_step = 'LAST_YR_ACTUAL';

/* Statement 4: To distribute the budget figures to each department based on last year's expense */

insert into budget_work_file 
(account, department, budget_amount, process_step) 
select a.account, a.department, 
b.department_budget * a.last_year_actual_amt / a.last_year_department_total, 
'BUDGET_OUTPUT' 
from budget_input b, budget_work_file a 
where b.department = a.department and 
          a.process_step = 'LAST_YR_ACTUAL' and 
          b.record_status = 'APPROVED' and 
          b.year = 2011 and 
          a.last_year_department_total <> 0;

/* Remark: We assume that there is only one row in budget_input per department for the same year */

/* Statement 5: To copy the budget by account and department to the LEDGER table */

insert into LEDGER 
(account, department, year, ledger_type, amount) 
select account, department, 2011, 'BUDGET', budget_amount 
from budget_work_file 
where process_step = 'BUDGET_OUTPUT';

Please note that statement 4 and 5 can be combined to insert budget figures into LEDGER directly from the two source tables in statement 4. But keeping an audit trail for the output data in the work table is not a bad thing.

The main difference between this approach and the procedural approach is that each INSERT or UPDATE operation will generate or update so many rows in the table, whereas the procedural in general can only update one row at a time.

The procedural language can still have a minor role in this approach. For example, these SQL statements can be embedded in a procedural language program so that the year entry (2011) is replaced by a bind variable. The value of this bind variable will come from the input parameters.

We have to emphasize that the example is constructed solely for demonstration. It does not take into account all details. For example, there could be new accounts set up in 2011 that were not in the 2010 ledger

The Control Column

You may have already noticed that the column PROCESS_STEP plays an important role to identify the set of data to be processed. This is what we call Control Column. This control column should have a distinct value for each Insert statement. It should also be reassigned a new value in any important Update statement to identify those rows that have actually participated in the update.

The work table should have sufficient columns to support all the computations. Since it will only be used for the process itself, we should be allowed to do anything necessary to provide maximum efficiency for the process, including building the suitable indexes to support our operation. For example, statement 3 will need an index on DEPARTMENT column. We may even want to break the rule of normalization and keep some redundant data if that can make the program run faster.

Some programmers may like to delete all the rows of the work table at the end of the process, others would prefer to keep the work table data as audit trail until the next run.

Concurrent processing

Some applications may have to allow two or more sessions of the same program to be run at the same time. In that case additional effort is required to structure our work table. Normally the business requirement will give us a hints to avoid the potential contention. For example, if the process is to update the LEGDER with a batch of journal entries identified by the batch number, then two different sessions can work simultaneously if all the operations in the program are applied to work table data with its own batch number.

If you can not find a key item such as the batch number to "partition" your work table, there is always a way to generate a unique number (e.g. a session counter) for your session to serve that purpose. How to support concurrent runs in a batch process is always an issue no matter what approach you take. But in this approach the issue is more noticeable.

In the more recent versions of  Oracle DBMS  you can create temporary tables which is accessible only within your own session. That could be a good vehicle for your work tables.

For a complicated batch process you may need more than one work table. Suppose we are writing a payroll program that calculates the incomes, benefits, deductions, taxes, and net pay for each employee. We may need to have a "pointer" work table to identify all the employees that will be included in this run. We may have another work table to house the related data for their incomes, benefits, and deductions. If the salary data such as hourly rates are not stored in a suitable format for our SQL operation, we may  need another work file to store the related hourly rates in exactly the format we want. The execution time to populate such a temporary hourly rate table is negligible in comparison to the benefit it provides.

When to Commit

The next issue that we want to discuss is when to apply the database commit. In the example above we update the database table LEDGER once at the end of the process. If there is any internal or external condition that causes the program to terminate before that step we can always restart from the beginning because no permanent database table would has been updated. Therefore abnormal termination of this process will not cause problem to the database.

However, if there are two updates on two permanent tables in the program then we should try not to Commit the database transaction between these two update statements. Otherwise when the system abnormally terminates between these two steps the integrity of the database will be violated. If you can you should place all the database table update statements at the end of the program and place no Commit between them.

We understand that there are situations that we need to Commit database transactions between two database updates. In that case sufficient data should be captured in your work table to make these updates reversible.

When to place your database commit is also an issue for the other approaches. But it is more important in the work table approach because each update or insert statement could change a lot of data in that table. If possible, we should keep enough data to make the entire process reversible.

Beyond batch processes

In this article we use the term Batch Process casually. In fact, this work table approach is not restricted to Batch Processes in the technical sense. In my previous job an on-line program which gathers data from 7 tables in an Asset Management System to produce transactions for mass transfer of assets was successfully implemented by this work table approach.  

The strength

In this approach overhead is reduced to an absolute minimum. Everything you need to know for a statement can normally be displayed on the same page of your screen. There is no need for you to go elsewhere in the program to do cross referencing. The highly independent nature of the SQL statement and its compact format not only simplifies the debugging process, but also makes program modification a much easier task.

A friend of mine said the predicate logic in the WHERE clause of a SQL statement is one step closer to human thinking as compared with the If-Then-Else types of logic. I totally agree after I implemented a large batch program in a project costing system. I remember my user gave me the following list of instructions:

  1. Identify all the capital projects 
  2. Copy the total outstanding amounts in the transaction file for these projects by account, department, 
      and cost type (labor or material) 
      Remark: There is no account code in the transaction file for labor cost type. 
   3. For each project and department prorate the total labor cost by material accounts. 
   4. .......

I was amazed to find out later that after setting up the work table I was able to translate her instructions into SQL statements one by one.


The weakness

The work table approach can not handle any recursive formula or recursive process directly. Recursive formula is normally implemented by iteration commands in a procedural language. However, we can still use the work table method inside the procedural loop if that is beneficial.

We also noticed that the work table approach demands an above average skills in SQL programming. Sometimes knowledge beyond the core SQL will come in handy to solve sophisticated problems. For example, we often use functions like DECODE, TO_CHAR, TO_DATE, ADD_MONTHS in Oracle SQL. Programmers that are not well versed in SQL programming language may not be able to take full advantage of this approach.


Conclusion

What we are talking here is not a new approach. A lot of programs have been written in similar fashions. Sometimes the term Set Processing is used for this kind of programs. But I haven't seen any systematic discussion on the work tables. Actually most of the Set Processing programs are doing direct Update and Insert on the permanent tables. Without using work tables and the control columns the method will suffer from a severe handicap.

I have been using the work table approach for many years and I have reduced my development cost and maintenance cost substantially. I will not go back to the procedure approach.


  评论这张
 
阅读(79)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018