Oracle: Load and transform external data into Oracle 9i
Overview
Up until Oracle 9i, SQL*Loader was the tool to load a lot of data into an Oracle database from external sources. Oracle9i introduces several interesting new features that change the way you look at the data-loading and transformation process. These is performed by the ETL toolkit (Extraction, Transformation, and Loading) and collectively provide you with a powerful ETL toolkit. Three ETL features are particularly interesting
Typical Loading Scenario
In the following example we use external tables and table functions, two of these powerful ETL features. Let's assume that you have the following revenue data to load into a database and that you want to normalize it so you end up with one row for each person/month combination:
Person Jan Feb Mar Apr Mai Jun Jul Aug Sep Oct Nov Dez
--------------------------------------------------------
Schnyder,345,223,122,345,324,244,123,123,345,121,345,197
Weber,234,234,123,457,456,287,234,123,678,656,341,567
Keller,596,276,347,134,743,545,216,456,124,753,346,456
Meyer,987,345,645,567,834,567,789,234,678,973,456,125
Holzer,509,154,876,347,146,788,174,986,568,246,324,987
Müller,456,125,678,235,878,237,567,237,788,237,324,778
Binggeli,487,347,458,347,235,864,689,235,764,964,624,347
Stoller,596,237,976,876,346,567,126,879,125,568,124,753
Marty,094,234,235,763,054,567,237,457,325,753,577,346
Studer,784,567,235,753,124,575,864,235,753,864,634,678
Person Month Revenue
-------------------------
Schnyder Jan 345
Schnyder Feb 223
Schnyder Mar 122
Schnyder Apr 345
Schnyder Mai 324
Schnyder Jun 244
Schnyder Jul 123
Schnyder Aug 123
Schnyder Sep 345
Schnyder Oct 121
Schnyder Nov 345
Schnyder Dez 197
........ ... ...Using Oracle8i, you would need to use at least two discrete steps to load and transform this data. The first step is to load the the data into a staging table using SQL*LOADER, the next step is to transform and insert the data in the final table using a PL/SQL procedure.
Oracle9i's ETL features, however, allow for a more interesting approach to the task, one that loads and transforms in just one step.
New in Oracle9i is the concept of an external table. This is a table that you define in the database's data dictionary, but for which the data itself is stored outside of the database. It's possible, for example, to define an external table that derives its data from the type of text file you would load using SQL*Loader. This is great, because the revenue data in the above example you need to load resides in just such a text file. Before you can create an external table, you need to create an Oracle directory object that points to the operating system directory in which your text file resides.
CONNECT sys/manager AS SYSDBA;
CREATE OR REPLACE DIRECTORY dat_dir AS 'C:\Oradata\Data';
CREATE OR REPLACE DIRECTORY log_dir AS 'C:\Oradata\Log';
CREATE OR REPLACE DIRECTORY bad_dir AS 'C:\Oradata\Bad';GRANT READ ON DIRECTORY dat_dir TO scott;
GRANT WRITE ON DIRECTORY log_dir TO scott;
GRANT WRITE ON DIRECTORY bad_dir TO scott;You can now use a new form of the CREATE TABLE statement that looks like a cross between a SQL statement and a SQL*Loader control file:
CONNECT scott/tiger;
CREATE TABLE revext (person VARCHAR2(20),
rev_jan NUMBER(4),
rev_feb NUMBER(4),
rev_mar NUMBER(4),
rev_apr NUMBER(4),
rev_mai NUMBER(4),
rev_jun NUMBER(4),
rev_jul NUMBER(4),
rev_aug NUMBER(4),
rev_sep NUMBER(4),
rev_oct NUMBER(4),
rev_nov NUMBER(4),
rev_dez NUMBER(4))
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY dat_dir
ACCESS PARAMETERS
(
records delimited by newline
badfile bad_dir:'revext%a_%p.bad'
logfile log_dir:'revext%a_%p.log'
fields terminated by ','
missing field values are null
( person,
rev_jan,
rev_feb,
rev_mar,
rev_apr,
rev_mai,
rev_jun,
rev_jul,
rev_aug,
rev_sep,
rev_oct,
rev_nov,
rev_dez
)
)
LOCATION ('revext.dat')
)
PARALLEL 4
REJECT LIMIT UNLIMITED;When you create an external table, you're really only creating some data dictionary entries. Nothing exciting happens until you query the table. Realize though, that you can query the table as you would any other SQL table. Let's assume you had the following production table:
CREATE TABLE revenue (
person VARCHAR2(20),
month VARCHAR2(3),
revenue NUMBER,
CONSTRAINT revenue_pk PRIMARY KEY (person,month));Given this table, you can use the following INSERT...SELECT FROM statement to extract revenue data from your external file in a normalized format and insert it into your production table.
INSERT INTO revenue (person,month,revenue)
SELECT person,'Jan',rev_jan
FROM revext
WHERE rev_jan IS NOT NULL
UNION ALL
SELECT person,'Feb',rev_feb
FROM revext
WHERE rev_feb IS NOT NULL
UNION ALL
SELECT person,'Mar',rev_mar
FROM revext
WHERE rev_mar IS NOT NULL
UNION ALL
SELECT person,'Apr',rev_apr
FROM revext
WHERE rev_apr IS NOT NULL
UNION ALL
SELECT person,'Mai',rev_mai
FROM revext
WHERE rev_mai IS NOT NULL
UNION ALL
SELECT person,'Jun',rev_jun
FROM revext
WHERE rev_jun IS NOT NULL
UNION ALL
SELECT person,'Jul',rev_jul
FROM revext
WHERE rev_jul IS NOT NULL
UNION ALL
SELECT person,'Aug',rev_aug
FROM revext
WHERE rev_aug IS NOT NULL
UNION ALL
SELECT person,'Sep',rev_sep
FROM revext
WHERE rev_sep IS NOT NULL
UNION ALL
SELECT person,'Oct',rev_oct
FROM revext
WHERE rev_oct IS NOT NULL
UNION ALL
SELECT person,'Nov',rev_nov
FROM revext
WHERE rev_nov IS NOT NULL
UNION ALL
SELECT person,'Dez',rev_dez
FROM revext
WHERE rev_dez IS NOT NULL;Because you set the degree of parallelism to four when you created the external table, the database will divide the file to be read by four processes running in parallel. This parallelism happens automatically, with no additional effort on your part, and is really quite convenient. To parallelize this load using SQL*Loader, you would have had to manually divide your input file into multiple smaller files.
Disadvantage of UNION ALL
The INSERT statement shown previously works off the union of 12 SELECT statements. This means that the external data file gets read 12 times, once for each SELECT. Reading the input file 12 times isn't desirable, especially if the file is very large. Fortunately, this happens to be just the type of problem that you can solve using table functions. You can think of a table function as a highly streamlined transformation engine. A table function takes a set of rows as input and returns a different set of rows as output. Unlike traditional functions, table functions are designed to be invoked from a SELECT statement's FROM clause.
Table Function as highly streamlined transformation engine
For the revenue data, we want to take each input row and transform it into 12 output rows. Each input row holds data from 12 different revenue months. Your normalized destination table, however, requires that each month get its own row, so your table function must transform the 12 population counts from each input row into 12 separate output rows.
Table functions always return a collection of records
Before creating the table function, you need to create some types. Table functions always return a collection of records, so to begin, create a table type that corresponds to the definition of your ultimate destination table. To do this, first create an object type to define the record, and then create a table type based on that object type.
connect scott/tiger;
CREATE TYPE revenue_row AS OBJECT (
person VARCHAR2(20),
month VARCHAR2(3),
revenue NUMBER
);
/CREATE TYPE revenue_tab
AS TABLE OF revenue_row;
/The input to your function will be rows returned by a SELECT statement against the external table REVEXT, so you'll need an appropriate REF CURSOR type. The following statement creates a package containing a REF CURSOR type REVENUE_CUR that matches the record structure of the REVEXT table. The package also defines a table function that takes such a cursor as an input parameter. Note that you have to create the table type REVENUE_TAB first, so you can use that type in your table function's RETURN clause:
CREATE OR REPLACE PACKAGE revenue_pkg
AS
TYPE revenue_cur IS REF CURSOR RETURN revext%ROWTYPE;
FUNCTION revenue_fun (revenue_arg IN revenue_cur)
RETURN revenue_tab
PARALLEL_ENABLE (PARTITION revenue_arg BY ANY)
PIPELINED;
END;
/The PARALLEL_ENABLE clause in the function specification allows the database to parallelize the function's execution. The PARTITION revenue_arg BY ANY clause indicates that the input rows can be arbitrarily divided into any number of buckets, all of which can then be processed in parallel. The PIPELINED clause enables the function to return the result set incrementally while other input data is still being processed. Think of the rows as "flowing through" the function during the execution of a query invoking the function. The function code defined in the package body takes care of transforming each input row into the desired 12 output rows.
CREATE OR REPLACE PACKAGE BODY revenue_pkg AS
FUNCTION revenue_fun (revenue_arg IN revenue_cur)
RETURN revenue_tab
PARALLEL_ENABLE (PARTITION revenue_arg BY ANY)
PIPELINED IS revenue_rec revext%ROWTYPE;
outrow_jan revenue_row := revenue_row('','',0);
outrow_feb revenue_row := revenue_row('','',0);
outrow_mar revenue_row := revenue_row('','',0);
outrow_apr revenue_row := revenue_row('','',0);
outrow_mai revenue_row := revenue_row('','',0);
outrow_jun revenue_row := revenue_row('','',0);
outrow_jul revenue_row := revenue_row('','',0);
outrow_aug revenue_row := revenue_row('','',0);
outrow_sep revenue_row := revenue_row('','',0);
outrow_oct revenue_row := revenue_row('','',0);
outrow_nov revenue_row := revenue_row('','',0);
outrow_dez revenue_row := revenue_row('','',0);
BEGIN
LOOP
FETCH revenue_arg INTO revenue_rec;
EXIT WHEN revenue_arg%NOTFOUND;
IF revenue_rec.rev_jan IS NOT NULL THEN
outrow_jan.person := revenue_rec.person;
outrow_jan.month := 'Jan';
outrow_jan.revenue := revenue_rec.rev_jan;
PIPE ROW (outrow_jan);
END IF;
IF revenue_rec.rev_feb IS NOT NULL THEN
outrow_feb.person := revenue_rec.person;
outrow_feb.month := 'Feb';
outrow_feb.revenue := revenue_rec.rev_feb;
PIPE ROW (outrow_feb);
END IF;
IF revenue_rec.rev_mar IS NOT NULL THEN
outrow_mar.person := revenue_rec.person;
outrow_mar.month := 'Mar';
outrow_mar.revenue := revenue_rec.rev_mar;
PIPE ROW (outrow_mar);
END IF;
IF revenue_rec.rev_apr IS NOT NULL THEN
outrow_apr.person := revenue_rec.person;
outrow_apr.month := 'Apr';
outrow_apr.revenue := revenue_rec.rev_apr;
PIPE ROW (outrow_apr);
END IF;
IF revenue_rec.rev_mai IS NOT NULL THEN
outrow_mai.person := revenue_rec.person;
outrow_mai.month := 'Mai';
outrow_mai.revenue := revenue_rec.rev_mai;
PIPE ROW (outrow_mai);
END IF;
IF revenue_rec.rev_jun IS NOT NULL THEN
outrow_jun.person := revenue_rec.person;
outrow_jun.month := 'Jun';
outrow_jun.revenue := revenue_rec.rev_jun;
PIPE ROW (outrow_jun);
END IF;
IF revenue_rec.rev_jul IS NOT NULL THEN
outrow_jul.person := revenue_rec.person;
outrow_jul.month := 'Jul';
outrow_jul.revenue := revenue_rec.rev_jul;
PIPE ROW (outrow_jul);
END IF;
IF revenue_rec.rev_aug IS NOT NULL THEN
outrow_aug.person := revenue_rec.person;
outrow_aug.month := 'Aug';
outrow_aug.revenue := revenue_rec.rev_aug;
PIPE ROW (outrow_aug);
END IF;
IF revenue_rec.rev_sep IS NOT NULL THEN
outrow_sep.person := revenue_rec.person;
outrow_sep.month := 'Sep';
outrow_sep.revenue := revenue_rec.rev_sep;
PIPE ROW (outrow_sep);
END IF;
IF revenue_rec.rev_oct IS NOT NULL THEN
outrow_oct.person := revenue_rec.person;
outrow_oct.month := 'Oct';
outrow_oct.revenue := revenue_rec.rev_oct;
PIPE ROW (outrow_oct);
END IF;
IF revenue_rec.rev_nov IS NOT NULL THEN
outrow_nov.person := revenue_rec.person;
outrow_nov.month := 'Nov';
outrow_nov.revenue := revenue_rec.rev_nov;
PIPE ROW (outrow_nov);
END IF;
IF revenue_rec.rev_dez IS NOT NULL THEN
outrow_dez.person := revenue_rec.person;
outrow_dez.month := 'Dez';
outrow_dez.revenue := revenue_rec.rev_dez;
PIPE ROW (outrow_dez);
END IF;
END LOOP;
RETURN;
END;
END;
/By enabling parallel DML and then using the table function, it's now possible to load and transform the revenue data in one operation:
ALTER SESSION ENABLE PARALLEL DML;
INSERT /*+ APPEND PARALLEL (t,4) */ INTO revenue t
SELECT *
FROM TABLE (revenue_pkg.revenue_fun (
CURSOR(SELECT person,
rev_jan,
rev_feb,
rev_mar,
rev_apr,
rev_mai,
rev_jun,
rev_jul,
rev_aug,
rev_sep,
rev_oct,
rev_nov,
rev_dez
FROM revext)));The revenue data will be read from the external table. That process will be parallelized, using intrafile parallelism. The output from each of those parallel operations will feed into separate processes, also running in parallel, that transform each input row into the 12 output rows you desire. Because the table function handles the transformation, only one pass through the external file is necessary. It's not necessary to stage the data in any sort of work table. The data is pipelined from the external data file, through the table function, and directly into the destination table. Fewer copies of the data results in less disk space being consumed.
The MERGE statement solves the long-standing problem of reloading data that you have loaded previously. Prior to Oracle9i, you needed to write procedural code to detect whether a row existed, and to issue an INSERT or UPDATE statement as appropriate. In Oracle9i, you can simply use the MERGE statement and let the database handle the details.
INPUT or UPDATE is the Question
Let's consider one final, and very common, data loading problem with respect to our scenario. Let's assume that we had previously loaded revenue data into the REVENUE table, and that our current input file contained both new data and updates to existing data. In cases where data in the file represents an update, we want to update existing rows in the REVENUE table. When data in the file represents new data, we want to insert new rows into the REVENUE table. This is a common problem, and one that's often handled by implementing some procedural logic using PL/SQL. For example, the following PL/SQL code reads rows one at a time from REVEXT, and issues either UPDATE or INSERT statements as appropriate.
Oracle 8i Solution
DECLARE
BEGIN
FOR inrec IN (
SELECT person,rev_jan,rev_feb,rev_mar,rev_apr,rev_mai,rev_jun,
rev_jul,rev_aug,rev_sep,rev_oct,rev_nov,rev_dez
FROM revext
) LOOP
/* Jan Data */
IF inrec.rev_jan IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_jan
WHERE person = inrec.person
AND month = 'Jan';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Jan', inrec.rev_jan);
END IF;
END IF;
/* Feb Data */
IF inrec.rev_feb IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_feb
WHERE person = inrec.person
AND month = 'Feb';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Feb', inrec.rev_feb);
END IF;
END IF;
/* Mar Data */
IF inrec.rev_mar IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_mar
WHERE person = inrec.person
AND month = 'Mar';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Mar', inrec.rev_mar);
END IF;
END IF;
/* Apr Data */
IF inrec.rev_apr IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_apr
WHERE person = inrec.person
AND month = 'Apr';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Apr', inrec.rev_apr);
END IF;
END IF;
/* Mai Data */
IF inrec.rev_mai IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_mai
WHERE person = inrec.person
AND month = 'Mai';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Mai', inrec.rev_mai);
END IF;
END IF;
/* Jun Data */
IF inrec.rev_jun IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_jun
WHERE person = inrec.person
AND month = 'Jun';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Jun', inrec.rev_jun);
END IF;
END IF;
/* Jul Data */
IF inrec.rev_jul IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_jul
WHERE person = inrec.person
AND month = 'Jul';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Jul', inrec.rev_jul);
END IF;
END IF;
/* Aug Data */
IF inrec.rev_aug IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_aug
WHERE person = inrec.person
AND month = 'aug';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Aug', inrec.rev_aug);
END IF;
END IF;
/* Sep Data */
IF inrec.rev_sep IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_sep
WHERE person = inrec.person
AND month = 'Sep';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Sep', inrec.rev_sep);
END IF;
END IF;
/* Oct Data */
IF inrec.rev_oct IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_oct
WHERE person = inrec.person
AND month = 'Oct';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Oct', inrec.rev_oct);
END IF;
END IF;
/* Nov Data */
IF inrec.rev_nov IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_nov
WHERE person = inrec.person
AND month = 'Nov';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Nov', inrec.rev_nov);
END IF;
END IF;
/* Dez Data */
IF inrec.rev_dez IS NOT NULL THEN
UPDATE revenue
SET revenue = inrec.rev_dez
WHERE person = inrec.person
AND month = 'Dez';
IF SQL%ROWCOUNT = 0 THEN
INSERT INTO revenue
(person, month, revenue)
VALUES (inrec.person, 'Dez', inrec.rev_dez);
END IF;
END IF;
END LOOP;
END;
/An approach like this has several problems. For one, we're no longer using table functions, so we've just destroyed our ability to parallelize the transformation part of our load process. We're also forced to issue up to 12 DML statements for each input record, and we have to worry about whether the most likely case will be an UPDATE or an INSERT. Finally, it's a downright pain to have to code this type of logic. Wouldn't it be nice of there were a single statement to do all this work for us? In Oracle9i, there is such a statement. It's the MERGE statement, and it's designed to either INSERT new rows or UPDATE existing row as appropriate.
Oracle 9i Solution
When writing a MERGE statement, you must specify the following items:
- The name of the destination table
- A SELECT statement to serve as the source of the data
- A condition for use in identifying input rows that represent updates to existing data
- UPDATE and INSERT clauses
The following MERGE statement will properly merge data from our external table REVEXT (src) into the REVENUE (dest) table. The ON clause references the primary key columns for the destination table. When the database reads a row from the source query, it evaluates the ON condition against each row in the destination table (via the primary key index in this case). If a row is found in the destination table for which the ON condition evaluates to TRUE, then the database treats the source row as an update.
MERGE INTO revenue dest
USING (SELECT * FROM TABLE (revenue_pkg.revenue_fun(
CURSOR(SELECT person,
rev_jan,
rev_feb,
rev_mar,
rev_apr,
rev_mai,
rev_jun,
rev_jul,
rev_aug,
rev_sep,
rev_oct,
rev_nov,
rev_dez
FROM revext)))) src
ON (dest.person = src.person
AND dest.month = src.month)
WHEN MATCHED THEN
UPDATE SET dest.revenue = src.revenue
WHEN NOT MATCHED THEN
INSERT (person, month, revenue)
VALUES (src.person, src.month, src.revenue);
COMMIT;This is really exciting! With one statement, we can read data from our external data file, pass it through our table function REVENUE_FUN in order to transform it to match our database table, and then either update existing rows or insert new rows into our database table as appropriate. What's more, no temporary staging tables are needed, and all of this work can easily be parallelized to any degree.
Conclusion
The problem of merging data has been around forever, and it's great that we finally have a way to deal with it at the database level that also takes advantage of the database's scalability. External tables promise to blur the distinction between external and internal data, making it easier to load data from external systems into the database. Using these new ETL features, you can take a multistep, single-threaded load process, and convert it to a single-step, pipelined, and highly parallelized load process
Source: http://www.akadia.com/services/ora_etl.html
Comments