Miscellaneous Ramblings on Hacking

November 14, 2006

Source File Archiving: SSIS versus DataStage

Filed under: DataStage, ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 4:42 pm

I thought it would be interesting to compare and contrast the differences between SSIS and IBM DataStage. It is easy for some to be married to one particular technology, so I thought that this would be interesting for some.

In this example, we’re in the situation of working with source files. After a load, I believe it’s best practice to archive the files that you’ve loaded for later reference. These files can then be absorbed by the corporate archive strategy all ready in place. The goal of the process is as follows:

  1. Open File
  2. Process data in file and load to target
  3. Archive File

SSIS File Archive

In SSIS, steps 1 and 2 are handled by the Data Flow of the package. Step 3 is executed within the Control Flow. There are two methods available for archiving:

  1. Utilize two File System tasks in sequential order. The first to move the file to the desired directory. The second, rename the file with the appropriate date stamp. The downside, each File System task requires that a destination and source connection manager be defined.
  2. Utilize a Script Task that executes code to move and rename a file. This has lots more control and can do the move and rename in one step. The downside, code reuse will require one to either create their own transformer or copy the script into to each package.

Using this script as a reference, I went the scripting route as it seemed cleaner. The script was defined with the following variables:

User::FileName – Data file name, eg. loandata.txt
User::FileDir_DATA – Data directory on the staging server, eg. ‘c:\data\sources\’
User::FileDir_TYPE – Data sub-directory, eg. ‘loandata\’

Here’s the code:

Imports System Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO

Public Class ScriptMain

‘Reference: http://blogs.conchango.com/jamiethomson/archive/2005/09/14/2149.aspx

Public Sub Main()

‘check if vars exist
If Dts.Variables.Contains(“User::FileName”) = True AndAlso _
Dts.Variables.Contains(“User::FileDir_DATA”) = True AndAlso _
Dts.Variables.Contains(“User::FileDir_TYPE”) = True
Then

Dim srcFile, tgtFile, tgtFileName, searchStr As String

searchStr = “.”
srcFile = CStr(Dts.Variables(“User::FileDir_DATA”).Value) + _
CStr(Dts.Variables(“User::FileDir_TYPE”).Value) + _
CStr(Dts.Variables(“User::FileName”).Value)

tgtFileName = CStr(Dts.Variables(“User::FileName”).Value)

tgtFileName = Left$(tgtFileName, InStrRev(tgtFileName, “.”) – 1) + _
“_” + CStr(Format(Now, “yyyyMMddHH”)) + _
Mid(tgtFileName, InStrRev(tgtFileName, “.”))

‘System.Windows.Forms.MessageBox.Show(tgtFileName)

tgtFile = CStr(Dts.Variables(“User::FileDir_DATA”).Value) + _
CStr(Dts.Variables(“User::FileDir_TYPE”).Value) + _
“Archive\” + tgtFileName
‘System.Windows.Forms.MessageBox.Show(tgtFile)

Try
File.Move(srcFile, tgtFile)
Dts.Events.FireInformation(0, “”, “File moved to: “ + tgtFile, “”, 0, True)
Dts.TaskResult = Dts.Results.Success
Catch ex As Exception
Dts.Events.FireInformation(1, “”, “File move failure. Tried to move to: “ + tgtFile, “”, 0, False)
Dts.TaskResult = Dts.Results.Failure
End
Try

End If

End Sub

After the Script Task is defined, the package can now move and rename a source data file to the designated Archive subdirectory. This is great if you like to code. My personal opinion, it’s too much logic for a simple ‘move’ command.

DataStage File Archive

In DataStage, the process of archiving files can be controlled through the job’s properties. (Note: a DataStage job is the equivalent to an SSIS package). One only needs to go to the job’s properties and set the ‘After-job Subroutine’.

croppercapture1.Png

In the above screenshot, you’ll see that the ‘After-job Subroutine’ is set to ‘ExecDos’, which executes an easy DOS move command. It doesn’t get any easier than DOS. The InputValue is set to the actual command syntax. For clarification, the command is using variables, which are as follows:

#$DrvPath# – drive letter, eg. c:\

#$SrcPath# – source data directory, eg. data\sources\

The command is also using %date to retrieve parts of the date in yyyymmdd format.

Conclusion

If you love to code and you revel in the idea of working with VBA, then ofcourse SSIS is your preference. But what this post intends to point out is that one shouldn’t have to write code for what is essentially one line at the DOS command line.

September 16, 2006

Building a Calendar Dimension in DataStage

Filed under: DataStage, ETL — Zack Bethem @ 11:55 pm

Inaugural post!

One of the first ETL tasks that are commonly required is building a time or calendar dimension. The ETL is used on an ad-hoc basis for populating the target dimension. In this example, I’ll show how a single transformer is used to generate rows of day representing a single day. These resulting rows will then be inserted into the target.

Target definition:
create table calendar (
mo_date datetime,
mo_start_date,
mo_work_day,
mo_work_day_remaining,
mo,
qtr,
yr,
fiscal_mo,
fiscal_qtr,
fiscal_yr);

Now that the table is setup, it’s time to startup DataStage and create a new job. The job should be setup with the following objects as shown.

As you can see, there is a post update to the calendar dimension. The update is responsible for setting the values in the column: mo_work_day_remaining, which represents the number of workdays remaining in the month.

Variables

To ensure that the job is dynamic and can be re-used in the future, the job is defined with the following stage variables. For this job, our variable are:
StartDate – defines the first day to generate records
EndDate – sets the final day to generate records

Transformer: X_GenerateCalendar

The meat of the job is in the first transformer. It is in this transformer that records are generated for each interval (variable: Interval) from the first day (job variable: StartDate) to the last day (job variable: EndDate).

For this transformer, we’ve setup the following variables:

These variables are later defined as stage variables within the X_GenerateCalendar transformer.

The stage variables are now setup for the initial transformer. You’ll notice that we’ve used extensive use of the ICONV and OCONV functions within DataStage. These functions translate the dates into internal and external formats, respectively. By using these functions, we’re allowed to use the available D-Code values based on the internal format. For example, the D-Code “DM” for the numerical month and “DY[4]” for the four-digit year.

The transformer output link is setup as follows:

Note the defined constraint for the output link. We ensure that the variables are defined before sending the records by the constraint: @OUTROWNUM > 0. We also ensure that we continue to generate rows until the StartingDate is equal to the EndingDate.

Summary

That’s it. Through the use of a single, initial transformer we were able to generate calendar rows with a date range based on the input parameters: start date and end date. Starting with the initial date provided, we generated several columns (mo, yr, qtr) that will be useful for dissecting and aggregating our fact tables are varying levels of granularity.

Hope it helps. I’ve skipped the remainder transformation for this particular job, as that is simply controlled by standard Output SQL from Tgt_Calendar.

Create a free website or blog at WordPress.com.