Miscellaneous Ramblings on Hacking

October 20, 2009

SSIS: Waiting for a file

Filed under: ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 4:50 pm

With longer running processes, it is a common practice to create a 0-byte file as a signal to a dependent process. An example would be the data transfer via FTP of several files to a staging directory. This post shows how to wait for the 0-byte file.

Control Flow

When complete, the control flow for this package will have the following items defined:

  • For Loop Container – this makes the process continue in a loop until the 0byte file is found
  • Check File script task – this checks for the presence of a file and sets a Boolean variable to True or False
  • Sleep script task – this will create a time delay so that the process isn’t checking every nano-second
  • Move CheckFile script task – this moves the 0byte file to avoid conflicts with subsequent data runs

The control flow will appear as follows:

Control Flow

Variables

The following variables need to be defined:

Variable DataType Purpose
CheckFile String Filename of the 0-byte file to look for
CheckFileExists Boolean Boolean flag that is set to true upon detection of the 0-byte file
FFDataPath String Staging directory where the 0-byte file is expected to be placed
SleepMinutes Int32 Number of minutes to sleep between checks for the 0-byte file
Definition: For Loop Container

Configured to loop while User variable User::CheckFileExists is equal to ‘False’.

InitExpression
EvalExpression @[User::CheckFileExists]==False
AssignExpression
Definition: CheckFile Script Task

Configured to manipulate the value of the User variable User::CheckFileExists to ‘True’ or ‘False’.

ReadOnlyVariables User::CheckFile, User::FFDataPath
ReadWriteVariables User::CheckFileExists

The script task will then execute the following code:

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO


Public Class ScriptMain

    ' The execution engine calls this method when the task executes.
    ' To access the object model, use the Dts object. Connections, variables, events,
    ' and logging features are available as static members of the Dts class.
    ' Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
    ' 
    ' To open Code and Text Editor Help, press F1.
    ' To open Object Browser, press Ctrl+Alt+J.

    Public Sub Main()
        '
        ' Created by: Zack Bethem @ AmberLeaf
        '
        Dim fileLoc, fileName As String
        If Dts.Variables.Contains("User::FFDataPath") = True AndAlso _
        Dts.Variables.Contains("User::CheckFile") = True Then
            fileLoc = CStr(Dts.Variables("User::FFDataPath").Value)
            fileName = CStr(Dts.Variables.Item("User::CheckFile").Value)
            'debug System.Windows.Forms.MessageBox.Show("FileDir:" + fileLoc + "FileName:" + fileName)
            If File.Exists(fileLoc + fileName) Then
                Dts.Variables.Item("User::CheckFileExists").Value = True
                'debug System.Windows.Forms.MessageBox.Show("File exists")
            Else
                Dts.Variables.Item("User::CheckFileExists").Value = False
                'debug System.Windows.Forms.MessageBox.Show("File not exists")
            End If
            Dts.TaskResult = Dts.Results.Success
        Else
            Dts.TaskResult = Dts.Results.Failure
        End If
    End Sub

End Class

The script task is configured to execute the Sleep task based on the precedence constraint. This is defined by double clicking on the connector and setting the following:

Capture2

Definition: Sleep Script Task

Configured to sleep or pause the process.

ReadOnlyVariables User::SleepMinutes
ReadWriteVariables

The script task will then execute the following code:

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime

Public Class ScriptMain

    ' The execution engine calls this method when the task executes.
    ' To access the object model, use the Dts object. Connections, variables, events,
    ' and logging features are available as static members of the Dts class.
    ' Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
    ' 
    ' To open Code and Text Editor Help, press F1.
    ' To open Object Browser, press Ctrl+Alt+J.

    Public Sub Main()
        '
        ' Zack Bethem @ AmberLeaf
        ' Based on: http://blogs.pragmaticworks.com/mike_davis/2009/06/make-an-ssis-package-delay-or-wait-for-data.html
        '
        If Dts.Variables.Contains("User::SleepMinutes") = True Then
            Dim min As Double = Convert.ToDouble(Dts.Variables("SleepMinutes").Value) * 60
            Dim ms As Int32 = Convert.ToInt32(min * 1000)
            System.Threading.Thread.Sleep(ms)

            Dts.TaskResult = Dts.Results.Success
        Else
            Dts.TaskResult = Dts.Results.Failure
        End If

    End Sub

End Class

The For Loop container is configured to execute the Move CheckFile task based on the precedence constraint. This is defined by double clicking on the connector and setting the following:

Capture3

Definition: Move CheckFile Script Task

Configured to move the 0-byte handoff file to an Archive location with the appended suffix of the date in yyyyMMddHH format.

ReadOnlyVariables User::CheckFile,User::FFDataPath,User::FFArchivePath
ReadWriteVariables

The script task will then execute the following code:

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO

Public Class ScriptMain

    ' The execution engine calls this method when the task executes.
    ' To access the object model, use the Dts object. Connections, variables, events,
    ' and logging features are available as static members of the Dts class.
    ' Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
    ' 
    ' To open Code and Text Editor Help, press F1.
    ' To open Object Browser, press Ctrl+Alt+J.

    Public Sub Main()
        '
        'check if vars exist
        If Dts.Variables.Contains("User::CheckFile") = True AndAlso _
        Dts.Variables.Contains("User::FFDataPath") = True AndAlso _
        Dts.Variables.Contains("User::FFArchivePath") = True Then

            Dim srcFile, tgtFile, tgtFileName, searchStr As String

            srcFile = CStr(Dts.Variables("User::FFDataPath").Value) + _
            CStr(Dts.Variables("User::CheckFile").Value)

            tgtFileName = CStr(Dts.Variables("User::CheckFile").Value)

            'add a datastamp on handoff file
            tgtFileName = Left$(tgtFileName, InStrRev(tgtFileName, ".") - 1) + _
            "_" + CStr(Format(Now, "yyyyMMddHH")) + _
            Mid(tgtFileName, InStrRev(tgtFileName, "."))

            'System.Windows.Forms.MessageBox.Show(tgtFileName)

            tgtFile = CStr(Dts.Variables("User::FFArchivePath").Value) + tgtFileName

            'System.Windows.Forms.MessageBox.Show(tgtFile)

            Try
                File.Move(srcFile, tgtFile)
                Dts.Events.FireInformation(0, "", "File moved to: " + tgtFile, "", 0, True)
                Dts.TaskResult = Dts.Results.Success
            Catch ex As Exception
                Dts.Events.FireInformation(1, "", "File move failure. Tried to move to: " + tgtFile, "", 0, False)
                Dts.TaskResult = Dts.Results.Failure
            End Try

        End If

    End Sub


End Class

Advertisements

July 24, 2007

Using WinZip within SSIS

Filed under: ETL, SSIS — Zack Bethem @ 11:30 pm

Windows comes with the ability to compress files. In practice, however, I found that the standard utility starts to choke when compressing larger flat files (greater than 2GB). To get around this problem within SSIS, I used the default WinZip program. There were some tricks to get it working, however.

Using a Process Task in the Control Flow of a package, I configured the following via Expressions in the Expression Editor:

  • set the WorkingDirectory to the expected location of the data file
  • set the Arguments to evaluate as: -min -a <fullpath>\<zipfile>.zip <fullpath>\<sourcefile>.txt

The Process Task Executable argument was set to the absolute path of the WINZIP32.EXE program (eg. C:\Program Files\WinZip\WINZIP32.EXE)

Automating FTP

Filed under: ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 11:19 pm

Processing flat files are a common task when building a Data Warehouse. The need typically arises when integrating with 3rd-party providers or data outsourcers. A common integration technique is to use flat files with FTP as a transfer mechanism. One should use secure FTP, but SSIS out-of-the-box only supports the weaker, less-secure FTP.

When configuring the FTP connection, SSIS cannot store the login information. This is great if your final solution requires a human to push a button and enter the security information every time, but hardly ideal for us that like to automate things and sleep at night. To get around the problem, one must include a Script Task in the package’s Control Flow. The Script Task will set the connection properties of the FTP Connection automatically.

Here’s how you do it:

1. Add a Script Task within the Control Flow before the FTP

Sample Control Flow

2. Assign the following user defined variables as ReadOnlyVariables:

FtpServer – server connection

FtpPort – port number used for FTP connection

FTPuser – FTP user account

FTPpwd – FTP password for account

3. Add the following code to the Script Task to set the connection properties for the FTP Connection Manager:

Imports System

Imports System.Data

Imports System.Math

Imports Microsoft.SqlServer.Dts.Runtime

Public Class ScriptMain

Public Sub Main()

Dim ftpConnectionManager As ConnectionManager

ftpConnectionManager = Dts.Connections(“FTP Connection Manager”)

ftpConnectionManager.Properties(“ServerName”).SetValue(ftpConnectionManager, Dts.Variables(“FtpServer”).Value)

ftpConnectionManager.Properties(“ServerPort”).SetValue(ftpConnectionManager, Dts.Variables(“FtpPort”).Value)

ftpConnectionManager.Properties(“ServerUserName”).SetValue(ftpConnectionManager, Dts.Variables(“FTPuser”).Value)

ftpConnectionManager.Properties(“ServerPassword”).SetValue(ftpConnectionManager, Dts.Variables(“FTPpwd”).Value)

Dts.TaskResult = Dts.Results.Success

End Sub

End Class

November 29, 2006

Variables in Source SQL

Filed under: ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 11:09 pm

When dealing with source database systems, it’s common to only want to process the “delta” of changed or new records. These systems typically have a created and last_updated datestamp on each record for audit purposes. This entry will detail how one can do this in SSIS.

Last Run Date

In my scenario, I opted to store a transaction log table. This table stores the package execution history. This table is then queried in a SQL Task in the control flow. The SQL Task is set to retrieve only one row and store the last_load_date into a variable. This variable will then be used later in the where clause of my source SQL in later data flows.

The creation script for the table is below.

CREATE TABLE dbo.load_history (
mapping_name nvarchar (50) NULL ,
last_load_date smalldatetime NULL ,
start_date smalldatetime NULL ,
end_date datetime NULL ,
status nvarchar (20) NULL
)

The setup of the SQL Task to return the last_load_date into a variable is done as follows:

  1. Defined a variable LastRunDate with a data type of DateTime.
  2. Under the General tab, set the Result Set to ‘Single Row’. The SQLStatement is set to:
    select last_load_date from load_history
    where mapping_name = <mapping_name>
    and end_date is null
    and status = ‘Pending’
  3. croppercapture26.Png

  4. Under the Result Set tab, set the Result Name: ‘last_load_date’ to the defined variable. In my case, I named my variable: LastRunDate.croppercapture27.Png
The Hidden Magic of Variable Properties

Next, is to define a variable for the source sql.

  1. Add a variable of data type string. In my case, I named the variable: SQL
  2. With the variable selected in the variable window, hit F4 to open the variable’s properties
  3. Set the property ‘EvaluateAsExpression’ to true
  4. Set the property expression using the variable defined earlier. Don’t forget that the expression is a string, so it must be wrapped in double quotes. For example:“select col_1, col_2, … from table_1
    where
    last_update_date >= ‘” + @[User::LastRunDate] + “‘
    or created_date >= ‘” + @[User::LastRunDate] + “‘;”

    croppercapture28.Png

November 14, 2006

Source File Archiving: SSIS versus DataStage

Filed under: DataStage, ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 4:42 pm

I thought it would be interesting to compare and contrast the differences between SSIS and IBM DataStage. It is easy for some to be married to one particular technology, so I thought that this would be interesting for some.

In this example, we’re in the situation of working with source files. After a load, I believe it’s best practice to archive the files that you’ve loaded for later reference. These files can then be absorbed by the corporate archive strategy all ready in place. The goal of the process is as follows:

  1. Open File
  2. Process data in file and load to target
  3. Archive File

SSIS File Archive

In SSIS, steps 1 and 2 are handled by the Data Flow of the package. Step 3 is executed within the Control Flow. There are two methods available for archiving:

  1. Utilize two File System tasks in sequential order. The first to move the file to the desired directory. The second, rename the file with the appropriate date stamp. The downside, each File System task requires that a destination and source connection manager be defined.
  2. Utilize a Script Task that executes code to move and rename a file. This has lots more control and can do the move and rename in one step. The downside, code reuse will require one to either create their own transformer or copy the script into to each package.

Using this script as a reference, I went the scripting route as it seemed cleaner. The script was defined with the following variables:

User::FileName – Data file name, eg. loandata.txt
User::FileDir_DATA – Data directory on the staging server, eg. ‘c:\data\sources\’
User::FileDir_TYPE – Data sub-directory, eg. ‘loandata\’

Here’s the code:

Imports System Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO

Public Class ScriptMain

‘Reference: http://blogs.conchango.com/jamiethomson/archive/2005/09/14/2149.aspx

Public Sub Main()

‘check if vars exist
If Dts.Variables.Contains(“User::FileName”) = True AndAlso _
Dts.Variables.Contains(“User::FileDir_DATA”) = True AndAlso _
Dts.Variables.Contains(“User::FileDir_TYPE”) = True
Then

Dim srcFile, tgtFile, tgtFileName, searchStr As String

searchStr = “.”
srcFile = CStr(Dts.Variables(“User::FileDir_DATA”).Value) + _
CStr(Dts.Variables(“User::FileDir_TYPE”).Value) + _
CStr(Dts.Variables(“User::FileName”).Value)

tgtFileName = CStr(Dts.Variables(“User::FileName”).Value)

tgtFileName = Left$(tgtFileName, InStrRev(tgtFileName, “.”) – 1) + _
“_” + CStr(Format(Now, “yyyyMMddHH”)) + _
Mid(tgtFileName, InStrRev(tgtFileName, “.”))

‘System.Windows.Forms.MessageBox.Show(tgtFileName)

tgtFile = CStr(Dts.Variables(“User::FileDir_DATA”).Value) + _
CStr(Dts.Variables(“User::FileDir_TYPE”).Value) + _
“Archive\” + tgtFileName
‘System.Windows.Forms.MessageBox.Show(tgtFile)

Try
File.Move(srcFile, tgtFile)
Dts.Events.FireInformation(0, “”, “File moved to: “ + tgtFile, “”, 0, True)
Dts.TaskResult = Dts.Results.Success
Catch ex As Exception
Dts.Events.FireInformation(1, “”, “File move failure. Tried to move to: “ + tgtFile, “”, 0, False)
Dts.TaskResult = Dts.Results.Failure
End
Try

End If

End Sub

After the Script Task is defined, the package can now move and rename a source data file to the designated Archive subdirectory. This is great if you like to code. My personal opinion, it’s too much logic for a simple ‘move’ command.

DataStage File Archive

In DataStage, the process of archiving files can be controlled through the job’s properties. (Note: a DataStage job is the equivalent to an SSIS package). One only needs to go to the job’s properties and set the ‘After-job Subroutine’.

croppercapture1.Png

In the above screenshot, you’ll see that the ‘After-job Subroutine’ is set to ‘ExecDos’, which executes an easy DOS move command. It doesn’t get any easier than DOS. The InputValue is set to the actual command syntax. For clarification, the command is using variables, which are as follows:

#$DrvPath# – drive letter, eg. c:\

#$SrcPath# – source data directory, eg. data\sources\

The command is also using %date to retrieve parts of the date in yyyymmdd format.

Conclusion

If you love to code and you revel in the idea of working with VBA, then ofcourse SSIS is your preference. But what this post intends to point out is that one shouldn’t have to write code for what is essentially one line at the DOS command line.

November 1, 2006

SSIS:: Test for Data Files Existence

Filed under: ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 6:15 pm

A common data source for my packages includes flat files. Unlike the scheduling of the master package, however, the flat files may not be staged every single day. Leaving the packages responsible for loading the files as is, however, will result in a false failure. This leads to the following script task in the control flow. The task will read input variables, which the file connection manager should also be using, to test whether the file exists or not. If it does not, it will not execute the subsequent data flow and return a success.

Defined variables:
ExecDir
FileExists
FileName

croppercapture15.Png

Defined Data Flow:

croppercapture13.Png

Script Task

In the Control Flow, add a Script Task. The task is defined with ReadOnlyVariables of:
User::ExecDir
User::FileName

The ReadWriteVariables is set to:
User::ExecDir

Add the following code to the script task:

Imports System
Imports System.Data
Imports System.Math
Imports System.IO
Imports Microsoft.SqlServer.Dts.Runtime

Public Class ScriptMain
‘ Created: Zack Bethem – AmberLeaf
Public Sub Main()
Dim fileLoc, fileName As String

If Dts.Variables.Contains(“User::ExecDir”) = True AndAlso _
Dts.Variables.Contains(“User::FileName”) = True Then
fileLoc = CStr(Dts.Variables(“User::ExecDir”).Value)
fileName = CStr(Dts.Variables.Item(“User::FileName”).Value)
‘System.Windows.Forms.MessageBox.Show(“FileDir:”+fileLoc+”FileName:” + fileName)

If File.Exists(fileLoc + fileName) Then
Dts.Variables.Item(“User::FileExists”).Value = True
‘System.Windows.Forms.MessageBox.Show(“File exists”)
Else
Dts.Variables.Item(“User::FileExists”).Value = False
‘System.Windows.Forms.MessageBox.Show(“File not exists”)
End If

Dts.TaskResult = Dts.Results.Success
Else
Dts.TaskResult = Dts.Results.Failure
End If

End Sub
End Class

Now your set. The script task will read in the variables User::ExecDir and User::FileName. Using those variables, it will use the System.IO to determine if the file exists or not. If it does, the variable User::FileExist variable is set to True. Otherwise, the variable is set to False.

Conditions in Control Flow

The next step is to define a condition in the control flow. This is done by double clicking on the path line from the Script task to the Data Flow task. This leads to the following dialog:

Here is where you add the conditional logic. The Evaluation Operation is set to ‘Expression’. Using our User::FileExists variable, a check is done to see if the value is equal to True. If it is, the data flow executes. Otherwise, we have our graceful exit and the package is successful.


October 30, 2006

Creating the Master Package

Filed under: ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 9:37 pm

It’s come to that point where all of the SSIS packages for the data warehouse are complete. As other searches have pointed out, there are multiple ways to glue everything together. Coming from a DataStage or Informatica background, however, I prefer the creation of a “master package”. This package is strictly a control flow that utilizes several ‘Execute Package’ tasks.

Step 1: Define File Connection

For this implementation, all of our packages will exist as files. This requires that each package have a defined file connection within the master package. Under the Connection Manger, right click to add a new file connection for the following dialog prompt.

pngcroppercapture3.PNG

Click on Browse to add the package of choice in the following dialog.

pngcroppercapture4.PNG

After selecting the file, click on the File Name listbox. This will automatically highlight the filename. Copy the filename to the clipboard by pressing Ctrl-C. This will save time in the later steps.

Once the file connection is added, an expression will need to be defined for the ConnectionString property. In my installation, I defined a variable User::ExecDir_DWL that corresponds to the project folder location for all of my packages. This allows me to define the ConnectionString as follows (hint: use Ctrl-V to paste in the package name to save time and reduce errors). Once complete, this will ensure that the file connection is not tied to any specific location. In turn, this allows for the master package to be easily deployed to any folder structure.
pngcroppercapture5.PNG

Step 2: Add Execute Package Task

The final piece of the pie is to add the Execute Package task to the control flow. Here, I use the file name text copied to the clipboard earlier to save time.

  • First, rename the execute package task to ‘EXEC <package_name>‘.
  • Second, explicitly set the property PackageName using the expression builder. I’ve noticed that the package name isn’t set when using a package template. This should override that issue.

October 16, 2006

SSIS Script: RowCounter

Filed under: ETL, SQL Server Integration Services, SSIS — Zack Bethem @ 10:06 pm

Straight from the MSDN archives, I found this little gem to count rows within a data pipe as they are processed. The key phrase is: “as they are processed”. You’d think the Row Count transformation would work if you wanted a sequential row number. Unfortunately, this transformation only updates the assigned variable at the end of processing for the given Data Flow task.

As a lover of code snippets for reuse, here’s what you have to do:

1. Add an Output Column.  I called mine rowCount.

2. Add  the following script lines:

Public Class ScriptMain
  Inherits UserComponent
  Dim iCount As Integer = 0
  Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    Row.rowCount = iCount
    iCount = iCount + 1
  End Sub
End Class

October 3, 2006

SQL Server: Unit Testing Loads

Filed under: ETL, SQL Server Integration Services — Zack Bethem @ 10:48 pm

Through the magic of the system tables in SQL Server: sysobjects, systables, and syscolumns, I’ve found a new favorite query.

Scenario

ETL packages must account for trimming extra pads placed on data fields from the source. This is usually a problem from data sourced from flat files, especially fixed-width files. Some may argue that not addressing these issues results in data quality issues. I just prefer that things are perfect for future users. (and yes, data quality is on my mind too)

Solution

The following query will dynamically generate a query for every column in a table. The each query will confirm the max length of a field versus the trimmed max length. This results in a huge time savings from manually typing everything. Enjoy!

— SQL Server: generate column trimmings SQL
select
‘select ”’+so.name+‘.’+sc.name+”’,sum(case when datalength(‘+sc.name+’) > datalength(ltrim(rtrim(‘+sc.name+‘))) then 1 else 0 end) as err_count from ‘+so.name+‘;’
from sysobjects so, syscolumns sc, systypes st
where
sc.id = so.id
and sc.type = st.type
and sc.xtype=st.xusertype
and so.name in (‘individual’)
and so.xtype=’U’
order by sc.name

** Note: The query above refers to ‘individual’ as a specific table name. You’ll want to change that based on your validation.

Update: The SQL Server function DATALENGTH() should be used instead of LEN(). LEN() will automatically rtrim a column, thus giving you a false positive. This behavior may be limited to VARCHAR columns, but be warned.

September 20, 2006

SSIS: No Dynamic Lookup?

Filed under: ETL, SQL Server Integration Services — Zack Bethem @ 1:50 am

New project and moving on to SSIS. As I learn the new “nuances” of another ETL tool, one quickly learns its shortcomings.  A standard staple: dynamic lookups seems to be missing. 

Guess I’ll have to conjure up a solution. I’ll post the developed solution, if I find one.

Older Posts »

Blog at WordPress.com.