SSIS bits 'n' bobs BLOG
Paul Ibison
SSIS bits 'n' bobs

Escaping Double Quotes in Flat Files: SSIS is different to DTS

Just a heads-up as I noticed today that the import behaviour of escaped double-quotes in a string is different in DTS and SSIS.

If we have the following single entry in a text file:

"xxx""999""xxx"

Using a Transform Data task in DTS, the resulting database entry will be:

xxx"999"xxx

Using a Data Flow task in SSIS we will get:

xxx""999""xxx

In both packages the double quote is defined as a text delimiter and apart from that there don't seem to be any relevant configuration possibilities to finetune the behaviour.

It seems DTS interpretets this as an intentionally escaped double-quote, and removes the extra characters automatically. SSIS on the other hand assumes that this is a valid string entry.

I'm currently converting DTS packages to SSIS at the moment and as the import behaviour has changed in this way, my solution is to run an immediate update (ExecuteSQL task) on the destination data and replace
adjacent double quotes with a single one - not nice. I could alternatively strip them out in the data flow, using a derived column transformation - also not nice. Hopefully this'll be exposed as a property of the Flat File connection manager some time.

When to Not use SSIS - directory / file iterations!!!!

I've become a bit too much of an enthusiast about using SSIS for everything recently and I came unstuck with a simple file iteration problem........

I simply wanted to iterate through a set of network directories and do some file copying. Basically I needed to sync up the UAT and live directories of source files, so that the testing in the UAT environment could be a true parallel test of live. So, if the process came across a recent file in a live directory that wasn't in the corresponding test
subdirectory, it should be copied over.

I agree that this sounds like a straightforward issue. It's the sort of thing I can easily do in VBScript or VB6.0 but I'm trying to do everything in SSIS at the moment to centralise the code for maintenance, so I decided to do it that way.

In the designer we see that there is a task which seems custom-made for this type of problem - "Foreach Loops based on File Enumerator" so away I went and created the package below:



In the SSIS package above, the outer Foreach_FeedDirectory loop iterates through a recordset of directories and stores the directory path in a local variable. The inner ForeachFeedSourceFile loop iterates over the files in the specific directory.

Here's where you begin to come unstuck - all the usual file attributes you'd expect to see are simply not exposed to the Foreach File Enumerator. In the SSIS loop we can get the filename with or without the full path / extension and that's about it - no "Created Date", no "Last Modified Date", no "Size" etc etc. To get any of these useful attributes you have resort to scripting!

So in the package shown above, there is an inner script task which uses System.IO.FileInfo, System.IO.File.Exists, System.IO.File.Copy etc to decide whether to do the copy or not and applies to the file name found in the inner Foreach File Enumerator.

Let's recap: I am looping through each directory. Once I find a diectory I loop through the files. Once I find a file I use the FileInfo method to get the file info in script, then copy over if it is recent and if it doesn't exist in the destination directories.

On my 33 directories, some of which contain 1000s of files, this takes approx  1.5 to 2 hours to run in SSIS

As this is a process I need to run often, this is a complete disaster . I can see that the iteration over the files is the bottleneck and this gives precious little useful info in the task, so as a trial, I start thinking of c# as a replacement. Now I'm definitely not a c# developer (and never will be ), but I managed to knock together the console application below to replace the entire package in an hour or so using my pal Google. Amazingly it c# does the same task but takes just 70secs!!!!!!

So - if you're just doing iterative directory / file manipulation and have to incorporate scripting, I'd definitely recommend resorting to c# rather than SSIS.

C# code:

        private static void CheckDirectory(String sLiveDirectory, String sTestDirectory)
        {
            Console.WriteLine("Processing: " + sLiveDirectory);
            DirectoryInfo dirCustom = new DirectoryInfo(sLiveDirectory);
            FileInfo[] filCustom;
            filCustom = dirCustom.GetFiles();
            foreach (FileInfo filFile in filCustom)
            {
                TimeSpan ts = DateTime.Now - filFile.CreationTime;
                if (ts.Days < 2)
                {
                    if (!File.Exists(sTestDirectory + filFile.Name))
                    {
                        File.Copy(filFile.FullName, sTestDirectory + filFile.Name);
                        Console.WriteLine("Copying: " + filFile.FullName);
                    }
                }
            }
        }

Simple Method for Importing Ragged files / Files with Different Row Types in SSIS / Files with a problematic final row

 

There may come a time when you are not in control of the input file format and it is decidely non-standard. I recently had to process such a feed file which had a metadata row at the start and the end of the file. The file could be represented something like this, with only the x,y,z records being of importance (along with the column headers if we can get this info).

Col1   Col2  Col3
 xxx start    
 x  y  z
 x  y  z
 x  y  z
 xxx end    

The problem is that for the red rows there is only one column: "xxx start" and no comma separators, while the second row would be "x,y,z".

This is not an easy problem to describe but I tried various keyword combinations on Googling to see how others were coping with this issue.

Some people advise using the script task: http://www.sql-server-performance.com/article_print.aspx?id=1056&type=art
Some use the Conditional Split: http://www.sqlis.com/54.aspx
I've also seen packages which call vbscript code to open the text file, read all the lines and then write them back into the text file minus the problemmatic rows.

All of these are interesting technical solutions, however they weren't suitable for a lazy guy like me

  1. I didn't want to write any code
  2. in my case there are dozens of columns, so I wanted the designer to define the columns for me by looking at the text file, rather than defining them one-by-one by hand
  3. finally I didn't want to use script outside the package which would make maintenance more difficult.

In short, my aim was to have the entire process encapsulated in a single package with no calls being made to outside processes, and have it as simple as possible to build and maintain.

So, how to do this?

The way I do it now is quite simple (below):

 

Initially I import the entire file into a staging table. The Flat File source is defined as "Ragged right" and has one column which is 8000 chars long. This is imported into a staging table which has one column - defined as varchar(8000). This way the entire file is always imported successfully, along with the 2 problem rows.

Next we remove the extra rows. In my case this is a simple delete statement as the length of the problem rows is constant and is significantly shorter than a data row. You'll need some way of distinguish this row(s) from the others and if you're lucky it'll be as simple as my case. If not, you might have to filter using the column delimiters or use some other similar logic.

Next I export to a temporary file. This is a Flat File defined as "Ragged right", again with one column. The staging table above is the source and data is exported from it to the staging Flat File.

Finally, this staging file is treated as a standard Flat File and imported as per usual in another data flow task. This means that the column names can be determined from the source file by the designer.

It's all really simple and transparent. If / when a recordset source can be used, we can do away with the persisted staging Flat File to make it even neater in the future.

Thorough Package Logging - Some Lessons Learned and Recommendations

At some point you'll need to decide on a logging methodology, and this is my 2ps worth:

(1) As a starting point I'd say take a look at Jamie's excellent article on event-handler logging: http://blogs.conchango.com/jamiethomson/archive/2005/06/11/SSIS_3A00_-Custom-Logging-Using-Event-Handlers.aspx. Initially I used the same setup and we standardised all packages on it but with hindsight I'd now recommend to do things slightly differently. The main difference would be to not use property expressions to do the insert to the log table but instead for event-handling, call a stored procedure. Why? Because the property expression will become completely illegible when you try to escape single and double quotes - both of which might appear in a server-generated error message.
We currently just log the OnError and OnPostExecute events, but I'm looking again to see if any of the others would be particularly useful. Finally I'd say that this goes hand-in-hand with using a decent naming convention (Jamie has some standards in his blog). The reason being that the events will log the task name so if they are all called "Data Flow", "Data Flow 1", "Data Flow 2" etc it's just a complete pain when debugging!

(2) As well as event logging to a central table, do some 'simple workflow message logging' to the same table. In my package template I have 2 tasks which sandwich the rest. This makes reports nice and easy.

 

(3) Make sure each end task in the control flow is a logging message task. In this task use a message that is either constant, or which begins consistently. That way you know that a package always has a start and end message pair or a start and an error message pair. This makes reporting really simple.

(3) Add another parameter to the logging table - "PackageSessionGUID". Have an intial task which generates a guid (use newid() against a SQL OLEDB connection) and which then saves the value to a package variable. Use this variable in all subsequent logging. Why? Because this allows you to distinguish between parallel executes of a package.

(4) Also log to a text file. This can be really useful when the problem lies with the normal logging setup. EG what happens if the connection to the database which is used as a logging repository is not correct? Normally you'd not get any info because you rely on this to write the log entries, but the text file works as a kind of backup log.

(5) Also log the output of the job itself, assuming the package is scheduled.
This will help debug any other problems associated with the package and it's batch file.

Retain NULL values vs Keep NULLs - which to use?


There is some confusion as to what the various NULL settings all do in SSIS. In fact in my team we had created 15 packages before realising the full implications of the various default settings. Anyway, hopefully this blog will help to clarify a bit......

Suppose you are doing a bog-standard import of a text file like the one below. We have a data flow task with a flat-file source going to a OLEDB destination. The flat file contains 3 columns: MyDate, MyInt and MyText. In the file there are 2 rows - one containing normal values and the other one just containing blank column values. So, what happens in the case of a blank value?




By default, what will happen is shown in the first dataset below (2nd row). A blank date will become "1st Jan 1753", an integer will become "0" and text will become "". Is this what you want? Usually not - especially if "0" is also a legitimate explicit numerical value in the text file!




So, we can designate that the null values from the flat file (really an absence of a value) will become a null in the resulting OLEDB destination. There is a little checkbox on the flat file source which controls this:



The imported dataset will now look like this:



So far so good. However there is another similar setting available here. This is a checkbox on the OLEDB destination labelled as "Keep NULLs" - what is that for? Basically this controls what happens when you have defaults on the destination table. If there is a default constraint on each column - eg the date should be today's date, the number should be -1 and the text "n/a" - do we want these default values to be overwritten by NULL values or to simply apply their defaults?

The option is shown below. In this particluar case the defaults won't get applied and will be overwritten by NULLs from the source file.



If it was unchecked, and the defaults mentioed above existed, we would have the values below.



Hopefully this makes these 2 setting and the various permutations s a little clearer

3 Tips when using (SQL) Indirect Configurations

Plenty of info on setting up indirect configurations in BOL and in other sites so I won't bother posting up info on how to get it all going.

In my case I like to hold the configuration info in SQL Server rather than XML files, so it is set up centrally in one table. I'll do a more detailed blog on this some other time but basically there is a separate configurations database in Dev, Test and Live which holds relevant connection info, so a package will automatically configure itself when placed on the deployment server - the so called "keeping packages in the dark" method.

So, what else to look out for?

(1) Make sure there is a unique index on the "PackagePath" column.
There isn't one in the table which is automatically created by the design environment. Perhaps it should be a composite constraint on "ConfigurationFilter, PackagePath and ConfiguredValueType" but in all likelihood the "ConfigurationFilter and ConfiguredValueType" will be constant anyway - it depends on how you use the table. Anyway, you'll need this unique constraint because sooner or later someone will add a duplicate PackagePath with a different ConfiguredValue with interesting consequences!

(2) Once it's set up, create a rollback trigger on this table.
Future entries can then only be entered by script with an initial explicit disabling of the trigger. Why? Because in some cases the configuration values can be written from a package into the table directly from the designer
without warning and the trigger will prevent this occurring. It also acts as an extra security on the table which will be the most crucial table for our packages.

(3) Don't store connection strings directly in the configuration file.

ie never have:

\Package.Connections[ServerName_DatabaseName].Properties[ConnectionString]

But instead have the connection string stored in a package variable and have the variable's value stored in the configuration file:

\Package.Variables[User::s_DatabaseName_ConnectionString].Properties[Value]

Why jump through an extra hoop and do it this way?

Well if the PackagePath maps directly to a connection manager which exists in Package A but not in Package B, Package B will error when running. However if the PackagePath mapping to the variable is missing, you'll just get a warning but no error so the package will run ok.

The one exception to this is the connectionstrings to the Configurations Database which are hardcoded for each environment (more on this in a blog on logging some other time).

I'm sure I came across other problems with using the SQL table, but I didn't note them down. If anything else comes back to me I'll pots it up in a comment here.

Coping with No Column Names in the OLEDB Data Source Editor

Sooner or later you'll create a DataFlow and come across this type of dialog box for the OLE DB source:





No source columns listed and noting to map - great
 
So, why does this problem occur, when all previous attempts have worked? Well the most likely case is that you are selecting from a temporary table as the last statement in the source stored procedure, and in this case there is no metadata returned to the designer. Just a simple proc like the one below will cause this problem:

CREATE PROC myproc AS
SELECT TOP 10 organisation_name INTO #MyOrganisations FROM dbo.Organisation
SELECT * FROM #MyOrganisations

So, what can we do? There are 2 main possibilities (there is also a third one that I'll post up some other time).

(1) One possibility is to change the SQL Command Text as follows and add the italicised text:

SET FMTONLY OFF
SET NOCOUNT ON

exec myproc

This forces the parser to actually run the query to return the metadata and reverse the SET FMTONLY ON that it normally uses. This works, and you'll see the column names as per normal. However there is a serious downside that seems to be missed when this solution is advised. My colleague John Gillies added an auditing step to the proc "myproc" to test this and found that the proc itself was executed 5 times when running the package!!! This could seriously slow things down if the initial stored proc itself is slow anyway.

(2) Use a Table Variable instead

CREATE PROC myproc AS
declare @mytable table (organisation_name varchar(100))
INSERT INTO @mytable(organisation_name)
SELECT TOP 10 organisation_name FROM dbo.Organisation
SELECT * FROM @mytable

You'll now need to set the SQL Command Text to be:

SET NOCOUNT ON
exec myproc

Otherwise you'll get the dreaded "A rowset based on the SQL command was not returned by the OLE DB provider" error.

Using this technique will make sure that the proc only runs once only! OK - you can't put nonclustered indexes on a table variable and all the other restrictions, so if really needed you can use option (1) but in most cases I'd reason that option (2) is the preferable solution.

But I only wanted one log file!

Suppose you are using a Log File into which you do your logging. Setting it up is straightforward, and you want it to go to "C:\SSIS\Logs\PackageName.log" during testing.

Later on you decide to go for a more sophisticated version and use a configuration file to determine where the log will be placed. This makes sense, as your deployment server might not have the "C:\SSIS\Logs\PackageName.log" directory, and perhaps you set up the configuration file to have the log's connection string set as "D:\SSIS\Logs\PackageName.log".

When you run the package, where will your log file be found?

You'd assume that it'll be "D:\SSIS\Logs" because the real log file location will be read from the configuration file - yes?

Actually what happens is that you'll have 2 log files - one on "C:\SSIS\Logs" and one on "D:\SSIS\Logs"! The first will be quite small, while the second contains most of the info.

At first I was surprised to see this after a colleague (Reno Messina) pointed it out. Then I decided it made sense. The first log file ("C:\SSIS\Logs") has to be there to log the process up to the loading of the configurations. Otherwise if the log location in the configuration file itself was incorrect we'd have no way of getting any info at all and be even more confused

So, if you know the location of your log file on the destination server, set this up in the package as well as the configuration file. That way your log info will all be combined together. If not, then make sure it's a location that exists on all your servers, just to be sure that the initial log info will go somewhere without breaking the package.

Data Flows and Problems with Partial Inserts

For a data flow task using the fast load option I had assumed that the process copies all of the records or in the case of failure none at all. In fact this behaviour had been confirmed to me by an SSIS guru. However, I've since noticed that partial imports are in fact possible, and we currently have big problems with them .
Suppose you have a large text file and are doing a flat-file import into a SQL table. The package could be simply one single data flow task.



The data in the flat file is one varchar column and a counter column, and all the text goes into a NOT NULL varchar column on the SQL destination. The text file always has values for the varchar column, apart from one particular row, so it looks like that below - what will happen as a result of row 3?

 www
 xxx  2
   3
 zzz  4

In fact, rows up to the problem row ("www" and "xxx") are inserted then the package errors. There's no rollback and you have a partial insert. Changing the Rows per batch or Commit size doesn't stop this happening, and we've already started out withe the Fast Load option which uses a "Insert Bulk" methodology, so changing to a non-Fast_Load which does a row-by-row iterative approach also won't improve things.

So - what can we do to ensure all-or-nothing in the case of NULLs?
As far as I can tell, We have 2 potential solutions:

(1) change the "Transaction Option" to "Required" for the Data Flow task.



This works ok but has 2 disadvantages:

(a) you must ensure that MSDTC is running. There is potentially now another point of failure because someone else might change this service and not realise the impact. And this service will need to be running on all the servers you intend to move this package to.

(b) all the transformation components will be contained within the transaction, which might cause unacceptable blocking to other users.

(2) Have a task which runs on failure of the data flow and which cleans up the partial import. In fact this is what I have now gone for. I add a derived column to the data flow which adds a column with a flag value of -1. An ExecuteSQL task runs on failure and which deletes the -1 records. Success of the data flow results in another ExecuteSQL task running to update the flag to 0. We only use "0" records for processing further down the pipeline.

Hopefully this can be useful for anyone else experiencing this problem. Interestingly, this behaviour is the opposite of what happens with a check constraint. When a check constraint is violated, the whole insert fails without any treaks being made. So, perhaps this will be configurable at some later time. 

Upgrade Advisor - don't trust it 100%!

We recently tested the upgrade path for 6 large SQL 2000 databases. We'd faithfully used the Upgrade Advisor and implemented all the changes it recommended to TSQL code throughout the databases. So - everything should work - yes?

Well we thought so, until I tried an additional test. I scripted out all the user stored procedures (upgraded ones) in SQL 2005, then deleted them, then I ran the same script. I hoped to not find anything at all but it failed for 2 generic types of error:

(1) Illogical syntax no longer acceptable in SQL 2005 but works fine in SQL 2000:

Declare @count int
Set Set @count = 1
Select @count

Declare @MyDate datetime
Set @mydate = cast('1/1/99' as datetime(103))
Select @mydate

Yes - I know. This code makes no sense. But if you have 100s or 1000s of procedures, someone may have written something strange like this somewhere.

(2) Reserved Keywords.

We have several columns called "External". I was surprised that this wasn't picked up by the advisor, and it's easily fixed by just adding square brackets, but is still a bit of a pain.

If I find anything else it missed I'll post it up.