Data Flows and Problems with Partial Inserts

For a data flow task using the fast load option I had assumed that the process copies all of the records or in the case of failure none at all. In fact this behaviour had been confirmed to me by an SSIS guru. However, I've since noticed that partial imports are in fact possible, and we currently have big problems with them .
Suppose you have a large text file and are doing a flat-file import into a SQL table. The package could be simply one single data flow task.



The data in the flat file is one varchar column and a counter column, and all the text goes into a NOT NULL varchar column on the SQL destination. The text file always has values for the varchar column, apart from one particular row, so it looks like that below - what will happen as a result of row 3?

 www
 xxx  2
   3
 zzz  4

In fact, rows up to the problem row ("www" and "xxx") are inserted then the package errors. There's no rollback and you have a partial insert. Changing the Rows per batch or Commit size doesn't stop this happening, and we've already started out withe the Fast Load option which uses a "Insert Bulk" methodology, so changing to a non-Fast_Load which does a row-by-row iterative approach also won't improve things.

So - what can we do to ensure all-or-nothing in the case of NULLs?
As far as I can tell, We have 2 potential solutions:

(1) change the "Transaction Option" to "Required" for the Data Flow task.



This works ok but has 2 disadvantages:

(a) you must ensure that MSDTC is running. There is potentially now another point of failure because someone else might change this service and not realise the impact. And this service will need to be running on all the servers you intend to move this package to.

(b) all the transformation components will be contained within the transaction, which might cause unacceptable blocking to other users.

(2) Have a task which runs on failure of the data flow and which cleans up the partial import. In fact this is what I have now gone for. I add a derived column to the data flow which adds a column with a flag value of -1. An ExecuteSQL task runs on failure and which deletes the -1 records. Success of the data flow results in another ExecuteSQL task running to update the flag to 0. We only use "0" records for processing further down the pipeline.

Hopefully this can be useful for anyone else experiencing this problem. Interestingly, this behaviour is the opposite of what happens with a check constraint. When a check constraint is violated, the whole insert fails without any treaks being made. So, perhaps this will be configurable at some later time. 

 

What did you think of this article?




Trackbacks
  • No trackbacks exist for this entry.
Comments

  • 11/14/2007 10:24 PM Phil Brammer wrote:
    Did you check the "Retain null values from the source as null values in the data flow" box in the Flat File Source component?

    Also, were there actually NULLs in the file, or just empty strings. They are different, and as such, empty strings should be able to get inserted to the destination NOT NULL column.
    Reply to this
    1. 11/15/2007 10:12 PM Paul Ibison wrote:
      Hi Phil and thanks for being my very first replier!

      Yes - the "Retain nulls values..." checkbox was checked - otherwise there wouldn't have been a problem with the NULL values (sorry - couldn't resist ).

      We don't have text qualifiers in our csv feed files, so the (NULL vs empty string) interpretation of adjacent commas ",," is left to the above configuration setting. As it is checked, ",," gives us a NULL, and consequently a pain. However unchecking this option is not an option for us


      Reply to this
Leave a comment

Submitted comments will be subject to moderation before being displayed.

 Enter the above security code (required)

 Name

 Email (will not be published)

 Website

Your comment is 0 characters limited to 3000 characters.