There may come a time when you are not in control of the input file format and it is decidely non-standard. I recently had to process such a feed file which had a metadata row at the start and the end of the file. The file could be represented something like this, with only the x,y,z records being of importance (along with the column headers if we can get this info).
Col1
Col2
Col3
xxx start
x
y
z
x
y
z
x
y
z
xxx end
The problem is that for the red rows there is only one column: "xxx start" and no comma separators, while the second row would be "x,y,z".
This is not an easy problem to describe but I tried various keyword combinations on Googling to see how others were coping with this issue.
Some people advise using the script task: http://www.sql-server-performance.com/article_print.aspx?id=1056&type=art
Some use the Conditional Split: http://www.sqlis.com/54.aspx
I've also seen packages which call vbscript code to open the text file, read all the lines and then write them back into the text file minus the problemmatic rows.
All of these are interesting technical solutions, however they weren't suitable for a lazy guy like me
In short, my aim was to have the entire process encapsulated in a single package with no calls being made to outside processes, and have it as simple as possible to build and maintain.
So, how to do this?
The way I do it now is quite simple (below):
Initially I import the entire file into a staging table. The Flat File source is defined as "Ragged right" and has one column which is 8000 chars long. This is imported into a staging table which has one column - defined as varchar(8000). This way the entire file is always imported successfully, along with the 2 problem rows.
Next we remove the extra rows. In my case this is a simple delete statement as the length of the problem rows is constant and is significantly shorter than a data row. You'll need some way of distinguish this row(s) from the others and if you're lucky it'll be as simple as my case. If not, you might have to filter using the column delimiters or use some other similar logic.
Next I export to a temporary file. This is a Flat File defined as "Ragged right", again with one column. The staging table above is the source and data is exported from it to the staging Flat File.
Finally, this staging file is treated as a standard Flat File and imported as per usual in another data flow task. This means that the column names can be determined from the source file by the designer.
It's all really simple and transparent. If / when a recordset source can be used, we can do away with the persisted staging Flat File to make it even neater in the future.
At some point you'll need to decide on a logging methodology, and this is my 2ps worth:
(1) As a starting point I'd say take a look at Jamie's excellent article on event-handler logging: http://blogs.conchango.com/jamiethomson/archive/2005/06/11/SSIS_3A00_-Custom-Logging-Using-Event-Handlers.aspx. Initially I used the same setup and we standardised all packages on it but with hindsight I'd now recommend to do things slightly differently. The main difference would be to not use property expressions to do the insert to the log table but instead for event-handling, call a stored procedure. Why? Because the property expression will become completely illegible when you try to escape single and double quotes - both of which might appear in a server-generated error message.
We currently just log the OnError and OnPostExecute events, but I'm looking again to see if any of the others would be particularly useful. Finally I'd say that this goes hand-in-hand with using a decent naming convention (Jamie has some standards in his blog). The reason being that the events will log the task name so if they are all called "Data Flow", "Data Flow 1", "Data Flow 2" etc it's just a complete pain when debugging!
(2) As well as event logging to a central table, do some 'simple workflow message logging' to the same table. In my package template I have 2 tasks which sandwich the rest. This makes reports nice and easy.
(3) Make sure each end task in the control flow is a logging message task. In this task use a message that is either constant, or which begins consistently. That way you know that a package always has a start and end message pair or a start and an error message pair. This makes reporting really simple.
(3) Add another parameter to the logging table - "PackageSessionGUID". Have an intial task which generates a guid (use newid() against a SQL OLEDB connection) and which then saves the value to a package variable. Use this variable in all subsequent logging. Why? Because this allows you to distinguish between parallel executes of a package.
(4) Also log to a text file. This can be really useful when the problem lies with the normal logging setup. EG what happens if the connection to the database which is used as a logging repository is not correct? Normally you'd not get any info because you rely on this to write the log entries, but the text file works as a kind of backup log.
(5) Also log the output of the job itself, assuming the package is scheduled.
This will help debug any other problems associated with the package and it's batch file.
| www | 1 |
| xxx | 2 |
| 3 | |
| zzz | 4 |