Simple Method for Importing Ragged files / Files with Different Row Types in SSIS / Files with a problematic final row
There may come a time when you are not in control of the input file format and it is decidely non-standard. I recently had to process such a feed file which had a metadata row at the start and the end of the file. The file could be represented something like this, with only the x,y,z records being of importance (along with the column headers if we can get this info).
Col1
Col2
Col3
xxx start
x
y
z
x
y
z
x
y
z
xxx end
The problem is that for the red rows there is only one column: "xxx start" and no comma separators, while the second row would be "x,y,z".
This is not an easy problem to describe but I tried various keyword combinations on Googling to see how others were coping with this issue.
Some people advise using the script task: http://www.sql-server-performance.com/article_print.aspx?id=1056&type=art
Some use the Conditional Split: http://www.sqlis.com/54.aspx
I've also seen packages which call vbscript code to open the text file, read all the lines and then write them back into the text file minus the problemmatic rows.
All of these are interesting technical solutions, however they weren't suitable for a lazy guy like me
- I didn't want to write any code
- in my case there are dozens of columns, so I wanted the designer to define the columns for me by looking at the text file, rather than defining them one-by-one by hand
In short, my aim was to have the entire process encapsulated in a single package with no calls being made to outside processes, and have it as simple as possible to build and maintain.
So, how to do this?
The way I do it now is quite simple (below):
Initially I import the entire file into a staging table. The Flat File source is defined as "Ragged right" and has one column which is 8000 chars long. This is imported into a staging table which has one column - defined as varchar(8000). This way the entire file is always imported successfully, along with the 2 problem rows.
Next we remove the extra rows. In my case this is a simple delete statement as the length of the problem rows is constant and is significantly shorter than a data row. You'll need some way of distinguish this row(s) from the others and if you're lucky it'll be as simple as my case. If not, you might have to filter using the column delimiters or use some other similar logic.
Next I export to a temporary file. This is a Flat File defined as "Ragged right", again with one column. The staging table above is the source and data is exported from it to the staging Flat File.
Finally, this staging file is treated as a standard Flat File and imported as per usual in another data flow task. This means that the column names can be determined from the source file by the designer.
It's all really simple and transparent. If / when a recordset source can be used, we can do away with the persisted staging Flat File to make it even neater in the future.

once it was in the database, wouldn't an sql-insert (with a group by and filter to remove empty rows) do, instead of exporting again as a file?
just a thought, and even if so, tests for time to complete vs. file-based ops like above.
Reply to this
No need for a Group By - if you wanted, you could define the logic in TSQL by using Substring for each column and it ends up like the conditional split method. But when you have 100s of columns, this IMO isn't really viable and is difficult/impossible to support solutions with non-explicit mappings like this.
Cheers.
Reply to this