Can metadata be used to adapt parsing for an unescaped in field use of the delimiter?

3

I have data coming from a source system that is pipe delimited. Pipe was selected over comma since it was believed no pipes appeared in field, while it was known that commas do occur. After ingesting this data into Hive however it has been discovered that rarely a field does in fact contain a pipe character.

Due to a constraint we are unable to regenerate from source to escape the delimiter or change delimiters in the usual way. However we have the metadata used to create the Hive table. Could we use knowledge of the fields around the problem field to reprocess the file on our side to escape it or to change the file delimiter prior to reloading the data into Hive?

Chris Simokat

Posted 2014-05-20T22:14:02.927

Reputation: 131

1It may be easier to help if you show some two or three entries of your input file. – Rubens – 2014-05-20T22:38:19.070

For next time, use ASCII Delimited Text if you are unsure of which delimiter to select.

– Amir Ali Akbari – 2014-05-30T06:16:20.733

@Rubens yes I will mock up some entries and edit this weekend. – Chris Simokat – 2014-05-31T06:56:25.183

@AmirAliAkbari unfortunately I cannot get the source system to change the format otherwise I wouldn't have the issue, and SASB on AIX has limited options, but thank you for the link. – Chris Simokat – 2014-05-31T07:00:48.387

Answers

4

So, a few of your rows will have too many columns by one or more as a result. That's easy to detect, but harder to infer where the error was -- which two columns are actually one? which delimiter is not a delimiter?

In some cases, you can use the metadata, because it helps you know when an interpretation of the columns can't be right. For example, if just the one column can have a text value, and all the others must be numeric, it's unambiguous where the error is. Any additional columns created by this error occur right after the text column.

If they're all text, this doesn't work of course.

You might be able to leverage more than the metadata's column type. For example you may know that some fields are from an enumerated set of values, and use that to determine when a column assignment is wrong.

Sean Owen

Posted 2014-05-20T22:14:02.927

Reputation: 5 987

Besides the metadata used to declare the hive table there is also information on the original informat from the SAS environment. Even if the case is more obscure than int, string, int I am hoping that that additional information could be used to get there if the table metadata is insufficient due to potential ambiguity (e.g. string, string, string) – Chris Simokat – 2014-05-31T07:03:29.753