How to speed up interpreting numbers from strings?

8

3

In processing a large text file I have batches from ReadList of 100,000 Records of the form.

SeedRandom[123];
fromReadList = 
  MapThread[
   "1\t2023.203\t" <> #1 <> "\t4.932\t" <> #2 <> "\t" <> #3 <>"\t430.334" &,
   {RandomChoice[{.95, .05} -> {"73.030", "4.3103e+008"}, 100000],
    RandomChoice[{.95, .05} -> {"23.335", "-1.02847e+007"}, 100000],
    RandomChoice[{.4, .4, .2} -> {"Cake", "Cookies", "Muffins"}, 100000]}];

There are numbers scattered throughout the file that are in a (scientific) format that ToExpression does not recognize. Such that when I process the batch like so

res = MapThread[#1[#2] &, {Insert[Identity, 6]@ConstantArray[ToExpression, 6], #}] & /@ 
 Select[ContainsAny[{#[[6]]}, {"Cake", "Cookies"}] &]@
  StringSplit[fromReadList]

There are entries that are not numbers that should be. (e.g. res[[2, 3]] gives 8 + 4.3103 e).

If Interpreter["Number"] is swapped for ToExpression then one batch takes far too long to complete considering that the file has 400+ batches to process.

MapThread[#1[#2] &, {Insert[Identity, 6]@ConstantArray[Interpreter["Number"], 6], #}] & /@ 
 Select[ContainsAny[{#[[6]]}, {"Cake", "Cookies"}] &]@
  StringSplit[fromReadList]

The unrecognized number strings can be in any of the number locations. How can I either speed up Interpreter["Number"] or apply some other technique to process the numbers in the batch.

I am placing the results each batch into a database ("HSQL(Standalone)") using SQLInsert and am open to any shortcuts that could take advantage of this.

Edmund

Posted 2016-06-14T16:11:18.727

Reputation: 35 657

2Try using Internal`StringToDouble in place of ToExpression. – Leonid Shifrin – 2016-06-14T16:30:29.933

@LeonidShifrin That works. How safe is it considering that it is in Internal` and that is undocumented? – Edmund – 2016-06-14T16:33:37.803

I have a small and primitive LibraryLink based package that can parse numbers in this format or tell you that the input is not a number. Here's a usage example. Yes, it's clumsy, it's for my own use. It's very fast though. If you want, I can make the repo public, but I can't support the package.

– Szabolcs – 2016-06-14T16:37:33.887

1@Leonid The problem with that function is that it won't tell you if the conversion fails. – Szabolcs – 2016-06-14T16:38:12.127

1@Szabolcs Yes, that sucks. I wasn't aware of that. – Leonid Shifrin – 2016-06-14T16:39:39.727

@Szabolcs It would be nice if you make the repo public - this functionality is often needed. – Leonid Shifrin – 2016-06-14T16:40:34.943

@Edmund It is safe in the sense that it is very unlikely that this function will go away. But, there is this problem mentioned by Szabolcs, which is a really bad one. – Leonid Shifrin – 2016-06-14T16:41:52.070

@Szabolcs Yes. That would be very generous of you to share the repository. – Edmund – 2016-06-14T16:42:46.510

1Import and ImportString handle the e number format okay. res2 = fromReadList~StringRiffle~"\n"~ImportString~"TSV"~Cases~{__,"Cake"|"Cookies",_}; – Simon Woods – 2016-06-14T18:05:01.017

@SimonWoods Please add as an answer. Not only does it import the e number format but it is twice as fast as my existing code. (+1) – Edmund – 2016-06-14T19:31:56.930

1

Have you tried either of the other two methods I posted in (1737)? Particularly System`Convert`TableDump`ParseTable seemed quite fast the last time I had need of it.

– Mr.Wizard – 2016-06-15T00:12:42.057

@Mr.Wizard I had not seen that post. Thanks for sharing. However, @SimonWood post below is the way I think I will go. No undocumented functions and it is twice as fast has my ToExpression method above that does not work with the "e" format. – Edmund – 2016-06-15T00:33:59.170

Answers

5

Import and ImportString handle the e number format okay. You might be able to Import directly from file, or use ImportString to process the data you've already read in:

res = fromReadList ~StringRiffle~ "\n" ~ImportString~ "TSV" ~Cases~ {__,"Cake"|"Cookies",_};

Simon Woods

Posted 2016-06-14T16:11:18.727

Reputation: 81 905

(+1) Its also faster than my initial code. – Edmund – 2016-06-14T21:05:29.580

To those who found the infix difficult to decipher: Cases[ImportString[StringRiffle[fromReadList, "\n"], "TSV"], {__, "Cake" | "Cookies", _}] – Henrik Hansen – 2016-06-18T13:58:51.247

14

I would like to link this W Community thread here where I asked for functionality like this in 2015 September, and explained why it's critical to have it. I can't link to individual posts but you can find it by searching the page for "StringToDouble".


As Leonid mentioned, there is Internal`StringToDouble. This function is very fast, but it does not report errors. This makes it unsuitable for applications where not all inputs are numbers, especially when the type of the input is unpredictable.

Internal`StringToDouble["1e2"]
(* 100. *)

Internal`StringToDouble["foo"] (* not a number *)
(* 0. *)

As a workaround we can make a small LibraryLink function that parses numbers in this format. Fortunately it's very easy to do in C++.

I am going to use LTemplate for reasons of laziness (as I always do recently). LTemplate is absolutely not needed here, it just makes it quicker for me to set everything up.

First, put this C++ code in Parser.h:

#include <sstream>

class Parser {
    bool good;

public:
    double parseReal(const char *s) {
        std::istringstream str(s);
        mma::disownString(s);
        double res;
        str >> res;
        if (str.fail() || ! str.eof()) {
            good = false;
            return 0;
        }
        good = true;
        return res;
    }

    mint parseInteger(const char *s) {
        std::istringstream str(s);
        mma::disownString(s);
        mint res;
        str >> res;
        if (str.fail() || ! str.eof()) {
            good = false;
            return 0;
        }
        good = true;
        return res;
    }

    bool success() const { return good; }
};

Then from Mathematica, make sure that Directory[] is where Parser.h is and evaluate:

<<LTemplate`

template = LClass[
  "Parser",
  {
    LFun["parseReal", {"UTF8String"}, Real],
    LFun["parseInteger", {"UTF8String"}, Integer],
    LFun["success", {}, True | False]
  }
];

CompileTemplate[template]
LoadTemplate[template]

Then use like this:

enter image description here

The parseReal method parses a real number in e-notation. The success method tells us if the parsing was successful. Then we can build on top of this.

I used this for personal projects. You can get the code here, but LTemplate must be installed first, and you must also have a working C++ compiler installed... Remember that I made this for personal use. It's a really basic package and you will probably be better off writing your own, tailed to your own needs.

The only thing this provides over Internal`StringToDouble is a way to check for errors.

Szabolcs

Posted 2016-06-14T16:11:18.727

Reputation: 213 047

I didn't realize that it doesn't throw a message on fail and that it needed the extra bits. Not certain it will work for deploying to Player Pro. Thanks for sharing. (+1) – Edmund – 2016-06-14T17:08:27.317

2@Edmund I don't have Player Pro, but you're right ... it might not work ... it would also make deployment more complicated because you would need to ship a package which has binaries for all platforms: that means one binary for Windows, one carefully compiled one for Linux to make sure it works on many distros, and two for OS X (for 10.0-10.3 and a different one for 10.4) ... – Szabolcs – 2016-06-14T17:20:50.363

1Amazing work, this and the LTemplate package. Thank you very much for documenting and sharing. – Theo Tiger – 2018-10-08T08:32:14.690