How can I read in specific information from a TeX-file?



I have several TeX-files and they contain information of the form

\commandnameB{It wasn't Bob}
\commandnameE{\commandnameF{It was Bob!}}

Is it possible to read in that information via Mathematica? Like for example in a list

{Bob,It wasn't Bob,2012,$\frac{1}{3}$,\commandnameF{It was Bob!},It was Bob!,29}

The purpose would be to use the enumaration of the files (here 29) together with the other content of the files to further calculate statistics and generate graphics about these bunch of TeX-files as a whole.


Posted 2012-04-29T16:46:12.353

Reputation: 1 435

Do the commands always appear literally (i.e. not as result of other command expansion?) Because otherwise I guess a better solution would be to run the LaTeX file with redefinitions of \commandnameA etc. to write into some file which you then can read into Mathematica. – celtschk – 2012-04-29T19:19:32.727

@celtschk: They appear literally. Nevertheless am I interested in what you say. I always only compile the files with the LaTeX tools, I don't even know how I can print out the file into another file for this process you're describing. – Nikolaj-K – 2012-04-29T19:46:27.807

Well, the idea would be to modify the macros commandnameA etc. to write their content into a special file (in addition to what they normally do). That way you'd even get the information if someone invoked them indirectly, e.g. using \let\cnA=\commandnameA and then later \cnA{Bob}. However, if the macros always appear literally, that would be overkill. Anyway, the process of writing to a file from LaTeX is described in

– celtschk – 2012-04-29T20:04:40.860



The regular expression approach is my favorite, but I would do it a little differently to make it more robust. The approach by David didn't quite get the } treated right. The approach by R.M relied on the newline characters in the file (but newlines are optional in $\TeX$). So here is what I believe fixes these problems.

First define the example $\TeX$ content:

tex = "\\commandnameA{Bob}
    \\commandnameB{It wasn't Bob}
    \\commandnameE{\\commandnameF{It was Bob!}}

Now comes the function that does the translation:

translate[t_] := 
 Module[{regex = 
  Flatten[{t, StringCases[t, regex :> translate["$1"]]}]

And finally the application:


{"Bob", "It wasn't Bob", "2012", "$\\frac{1}{3}$", "\\commandnameF{It \ was Bob!}", "It was Bob!", "29"}

The translate function finds matching braces following any of the \commandname keywords, and applies itself recursively to the resulting content. It returns the supplied argument plus the result of the recursive translation.

Therefore, the first entry in the result of translate is always the original text. That is why I use Rest to print the desired sub-strings.


Posted 2012-04-29T16:46:12.353

Reputation: 93 191

1The problem with a regex only solution (which is the reason David's solution also fails) is that regex cannot adequately handle tex, which is not a regular language (i.e., not Chomsky type 3). You cannot handle bracket matching perfectly and no matter what solution you come up with, you can always create valid tex that will break it. For example, try yours by replacing 2012 with $2012^{}$. :) Relying on newlines is not very robust either, but that was how the OP mentioned their code was (I'm very used to writing tex that way because it's easier to put it under version control that way) – rm -rf – 2012-04-29T23:55:31.167

That one is trivial to fix. I've edited it in. We're talking about a very special parsing problem, so it should be doable. Of course I'm not relying entirely on regex, that's why I have the recursion in there. – Jens – 2012-04-30T00:43:03.150

Well, it now breaks for ${2012^{}}$ ;) I'll stop here, but my point was largely that it's a very complex problem to make a truly robust solution and that even in this simple case, the regex is barely readable (you could say that about most non-trival regexes). This isn't even outrageous tex code—I've seen similar stuff in people's documents, where they used to have a superscript and then deleted it, but left the {} in just in case they needed to reinsert a superscript. Nevertheless, a +1 from me because it does answer the OP's question and doesn't have the issue that David's solution has – rm -rf – 2012-04-30T01:02:23.813

That's a valid point: if you need deeper nesting of braces then it will require additional recursion logic. It can be done, but at least the example didn't require it... – Jens – 2012-04-30T01:20:15.100

If you want to be completely robust, I guess the only way is to extract the information directly in TeX anyway. Unless you want to re-implement the TeX interpreter, of course. – celtschk – 2012-04-30T14:46:57.880

Absolutely true. I just didn't want to assume that a TEX engine is installed, and took the example file as delineating all the requirements of the import. – Jens – 2012-04-30T14:59:58.727


Here's a way using StringSplit and StringCases. The file test.tex is a file with your tex example.

tex = StringSplit[Import["test.tex", "Text"], "\n"]
StringCases[tex, "\\commandname" ~~ LetterCharacter .. ~~ "{" ~~ x__ ~~ "}" :> x] // Flatten
(* Out[1]= {"Bob", "It wasn't Bob", "2012", "$\frac{1}{3}$", 
    "\commandnameF{It was Bob!}", "29"} *)

rm -rf

Posted 2012-04-29T16:46:12.353

Reputation: 85 395


Regex! The main problem here will be that you've got syntax-based rules in case of the nested expressions, i.e. when you're matching \comandnameE{ ... } you don't want to match until the first }, but after the parenthesis balance is even again. I don't know how to take care of that using only Regex. Anyway,

(* Your string condensed into one line *)
tex = "\\commandnameA{Bob}\nblabla\n\n\n\n \\commandnameB{It wasn't Bob}\n\n \\commandnameC{2012}\n\nbla\n\n \\commandnameD{$\\frac{1}{3}$}\n\n \\commandnameE{\\commandnameF{It was Bob!}}\n\n \\commandnamefilename{29}";
(* Regex that matches the inside of a parenthesis of
   a `commandname` instruction *)
regex = RegularExpression["\\\\commandname[a-zA-Z]+\{(.+)\}"];
(* Apply regex *)
StringCases[tex, regex -> "$1", Overlaps -> True]
    It wasn't Bob,
    \commandnameF{It was Bob!},
    It was Bob!},

Note the trailing } in the It was Bob!} line, which you may have to take care of manually.

The regex is replaced by $1, which is the first matched sub-pattern, i.e. the first parenthesis expression in the regex. $0 would have been the entire expression, i.e. the resulting list would contain all the \commandnameX, the curly braces and so on.

The Overlaps parameter lets the regex match single characters multiple times, i.e. after matching commandE, the already matched character sequence is searched again, yielding the contents of commandF.

If you want to incorporate a dictionary of possible \commandX, simply replace the command in the regex by (command1|foo|bar|command12|...).


Posted 2012-04-29T16:46:12.353

Reputation: 14 421