File-backed lists/variables for handling large data

118

106

Background

Currently I am working with some large data (most of it generated by Mathematica itself). I usually find it a hassle to do this. For example, I just exported a large amount of data to WDX on a machine with a lot of memory, just to find that I can't read it on my own machine (with little memory) because the file can only be read as a whole. It is also extremely slow to import (but using MX was not an option due to different architectures)

Mathematica is excellent when working with in-memory data, as it's paradigms of operating on data as a whole (Map, Outer, Table, etc.) are very convenient. But it is not great at working with data that is too large to fit into memory, and it is not good at sequential processing of on-disk data.

There have been discussions about this (see the comment discussions on this and this question), and the following idea came up more than once: it would be great to be able to use Mathematica's native paradigms to work on large on-disk data. The data would be loaded on-demand from disk, and discarded when not needed any more.

I'd love to hear some ideas on how to implement a framework that does this, but read the next section for a more practical question.

Question

How can I conveniently work with data that doesn't fit in memory? Can we implement a list-like structure which fetches the data from disk as needed? (For example, when indexed, it would load only the requested list-item directly from disk. As processing of this item has finished, the memory it took up would be freed.) Alternatively could we implement variable which are loaded from disk on demand, but can be unloaded from memory?

I'd prefer not to have to deal with things like file names (if this is backed my multiple files). I'm hoping to be able to have a nice abstraction, where I never need to do explicit reading from disk. I'd like to work with an object which acts and works similar to an in-memory list.

Ideas

  • This could be backed by the MX format, which is very fast to read and can store any expression. Unfortunately it's not portable between machines. For machine numbers, a flat binary file and BinaryReadList could be useful.

  • Stub might or might not be useful.

  • Are databases the right way to go? I'm not familiar with them.

Szabolcs

Posted 2012-01-17T22:36:02.517

Reputation: 213 047

1@MikeB that doesn't scale though, and is inconvenient. I generally do the same (having access to machines with 512GB of RAM helps), but I'd like to know how to do things in a more reasonable way. – acl – 2012-01-18T01:18:32.610

1@Szabolcs as an FYI I had previously stored all my (economic/financial) data as WDX prior to switching it over to MySQL a couple of years ago. It has made life so much easier now to update and retrieve. For your problem it seems to me that databases are designed for these sorts of tasks. Also see Sal Mangano's talk about kdb+ if extracting columns rather than rows is better for what you specifically want to do. The advantages appear to be many orders of magnitude speed enhancement. I don't have links handy but should be easy to find. – Mike Honeychurch – 2012-01-18T01:22:27.537

@LeonidShifrin answer and comments like the one from telefunkenvf14 suggest that the direction of research for such problems is that of database technology. We are in the era of NoSQL databases and if you are not going to reinvent the wheel on the I/O low-level details of such a DBMS then at a higher level you have to switch the way you normally think, i.e from records (rows) to fields (columns) to values (cells). In data modeling terms the problem you face is that of redundancy. Single-instance storage and associative technology like that in Qlikview are in the right direction. – Athanassios – 2016-02-03T21:17:01.157

@Athanassios Well, I wasn't setting too ambitious goals for this answer, I just tried to get a minimal framework to address the basic needs specific to Mathematica workflows. – Leonid Shifrin – 2016-02-03T21:30:14.347

Sure, I understand, and many thanks for sharing this with the rest of us. As I said playing at the data model level is certainly easier, but you have to rely on the DBMS data storage. This is how I have been researching solutions on data modeling. I believe Mathematica has to be enhanced with a similar data structure, in-memory processing as that of Qlikview. That will boost popularity and it will make it super efficient with large volumes of data. Then you only have to combine this with a similar type DBMS. – Athanassios – 2016-02-03T21:45:59.317

1

@Szabolcs I know this is question is mad old, but it turns out WDX can support access to certain explicit positions. I dredged this up when invesitgating how to work with data paclets. See this: https://mathematica.stackexchange.com/a/146139/38205 for a quick rundown of the layout. Unless, of course, you already knew this and there's a subtlety I'm missing. If so please do let me know because it's always good to learn these things.

– b3m2a1 – 2017-06-06T06:53:54.967

Rather than DumpSave["mydata.mx", mydata] try Save["mydata.sav", mydata]. I find it very useful. – Chris Degnen – 2012-12-05T16:36:42.907

@Chris Unfortunately that is very very slow with large data, and also produces huge files. – Szabolcs – 2012-12-05T22:16:01.380

2

Using databases in Mathematica was discussed in Using Mathematica in MySQL databases. I know that QLink is used in some fairly large Feynman diagram calculations...

– Simon – 2012-01-17T22:57:15.557

1Personally I find life so much easier by having data in a database and linking to Mma. – Mike Honeychurch – 2012-01-17T23:11:21.853

@Mike I have never done that, it's good to hear experiences in how well that works. E.g. how long would it take to load 2 GB of data into Mathematica, compared to MX files? – Szabolcs – 2012-01-17T23:12:58.573

@Szabolcs I haven't worked with stuff that large and I would imagine that you will run into Mma limitations. From your background and question I thought you only wanted to bring into Mma "chunks" of data on demand. In other words do you really need an entire 2GB or can you do some SQL operations to pick out what you need? – Mike Honeychurch – 2012-01-17T23:20:55.643

This is one of the times that I've taken the brute force approach: Just throw more memory at the problem. On my last update I upgraded my machine to 12 GB of RAM. On some simulations I was running it was easily eating up in excess of 2 GB per kernel (= 8 GB total). It would be convenient to have some way of streaming data in and out of kernels though. – Mike Bailey – 2012-01-17T23:31:05.547

@MikeHoneychurch You're right, I only need chunks. I asked about such a large amount of data to get a good feel about the loading speed. Loading the data will hopefully not be the bottleneck. (Compare WDX loading speed to MX, there's huge difference) – Szabolcs – 2012-01-18T00:14:15.940

Answers

111

Preamble

I spent some time and designed and implemented a tiny framework to deal with this problem, over the last two days. Here is what I've got. The main ideas will involve implementing a simple key-value store in Mathematica based on a file system, heavy use and automatic generation of UpValues, some OOP - inspired ideas, Compress, and a few other things. Those who know my posts, I have to warn that this is going to be an unusually long one.

The problem and ideas behind the solution


Let me describe the limitations of my system right away. Since the general problem is tough, I consider a very simplified version, but one which can be useful in its own right, and which can serve as a good starting point for future developments. The problem is how to file-back a large ragged numerical list, whose sublists are possibly packed, but generally of different lengths. Let me tell from the start that since I can not use .mx files to avoid platform-dependence, the performance of this won't be stellar. This is a clear speed/memory trade-off situation, and the performance will be merely average. Perhaps, one could make a few tweaks. The overall design was more of my concern here, and I hope I've got a few things right in that department.

Let us say we have a large list already constructed in memory in Mathematica, call it testList. Its elements are lists themselves. What I will do is traverse it element by element. For a given element (sub-list), we will analyze how much memory it occupies, and if this amount exceeds a certain threshold that we specify, we will create a key-value pair for it. The key will be some dummy generated symbol, and the value will be a file name for a file where we will save a contents of this element. We will actually Compress the element first, and save the compressed data.

Low-level OOP-style data exchange API

EDIT

Since using .mx files is so much faster, I added some switch functions which will allow one to switch between using usual files and .mx files:

ClearAll[$fileNameFunction,fileName, $importFunction,import, $exportFunction, 
  export,  $compressFunction, $uncompressFunction]

$fileNameFunction = fileName;
$importFunction  = import;
$exportFunction = export;
$compressFunction = Compress;
$uncompressFunction = Uncompress;

fileName[dir_, hash_] := 
   FileNameJoin[{dir, StringJoin["data", ToString[hash], ".dat"]}];
mxFileName[dir_, hash_] := 
   FileNameJoin[{dir, StringJoin["data", ToString[hash], ".mx"]}];
import =  
   Function[fname, Import[fname, "String"]];
export = 
   Function[{fname, compressedValue}, 
      Export[fname, compressedValue, "String"]];
mxImport = 
   Function[fname, Block[{data}, Get[fname]; data]];
mxExport = 
   Function[{fname, compressedValue}, 
       Block[{data = compressedValue}, DumpSave[fname, data]]];

In addition, compression / uncompression we will also be able to switch on and off. Note also that other functions down the page have been modified accordingly.

END EDIT

As a second component, we need some high-level structure, which will represent the "skeleton" of the original list, and which will manage the on-demand data fetching and saving. As such a structure, I will use just a single symbol, say s. Here is the function which implements the management (the large one):

ClearAll[definePartAPI];
definePartAPI[s_Symbol, part_Integer, dir_String] :=
 LetL[{sym = Unique[], hash = Hash[sym], 
     fname = $fileNameFunction[dir, hash]
   },
   sym := sym =  $uncompressFunction@$importFunction[fname];
   s /: HoldPattern[Part[s, part]] := sym;

   (* Release memory and renew for next reuse *)
   s /: releasePart[s, part] :=
       Replace[Hold[$uncompressFunction@$importFunction[fname]], 
          Hold[def_] :> (ClearAll[sym]; sym := sym = def)];

   (* Check if on disk *)
   s /: savedOnDisk[s, part] := FileExistsQ[fname];

   (* remove from disk *)
   s /: removePartOnDisk[s, part] := DeleteFile[fname];

   (* save new on disk *)
   s /: savePartOnDisk[s, part, value_] :=
      $exportFunction[fname, $compressFunction @value];

   (* Set a given part to a new value *)
   If[! TrueQ[setPartDefined[s]],
     s /: setPart[s, pt_, value_] :=
       Module[{},
         savePartOnDisk[s, pt, value];
         releasePart[s, pt];
         value
       ];
     s /: setPartDefined[s] = True;
   ];
(* Release the API for this part. Irreversible *)
s /: releaseAPI[s, part] := Remove[sym];
];

How it works

Let me now explain what happens here. First, LetL is a sequentially-binding version of With, which I will display in a minute. It allows to avoid nested With statements. The parameters of the function are the main top-level symbol s, the part index, and the directory where our key-value store will be located. Basically, in OO terms, this function creates an instance of a class, with these methods: Part (part extraction), releasePart (releasing the memory occupied by the part, and getting ready to extract it from file again, savedOnDisk - checks is the part has been backed into a file, removePartOnDisk - deletes the backing file for the part, savePartOnDisk - save the part contents to a file, and releaseAPI - needed to release resources at the end.

All this is implemented via UpValues for s. In particular, the Part is overloaded, so now when I call s[[part]], it will look and feel like I extracted the part of s (not true of course, but very convenient). The content of the part is stored in the generated symbol sym, which is unique for a given part. Notice that the definition is lazy and self-uncompressing. This is a similar technique to one I used in this answer. Upon the first call, sym loads the content from file and uncompresses it, and then assigns it to itself. All subsequent calls will be constant time, with the content of the part stored in sym. Note also that when I call releasePart, I remove the direct part content from sym, feed it to the garbage collector, and reconstruct back the lazy definition for sym. This is my mechanism to be able to release part content when no longer needed, but also be able to load it back again on demand.

There are two important points to note regarding Compress. One is that it does not unpack packed arrays. Another is that it is cross-platform. Both are huge wins for us. Note that, essentially, for each part I create an instance of a class, where sym plays a role of instance variable. Note also that I use the Hash of the name of sym, to construct the file name. There are two flaws with this approach actually. One is that there in principle can be hash collisions, and currently I don't handle them at all. Another is that the symbols sym are unique only within a single session, while, as we'll see, I will be exporting their definitions. Both problems are surmountable, but for the sake of simplicity, I ignore them for now. So, the above code represents the low-level data-exchange API on the level of a single list's part.

Here is the code for LetL macro:

(* A macro to bind sequentially. Generates nested With at run-time *)

ClearAll[LetL];
SetAttributes[LetL, HoldAll];
LetL /: Verbatim[SetDelayed][lhs_, rhs : HoldPattern[LetL[{__}, _]]] :=  
  Block[{With},
    Attributes[With] = {HoldAll};
    lhs := Evaluate[rhs]];
LetL[{}, expr_] := expr;
LetL[{head_}, expr_] := With[{head}, expr];
LetL[{head_, tail__}, expr_] :=
  Block[{With}, Attributes[With] = {HoldAll};
   With[{head}, Evaluate[LetL[{tail}, expr]]]];

The details of how it works are explained in much detail here.

Higher-level interface: the list-building function

This is the main function used in list-building. Its name pretty much tells what it does - it extends the list with one more element. This, however, does not cost us a performance penalty, since our "list" is faked - it is a symbol s which pretends to be a list but in fact is not (it is more like a hash-table filled with class instances).

ClearAll[appendTo];
Options[appendTo] = {
   ElementSizeLimit :> $elementSizeLimit,
   DestinationDirectory :> $destinationDirectory
 };
appendTo[s_Symbol, value_, opts : OptionsPattern[]] :=
  LetL[{len = Length[s], part = len + 1,
     dir = OptionValue[DestinationDirectory],
     blim = OptionValue[ElementSizeLimit]
    },
    definePartAPI[s, part, dir];
    s /: Length[s] = part;
    If[ByteCount[value] > blim,
       definePartAPI[s, part, dir];
       savePartOnDisk[s, part, value];
       releasePart[s, part],
       (* else *)
       With[{compressed = $compressFunction @value}, 
         s /: Part[s, part] := 
            (s /: Part[s, part] = $uncompressFunction@compressed);
         s /: Part[s, part, parts___] := Part[s, part][[parts]];
  ]]];

As you can see from this code, not all parts of the list are backed by files. Those which are below the threshold in terms of size, are merely compressed and also assigned to s via UpValues and overloaded Part, but are not on the disk. The code of this function is pretty self-explanatory, so I will move on.

Integration with the system and initialization

The following function (partially) integrates my construction with some commands that we all love. This will help to better masquerade our symbol s so that in many respects it now behaves as an ordinary list.

ClearAll[initList];
initList[s_Symbol] :=
  Module[{},
   ClearAll[s];
   (* Set a new value for part, including update on disk *)
   s /: Length[s] = 0;
   s /: HoldPattern[Take[s, {n_}]] := s[[n]];
   s /: HoldPattern[Take[s, n_]] := Take[s, {1, n}];
   s /: HoldPattern[Take[s, {m_, n_}]] := Table[s[[i]], {i, m, n}];
   s /: HoldPattern[Drop[s, {n_}]] := Drop[s, {n, n}];
   s /: HoldPattern[Drop[s, n_]] := 
      Table[s[[i]], {i, n + 1, Length[s]}];
   s /: HoldPattern[Drop[s, {m_, n_}]] :=
        Table[s[[i]], {i, Range[m - 1] ~~ Join ~~ Range[n + 1, Length[s]]}];
   s /: Map[f_, s] := Table[f[s[[i]]], {i, Length[s]}];
   s /: HoldPattern[First[s]] := s[[1]];
   s /: HoldPattern[Last[s]] := s[[Length[s]]];
   s /: HoldPattern[Rest[s]] := Drop[s, 1];
   s /: HoldPattern[Most[s]] := Take[s, {1, Length[s] - 1}];
   s /: Position[s, patt_] :=
      If[# === {}, {}, First@#] &@
        Reap[Do[If[MatchQ[s[[i]], patt], Sow[{i}]], {i, Length[s]}]][[2]]
  ];

The above code probably does not need any comments.

Settings

There are a few settings I use, basically defaults for the directory and the size threshold.

ClearAll[releasePart, savedOnDisk, removePartOnDisk, removePartOnDisk,
   savePartOnDisk, releaseAPI]
$destinationDirectory = $TemporaryDirectory ;
$elementSizeLimit = 50000;

Higher-level and management-level functions

The following functions realize higher-level API which is actually what the end user is supposed to work with.

ClearAll[appendList];
appendList[s_Symbol, l_List, opts : OptionsPattern[]] :=
   Do[appendTo[s, l[[i]], opts], {i, 1, Length[l]}];

ClearAll[removeStorage];
removeStorage[s_Symbol] :=
   Do[If[savedOnDisk[s, i], removePartOnDisk[s, i]], {i, Length[s]}];

ClearAll[releaseAllMemory];
releaseAllMemory[s_Symbol] :=
   Do[releasePart[s, i], {i, Length[s]}];

The last several functions are concerned with disk management, and storing the main structure / definitions on disk. The point is that in the process of creating our key-value store, we generated lots of UpValues for s, and all those private symbols sym for each part, must also be saved together with s, if we want to fully reconstruct the environment on a fresh kernel.

This will find the dependencies of the main symbol s. We only use UpValues, so this is quite straightforward.

(* Our current system only has one-step dependencies*)
ClearAll[getDependencies];
getDependencies[s_Symbol] :=
 Thread[
   Prepend[
     Union@Cases[UpValues[s],
     sym_Symbol /; Context[sym] =!= "System`" :> HoldComplete[sym],
     {0, Infinity}, Heads -> True],
   HoldComplete[s]
  ],
  HoldComplete] 

This generates a file name. It is important that the extension for the main file is .m (Mathematica package) - will come to that later.

ClearAll[getMainListFileName];
Options[getMainListFileName] = {
   DestinationDirectory :> $destinationDirectory,
   ListFileName -> Automatic
 };
getMainListFileName[s_Symbol, opts : OptionsPattern[]] :=
  LetL[{fn = OptionValue[ListFileName],
    fname = If[fn === Automatic, ToString[s] <> ".m", fn],
    fullfname = FileNameJoin[{OptionValue[ DestinationDirectory], fname}]},
   fullfname];

This function saves the main symbol s and those on which it depends (definitions) in a plain .m format to the disk.

ClearAll[storeMainList];
storeMainList[s_Symbol, opts : OptionsPattern[]] :=
  LetL[{filteredOpts  = 
      Sequence @@ FilterRules[{opts}, Options[getMainListFileName]],
      fname  = getMainListFileName[s, filteredOpts]},
    releaseAllMemory[s];
    If[FileExistsQ[fname], DeleteFile[fname]];
    Replace[getDependencies[s],
       HoldComplete[syms_] :> Save[fname , Unevaluated[syms]]]];

A call to releaseAllMemory is important, since it converts all possibly expanded definitions of sym-s for various parts back to lazy form, and in that form they will be saved.

This function does the inverse: it loads the environment, on a fresh kernel:

ClearAll[retrieveMainList];
retrieveMainList[s_Symbol, opts : OptionsPattern[]] :=
  LetL[{filteredOpts  = 
      Sequence @@ FilterRules[{opts}, Options[getMainListFileName]],
      fname  = getMainListFileName[s, filteredOpts],
      imported =  Import[fname , "HeldExpressions"]
     },
    ReleaseHold[imported /.
       {TagSet -> TagSetDelayed, UpSet -> UpSetDelayed}
       ] /; imported =!= $Failed;
    ];

 retrieveMainList[___] := $Failed;

There are a few subtleties here. The problem is that Save converts delayed UpValue definitions (made with TagSetDelayed or UpSetDelayed), into immediate ones (which looks like a bug to me, but anyways). Therefore, I have to load the package in unevaluated form and do back replacements manually, before I allow it to run.

The last function here will completely remove all the generated files from the file system:

ClearAll[deleteListComplete];
deleteListComplete[s_Symbol, opts : OptionsPattern[]] :=
 LetL[{filteredOpts  = 
    Sequence @@ FilterRules[{opts}, Options[getMainListFileName]],
    fname  = getMainListFileName[s, filteredOpts]},
    removeStorage[s];
    If[FileExistsQ[fname], DeleteFile[fname]];
    Do[releaseAPI[s, i], {i, Length[s]}];
    ClearAll[s]]; 

This completes the current version of the system, and now we are ready to start using it.

Examples and benchmarks

Initialization

The following may be considered as a quick guide to the usage.

$HistoryLength = 0

We first generated a reasonably small piece of data, to have something to play with:

smallTest = RandomInteger[100, #] & /@ RandomInteger[{10000, 20000}, 300];

I will chose our top-level symbol to have a name test. Before we start anything, we must initialize it:

initList[test]

Convertin a list

We now convert our list into our key-value structure:

In[83]:= appendList[test,smallTest,DestinationDirectory:>"C:\\Temp\\LargeData"];//Timing
Out[83]= {2.906,Null}

This was about 18Mb:

In[84]:= ByteCount[smallTest]
Out[84]= 18193688

And we generated about 230 files:

In[87]:= FileNames["*.dat",{"C:\\Temp\\LargeData"}]//Short
Out[87]//Short= {C:\Temp\LargeData\data530106946.dat,<<234>>,
      C:\Temp\LargeData\data530554672.dat}

Details and tests...

Note that I intentionally chose a high enough threshold so that not all parts of smallTest ended up in files, some were assigned in-memory only:

In[95]:= Length[test]
Out[95]= 300

In[97]:= Position[Table[savedOnDisk[test,i],{i,Length[test]}],False]//Short
Out[97]//Short= {{3},{5},{7},{33},{34},{35},{39},<<50>>,{277},{280},{287},{290},{298},{299},{300}}

Let us now test that our file-backed system keeps the right results. We pick some random positions:

In[99]:= randomPos = RandomSample[Range[Length[test]],20]
Out[99]= {287,214,9,294,32,263,12,141,282,85,213,108,22,197,77,67,41,286,146,38}

And test:

In[100]:= test[[#]]==smallTest[[#]]&/@randomPos//Timing
Out[100]= {0.203, {True,True,True,True,True,True,True,True,True,True,
True,True,True,True,True,True,True,True,True,True}}

Note that the second time the test is instant, since memoization is now at work, and there's no need to uncompress again:

In[101]:= test[[#]]==smallTest[[#]]&/@randomPos//Timing
Out[101]= {0.,{True,True,True,True,True,True,True,True,True,True,True,
True,True,True,True,True,True,True,True,True}}

Another test:

In[102]:= Take[test, {10, 20}] == Take[smallTest, {10, 20}]
Out[102]= True

Adding new elements

Let us append some elements to our list now:

appendTo[test, Range[10000]]

We check the length:

In[105]:= Length[test]
Out[105]= 301

We can also test directly:

In[116]:= test[[301]]//Short
Out[116]//Short= {1,2,3,4,5,6,7,8,9,10,<<9980>>,9991,9992,
9993,9994,9995,9996,9997,9998,9999,10000}

In[117]:= Last@test//Short
Out[117]//Short= {1,2,3,4,5,6,7,8,9,10,<<9980>>,9991,9992,
 9993,9994,9995,9996,9997,9998,9999,10000}

We can append wholesale as well:

In[118]:= appendList[test, Partition[Range[10000, 60000], 10000]]

In[119]:= Length[test]
Out[119]= 306

Memory management

I will now illustrate memory management: we will force it to load from disk and uncompress all parts:

In[120]:= MemoryInUse[]
Out[120]= 49040104

In[121]:= Take[test, {1, Length[test]}];

In[122]:= MemoryInUse[]
Out[122]= 64273408

We now release all memory, and return to lazy self-uncompressing definitions.

In[123]:= releaseAllMemory[test];

In[124]:= MemoryInUse[]
Out[124]= 49079560

Saving and reconstructing the environment

Let us now save our environment:

In[125]:= 
storeMainList[test, DestinationDirectory :> "C:\\Temp\\LargeData"] // AbsoluteTiming

Out[125]= {1.1015625, Null}

We now quit the kernel:

Quit

and now try to reconstruct it back:

In[126]:= 
retrieveMainList[test, 
   DestinationDirectory :> "C:\\Temp\\LargeData"] // AbsoluteTiming

Out[126]= {1.2294922, Null}

We can see that we are in business:

In[127]:= Length[test]
Out[127]= 306

In[128]:= test[[301]]//Short
Out[128]//Short= {1,2,3,4,5,6,7,8,9,10,<<9980>>,9991,9992,9993,
9994,9995,9996,9997,9998,9999,10000}

Removing the key-value store - uninstall

Finally, this will remove all the files from the system completely:

In[129]:= deleteListComplete[test,DestinationDirectory:>"C:\\Temp\\LargeData"]//Timing
Out[129]= {0.031,Null}

Larger tests

I will throw in a few larger tests, which are still kind of toy tests, but a bit more representative. We start with this:

In[130]:= MemoryInUse[]
Out[130]= 44668800

Now we create a reasonably large dataset:

In[131]:= mediumTest = RandomInteger[100,#]&/@RandomInteger[{100000,200000},1000];
In[132]:= ByteCount[mediumTest]

This tells how large

Out[132]= 607800752
In[133]:= initList[test]

It takes slightly more than a minute to convert it to our data store:

In[134]:= 
appendList[test, mediumTest, 
   DestinationDirectory :> "C:\\Temp\\LargeData",
   ElementSizeLimit:>20000]; //Timing
Out[134]= {73.906,Null}

The memory consumption is just amazing (the lack of it!):

In[135]:= MemoryInUse[]
Out[135]= 657753176

This is pretty much what the initial memory use was plus the memory occupied by mediumTest - our construction takes almost no memory because everything is cached and lazy.

Here we extract some element (which is not that small):

In[136]:= test[[10]]//Short//Timing
Out[136]= {0.047,{1,19,82,24,54,12,25,5,11,4,74,7,75,
   <<176964>>,93,5,12,25,97,89,56,59,46,35,95,1,49}}

All the next times, this will be instantly for this particular element, until we decide to release the cache. We take some more now:

In[137]:= Take[test,{10,30}]//Short//Timing
Out[137]= {0.5,{<<1>>}}

In[138]:= ByteCount[Take[test,{10,30}]]
Out[138]= 13765152

We now take about a third of the total data set - it takes several seconds:

In[139]:= (chunk = Take[test,{1,300}]);//Timing
Out[139]= {6.75,Null}

In[140]:= ByteCount[chunk]
Out[140]= 180658600

Need for speed: Turning on .mx files

If we sacrifice being cross-platform for speed, we get 10-40x speedup by using .mx files, and in this regime I'll be hard-pressed to see any database solution beating this in terms of performance. Here are the same benchmarks as before, done with .mx files.

First, switch to .mx:

$fileNameFunction = mxFileName;
$importFunction  = mxImport ;
$exportFunction = mxExport ;
$compressFunction = Identity;
$uncompressFunction = Identity;

Note also that I disabled compressing, for maximal speed. The benchmarks:

In[57]:= MemoryInUse[]
Out[57]= 18638744

In[58]:= mediumTest = RandomInteger[100,#]&/@RandomInteger[{100000,200000},1000];

In[59]:= ByteCount[mediumTest]
Out[59]= 594434920

In[60]:= initList[test]

In[61]:= appendList[test,mediumTest,DestinationDirectory:>"C:\\Temp\\LargeData"];//Timing
Out[61]= {14.797,Null}

In[62]:= MemoryInUse[]
Out[62]= 618252872

Extraction of a singe list element (including loading from disk) is now instantly:

In[63]:= test[[10]]//Short//Timing
Out[63]= {0.,{7,17,36,41,54,62,49,78,63,62,84,83,14,42,42,
    <<184520>>,83,0,64,25,86,84,89,17,71,94,84,3,6,23,38}}

Extracting 20 elements is also pretty fast:

In[64]:= Take[test,{10,30}];//Timing
Out[64]= {0.047,Null}

In[65]:= ByteCount[Take[test,{10,30}]]//AbsoluteTiming
Out[65]= {0.,12279632}

We now extract about 300 elements, with the total size af about 180Mb:

In[66]:= (chunk = Take[test,{1,300}]);//AbsoluteTiming
Out[66]= {0.3281250,Null}

In[67]:= ByteCount[chunk]
Out[67]= 178392632

To my mind, this is blazing fast.

Summary and conclusions

I presented here a tiny but complete implementation of a key-value store, which may make it possible to work with large files which don't fit in memory, notably lists. From the technical viewpoint, this is by far the most serious application of UpValues I have ever written. I think the simplicity of the code illustrates the power of UpValues well. They also made it possible to have nice syntactic sugar, and be able to use the familiar commands such as Part, Take, etc.

The implementation has many flaws, and it is still not clear to me whether it is efficient enough to be useful, but I think this may represent a good starting point.

EDIT

As it turns out, using .mx files gives a huge speedup (which is not unexpected of course). If speed is absolutely crucial, one can use .mx files for all computations and only use normal files to import from or export to another computer. I plan to build a layer which would automate that, but so far, this can be done manually, based on the single-part API in the code above.

END EDIT

All ideas, suggestions etc - most welcome!

Leonid Shifrin

Posted 2012-01-17T22:36:02.517

Reputation: 108 027

@LeonidShifrin For private opensource projects github is free, but otherwise it seems to get expensive. Maybe http://gitorious.org/ ? So my idea is I, you, others can opensource some of their Mathematica code and people can fork it, improve, use, etc.

– Rolf Mertig – 2012-01-23T01:27:44.623

@Rolf Yes, I have the same idea. But why do you think GitHub isn't a good match? It is only expensive for private development, but the code we will share will be public anyway, at least most of it. – Leonid Shifrin – 2012-01-23T07:20:19.907

@LeonidShifrin I am probably not qualified to make a judgemnt, but like this answer: http://stackoverflow.com/a/1037005/887505

– Rolf Mertig – 2012-01-23T10:42:28.120

@Rolf Let me look at these options more carefully. I'll do my homework and report my findings / opinion back in a few days. – Leonid Shifrin – 2012-01-23T11:01:49.407

This is for sure one of the top three answers I've ever read on any Q&A site on internet! – VividD – 2014-10-13T07:13:01.203

@VividD Thanks :). This isn't as complete as I wanted it to be, and perhaps I will add a few things here some time soon. – Leonid Shifrin – 2014-10-13T07:18:30.670

I need a bit more time before I can provide proper feedback, but please do let me know when you put this on GitHub! – Szabolcs – 2012-01-26T16:23:54.200

@Szabolcs Sure, you and Rolf will be the first to know. It won't probably happen until next week though, since I pretty much went over all my timing limitations here on this site, and will have to reduce my presence here, and also won't have much time for SE-related stuff until then. – Leonid Shifrin – 2012-01-26T16:33:59.147

@LeonidShifrin Why on my mma 10.2 the ByteCount of smallTest is always about 35 MB !!! With my recent experience shown in this post http://mathematica.stackexchange.com/questions/97498/making-large-calculation-results-persistent I am wondering if v10.2 has some bugs in storing data for it is easily get frozed when saving and the storage takes more spaces.

– matheorem – 2015-10-25T11:14:39.437

@matheorem I can't look into this right away, but will try not to forget to fo that later. It's hard to say what's going wrong. ByteCount is not a very reliable way to measure the byte size, but it's all we've got, and in any case it shouldn't behave the way you described. – Leonid Shifrin – 2015-10-25T11:42:20.843

Wow Leonid! Many thanks for this! I won't read it tonight (some more work to do), but it'll be the first thing in the morning, with a fresh mind. – Szabolcs – 2012-01-18T20:40:03.443

3@Szabolcs This had to be done. We were ignoring this problem for too long. – Leonid Shifrin – 2012-01-18T20:46:00.183

It's a really good time to do it because right now I need it (this was a practical problem too right now) – Szabolcs – 2012-01-18T20:47:33.697

@Szabolcs Well, may be you will be able to start using this right away! I really hope so! I am not quite sure about efficiency of this stuff - how acceptable it is. Man, I was thinking all night about how to do it, but only came to this idea several hours ago. – Leonid Shifrin – 2012-01-18T20:50:17.763

Is there any way your framework could be used on existing data (created outside Mathematica) ? For instance, one 4GB file of data ? – b.gates.you.know.what – 2012-11-15T12:58:08.513

@b.gatessucks Yes, as long as you have some means to read that file into Mathematica piece by piece. This can be, for example, some custom file-reading function based on streams, either pure Mathematica (a.g. using BinaryRead or BinaryReadList), or say Java. Then, you just have to read the next element of the list (assuming your file represents one large list - which is the current assumption of the framework), and use appendTo from the above API to append it to the file-backed list (or you can use appendList to append many elements at once). If you give a specific file format /... – Leonid Shifrin – 2012-11-15T17:44:07.050

@b.gatessucks ... /example (say, as a separate question), I can try to illustrate how it can be done with some specific code. Once you convert your file to the file-backed list and save it, you can work with it fully within the framework and get all the benefits it can give you. – Leonid Shifrin – 2012-11-15T17:45:39.933

@LeonidShifrin Many thanks Leonid, your suggestion gets me started. No need to use more of your time for now. – b.gates.you.know.what – 2012-11-15T19:41:39.537

1

@LeonidShifrin - Have you seen this before? (1) Sal Mangano's presentation on linking to Kdb+ (lightweight and super fast in-memory database for financial crowd)? The 32-bit version was free last time I checked---but I believe it's still pretty darn fast, when files can't be completely stored in memory. http://www.youtube.com/watch?v=AGGGU7tVdEk (2) Also, I asked about a Hadoop-link during a live conference and was told it was being worked on. Dunno if that's for v9 or if it was included in the Finance Platform. (3) Why not just dl a preconfigured PostgreSQL appliance and use DB link? Slower?

– telefunkenvf14 – 2012-11-27T18:14:32.760

1@telefunkenvf14 Yes, I was actually lucky to have attended that talk in person. Very impressive, but Sal was more talking about how he was generating K by Mathematica - so some knowledge of K / KDB would be needed. Re: Hadoop link: so far I only know that it exists, and in fact has been open-sourced on GitHub IIRC. But I may have a chance to get to know it much better in the near future. Re: PostgreSQL: much slower than .mx files, I am sure about it (did not benchmark however), plus potentially my framework is more flexible, since the data in a list can be any Mathematica expression. – Leonid Shifrin – 2012-11-27T18:43:24.730

@telefunkenvf14 The biggest advantage of the above code is that it is only 200 lines, and provides a rather natural high-level interface. Of course, it does not do much, compared to the alternatives you mentioned. – Leonid Shifrin – 2012-11-27T18:46:34.947

Nice idea, but I do not think that platform dependency is so important here. I would not rule out mx files. I think for truly large problems (I had one with nearly 40 GB in about 100 mx files) your approach will be too slow and the only alternative is really to use fast databases (which I did not do yet). – Rolf Mertig – 2012-01-19T00:16:31.687

@RolfMertig it depends. For instance, I often produce lots of data on the large machines at work and then have trouble analyzing them on my laptop (which is where I always work). They run a different OS, so mx files are out (in addition to which, I don't have enough memory to load the data in RAM on my laptop). So this looks very useful, if I can make it do what I want. – acl – 2012-01-19T01:02:47.853

Holy cow! +1 for the monster effort! – Markus Roellig – 2012-12-06T15:19:46.433

@MarkusRoellig Thanks:). Well, it's been already a while ago, time's running. – Leonid Shifrin – 2012-12-06T15:36:39.163

+1, longer than usual is a bit of an understatement. New chapter in the book? – rcollyer – 2012-01-19T03:33:42.753

@Rolf Thanks, your input is very valuable. I had your problem in mind as well. I tried .mx file and everything is blazing fast (10-20 times speedup w.r.t. normal). I will post the changes soon, they are minimal. I think, no database can give you the speed I am able to achieve with this framework and .mx files. For example, I am getting 180Mb from file in 300 chunks (300 files) in 0.3 s. For databases, you will have the link overhead which is likely do be the dominant bottleneck. – Leonid Shifrin – 2012-01-19T13:54:33.203

@Rolf Done now. Please have a look. I think this is pretty fast. – Leonid Shifrin – 2012-01-19T15:10:40.943

@Leonid: Fantastic! Thanx. I have no time now, but next week I will try to use your setup for my old project. Shouldnt' we start a github or google-code or something like this place so those useful programs can be easier gotten, developed, etc. ? – Rolf Mertig – 2012-01-20T23:02:40.727

This will be really interesting to see on github once it is up. Looking forward to it! – nixeagle – 2012-03-23T04:15:57.360

@nixeagle Thanks!I hope I'll get to that soon. – Leonid Shifrin – 2012-03-23T07:50:21.930

@Rolf Yes, I thought of Github already. I actually have this in my plans. – Leonid Shifrin – 2012-01-21T08:57:32.613