Distribute and analyze large datasets



I need to analyze very large datasets, and they obviously don't fit in a computer memory. I need to produce some statistics on these datasets, and this brings me here:

How can I make some statistical analysis on distributed datasets?

I imagine Mathematica may split (hopefully automagically) data over many kernels, being local or remote.

Anyone had any experience?


Posted 2013-05-24T10:49:20.630

Reputation: 493

Good question. I'm not aware of anything like this existing at the moment, but surely it is possible (though not trivial) to implement it. Would you like to give it a try? – Oleksandr R. – 2013-05-24T11:50:30.963

I've provided an initial answer below and find the question interesting. Still, I do think it would benefit from more detail, specificity, and an example of what the OP may have tried. – Jagra – 2013-05-24T17:22:57.637

Well.. Today MMA is not a good choice for this kind of operation. Maybe in MMA 10. – Murta – 2013-05-25T04:15:52.097

@OleksandrR. yes, I will definitely try to implement this. – senseiwa – 2013-05-28T07:58:23.953

@Murta that's a shame, some operations might be easily paralleled, and considering that MMA already contains parallel primitive, I find this very strange. – senseiwa – 2013-05-28T07:59:56.163



This is not exactly an answer to the question as posed, but rather a suggestion for how one can potentially cause the separate subkernels to be able to communicate with one another. I have previously posted this on MathGroup, but there seems not to be any good reason not to present it here as well.

The approach demonstrated is based on the idea that MathLink connections can be established between subkernels on a peer-to-peer basis. Then, one has a message passing mechanism that could in principle be employed similarly to MPI. However, the code shown here does not include most essential elements of MPI, e.g. collective communication primitives, which would need to be implemented first.

Firstly, we start the subkernels between which MathLink connections will later be established. Since the environment will no longer be stateless after the subkernels are able to communicate with one another, we must set the appropriate options so that there will be no attempt at recovery in case any subkernel fails.

SetSystemOptions["ParallelOptions" -> {
   "RecoveryMode" -> "Abandon", "RelaunchFailedKernels" -> False
Quiet[CloseKernels[]; LaunchKernels[4], LaunchKernels::nodef];

Internally, the Parallel` package is based on anonymous kernel objects signifying locations to which expressions may be submitted in a queue for evaluation. The $KernelIDs correspond to these abstract objects and may be duplicated or reassigned arbitrarily; they are consequently unsuitable to our present purpose, necessitating the use of $RankID instead.

ranks = Range@Length[
    kernels = Kernels[]

With[{rankAssignments = Unevaluated[$RankID = #] & /@ ranks},
  Parallel`Developer`ParallelDispatch[rankAssignments, kernels]

Next, we define the Rank object, which will store (as downvalues) the properties of each of the ranks and implement (through upvalues) the context-sensitive substitution of these properties into various expressions for which they are required. Note that KernelObjects and Parallel` evaluation functions are not ordinarily propagated to the subkernels since they refer to and operate on the subkernels' $ParentLink. Therefore, we define the following relations only in the master kernel.

SetAttributes[Rank, NHoldFirst];

Rank[ranks_List?(VectorQ[#, IntegerQ] &), "KernelObject"] := 
  Rank[#, "KernelObject"] & /@ ranks;
MapThread[Set, {Rank[ranks, "KernelObject"], kernels}];
Rank[_, "KernelObject"] = $Failed;

Rank /: ParallelEvaluate[cmd_, Rank[ranks : _Integer | _List?(VectorQ[#, IntegerQ] &)]] :=
  ParallelEvaluate[cmd, Rank[ranks, "KernelObject"]];
Rank /: Parallel`Developer`ParallelDispatch[
    cmds_, Rank[ranks : _Integer | _List?(VectorQ[#, IntegerQ] &)]
   ] /; Length[cmds] == Length[ranks] :=
  Parallel`Developer`ParallelDispatch[cmds, Rank[ranks, "KernelObject"]];

Having created a mapping between subkernels and ranks, we are now in a position to open MathLink connections between them. For each pair of ranks we require that one endpoint creates a link, to which its partner subsequently connects. A convenient way to divide this work (and that does not depend on the number of ranks) is to partition the matrix of connections into upper and lower triangular portions. Here we form the lower triangle and hence determine by which ranks the links are to be created.

   allRanks = $AllRanks = ranks,
       rankCount = $RankCount = Length[ranks],
   machineRules = ParallelEvaluate[$RankID -> $MachineID]
   $AllRanks = allRanks;
       $RankCount = rankCount;
   locallyLinkedRanks = Take[$AllRanks, $RankID - 1];
   localLinks = If[#1 === #2,(* Same $MachineID on both link endpoints? *)
           (* Optimization for intra-machine links. *)
           LinkCreate[LinkProtocol -> "SharedMemory"],
           (* Inter-machine link; must use TCP/IP. *)
           LinkCreate[LinkProtocol -> "TCPIP"]
           ] & @@@ Thread[{$RankID, locallyLinkedRanks} /. machineRules];
   (* We also create a loopback link. Although this is technically unnecessary,
      it may prove useful for testing or help to simplify the construction of
      various collective communication algorithms. Loopback links do not have
      any LinkProtocol. *)
   ownLink = LinkCreate[LinkMode -> Loopback];

With the necessary links created, all that remains is for the remaining rank in each pair to connect to the link hosted by its partner. After the links have been opened from both ends, the partners each call LinkActivate on the connection in order to acknowledge each other.

   linkRules = ParallelEvaluate[
     (* Results must be returned in rank order or otherwise links will
        be established between incorrectly permuted pairs of ranks. *)
     Thread[locallyLinkedRanks -> localLinks] /.
      linkObject : LinkObject[linkName_String, __] :> {
        LinkProtocol -> "LinkProtocol" /. MathLink`LinkDeviceInformation[linkObject]
       remoteLinks = LinkConnect @@@ Flatten[ReplaceList[$RankID, linkRules], 1];
    (* LinkActivate is blocking at both endpoints and so the calls 
       should be made in a particular order so as to avoid deadlock. 
       The rotation is not required for correctness but rather overlaps
       pairwise calls that have no serial dependence. *)
    RotateRight[$RankLinks = Flatten[{localLinks, ownLink, remoteLinks}], $RankID - 1]

The links having been established, it is now useful to be able to call MathLink functions with a rank ID rather than a LinkObject, so here we overload Rank in order to accomplish this. This time, the definitions are are rank-specific, and meaningful only to the subkernels.

  SetAttributes[Rank, NHoldFirst];

  Rank[ranks_List?(VectorQ[#, IntegerQ] &), "LinkObject"] := 
   Rank[#, "LinkObject"] & /@ ranks;
  MapThread[Set, {Rank[$AllRanks, "LinkObject"], $RankLinks}];
  Rank[_, "LinkObject"] = $Failed;

  Rank /: LinkWrite[Rank[rank_Integer], expr_] := 
   LinkWrite[Rank[rank, "LinkObject"], Unevaluated[expr]];
  Rank /: LinkWrite[Rank[ranks_List?(VectorQ[#, IntegerQ] &)], expr_] :=
   LinkWrite[#, Unevaluated[expr]] & /@ Rank[ranks, "LinkObject"];

  Rank /: LinkFlush@Rank[rank_Integer] := 
   LinkFlush@Rank[rank, "LinkObject"];
  Rank /: LinkFlush@Rank[ranks_List?(VectorQ[#, IntegerQ] &)] := 
   LinkFlush /@ Rank[ranks, "LinkObject"];

  Rank /: LinkReadyQ@Rank[rank_Integer] := 
   LinkReadyQ@Rank[rank, "LinkObject"];
  Rank /: LinkReadyQ@Rank[ranks_List?(VectorQ[#, IntegerQ] &)] := 
   LinkReadyQ /@ Rank[ranks, "LinkObject"];

  Rank /: LinkRead[Rank[rank_Integer], h_: Identity] := 
   LinkRead[Rank[rank, "LinkObject"], h];
  Rank /: LinkRead[Rank[ranks_List?(VectorQ[#, IntegerQ] &)], h_: Identity] :=
   LinkRead[#, h] & /@ Rank[ranks, "LinkObject"];

To demonstrate the general principle, let's perform a simple all-to-all collective communication:

    "Message from rank " ~~ ToString[$RankID] ~~ " to rank " ~~ ToString[#]
      ] & /@ $AllRanks;
(* -> {{True, False, False, False}, {True, True, False, False},
       {True, True, True, False}, {True, True, True, True}} *)

(* -> {{True, True, True, True}, {True, True, True, True},
       {True, True, True, True}, {True, True, True, True}} *)

(* -> {{"Message from rank 1 to rank 1", "Message from rank 2 to rank 1", 
        "Message from rank 3 to rank 1", "Message from rank 4 to rank 1"},
       {"Message from rank 1 to rank 2", "Message from rank 2 to rank 2",
        "Message from rank 3 to rank 2", "Message from rank 4 to rank 2"},
       {"Message from rank 1 to rank 3", "Message from rank 2 to rank 3",
        "Message from rank 3 to rank 3", "Message from rank 4 to rank 3"},
       {"Message from rank 1 to rank 4", "Message from rank 2 to rank 4",
        "Message from rank 3 to rank 4", "Message from rank 4 to rank 4"}} *)

So, that's very nice, but what can we do productively? As an example, let's try a slightly more complicated communication pattern: the dissemination barrier, which can be used for synchronization between ranks. Each rank waits at the barrier for as long as necessary for all ranks to catch up; they then exit the barrier simultaneously.

  (* This implementation of the dissemination barrier algorithm will
     deadlock if any of the ranks are duplicated because a blocking read
     is used for synchronization. The ordering of the participating ranks
     must be consistent between calls from different ranks for the same reason. *)
  Barrier@Rank[ranks_List?(VectorQ[#, IntegerQ] && MemberQ[#, $RankID] &)] :=
          LinkWrite[#1, BarrierMessage[ranks]]; LinkFlush[#1];
           (* Here we simply discard irrelevant messages. 
              A receive queue would be a considerable improvement over this. *)
           While[msg =!= BarrierMessage[ranks], msg = LinkRead[#2]]
         ) &,
          rankCount = Length[ranks],
          thisRankPosition = Position[ranks, $RankID][[1, 1]]
          ranks[[#]] & /@ Mod[
            0~Range~Log[2, rankCount] //
             {thisRankPosition + 2^#, thisRankPosition - 2^#} &,
            rankCount, 1
      Barrier@Rank[_] = Null,

And, the result:

    Through@{Print, Pause}@RandomReal[10];
 ] // AbsoluteTiming
(* -> (prints) 6.43307
      (prints) 6.14898
      (prints) 7.92761
      (prints) 1.00838
      {7.9375000, {7.9375000, 7.9375000, 7.9375000, 7.9375000}} *)

Finally, to clean up, we close all of the links and clear the symbols defined above.

  ClearAll[BarrierMessage, Barrier];
  LinkClose /@ $RankLinks;
      $RankLinks =.;
  $RankCount =.;
      $AllRanks =.;
  $RankID =.;
    $RankCount =.;
$AllRanks =.;

Oleksandr R.

Posted 2013-05-24T10:49:20.630

Reputation: 22 073

This is a very good answer, although it strikes me that, for example, a simple Matrix-Vector product isn't parallelizable with simple tools, for example splitting data evenly over different kernels, and reducing with sum. – senseiwa – 2013-05-28T07:49:37.940

Yes, you're right--that's not an easy thing to implement. Actually, a matrix-vector product (amongst many other things) is highly memory bandwidth limited, while the MathLink protocol is quite slow. So, unless you have a distributed matrix to begin with, there are probably not going to be any performance benefits to doing that, simply because of Amdahl's law. – Oleksandr R. – 2013-05-28T08:31:38.550

Yes, but I planned to make a simple HDF5 (or even a custom binary format) and read it in parallel. In MPI this is straightforward, provided that I always use Lustre FS on all machines. Well, MMA is always a very nice product, but I'd be happier with distributed routines rather than being able to query Alpha or plot Facebook clusters of friends. Just my opinion, FWIW. – senseiwa – 2013-05-29T12:44:45.887

2I think Mathematica is just not built with HPC in mind--it probably isn't a big enough market to care about, especially since nobody who buys 2000 licences is going to pay anything close to full price. Social network integration, even if largely useless, probably has a significant marketing benefit associated with it, and is nowhere near as sensitive to implementation quality. – Oleksandr R. – 2013-05-29T13:54:11.070


More an extended comment rather than a complete answer (you've posed a rather general question).

I like this question.

Mathematica certainly includes the components and operations to do this. Unless some feature exists that hasn't come to mind, you'll need to combine those components and operations as a bespoke solution.

That said...

Kernels and RAM - You need more than just kernels. You need kernels on different machines (physical or virtual) each with dedicated RAM.

Assign data responsibility - Consider linking a machine ID ($MachineID) to a specific chunk of data readable from some central file or database. You could put all of these into a function that you distribute (DistributeDefinitions) to the remote kernels. To make this more easily scalable, you could create a table of machine ID's paired with ranges of data (positions in the file or specific records in a database). Place the table in a .csv file or some equivalent in a central location and have your distributed function read it into each kernel. This would allow you to add or delete kernels and reassign their responsibility without needing to rewrite your code.

None of this unusual or particularly insightful, just thinking through one way to accomplish what you want to do.

Read the data - Look at:

enter image description here

In the documentation to read in discrete chunks of the data to each remote kernel.

Collect the results - Run your analytics then collect the results in the main kernel.

Seems very workable to me and maybe not as daunting a task as @Oleksandr R. suggests.

Sorry no Mathematica automagic comes to mind ;-)

I just don't know enough about your specific problem to suggest more at this point. Any additional detail you can provide would help.

Still, a very interesting question.


Posted 2013-05-24T10:49:20.630

Reputation: 12 373

It's not a specific question, I wanted to see if, for very basic operations, I could use diverse kernels on different machines. Examples are endless, for instance calculating the sum of a list: splitting and sum-reducing is very straightforward, if we allow Mathematica to decide for us some defaults. – senseiwa – 2013-05-28T07:52:29.813