How to select minimal subsets?

34

19

I am a newbie, so please point me in the right direction if you feel this question has been answered somewhere else before. Here goes:

Suppose I have a list like this:

{{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}} 

I want to strip this list of all its non-minimal sublists, by which I mean that I want to check whether each set contains a subset that's already somewhere else in the set. So, in this case, the output would need to be:

{{a,b},{b,c},{a,c,e},{a,e,d,f}}

Where {a,b,c} is dropped either because it contains {a,b} or {b,c} and {a,b,e} is dropped because it contains because it contains {a,b}

Edited to add:

I have found one solution so far which works like this:

list = {{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}}
Intersection[DeleteDuplicates[Apply[Intersection, Tuples[list, 2], {1}]], list]

which generates the desired result:

{{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}}

What this does:

It generates all 2-tuple subsets of the list with itself. Then, the intersection of all these tuples are calculated and all duplicates are deleted. Finally, the resulting list is compared with the original list: the intersection is than the desired result.

But: this list has length 6, so the tuple-list is 6^2 = 36. I would like this formula to also work on lists of lengths around 500 to 1000, which would mean the tuple-list is between 250 000 and 1 000 000.

If anyone is able to point me to an easier way to do this calculation, I would be very much obliged.

MrDas

Posted 2012-07-10T16:36:58.107

Reputation: 443

Answers

23

You could do something like

minSubsets[lst_] := DeleteDuplicates[SortBy[lst, Length], Intersection[#1, #2] === Sort[#1] &]

Then for the example in the question you get

lst = {{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}};

minSubsets[lst]

(* out: {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)

Heike

Posted 2012-07-10T16:36:58.107

Reputation: 34 748

You fixed that real fast. I did not expect a solution based on DeleteDuplicates to be as fast. +1. – Leonid Shifrin – 2012-07-10T17:18:23.003

1Heike you're playing my role. Usually I give the simple, direct answer, and Leonid follows with a longer but more efficient, general, etc. one. Now what am I supposed to post? :^) – Mr.Wizard – 2012-07-10T17:18:42.673

1@Mr.Wizard sorry about that. I'll let you answer the next question that asks for fancy graphics. – Heike – 2012-07-10T17:19:49.413

@LeonidShifrin Thanks. It's still a factor 10 slower than yours though. – Heike – 2012-07-10T17:20:19.163

Since Sort already sorts by length I think you can replace SortBy[lst, Length] with Sort@lst. – Mr.Wizard – 2012-07-10T17:24:18.423

Heike, thanks for that one! Let me see if I can think of a fancy graphics question for Mr. Wizard ;) – MrDas – 2012-07-10T17:26:58.733

@Heike Actually, the speed difference depends on the size of the list of subsets. DeleteDuplicates has quadratic complexity, but alas mine has as well (for list modifications in NestWhile). I will see if I can avoid that. – Leonid Shifrin – 2012-07-10T17:27:16.723

@Mr.Wizard You're right, but I thought SortBy[..., Length] would make the purpose of sorting clearer. – Heike – 2012-07-10T17:27:32.443

Heike, I like my new role: post third and beat the performance of Leonid's solution with something short and sweet. \(^o^)/ Who am I kidding, that can only happen once is a blue moon. – Mr.Wizard – 2012-07-10T22:56:56.950

Hi all, I choose Heike's answer as the accepted answer, not because all the other very brilliant pieces of code aren't good, but because this is the one I am sure I fully understand. When I'm optimizing my program later on I will be returning to the other solutions to squeeze out extra seconds. Thanks for some incredibly answers, all of you! – MrDas – 2012-07-16T20:14:15.803

37

Solution

minimal[sets_] :=
  Module[{f},
    f[x__] := (f[x, ___] = Sequence[]; {x});
    SetAttributes[f, Orderless];
    f @@@ Sort @ sets
  ]

If the original order in the subsets must be retained one may introduce an auxiliary symbol without loss of performance:

minimal2[sets_] :=
  Module[{f, g},
    f[x__] := (f[x, ___] = True; False);
    g[a_] /; f @@ a = Sequence[];
    g[a_] := a;
    SetAttributes[f, Orderless];
    g /@ Sort @ sets
  ] 

Given that many definitions are created during this process a significant amount of time is spent ordering them. By using SetSystemOptions["DefinitionsReordering" -> "None"] we can eliminate this time, making an already fast function 2X faster.

minimalFast[sets_] :=
  Module[{f, g, op = SystemOptions["DefinitionsReordering"]},
    g[f[x__]] := (f[x, ___] = 1; {x});
    g[1] = Sequence[];
    SetAttributes[f, Orderless];
    SetSystemOptions["DefinitionsReordering" -> "None"];
    # &[
      g[f @@ #] & /@ Sort@sets,
      SetSystemOptions[op]
    ]
  ]

Timings

Using Lenoid's data and top-level function, and Heike's minSubsets:

randomSets = Table[Range@# ~RandomSample~ RandomInteger@{3, #} & @ 30, {8000}]; 

(r0 = minimal[randomSets]);         // Timing // First

(r1 = minimalFast[randomSets]);     // Timing // First

(r2 = selectMinimalHT[randomSets]); // Timing // First

(r3 = minSubsets[randomSets]);      // Timing // First

r0 === r1 === Sort /@ r2 === Sort /@ r3

0.234

0.109

1.482

15.257

True


Explanation

An explanation of this code was requested. First an understanding of the basic form of this method is required. Its mechanism is explained in this answer.

What remains is the working of the Orderless attribute. This is fairly simple in concept but rather tricky in application.

The first property is that arguments are automatically sorted before anything else is done, even before the function sees them: f[2, 1, 3, 4] becomes f[1, 2, 3, 4].

The second property, and the one at the heart of this answer, is that the pattern-matching engine takes into account Orderless such that MatchQ[f[5, 7, 2], f[7, __]] is True, because there is an ordering of 5, 7, 2 that matches 7, __.

Putting this together with the version 4 UnsortedUnion function and you have a function that deletes a set if it contains all the elements of a previously seen set.


There is a complication however. The third property of Orderless is the effect it has on the creation of definitions. Among other things it changes the order in which rules are tried. Normally Mathematica orders DownValues by specificity. Because 1, ___ is more specific than __ this returns "Match":

ClearAll[f];

f[__] = "Fail"; f[1, ___] = "Match";

f[1, 2, 3]

"Match"

Orderless changes this behavior:

ClearAll[f];

SetAttributes[f, Orderless];

f[__] = "Fail"; f[1, ___] = "Match";

f[1, 2, 3]

"Fail"

I credit Simon Woods for showing me how to get around this: the definitions made before the attribute is set are still automatically ordered relative to the other DownValues. Here __ is tried after 1, __ because it is less specific:

ClearAll[f]

f[__] = "Fail";

SetAttributes[f, Orderless]

f[1, ___] = "Match";

f[1, 2, 3]

"Match"

Mr.Wizard

Posted 2012-07-10T16:36:58.107

Reputation: 259 163

1A remarkably fast solution for the top-level M code. Only 1.5 times slower then my Java solution, which is 10 times more code in 2 languages. Big +1. – Leonid Shifrin – 2012-07-10T22:11:25.447

@Simon Thanks! I said you were one of the teachers. – Mr.Wizard – 2012-07-10T22:58:51.520

@Leonid I'll have to mark this on my calendar as an auspicious day; it may be a long time before this happens again. (Thanks.) – Mr.Wizard – 2012-07-10T22:59:54.250

This is a truly amazing solution. It has been a while since I saw such a powerful use of the pattern-matching with such a great performance. I was thinking of how I would replicate this (in Java say), and did not yet find an easy way to do this. Great example of utilizing the core language. I am actually tempted to delete my long and ugly post, since with your solution, mine does not bring much to the table (slight speed increase does not justify the use of Java and the overall program size etc). – Leonid Shifrin – 2012-07-10T23:02:09.043

I edited it for you. Do you have a link to where Simon used this or was it private communication? I don't recall seeing it – rm -rf – 2012-07-13T05:38:59.810

@R.M thanks; his use is here. Go vote! In defense of my answer there his method uses more memory, and memory conservation was the goal of the question.

– Mr.Wizard – 2012-07-13T05:44:29.257

Great one!!! :) – Rojo – 2012-07-16T00:27:54.850

@Rojo, thanks. Since I know you looked at Orderless yourself, were you aware of the DownValues ordering and the workaround I describe? – Mr.Wizard – 2012-07-16T01:01:52.543

I hadn't looked in depth at Orderless as I have with Flat, and I didn't know this. It is more or less similar to some things that happen with Flat, that are also fixed by changing the attribute setting order. I think this behaviour can have some interesting potential applications... – Rojo – 2012-07-16T01:18:24.430

@Rojo I got the feeling (from chat?) that you looked at multiple attributes. Anyway, it seems that these things are not well documented or understood, but I'm sure there have been presentations about this. Please let me know if you find a good one. – Mr.Wizard – 2012-07-16T01:21:10.507

At the time I investigated Flat I did look at Orderless too. But even in Flats case that's where my focus was, some subtleties were left for the next round of learning attributes (together with OneIdentity). This language is neverending and at the same time simple. I'm still discovering things every week. Some pretty revealing... You let me know too – Rojo – 2012-07-16T01:24:07.890

2@Leonid I was able to make my function considerably faster. I'm doing my best to impress you. – Mr.Wizard – 2012-07-16T07:00:58.043

1@Mr.Wizard "I'm doing my best to impress you" - what can I say - you are more than successful :-) – Leonid Shifrin – 2012-07-16T08:17:07.413

@Leonid may I turn the tassel, Professor? :D – Mr.Wizard – 2012-07-16T17:07:17.730

@Mr.Wizard I thought you've done that already long time ago :-). – Leonid Shifrin – 2012-07-16T17:09:59.737

@Leonid would you compare getMinSubsets, minimalFast and bclMinima on your machine again? – Mr.Wizard – 2012-07-16T17:15:57.873

@Mr.Wizard On the same test as in your update, I get 0.4785156 for minimal, 0.2490234 for minimalFast, and 0.3447266 for getMinSubsets, so your code is faster than my Java code. What I don't know yet is whether or not my Java code is dominated by the data transfer on the way back from Java. Will find this out when I get more time. I will also look into a better algorithm than the one I used, since it is hugely inefficient. – Leonid Shifrin – 2012-07-16T17:25:50.897

With my modifications to this answer, it is interesting to note if you set Orderless after the definition of poker, the definitions are re-ordered, but not if it is placed before. poker[1, 2, 3, 1, 2] evaluates to Straight, then, not Two Pair. It seems by placing Orderless after the definitions, I no longer benefit from its ability to simply the patterns themselves. I think this is a case of caveat emptor, and care needs to be taken to avoid subtle bugs.

– rcollyer – 2013-06-03T13:35:22.867

@rcollyer The behavior you observe appears to match that which I described in this answer and does not come as a surprise to me, unless I misunderstand. Are you reporting something different, or only commenting on the subtle nature of this behavior? (It certainly falls into the "advanced" category.) – Mr.Wizard – 2013-06-04T05:42:01.603

@Mr.Wizard commenting on the subtle behavior, but even after playing with your example, it was still somewhat surprising. The simplifications in particular were a nice touch. – rcollyer – 2013-06-04T11:36:20.077

17

Hybrid Mathematica - Java solution

Since the top-level solution from EDIT is still rather slow, here is a Java port of it. To use it, you have to first load the Java reloader into your session.

Code

Having done that, we have to compile this class:

JCompileLoad@"import java.util.*;

   public class MinSubsets{
      public static Object[] getMinimalSubsets(int[] lsortedflat, 
                    int[] lengths){
          int[][] lsorted = new int[lengths.length][];
          int ctr = 0;
          for(int i=0;i<lengths.length;i++){
             lsorted[i] = new int[lengths[i]];
             for(int j=0;j<lengths[i];j++){
                lsorted[i][j] = lsortedflat[ctr++];
             }
          }
          int[] positions = new int[lsorted.length];
          for(int i=0;i<lsorted.length;i++){
             positions[i]=i;
          }
          Map<Integer,Set<Integer>> hash = new HashMap<Integer,Set<Integer>>();
          for(int i=0;i<lsorted.length;i++){
             for(int elem:lsorted[i] ){
                if(!hash.containsKey(elem)){
                   hash.put(elem,new HashSet<Integer>());
                }
                hash.get(elem).add(i);
             }
          }
          List<int[]> aux = new ArrayList<int[]>();
          for(int i=0;i<lsorted.length;i++){
             if(positions[i]==-1) continue;
             Set<Integer> containing = 
                new HashSet<Integer>(hash.get(lsorted[i][0]));
             for(int j = 1; j<lsorted[i].length;j++){
                containing.retainAll(hash.get(lsorted[i][j]));
             }          
             for(int elem : lsorted[i]){ 
                hash.get(elem).removeAll(containing);
             }          
             for(int pos : containing){             
                if( pos == i)continue;              
                positions[pos]=-1;
             }
             aux.add(lsorted[i]);
          }     
          return aux.toArray(); 
      }
   }"

Now, here is the Mathematica part:

ClearAll[getMinSubsets];
getMinSubsets[l : {{__Integer} ..}] :=
  With[{sorted = Sort@l},
     MinSubsets`getMinimalSubsets[Flatten[sorted ], Length /@ sorted]
  ];

getMinSubsets[l_List] :=
  With[{rules = 
      Thread[# -> Range[Length[#]]] &[DeleteDuplicates[Flatten[l]]]
    },
    Map[ Developer`ToPackedArray,
      getMinSubsets[l /. Dispatch[rules]]
    ] /. Dispatch[Reverse[rules, {2}]]
  ];

The idea is that for integer elements, I send a flattened list of them to Java plus the list of the lengths of subsets, while for general elements I first map unique elements to inetegers, then do the same thing, then map those back.

Tests and benchmarks

For our test example:

getMinSubsets[sets]

(*  {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}}  *)

Now, the real sample (you will need to load the definiton of selectMinimalHT below, and also Heike's minSubsets, for comparison:

(res=getMinSubsets [ randomSets])//Length//AbsoluteTiming
(res1=selectMinimalHT[randomSets ])//Length//AbsoluteTiming
(res2 = minSubsets[randomSets ])//Length//AbsoluteTiming
res==res1==res2

(*
     {0.8750000,1177}
     {7.4492188,1177}
     {63.5615234,1177}
     True
*)

Conclusions

Depending on the data (how large are subsets on the average, and how big is a fraction of subsets containing other subsets), the hybrid Java - Mathematica solution can be 10-20 times faster than top-level Mathematica solution, and 50-100 times faster than Heike's one-liner, which I believe is the fastest of other posted solutions (the truth is that her and other posted solutions have quadratic complexity in the size of the subset list, so the larger it is, the more dramatic will be the performance difference).

EDIT The solution of @Mr.Wizard is actually the fastest top-level Mathematica solution, being only 1.5 times slower than this Java one, but also much shorter and more memory efficient END EDIT

This shows once again what can be a successful optimization path: prototype the algorithm in Mathematica first, get the asymptotic complexity right, and then move heavy part to Java.

The Java solution is also memory-hungry, like my Mathematica top-level one (and unlike Heike's solution which is very memory-efficient). So, for truly large lists, one may have to proceed iteratively, and / or also have lots of RAM available.

In any case, this Java solution may be fast enough to process your real sets in realistic time.

Top - level optimized solution using nested hash tables (used in the above Java solution as a prototype)

EDIT Apparently @Mr.Wizard's latest code is much faster than this and also much shorter END EDIT

Since you mentioned that you need to process rather large lists of subsets, I tried to optimize my code. Here is the fastest top-level implementation I was able to come up with:

Clear[selectMinimalHT];
selectMinimalHT[sets_List] :=
  Module[{hash, sorted = Transpose[{#, Range@Length@#} &@Sort@sets], 
     result},
   Do[hash[elem] = Unique[], {elem, Union@Flatten@sets}];
   Reap[Sow[#, First@#] & /@ sorted, _, 
       Do[hash[#1][set] = True, {set, #2}] &
   ];
   result  = 
     Reap[Do[
        If[sorted[[i]] == {}, Continue[]];
        Sow[sorted[[i, 1]]];
        With[{containing = 
          Apply[Intersection,
            Map[
              With[{sym = hash[#]},
                 DownValues[sym, Sort -> False][[All, 1, 1, 1]]
              ] &,
              sorted[[i, 1]]
            ]
          ]},
          Do[
             With[{sym  = hash[elem]},
               If[ValueQ[sym[set]], Unset[sym[set]]]
             ],
             {set, containing}, 
             {elem, First@set}
          ];
          sorted[[containing[[All, 2]]]] = {};
        ], (* With *)
        {i, Length[sorted]}
      ]
     ][[2, 1]];
     Remove @@ DownValues[hash][[All, 1, 1]];
     result
  ];

This is based on nested hash-tables, which are modified at run-time, but other than that, it is the same algorithm as in my original code. But, using hash-tables allows me to avoid requent copying of large lists, and, more importantly, the rules telling us which subsets are still potentially valid are updated at each step, which wasn't the case for Dispath-based rules. This allows to at least have a good asymptotic complexity, although perhaps with a large constant factor coming from a large overhead of top-level Mathematica code.

You use is as:

selectMinimalHT[sets]

(* {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)

Here is a more realistic sample:

randomSets = 
  Table[RandomSample[#,RandomInteger[{3,Length[#]}]]&@Range[100],{50000}];

selectMinimalHT[randomSets]//Short//AbsoluteTiming

(*
  {93.8876953,{{1,15,24},<<4703>>,
   {14,70,12,9,31,90,18,65,64,92,26,48,84,57,62,1,76,7,2,4,44,67,22}}}
*)

The complexity is approximately n*l, where n is the size of the list, and l is the average size of a subset. Note that this solution becomes quite memory-hungry, so you may want to split your list in chunks and feed those iteratively to it, combining the result with the remainder to obtain a list to be used in a new iteration.

If your subset elements are numbers, the code can be significantly sped up, by, e.g., porting the above algorithm to Java (Mathematica's Compile won't do since we need hash tables).

Initial moderately fast solution

I think, the following will be reasonably fast (although, perhaps, not the fastest):

Clear[selectMinimal]
selectMinimal[sets_List] :=
  With[{rules = Dispatch[Reap[Sow[#, #] & /@ sets, _, Rule][[2]]]},
    If[# === {}, {}, First@#] &@
       Reap[
         NestWhile[
           With[{set  = Sow@First@#},
             Complement[Rest@#, Apply[Intersection, set /. rules]]
           ] &, 
           Sort[sets], 
           # =!= {} &]
       ][[2]]
  ];

In your case, you use it as

selectMinimal[sets]

(*  {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}}  *)

For some larger example, I will generate a large list of random subsets of another list:

randomSets = 
  Table[RandomSample[#, RandomInteger[{3, Length[#]}]] &@ Range[30], {1000}];

I get then

selectMinimal[randomSets]//Short//AbsoluteTiming
{0.3535156,{{1,15,10},{2,30,11},<<182>>,{22,5,9,4,2,13,24,21,11,10,27},   
  {27,30,11,5,8,29,28,18,14,15,21}}}

Leonid Shifrin

Posted 2012-07-10T16:36:58.107

Reputation: 108 027

Leonid, thank you! Your example looks very good and very fast. – MrDas – 2012-07-10T17:11:59.063

@MrDas You are welcome. Fast, yes, more or less, but the code is not very straighforward. – Leonid Shifrin – 2012-07-10T17:14:20.353

I also see the intermediate solution I later added doesn't work at all, it ignores the {a,e,d,f} which it does need to count. – MrDas – 2012-07-10T17:15:31.883

I am positively amazed at both Mathematica's facility in providing a neat solution for this and your creativity in finding this one. Thanks a million! – MrDas – 2012-07-10T20:23:43.500

1@MrDas Java solution which is another 20 times faster is on the way - check back in 10 - 15 minutes. Great problem by the way. – Leonid Shifrin – 2012-07-10T20:30:13.330

Would your trie code be adapted for doing this? Mostly just idle speculation, but a trie is a data structure that would seem ideal to do this with. – rcollyer – 2012-07-11T19:31:35.967

@rcollyer I thought about it as well, but it would likely have to be more complicated than the trie I implemented before. Alas, no time at the moment to look at it further, but I will keep that in my todo list. – Leonid Shifrin – 2012-07-11T20:12:32.540

15

I'll show a method based on an algorithm by Bentley, Clarkson, and Levine.

--- edit ---

Their idea is to presort so that any obviously minimal elements are at the front. In this case, minimal length suffices for the test of being "obviously minimal".

Then loop over remaining elements. For each one: Loop from beginning until we hit elements of same length (as they cannot be proper subsets of the element under scrutiny). If any along the way is a proper subset then this one is not minimal and we break out of the loop. Else we add it to the minimal set, at the position one past the last added element.

Any time we find a minimizer we swap it with the first element on the list. This is a heuristic improvement from the BCL paper. I suspect there are other tweaks that might improve my code in terms of speed. Probably still would not be competitive with the Orderless pattern match.

Reference:

J. Bentley, K. Clarkson, D. Levine. Fast linear expected-time algorithms for computing maxima and convex hulls. Proceeding SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms Pages 179 - 187

There appears to be a later journal version in Algorithmica Volume 9, Number 2 (1993), 168-183.

--- end edit ---

bclMinima[ll_] := Module[
  {newl, n, len, j = 1, k = 0, lenj, l, keep}, 
  newl = Union[Map[Sort, ll]];
  newl = newl[[Ordering[Map[Length, newl]]]];
  n = Length[newl];
  len = Length[newl[[1]]];
  While[Length[newl[[j]]] == len && j <= n, j++; k++];
  While[j <= n,
   lenj = Length[newl[[j]]];
   l = 1;
   keep = True;
   While[lenj > Length[newl[[l]]], 
    If[Complement[newl[[l]], newl[[j]]] === {},
     newl[[{1, l}]] = newl[[{l, 1}]];
     keep = False;
     Break[];
     ];
    l++;];
   If[keep,
    k++;
    newl[[k]] = newl[[j]]];
   j++;
   ];
  Take[newl, k]
  ]

It performs reasonably well. Wizard's code is faster on tests I tried. I believe there is a dependency on lengths though, and if the elements are fairly long the pattern match might start to get slower. Here is an example where minimal lengths are 10.

SeedRandom[12345];
randomSets = 
  Table[RandomSample[#, RandomInteger[{10, Length[#]}]] &@
    Range[100], {5000}];

The codes come from other responses. As RM did not use a named function I recast it as minRM.

In[381]:= Timing[mins1 = minimal[randomSets];]

Out[381]= {10.38, Null}

In[382]:= Timing[mins2 = selectMinimal[randomSets];]

Out[382]= {125.29, Null}

In[383]:= Timing[mins3 = bclMinima[randomSets];]

Out[383]= {22.71, Null}

Timing[mins4 = minRM[randomSets];]

Out[387]= {66.14, Null}

In[388]:= Timing[mins5 = minSubsets[randomSets];]

Out[388]= {86.5, Null}

In[391]:= Timing[mins6 = selectMinimalHT[randomSets];]

Out[391]= {169.99, Null}

minRM gives a different result from the rest. I believe I copied it correctly but I do not rule out the possibility of error at my end.

In[395]:= SameQ[Sort[Map[Sort, mins1]], Sort[Map[Sort, mins2]], 
 Sort[Map[Sort, mins3]], Sort[Map[Sort, mins5]], 
 Sort[Map[Sort, mins6]]]

Out[395]= True

In[406]:= {Length[mins1], Length[mins4]}

Out[406]= {2833, 4935}

--- edit #2 ---

RM provided a corrected version. It gives the same result as the others for the example above and took 211.4 seconds.

--- end edit #2 ---

Mine will eventually beat Wizard's, for sufficiently large values of "eventually".

In[425]:= SeedRandom[12345];
randomSets = 
  Table[RandomSample[#, RandomInteger[{200, Length[#]}]] &@
    Range[400], {1000}];

In[427]:= Timing[mins1 = minimal[randomSets];]

Out[427]= {5.53, Null}

In[428]:= Timing[mins3 = bclMinima[randomSets];]

Out[428]= {5.17, Null}

But the list sizes are ridiculously long for the stated purpose of the original query. Also at this length there are probably more efficient ways of determining the sublist property. All in all, I'm glad I gave his an upvote. Okay, I gave a bunch of upvotes, but I'd give his another if I could.

Daniel Lichtblau

Posted 2012-07-10T16:36:58.107

Reputation: 52 368

1Since you undertook doing comparative timings would you also include Leonid's Java solution getMinSubsets? BTW, I suspect Leonid may implement your algorithm in Java; I'd like to see that. – Mr.Wizard – 2012-07-11T16:41:30.860

@Mr.Wizard I had tried it, but was not able to successfully get the JCompileLoad to work on it. – Daniel Lichtblau – 2012-07-11T16:48:31.773

Strange; I was. I guess he'd be the one to ask about that but may I know what errors you got (if any)? – Mr.Wizard – 2012-07-11T16:49:42.160

@Mr.Wizard Errors are Import::nffil: File not found during Import. >>

JCompileLoad::cmperr: The following compilation errors were encountered: $Failed – Daniel Lichtblau – 2012-07-11T16:53:04.373

1+1, for "for sufficiently large values of 'eventually'." – rcollyer – 2012-07-11T16:54:10.567

+1, very interesting. I am glad you found time to join the fun. I will definitely look into Java reloader problems, you are the second person to report them. What platform were you on for the tests? – Leonid Shifrin – 2012-07-11T23:33:10.900

I just tested on your large test, and my Java solution performs 8 times faster than yours and 4 times faster than @Mr.Wizard's, on my machine. Which means, that both your and his solutions are very fast, being written in the top-level Mathematica code. – Leonid Shifrin – 2012-07-11T23:47:20.477

@Leonid if I read that right mine is twice as fast as Daniel's on your system? Are you going to implement Daniel's algorithm in Java or is it too much bother? – Mr.Wizard – 2012-07-12T02:01:43.150

@Mr.Wizard Yes, you read it right. As to implementing in Java, no time for it at the moment, but once I get some time, why not - sure, I will give it a shot. In this respect, it is a good example, because Mathematica's Compile can not directly handle this problem (without padding to a rectangular array) due to the limitation that arrays must be rectangular. – Leonid Shifrin – 2012-07-12T09:52:50.710

Daniel, I have added a new version of my function to my answer. I think you will find its performance significant, especially on your final test set. Please consider editing your answer to include it. – Mr.Wizard – 2012-07-16T07:06:08.600

@Mr.Wizard Looks interesting. Timings at my end will need to wait a few days. I'm away from the office, sans Mathematica (but anone else could do the timings I did...) – Daniel Lichtblau – 2012-07-16T17:53:41.990

Daniel, do you have time to update this yet? – Mr.Wizard – 2012-07-29T18:37:05.690

I doubt I'll get a chance (been pulled into too many things lately). Sorry. No time to try to speed up my own method, for that matter. – Daniel Lichtblau – 2012-07-30T19:22:29.857

7

Since @DanielLichtblau used Internal`ListMin in an answer to a recent question, I will use it in an answer to this question. Given the following partial order on lists of vectors:

$$v \leq w \leftrightarrow v_i \leq w_i \text{ for all }i$$

Internal`ListMin will return the maximal subset of a set of vectors, where each member of the maximal subset is not smaller than any other vector in the set. This is exactly what is requested by the OP. The only issue is that the OP representation of a set element is not in the right form. To this end, we need the following functions to convert to and from the list form expected by Internal`ListMin:

toList[sets_] := Block[{d = ConstantArray[0, Max[sets]]},
    Internal`InheritedBlock[{d}, d[[#]]=1; d]& /@ sets
]

fromList[lists_] := Block[{d = Range @ Length @ First @ lists},
    Pick[d, #, 1]& /@ lists
]

OP example:

l = toList @ Replace[
    {{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}},
    Thread[{a, b, c, d, e, f} -> Range[6]],
    {2}
]

{{1, 1, 0, 0, 0, 0}, {0, 1, 1, 0, 0, 0}, {1, 1, 1, 0, 0, 0}, {1, 1, 0, 0, 1, 0}, {1, 0, 1, 0, 1, 0}, {1, 0, 0, 1, 1, 1}}

Now, we can use Internal`ListMin:

res = fromList @ Internal`ListMin @ l

{{1, 2}, {2, 3}, {1, 3, 5}, {1, 4, 5, 6}}

Converting back to the OP form of the set elements yields the same result:

Replace[
    res,
    Thread[Range[6] -> {a, b, c, d, e, f}],
    {2}
]       

{{a, b}, {b, c}, {a, c, e}, {a, d, e, f}}

Finally, a comparison with @ciao's excellent answer (the other answers are much, much slower):

RandomSets = Table[Range@#~RandomSample~RandomInteger@{3, #} &@4000, {8000}];

r1 = minsets[RandomSets]; //AbsoluteTiming
r2 = fromList @ Internal`ListMin @ toList @ RandomSets; //AbsoluteTiming

Sort[Sort/@r1] === Sort[Sort/@r2]

{18.1689, Null}

{4.43527, Null}

True

Carl Woll

Posted 2012-07-10T16:36:58.107

Reputation: 112 778

1When I saw this question had returned I kind of suspected there might be (1) A ListMin approach and (2) Carl Woll involvement. Re speed, it uses the Bentley-Clarkson-Levine algorithm. I hope others take note of the efficiency and upvote. – Daniel Lichtblau – 2017-08-03T16:33:28.530

1I forgot I had shown an implementation of the BCL method in this elderly thread (we have interns not so much older). So I should add that ListMin is a version of same, though coded in C. – Daniel Lichtblau – 2017-08-04T02:59:03.273

1Neat. +1....... – ciao – 2017-08-04T04:14:42.193

7

check[l_] :=
 If[
   ++$pos; Length@$minimals === Total @ Unitize[BitNot[l] ~BitAnd~ $minimals],
       $minimalIndicator += 2^$pos; AppendTo[$minimals, l]
 ]

binary[data2_, alphabet_] :=
  Total[2^(Length@alphabet - #), {2}] &[
    data2 /. Dispatch@MapIndexed[# -> #2[[1]] &, alphabet] ]

minimalR[data_] :=
 Block[{$minimals = {}, $pos = -1, $minimalIndicator = 0, sdat = Sort@data},
       Scan[check, binary[sdat, Union @@ data]];
       Pick[sdat, Reverse @ IntegerDigits[$minimalIndicator, 2, Length@data], 1]
 ]

Explanation:

binary[data, alphabet] receives the list of sets (data), and a list of symbols that include those in the set. It returns a list of integers, each representing one of data sets, whose bit representation are indicators of the elements of the alphabet in the set.

$minimals accumulates the already found minimal sets (as integers).

$pos stores the number of sets already checked -1

$minmalIndicator is just an integer whose bitwise representation indicates whose sets were found minimals. Having an indicator helps avoid having to reconstruct the original sets from the integer minimals and the alphabet, being able to simply use Pick. Doing it "as integer" was probably mostly due to the fact that I was already doing that for the sets and I felt like it.

check[l_], receives an integer, and every time it is called, if the integer is a minimal, it appends it to $minimals

Rojo

Posted 2012-07-10T16:36:58.107

Reputation: 40 993

1@MrWizard, it seems that this has it's place. If I make the second dataset bigger in some way, such as Table[RandomSample[#, RandomInteger[{200, Length[#]}]] &@ Range[800], {2000}]; this takes the lead – Rojo – 2012-07-17T17:04:10.893

Rojo I finally looked through this answer. Very interesting! Your method has complementary strengths to mine, meaning that yours is fast where mine is slow and vice versa. That may lend itself to a super-method using both selectively. I see some ways to streamline your code: may I make the edits? – Mr.Wizard – 2012-07-18T01:02:35.490

@Mr.Wizard, I'd rather you added the parameters as arguments to check and make it HoldRest. My coding style varies with my mood and whatever, and now I'm in a splitting-definitions phase. I'd rather not make everything belong to the same cell – Rojo – 2012-07-18T02:06:27.560

I think it's best as it is then. I hope I haven't made the style too unpleasant to you at this point. – Mr.Wizard – 2012-07-18T02:09:46.043

@Mr.Wizard I would never let you do that ;) – Rojo – 2012-07-18T02:15:27.147

7

Time from another necro badge, courtesy of the related questions sidebar...

My take:

minsets[sets_] := Module[{ut = Union@sets, jut, gb, jj,pw},
  jut = Join @@ ut;
  pw = 2^Range[Length@ut - 1, 0, -1];
  gb = GatherBy[Range@Length@jut, jut[[#]] &];
  jj = Join @@ MapThread[ConstantArray, {pw, Length /@ ut}];
  Pick[ut, 
   IntegerDigits[BitOr @@ BitAnd[BitAnd @@@ 
                 Replace[ut, Dispatch[Thread[jut[[gb[[All, 1]]]] -> 
                                       Total[jj[[#]] & /@ gb, {2}]]], {2}], 
                 Subtract[pw, 1]], 2, Length@ut], 0]];

Using the generator from what appears to have been the fastest answer and adjusting to more closely reflect the OP example (that is, rather than a result that drops over 90% of the sets, this results in closer to 50%, still less retention than OP example):

RandomSets = Table[Range@#~RandomSample~RandomInteger@{3, #} &@4000, {8000}];

this handily outperformed that answer. I did not test against others, since a few quick shorter tests showed them to perform/scale worse.

With data generated to reflect what I believe to better mimic the OP data, performance delta was greater still.

I think there's more optimization in this, since I just whipped it together based on an answer I did for maximal sets...

ciao

Posted 2012-07-10T16:36:58.107

Reputation: 23 752

Happy to cast the vote that should give you the badge you expected. :-) – Mr.Wizard – 2018-01-15T05:17:07.813

@Mr.Wizard - LOL, and happy new year to you and yours! – ciao – 2018-01-15T09:17:40.913

3

As always, there are many possible ways of doing things. Here's an example using Fold:

Fold[If[With[{u = #1~Join~{Union @@ #1}}, 
   MemberQ[u, Alternatives @@ Function[{x}, Intersection[x, #2]] /@ u]], 
   #1, {Sequence @@ #1, #2}] &, {First@#}, Rest@#] &@Map[Sort, list, {0, 1}];
(* {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)

rm -rf

Posted 2012-07-10T16:36:58.107

Reputation: 85 395

A large test I just did shows this variant giving a different result from the others. – Daniel Lichtblau – 2012-07-11T16:14:44.887

@DanielLichtblau Thanks, I've corrected it now. Please check it against your test (and possibly include the new timings). – rm -rf – 2012-07-11T17:02:56.273

You're using obscure Unicode. ;-p (More seriously, any reason for Sequence over Append?) – Mr.Wizard – 2012-07-11T17:10:01.963

The code below took 395 seconds on that example and returned the something the same size as the input. minRM[list_] := Fold[If[With[{u = #1~Join~{Union @@ #1}}, MemberQ[u, Alternatives @@ (x [RightTeeArrow] x â© #2) /@ u]], #1, {Sequence @@ #1, #2}] &, {First@#}, Rest@#] &@ Map[Sort, list, {0, 1}]; – Daniel Lichtblau – 2012-07-11T17:24:48.530

@DanielLichtblau Apparently the unicode that I used to make it look pretty isn't interpreted by Mathematica correctly as Function and Intersection respectively. Please see the edit for the verbose code – rm -rf – 2012-07-11T17:25:45.193

@Mr.Wizard Bitten by it again! Sadly, clip is not available for the Mac. No reason for Sequence... just artifacts of different trials hanging around – rm -rf – 2012-07-11T17:26:09.717

Does Jens' solution above mine work for you? (copyAsUnicode) – Mr.Wizard – 2012-07-11T17:36:39.373

Corrected version took 211.4 seconds. I inserted a remark about that in an edit; I do not wish to change the original post. – Daniel Lichtblau – 2012-07-11T17:44:40.813

@Mr.Wizard No. It does not copy to clipboard and the result in the temporary file it creates is also incorrect. For instance, right tee arrow is copied as subset – rm -rf – 2012-07-11T17:51:51.880

2

DeleteDuplicates + ContainsOnly

In versions 10.2+, you can use ContainsOnly as the test function

minS = DeleteDuplicates[#, ContainsOnly]&;
minS @ {{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}}

{{a, b}, {b, c}, {a, c, e}, {a, e, d, f}}

Alternatively, SubsetQ[#2,#]& or ContainsAll[#2, #]& instead of ContainsOnly.

TransitiveReductionGraph + RelationGraph + ContainsOnly

Also in versions 10.2+, you can use a combination of RelationGraph and TransitiveReductionGraph and get the source vertices:

minS2 = GraphComputation`SourceVertexList @ 
   TransitiveReductionGraph[RelationGraph[ContainsOnly, #]]&;

minS2 @ {{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}}

{{a, b}, {b, c}, {a, c, e}, {a, e, d, f}}

kglr

Posted 2012-07-10T16:36:58.107

Reputation: 302 076