Computing intersections between a large number of sets: doing it sequentially vs. sending everything to Intersection[…]

Say we want to compute the largest common subset among a very large number of sets QiQ_i. One option would be to simply run the command:

Intersection[Q1, Q2, Q3, Q4, Q5, Q6, Q7, …];

However, this seems to choke for very large numbers of input sets (10410^4 or so with on-order 10210^2 elements each). Another option would be compute Q12 = Intersection[Q1,Q2], then Q123 = Intersection[Q12,Q3], then Q1234 = Intersection[Q123,Q4], and so forth.

Are these two procedures equivalent? Is there a better way?

=================

  

 

They are not equivalent. I think nesting them is a bad idea, because the function Intersection[] do support multiple list. In general I always try to use those function that support exactly what I need them to and no more, because it seems to me quicker that way.
– Coolwater
Apr 4 ’14 at 18:34

2

 

What you are asking for is Fold[Intersection, First@#, Rest@#] &@list but it is slower.
– Kuba
Apr 4 ’14 at 18:39

1

 

@Coolwater the final result should be equivalent though.
– Yves Klett
Apr 4 ’14 at 18:41

  

 

@Kuba Just to check, is the procedure you suggest the same as the one I’m talking about where the number of intersection operations required is exactly proportional to the length of the sets?
– CA30
Apr 4 ’14 at 18:44

  

 

@CA30 Take a look at Fold in documentation. It does exactly n-1 intersections.
– Kuba
Apr 4 ’14 at 18:46

=================

1 Answer
1

=================

As Kuba notes using Fold is the most canonical way to implement your “sequential” algorithm in Mathematica. Performance it best determined by simply Timing your operation.

SeedRandom[1]
sets = RandomInteger[200, {10000, 1600}];

Intersection @@ sets // Timing
Fold[Intersection, #, {##2}] & @@ sets // Timing

{0.983, {12, 90, 94, 101, 164}}

{1.045, {12, 90, 94, 101, 164}}

(Timings performed in Mathematica 7 under Windows 7.)

So we see that the sequential method is if anything slightly slower than the direct application of Intersection. I think it is probable that Intersection is already using this algorithm.

Regarding your second question: Is there a better way? There can be, depending on your data. In the example above there is a lot of duplication within each set, and eliminating this beforehand greatly speeds the operation of Intersection:

Intersection @@ (DeleteDuplicates /@ sets) // Timing

{0.172, {12, 90, 94, 101, 164}}