Find specific rows based on column entry

Suppose we have a table like the following:

A B C
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

Say, if I want to find all rows satisfying (col A = 0 and col C = 1). Is there a way to do such query in Mathematica?

The real dataset is 4GB of hundreds of columns and hundreds of thousands of rows consists of strings, integers, and floating numbers, so doing a pattern matching might not be a feasible solution.

The end result should be a list of indices of matching rows.

=================

1

 

What is the format of your list? Please enter your list in properly formatted Mathematica syntax. What do you mean by “find all rows”? Do you want the function return the positions of the values? Anyway, try Position[list, {0, _, 0}].
– march
Aug 2 at 17:43

  

 

doing a pattern matching costs a lot of typing because my dataset is so large.
– Kaa1el
Aug 2 at 17:54

  

 

Then you need to more completely specify your actual problem in your post. You can keep your particular example (minimal examples are best), but you need to explain what exactly you need: how big your datasets are, what the format of the list is, etc. The pattern that I proposed can be automated, specifying particular columns in the automation, but we you need to give us more information first.
– march
Aug 2 at 17:57

1

 

I’m sorry, but that doesn’t really clarify to me what the problem with pattern matching is, and what the actual use case is. Let me ask some clarifying questions: (1) Do you want to be able to specify which columns contain certain numbers as arguments to a function, and have the the function spit out the rows that match that criterion? (2) Is pattern-matching too slow, or is the typing-out of the pattern the problem? (3) Do you care about performance optimization? Answers will differ depending on that. Etc.
– march
Aug 2 at 18:01

1

 

Have you looked into representing your data as an Association? I think this would be useful for you. As it is, for such large lists, searching in general is a computationally intensive problem. The right data structure helps a lot, I think.
– march
Aug 2 at 18:09

=================

3 Answers
3

=================

Here are two options. The first relies on pattern-matching, creating a pattern on the fly based on your choice of columns and values. The second relies on using Sow and Reap while Selecting those rows that match.

findRows1[data_List, columns : {__Integer}, values_List] /; Length@columns == Length@values :=
Flatten@Position[data,
Normal@SparseArray[Thread[columns -> values]~Append~(_ -> _), {Last@Dimensions@data}]
]

findRows2[data_List, columns : {__Integer}, values_List] /; Length@columns == Length@values :=
Flatten@Module[{i = 0},
Last@Reap@Select[data, If[i++; #[[columns]] === values, Sow[i]; True, False] &]
]

(I think there’s a more elegant way of spitting out the current row in the second function, but this works.)

Let’s take a sample list the following 1000 by 500 list:

SeedRandom[0]
mat = RandomChoice[{“a”, “b”}, {500, 1000}];

Then, let’s specify that we want to pick columns 1 and 3 to contain “a” and “b”, respectively:

findRows1[mat, {1, 3}, {“a”, “b”}]
findRows2[mat, {1, 3}, {“a”, “b”}];
% === %%

Once we get more answers, someone can do these timings, but in the case above (when you’re only testing two columns, the first evaluated in 0.05 seconds, and the second evaluated in 0.0014 seconds.

  

 

Thanks! I will definitely do a benchmark testing about performance after I tryout all these methods.
– Kaa1el
Aug 2 at 18:53

  

 

Here are some timing results: method1: 6.84375 6.70313 6.625 6.73438 7.20313 method2: 9.71875 9.4375 11.2344 9.85938 9.5 so it appears your first method wins
– Kaa1el
Aug 2 at 19:04

  

 

@Kaa1el. One last thing: depending on how many columns you match, one or another of the different methods will work better. Performance optimization often requires specifics.
– march
Aug 2 at 19:26

Import the data as a table, convert to a Association to enable fast lookups (if you have 4GB of data then I don’t recommend using a Dataset

list = ImportString[“A B C
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1”, “Table”];
dset =
Map[AssociationThread[First@list , #] &, Rest@list]
(* {<|"A" -> 0, “B” -> 0, “C” -> 0|>, <|"A" -> 0, “B” -> 0,
“C” -> 1|>, <|"A" -> 0, “B” -> 1, “C” -> 0|>, <|"A" -> 0, “B” -> 1,
“C” -> 1|>, <|"A" -> 1, “B” -> 0, “C” -> 0|>, <|"A" -> 1, “B” -> 0,
“C” -> 1|>, <|"A" -> 1, “B” -> 1, “C” -> 0|>, <|"A" -> 1, “B” -> 1,
“C” -> 1|>} *)

Use Select to get the rows you want,

Select[dset, #A == 0 && #C == 1 &]
(* {<|"A" -> 0, “B” -> 0, “C” -> 1|>, <|"A" -> 0, “B” -> 1,
“C” -> 1|>} *)

From here, it’s not too painful to get the indices from the rows,

Select[dset, #A == 0 && #C == 1 &] // (Position[dset,
Alternatives @@ #] &) // Flatten
(* {2, 4} *)

  

 

Why not use AssociationThread[] directly?
– J. M.♦
Aug 2 at 18:32

1

 

Because I have no experience with it and you hadn’t pointed it out to me yet 😛
– JasonB
Aug 2 at 18:33

1

 

Don’t say I never look out for you. 😛
– J. M.♦
Aug 2 at 18:34

  

 

So MapAssociationThread still shows up in blue……
– JasonB
Aug 2 at 18:35

1

 

Well, you still need Map[] after all. 🙂 Map[AssociationThread[First[list], #] &, Rest[list]]
– J. M.♦
Aug 2 at 18:36

If all of the columns are part of one table, try

matrix = RandomInteger[{0, 1}, {8, 3}];

For[i = 1, i <= Length[matrix], i++, If[matrix[[i, 1]] == 0 && matrix[[i, 3]] == 1, Print[i]]]; 1   You can do RandomInteger[{0, 1}, {8, 3}] instead of Table[RandomInteger[{0, 1}], {i, 8}, {j, 3}]. – march Aug 2 at 18:19