Notebook on nbviewer (original) (raw)

  1. sciruby-notebooks
  2. [Data Analysis](/github/SciRuby/sciruby-notebooks/tree/master/Data Analysis) Notebook

Searching and Combining data in daru

Arel-like query syntax

Arel is a very popular ruby gem that is one of the major components of the most popular ruby frameworks, Rails. It is an ORM-helper of sorts that exposes a beautiful and intuitive syntax for creating SQL strings by chaining Ruby methods.

In daru, we have successfully adopted this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.

Lets see how this syntax works with some examples:

To use this syntax we call any of the comparator methods defined on Daru::Vector and pass the results from these to the #where method available for both DataFrame and Vector.

To demonstrate with a quick example:

In [2]:

vector = Daru::Vector.new([2,4,5,51,5,16,2,5,3,2,1,5,2,5,2,1,56,234,6,21]) vector.where((vector.eq(5) | vector.eq(1)) & vector.mt(2))

Out[2]:

Daru::Vector:21491980 size: 5
nil
2 5
4 5
7 5
11 5
13 5

** Note that we use union OR (|) and union AND (&) and not logical OR (||) or logical AND (&&) **

Thus the results returned by the Vector#eq method are evaluated by Vector#where and the generated Vector is returned. The index is also preserved.

The where clause can also be used with DataFrame, with similar results.

In [3]:

df = Daru::DataFrame.new({ a: [1,2,3,4,5,6]*100, b: ['a','b','c','d','e','f']*100, c: [11,22,33,44,55,66]*100 }, index: (1..600).to_a.shuffle)

df.where(df[:a].eq(2) | df[:c].eq(55))

Out[3]:

Daru::DataFrame:19261300 rows: 200 cols: 3
a b c
263 2 b 22
248 5 e 55
339 2 b 22
160 5 e 55
505 2 b 22
311 5 e 55
199 2 b 22
232 5 e 55
410 2 b 22
125 5 e 55
350 2 b 22
572 5 e 55
29 2 b 22
166 5 e 55
68 2 b 22
270 5 e 55
273 2 b 22
416 5 e 55
11 2 b 22
492 5 e 55
85 2 b 22
198 5 e 55
306 2 b 22
364 5 e 55
589 2 b 22
220 5 e 55
467 2 b 22
341 5 e 55
269 2 b 22
190 5 e 55
158 2 b 22
354 5 e 55
... ... ... ...
514 5 e 55

The comparator methods on Vector return an object of type Daru::Core::Query::BoolArray which lets us perform OR and AND operations on it. See this blog post for more information on BoolArray and other comparator methods.

As a convenience for readability, you can also you the #and or #or methods instead of #& and #|.

In [4]:

vector.where(vector.eq(2).or(vector.eq(5)).and(vector.mt(2)))

Out[4]:

Daru::Vector:18369620 size: 5
nil
2 5
4 5
7 5
11 5
13 5

One of the major advantages of using the where clause over other more robust methods like DataFrame#filter or Vector#keep_if is that it is much faster (though not destructive). These benchmarks prove my point.

Performing joins between DataFrames

Daru::DataFrame offers the #join method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.

To demonstrate:

In [5]:

left = Daru::DataFrame.new({ :id => [1,2,3,4], :name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti'] }) right = Daru::DataFrame.new({ :id => [1,2,3,4], :name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja'] }) nil

To perform an inner join on the :name column:

In [6]:

left.join(right, on: [:name], how: :inner)

Out[6]:

Daru::DataFrame:12529040 rows: 2 cols: 3
id_1 name id_2
0 1 Pirate 2
1 3 Ninja 4

An outer left join can be done with:

In [7]:

left.join(right, on: [:name], how: :left)

Out[7]:

Daru::DataFrame:12260160 rows: 4 cols: 3
id_1 name id_2
0 1 Pirate 2
1 2 Monkey
2 3 Ninja 4
3 4 Spaghetti

An outer right join can be done like so:

In [8]:

left.join(right, on: [:name], how: :right)

Out[8]:

Daru::DataFrame:11720640 rows: 4 cols: 3
id_1 name id_2
0 Rutabaga 1
1 1 Pirate 2
2 Darth Vader 3
3 3 Ninja 4

And finally, a full outer join:

In [9]:

left.join(right, on: [:name], how: :outer)

Out[9]:

Daru::DataFrame:10555680 rows: 6 cols: 3
id_1 name id_2
0 1 Pirate 2
1 2 Monkey
2 3 Ninja 4
3 4 Spaghetti
4 Rutabaga 1
5 Darth Vader 3