Notebook on nbviewer (original) (raw)
- sciruby-notebooks
- [Data Analysis](/github/SciRuby/sciruby-notebooks/tree/master/Data Analysis) Notebook
Searching and Combining data in daru¶
Arel-like query syntax¶
Arel is a very popular ruby gem that is one of the major components of the most popular ruby frameworks, Rails. It is an ORM-helper of sorts that exposes a beautiful and intuitive syntax for creating SQL strings by chaining Ruby methods.
In daru, we have successfully adopted this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.
Lets see how this syntax works with some examples:
To use this syntax we call any of the comparator methods defined on Daru::Vector
and pass the results from these to the #where
method available for both DataFrame
and Vector
.
To demonstrate with a quick example:
In [2]:
vector = Daru::Vector.new([2,4,5,51,5,16,2,5,3,2,1,5,2,5,2,1,56,234,6,21]) vector.where((vector.eq(5) | vector.eq(1)) & vector.mt(2))
Out[2]:
Daru::Vector:21491980 size: 5 | |
---|---|
nil | |
2 | 5 |
4 | 5 |
7 | 5 |
11 | 5 |
13 | 5 |
** Note that we use union OR (|
) and union AND (&
) and not logical OR (||
) or logical AND (&&
) **
Thus the results returned by the Vector#eq
method are evaluated by Vector#where
and the generated Vector is returned. The index is also preserved.
The where
clause can also be used with DataFrame, with similar results.
In [3]:
df = Daru::DataFrame.new({ a: [1,2,3,4,5,6]*100, b: ['a','b','c','d','e','f']*100, c: [11,22,33,44,55,66]*100 }, index: (1..600).to_a.shuffle)
df.where(df[:a].eq(2) | df[:c].eq(55))
Out[3]:
Daru::DataFrame:19261300 rows: 200 cols: 3 | |||
---|---|---|---|
a | b | c | |
263 | 2 | b | 22 |
248 | 5 | e | 55 |
339 | 2 | b | 22 |
160 | 5 | e | 55 |
505 | 2 | b | 22 |
311 | 5 | e | 55 |
199 | 2 | b | 22 |
232 | 5 | e | 55 |
410 | 2 | b | 22 |
125 | 5 | e | 55 |
350 | 2 | b | 22 |
572 | 5 | e | 55 |
29 | 2 | b | 22 |
166 | 5 | e | 55 |
68 | 2 | b | 22 |
270 | 5 | e | 55 |
273 | 2 | b | 22 |
416 | 5 | e | 55 |
11 | 2 | b | 22 |
492 | 5 | e | 55 |
85 | 2 | b | 22 |
198 | 5 | e | 55 |
306 | 2 | b | 22 |
364 | 5 | e | 55 |
589 | 2 | b | 22 |
220 | 5 | e | 55 |
467 | 2 | b | 22 |
341 | 5 | e | 55 |
269 | 2 | b | 22 |
190 | 5 | e | 55 |
158 | 2 | b | 22 |
354 | 5 | e | 55 |
... | ... | ... | ... |
514 | 5 | e | 55 |
The comparator methods on Vector return an object of type Daru::Core::Query::BoolArray
which lets us perform OR and AND operations on it. See this blog post for more information on BoolArray and other comparator methods.
As a convenience for readability, you can also you the #and
or #or
methods instead of #&
and #|
.
In [4]:
vector.where(vector.eq(2).or(vector.eq(5)).and(vector.mt(2)))
Out[4]:
Daru::Vector:18369620 size: 5 | |
---|---|
nil | |
2 | 5 |
4 | 5 |
7 | 5 |
11 | 5 |
13 | 5 |
One of the major advantages of using the where
clause over other more robust methods like DataFrame#filter
or Vector#keep_if
is that it is much faster (though not destructive). These benchmarks prove my point.
Performing joins between DataFrames¶
Daru::DataFrame offers the #join
method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.
To demonstrate:
In [5]:
left = Daru::DataFrame.new({ :id => [1,2,3,4], :name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti'] }) right = Daru::DataFrame.new({ :id => [1,2,3,4], :name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja'] }) nil
To perform an inner join on the :name
column:
In [6]:
left.join(right, on: [:name], how: :inner)
Out[6]:
Daru::DataFrame:12529040 rows: 2 cols: 3 | |||
---|---|---|---|
id_1 | name | id_2 | |
0 | 1 | Pirate | 2 |
1 | 3 | Ninja | 4 |
An outer left join can be done with:
In [7]:
left.join(right, on: [:name], how: :left)
Out[7]:
Daru::DataFrame:12260160 rows: 4 cols: 3 | |||
---|---|---|---|
id_1 | name | id_2 | |
0 | 1 | Pirate | 2 |
1 | 2 | Monkey | |
2 | 3 | Ninja | 4 |
3 | 4 | Spaghetti |
An outer right join can be done like so:
In [8]:
left.join(right, on: [:name], how: :right)
Out[8]:
Daru::DataFrame:11720640 rows: 4 cols: 3 | |||
---|---|---|---|
id_1 | name | id_2 | |
0 | Rutabaga | 1 | |
1 | 1 | Pirate | 2 |
2 | Darth Vader | 3 | |
3 | 3 | Ninja | 4 |
And finally, a full outer join:
In [9]:
left.join(right, on: [:name], how: :outer)
Out[9]:
Daru::DataFrame:10555680 rows: 6 cols: 3 | |||
---|---|---|---|
id_1 | name | id_2 | |
0 | 1 | Pirate | 2 |
1 | 2 | Monkey | |
2 | 3 | Ninja | 4 |
3 | 4 | Spaghetti | |
4 | Rutabaga | 1 | |
5 | Darth Vader | 3 |