Notebook on nbviewer (original) (raw)
- sciruby-notebooks
- [Data Analysis](/github/SciRuby/sciruby-notebooks/tree/master/Data Analysis) Notebook
Usage of Daru::DataFrame¶
Daru::DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and vectors).
Arithmetic operations align on both row and vector labels. Can be thought of as a container for Daru::Vector objects. This is primary data structure used by daru and gems that depend on it (like statsample).
You should use DataFrame because it allows you to easily store, access and manipulate labelled data, plot it using an interactive graph library and perform various statistics operations by ignoring missing data.
Basic Creation and Access¶
Daru offers many options for creating DataFrames. You can create it from Hashes, Arrays, Daru::Vectors or even load it from CSV files, Excel spreadsheets or SQL databases.
From Array of Arrays
In the example below, I'm specifying the vertical Vectors of the DataFrame as an Array of Arrays and I specify their names in the :order
option, by supplying an Array of names that the vectors should be called by.
In the :index
option, we'll specify the names of the rows of the DataFrame. If the :index
is not given, DataFrame will assign numerical indexes starting from 0 to each row.
In [2]:
df = Daru::DataFrame.new([[1,2,3,4], [1,2,3,4]],order: [:a, :b], index: [:one, :two, :three, :four])
Out[2]:
Daru::DataFrame:22605100 rows: 4 cols: 2 | ||
---|---|---|
a | b | |
one | 1 | 1 |
two | 2 | 2 |
three | 3 | 3 |
four | 4 | 4 |
From Hash of Arrays
A similar DataFrame can be created from a Hash. In this case the keys of the Hash are the names of the vectors in the DataFrame. The :order
option, if specified, will only serve to decide the orientation of the Vectors in the DataFrame. Not specfiying :order
in this case will align the vectors alphabetically.
In [3]:
df = Daru::DataFrame.new({a: [1,2,3,4], b: [1,2,3,4]},order: [:b, :a])
Out[3]:
Daru::DataFrame:22188400 rows: 4 cols: 2 | ||
---|---|---|
b | a | |
0 | 1 | 1 |
1 | 2 | 2 |
2 | 3 | 3 |
3 | 4 | 4 |
From Hash of Vectors
A DataFrame can be created from a Hash of Daru::Vectors and their names. The name of the vector will be the key and the corresponding value, a Daru::Vector.
The values of the DataFrame are aligned according to the index of each Daru::Vector. A nil is assigned whenever a particular index is not available for one Vector but is present in any of the other Vectors, and the resulting index of the DataFrame is a union of the indexes of all the Vectors in alphabetical order.
The sizes or indexes of the supplied Vectors don't matter.
In [4]:
v1 = Daru::Vector.new([1,2,3,4,5], index: [:a, :b, :c, :d, :e]) v2 = Daru::Vector.new([11,22,33,44], index: [:b, :e, :a, :absent])
Daru::DataFrame.new({v1: v1, v2: v2})
Out[4]:
Daru::DataFrame:21716520 rows: 6 cols: 2 | ||
---|---|---|
v1 | v2 | |
a | 1 | 33 |
absent | 44 | |
b | 2 | 11 |
c | 3 | |
d | 4 | |
e | 5 | 22 |
The 'clone' option
If you have Vectors that have exactly the same index, you can specify the :clone
option to DataFrame. Setting :clone
to false will direct daru to utilize the same Vector objects in creating the DataFrame, that you have specified in the Hash and will prevent their cloning when being stored in the DataFrame. Thus the object IDs of the Vectors will remain the same.
Be wary of making changes in the DataFrame or the supplied vectors if you set :clone
to false.
In [5]:
v1 = Daru::Vector.new([1,2,3,4,5]) v2 = Daru::Vector.new([11,22,33,44,55])
df = Daru::DataFrame.new({a: v1, b: v2}, clone: false) puts "equalness a : #{v1.object_id == df[:a].object_id}\nequalness b : #{v2.object_id == df[:b].object_id}"
equalness a : true equalness b : true
Creating with rows
If you want to create a DataFrame by specifying the rows, you can do so by specifying an Array of Arrays or Array of Vectors to the .rows
method.
Lets first see creating DataFrames from an Array of Arrays:
In [6]:
Daru::DataFrame.rows([ [1,11,10,'a'], [2,22,20 ,4 ], [3,33,30,'g'], [4,44,40, 3 ] ], order: [:a, :b, :c, :d])
Out[6]:
Daru::DataFrame:20876660 rows: 4 cols: 4 | ||||
---|---|---|---|---|
a | b | c | d | |
0 | 1 | 11 | 10 | a |
1 | 2 | 22 | 20 | 4 |
2 | 3 | 33 | 30 | g |
3 | 4 | 44 | 40 | 3 |
If you supply an Array of Vectors to the .rows
method, the index of the Vectors will be automatically assigned as the names of the vectors of the DataFrame. Moreover, elements will be aligned by their indexes in the completed DataFrame.
If a Vector does not have a particular index that is present in other Vectors, a nil will be placed in that position.
The :order
option should be set in this case to whatever values you want to keep in your DataFrame to avoid unexpected behaviour.
In [7]:
r1 = Daru::Vector.new([1,2,3,4,5], index: [:a, :b, :c, :d, :e]) r2 = Daru::Vector.new([11,22,33,44,55], index: [:a, :c, :e, :b, :odd])
Daru::DataFrame.rows([r1,r2], order: [:a, :b, :c, :d, :odd])
Out[7]:
Daru::DataFrame:20467260 rows: 2 cols: 5 | ||||
---|---|---|---|---|
a | b | c | d | odd |
0 | 1 | 2 | 3 | 4 |
1 | 11 | 44 | 22 | 55 |
Loading data from different data sources¶
Daru::DataFrame currently supports loading data from CSV files, Excel spreadsheets and SQL databases. You can also write your DataFrames to these kinds of files using some simple functions. Daru also supports saving and loading data by Marshalling. Lets go through them one by one.
CSV (Comma Separated Values) files
To demonstrate loading and writing to CSV files, we'll read some sales data from this CSV file.
In [8]:
Daru::DataFrame.from_csv 'data/sales-funnel.csv'
Out[8]:
Daru::DataFrame:18079560 rows: 17 cols: 8 | ||||||||
---|---|---|---|---|---|---|---|---|
Account | Manager | Name | Price | Product | Quantity | Rep | Status | |
0 | 714466 | Debra Henley | Trantow-Barrows | 30000 | CPU | 1 | Craig Booker | presented |
1 | 714466 | Debra Henley | Trantow-Barrows | 10000 | Software | 1 | Craig Booker | presented |
2 | 714466 | Debra Henley | Trantow-Barrows | 5000 | Maintenance | 2 | Craig Booker | pending |
3 | 737550 | Debra Henley | Fritsch, Russel and Anderson | 35000 | CPU | 1 | Craig Booker | declined |
4 | 146832 | Debra Henley | Kiehn-Spinka | 65000 | CPU | 2 | Daniel Hilton | won |
5 | 218895 | Debra Henley | Kulas Inc | 40000 | CPU | 2 | Daniel Hilton | pending |
6 | 218895 | Debra Henley | Kulas Inc | 10000 | Software | 1 | Daniel Hilton | presented |
7 | 412290 | Debra Henley | Jerde-Hilpert | 5000 | Maintenance | 2 | John Smith | pending |
8 | 740150 | Debra Henley | Barton LLC | 35000 | CPU | 1 | John Smith | declined |
9 | 141962 | Fred Anderson | Herman LLC | 65000 | CPU | 2 | Cedric Moss | won |
10 | 163416 | Fred Anderson | Purdy-Kunde | 30000 | CPU | 1 | Cedric Moss | presented |
11 | 239344 | Fred Anderson | Stokes LLC | 5000 | Maintenance | 1 | Cedric Moss | pending |
12 | 239344 | Fred Anderson | Stokes LLC | 10000 | Software | 1 | Cedric Moss | presented |
13 | 307599 | Fred Anderson | Kassulke, Ondricka and Metz | 7000 | Maintenance | 3 | Wendy Yule | won |
14 | 688981 | Fred Anderson | Keeling LLC | 100000 | CPU | 5 | Wendy Yule | won |
15 | 729833 | Fred Anderson | Koepp Ltd | 65000 | CPU | 2 | Wendy Yule | declined |
16 | 729833 | Fred Anderson | Koepp Ltd | 5000 | Monitor | 2 | Wendy Yule | presented |
You can specify all the options to the .from_csv
function that you do to the Ruby CSV.read()
function, since this is what is used internally.
For example, if the columns in your CSV file are separated by something other that commas, you can use the :col_sep
option. If you want to convert numeric values to numbers and not keep them as strings, you can use the :converters
option and set it to :numeric
.
The .from_csv
function uses the following defaults for reading CSV files (that are passed into the CSV.read()
function):
{ :col_sep => ',', :converters => :numeric }
The #write_csv
function is used for writing the contents of a DataFrame to a CSV file.
Excel Files
The ::from_excel
method can be used for loading Excel files. The spreadsheet gem is used in the background in this case, so whatever variants of Excel compatible files can be loaded by spreadsheet should be easily loadable in this case too.
Let me demonstrate this using this Excel file.
In [9]:
df = Daru::DataFrame.from_excel 'data/test_xls.xls'
Out[9]:
Daru::DataFrame:16647660 rows: 6 cols: 5 | |||||
---|---|---|---|---|---|
id | name | age | city | a1 | |
0 | 1 | Alex | 20 | New York | a,b |
1 | 2 | Claude | 23 | London | b,c |
2 | 3 | Peter | 25 | London | a |
3 | 4 | Franz | Paris | ||
4 | 5 | George | 5.5 | Tome | a,b,c |
5 | 6 | Fernand |
Likewise, the #write_excel
method can be used for writing data stored in the DataFrame to an Excel file.
SQL Databases
Similar to the examples above you can use the ::from_sql
and #write_sql
methods for interacting with SQL databases.
Plaintext Files
In case your data is stored as columns in plaintext (for example this file), you can use the ::from_plaintext
method for loading data from the file.
Querying and accessing data¶
Daru::DataFrame consists of rows and vectors, both of which can be accessed by their labels using an intuitive syntax.
Consider the following DataFrame:
In [10]:
df = Daru::DataFrame.new({ a: [1,2,3,4,5,6,7], b: ['a','b','c','d','e','f','g'], c: [11,22,33,44,55,66,77] }, index: [:a,:b,:c,:d,:e,:f,:g])
Out[10]:
Daru::DataFrame:14984040 rows: 7 cols: 3 | |||
---|---|---|---|
a | b | c | |
a | 1 | a | 11 |
b | 2 | b | 22 |
c | 3 | c | 33 |
d | 4 | d | 44 |
e | 5 | e | 55 |
f | 6 | f | 66 |
g | 7 | g | 77 |
You can access any Vector using the #[]
operator. The resultant Vector is returned as a Daru::Vector which preserves the index of the DataFrame.
Out[11]:
Daru::Vector:14980940 size: 7 | |
---|---|
b | |
a | a |
b | b |
c | c |
d | d |
e | e |
f | f |
g | g |
You can also specify a Range inside #[]
to return a DataFrame which contains the columns within the Range.
Out[12]:
Daru::DataFrame:14029820 rows: 7 cols: 2 | ||
---|---|---|
b | c | |
a | a | 11 |
b | b | 22 |
c | c | 33 |
d | d | 44 |
e | e | 55 |
f | f | 66 |
g | g | 77 |
A row can be accessed using the #row[]
method. The row is also returned as a Daru::Vector and any operations so any operations on a Daru::Vector will be valid on the row too.
The index of the returned row corresponds to the names of the Vectors.
Out[13]:
Daru::Vector:13588820 size: 3 | |
---|---|
c | |
a | 3 |
b | c |
c | 33 |
Here too, you can specify a Range, and you will receive a Daru::DataFrame instead of a Daru::Vector containing the relevant rows specified by the Range.
Out[14]:
Daru::DataFrame:24490780 rows: 3 cols: 3 | |||
---|---|---|---|
a | b | c | |
d | 4 | d | 44 |
e | 5 | e | 55 |
f | 6 | f | 66 |
Rows can be accessed using numerical indices too (this works for columns too).
Out[15]:
Daru::Vector:24061940 size: 3 | |
---|---|
3 | |
a | 4 |
b | d |
c | 44 |
You can get the top 3 rows by passing an argument to the #head
method (or the bottom 3 using #tail
).
Out[16]:
Daru::DataFrame:23701640 rows: 3 cols: 3 | |||
---|---|---|---|
a | b | c | |
a | 1 | a | 11 |
b | 2 | b | 22 |
c | 3 | c | 33 |
Filtering, selecting, adding and deleting data¶
A column can be added by simply specifying it's name and value using the #[]=
operator.
In [17]:
df[:d] = df[:a] * df[:c] df
Out[17]:
Daru::DataFrame:14984040 rows: 7 cols: 4 | ||||
---|---|---|---|---|
a | b | c | d | |
a | 1 | a | 11 | 11 |
b | 2 | b | 22 | 44 |
c | 3 | c | 33 | 99 |
d | 4 | d | 44 | 176 |
e | 5 | e | 55 | 275 |
f | 6 | f | 66 | 396 |
g | 7 | g | 77 | 539 |
You can delete a vector with the #delete_vector
method.
Out[18]:
Daru::DataFrame:14984040 rows: 7 cols: 3 | |||
---|---|---|---|
a | c | d | |
a | 1 | 11 | 11 |
b | 2 | 22 | 44 |
c | 3 | 33 | 99 |
d | 4 | 44 | 176 |
e | 5 | 55 | 275 |
f | 6 | 66 | 396 |
g | 7 | 77 | 539 |
If you try to insert a Daru::Vector that does not conform to the index of the DataFrame, the values will be appropriately placed such that they conform to the DataFrame's index.
nil is inserted wherever a similar index cannot be found on the DataFrame.
Inserting an Array will require the Array to be of the same length as that of the DataFrame.
In [19]:
df[:b] = Daru::Vector.new(['a',33,'b','c','d',88,'e'], index: [:a,:c,:d,:b,:e,:f,:extra]) df
Out[19]:
Daru::DataFrame:14984040 rows: 7 cols: 4 | ||||
---|---|---|---|---|
a | c | d | b | |
a | 1 | 11 | 11 | a |
b | 2 | 22 | 44 | c |
c | 3 | 33 | 99 | 33 |
d | 4 | 44 | 176 | b |
e | 5 | 55 | 275 | d |
f | 6 | 66 | 396 | 88 |
g | 7 | 77 | 539 |
Inserting a row also works similarly.
In [20]:
df.row[:latest] = Daru::Vector.new([10,20,30,40], index: [:c,:b,:a,:d]) df
Out[20]:
Daru::DataFrame:14984040 rows: 8 cols: 4 | ||||
---|---|---|---|---|
a | c | d | b | |
a | 1 | 11 | 11 | a |
b | 2 | 22 | 44 | c |
c | 3 | 33 | 99 | 33 |
d | 4 | 44 | 176 | b |
e | 5 | 55 | 275 | d |
f | 6 | 66 | 396 | 88 |
g | 7 | 77 | 539 | |
latest | 30 | 10 | 40 | 20 |
In both row and vector insertion, if the index specified is not present in the DataFrame, a new index is created and appended or if it is present then the existing index will be over-ridden.
For filtering out certain rows/vectors based on their values, use the #filter
method. By default it iterates over vectors and keeps those vectors for which the block returns true. It accepts an optional axis argument which lets you specify whether you want to iterate over vectors or rows.
In [21]:
Filter vectors.
The type
method returns either :numeric or :object. The :numeric type states
that the Vector consists only of numerical data (combined with missing data).
If the type happens to be :object, it contains non-numerical data like strings
or symbols. Statistical operations will not be possible on Vectors of type :object.
df.filter do |vector| vector.type == :numeric and vector.median < 50 end
Out[21]:
Daru::DataFrame:20876140 rows: 8 cols: 2 | ||
---|---|---|
a | c | |
a | 1 | 11 |
b | 2 | 22 |
c | 3 | 33 |
d | 4 | 44 |
e | 5 | 55 |
f | 6 | 66 |
g | 7 | 77 |
latest | 30 | 10 |
In [22]:
Filter rows
df.filter(:row) do |row| row[:a] + row[:d] < 100 end
Out[22]:
Daru::DataFrame:20409180 rows: 3 cols: 4 | ||||
---|---|---|---|---|
a | c | d | b | |
a | 1 | 11 | 11 | a |
b | 2 | 22 | 44 | c |
latest | 30 | 10 | 40 | 20 |
A DataFrame can be transposed using the #transpose
method.
Out[23]:
Daru::DataFrame:18063520 rows: 4 cols: 8 | ||||||||
---|---|---|---|---|---|---|---|---|
a | b | c | d | e | f | g | latest | |
a | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 30 |
c | 11 | 22 | 33 | 44 | 55 | 66 | 77 | 10 |
d | 11 | 44 | 99 | 176 | 275 | 396 | 539 | 40 |
b | a | c | 33 | b | d | 88 | 20 |
Arithmetic¶
All arithmetic operations can be performed on a Daru::DataFrame and you can a DataFrame with another DataFrame, a Vector or a scalar.
Indexes are aligned appropriately whenever an operation is performed with a non-scalar quantity.
With a Scalar
Adding a scalar quantity will add that number to all the numeric type vectors, keeping :object type Vectors the way they originally were.
Out[24]:
Daru::DataFrame:17731620 rows: 8 cols: 4 | ||||
---|---|---|---|---|
a | c | d | b | |
a | 11 | 21 | 21 | a |
b | 12 | 32 | 54 | c |
c | 13 | 43 | 109 | 33 |
d | 14 | 54 | 186 | b |
e | 15 | 65 | 285 | d |
f | 16 | 76 | 406 | 88 |
g | 17 | 87 | 549 | |
latest | 40 | 20 | 50 | 20 |
With another DataFrame
Performing arithmetic between two data frames will align the elements by row and column indexes of either dataframe.
If a column is present in one dataframe but not in the other, the resultant dataframe will be populated with a column full of nils of that name.
DataFrames need not be of the same size for this operation to succeed.
In [25]:
df1 = Daru::DataFrame.new({ a: 7.times.map { rand(100) }, f: 7.times.map { rand(100) }, c: 7.times.map { rand(100) } }, index: [:a,:b,:c,:d,:latest,:older,:f])
df1 + df
Out[25]:
Daru::DataFrame:16665280 rows: 9 cols: 5 | ||||
---|---|---|---|---|
a | b | c | d | f |
a | 69 | 32 | ||
b | 72 | 56 | ||
c | 38 | 108 | ||
d | 26 | 47 | ||
e | ||||
f | 84 | 101 | ||
g | ||||
latest | 73 | 31 | ||
older |
Statistics¶
Statistical methods perform basic statistics on numerical Vectors only.
For a whole list of methods see the Daru::Maths::Statistics::DataFrame module in the docs.
To demonstrate, the #mean
method calculates the mean of each numeric vector and returns a Daru::Vector with the vector's name as the index alongwith the corresponding value.
Out[26]:
Daru::Vector:14533320 size: 3 | |
---|---|
mean | |
a | 7.25 |
c | 39.75 |
d | 197.5 |
The #describe
method can be used for knowing various statistics in one shot.
Out[27]:
Daru::DataFrame:14352440 rows: 5 cols: 3 | |||
---|---|---|---|
a | c | d | |
count | 8 | 8 | 8 |
mean | 7.25 | 39.75 | 197.5 |
std | 9.40744386111339 | 25.06990227344335 | 190.99214643539665 |
min | 1 | 10 | 11 |
max | 30 | 77 | 539 |
#cov
will return a covariance matrix of the DataFrame, and it will be properly indexed so you can see the data clearly.
Out[28]:
Daru::DataFrame:13991820 rows: 3 cols: 3 | |||
---|---|---|---|
a | c | d | |
a | 88.5 | -66.5 | -233.0 |
c | -66.5 | 628.5 | 4637.0 |
d | -233.0 | 4637.0 | 36478.0 |
Likewise #corr
computes the correlation matrix.
Out[29]:
Daru::DataFrame:12502180 rows: 3 cols: 3 | |||
---|---|---|---|
a | c | d | |
a | 1.0 | -0.28196640612394586 | -0.12967873822641748 |
c | -0.28196640612394586 | 0.9999999999999998 | 0.9684315851062977 |
d | -0.12967873822641748 | 0.9684315851062977 | 1.0 |
You can use report builder to create a quick summary of the DataFrame using the #summary
method.
= 7ebe63b4-aa3b-42f4-a0d1-c5b7d6813b77 Number of rows: 8 Element:[a] == a n :8 n valid:8 median: 4.5 mean: 7.2500 std.dev.: 9.4074 std.err.: 3.3260 skew: 1.6908 kurtosis: 1.3190 Element:[c] == c n :8 n valid:8 median: 38.5 mean: 39.7500 std.dev.: 25.0699 std.err.: 8.8635 skew: 0.1381 kurtosis: -1.7271 Element:[d] == a n :8 n valid:8 median: 137.5 mean: 197.5000 std.dev.: 190.9921 std.err.: 67.5259 skew: 0.5945 kurtosis: -1.3406 Element:[b] == b n :8 n valid:7 factors: a,c,33,b,d,88,20 mode: a Distribution +----+---+--------+ | a | 1 | 14.29% | | b | 1 | 14.29% | | c | 1 | 14.29% | | d | 1 | 14.29% | | 20 | 1 | 14.29% | | 33 | 1 | 14.29% | | 88 | 1 | 14.29% | +----+---+--------+
Looping and iterators¶
Daru::DataFrame offers many iterators to loop over either rows or columns.
#each
#each
works exactly like Array#each. The default mode for each
is to iterate over the columns of the DataFrame. To iterate over rows you must pass the axis, i.e :row
as an argument.
In [31]:
Iterate over vectors
e = [] df.each do |vector| e << vector[:a].to_s + vector[:latest].to_s end
puts e
["130", "1110", "1140", "a20"]
In [32]:
Iterate over rows
r = [] df.each(:row) do |row| r << row[:a] * row[:c] end
puts r
[11, 44, 99, 176, 275, 396, 539, 300]
#map
The #map iterator works like Array#map. The value returned by each run of the block is added to an Array and the Array is returned.
This method also accepts an axis
argument, like #each
. The default is :vector
.
In [33]:
Map over vectors.
The only_numerics
method returns a DataFrame which contains vectors
with only numerical values. Setting the :clone
option to false will
return the same Vector objects that are contained in the original DataFrame.
df.only_numerics(clone: false).map do |vector| vector.mean end
In [34]:
Map over rows.
Calling only_numerics
on a Daru::Vector will return a Vector with only numeric and
missing data. Data marked as 'missing' is not considered during statistical computation.
df.map(:row) do |row| row.only_numerics.mean end
Out[34]:
[7.666666666666667, 22.666666666666668, 42.0, 74.66666666666667, 111.66666666666667, 139.0, 207.66666666666666, 25.0]
#recode
Recode works similarly to #map
, but an important difference between the two is that recode returns a modified Daru::DataFrame instead of an Array. For this reason, #recode
expects that every run of the block to return a Daru::Vector
.
Just like map and each, recode also accepts an optional axis argument.
In [35]:
Recode vectors
df.only_numerics(clone: false).recode do |vector| vector[:a] = vector[:d] + vector[:c] vector[:b] = vector.mean + vector[:a] vector # <- return the vector to the block end
Out[35]:
Daru::DataFrame:22133080 rows: 8 cols: 3 | |||
---|---|---|---|
a | c | d | |
a | 7 | 77 | 275 |
b | 15.0 | 125.0 | 505.5 |
c | 3 | 33 | 99 |
d | 4 | 44 | 176 |
e | 5 | 55 | 275 |
f | 6 | 66 | 396 |
g | 7 | 77 | 539 |
latest | 30 | 10 | 40 |
In [36]:
Recode rows
df.recode(:row) do |row| row[:a] = row[:c] - row[:d] row[:b] = row[:b].to_i if row[:b].is_a?(String) row end
Out[36]:
Daru::DataFrame:21467720 rows: 8 cols: 4 | ||||
---|---|---|---|---|
a | c | d | b | |
a | 0 | 11 | 11 | 0 |
b | -22 | 22 | 44 | 0 |
c | -66 | 33 | 99 | 33 |
d | -132 | 44 | 176 | 0 |
e | -220 | 55 | 275 | 0 |
f | -330 | 66 | 396 | 88 |
g | -462 | 77 | 539 | |
latest | -30 | 10 | 40 | 20 |
#collect
The #collect
iterator works similar to #map
, the only difference being that it returns a Daru::Vector comprising of the results of each block run. The resultant Vector has the same index as that of the axis over which collect
has iterated.
It also accepts the optional axis argument.
In [37]:
Collect Vectors
df.collect do |vector| vector[:c] + vector[:f] end
Out[37]:
Daru::Vector:20466840 size: 4 | |
---|---|
nil | |
a | 9 |
c | 99 |
d | 495 |
b | 121 |
In [38]:
Collect Rows
df.collect(:row) do |row| row[:a] + row[:d] - row[:c] end
Out[38]:
Daru::Vector:20062900 size: 8 | |
---|---|
nil | |
a | 1 |
b | 24 |
c | 69 |
d | 136 |
e | 225 |
f | 336 |
g | 469 |
latest | 60 |
#vector_by_calculation
#vector_by_calculation
is a DSL that can be used for generating a Daru::Vector based on the results returned by the block.
This DSL lets you refer to elements directly as methods inside the block.
In [39]:
df.vector_by_calculation { a + c + d }
Out[39]:
Daru::Vector:17919800 size: 8 | |
---|---|
nil | |
a | 23 |
b | 68 |
c | 135 |
d | 224 |
e | 335 |
f | 468 |
g | 623 |
latest | 80 |
Sorting¶
Daru::DataFrame offers a robust #sort
function which can be used for hierarchically sorting the Vectors in the DataFrame.
Here are couple of examples to demonstrate a lot of the options:
In [40]:
df = Daru::DataFrame.new({ a: ['g', 'g','g','sort', 'this'], b: [4,4,335,32,11], c: ['This', 'dataframe','is','for','sorting'] })
Out[40]:
Daru::DataFrame:17606280 rows: 5 cols: 3 | |||
---|---|---|---|
a | b | c | |
0 | g | 4 | This |
1 | g | 4 | dataframe |
2 | g | 335 | is |
3 | sort | 32 | for |
4 | this | 11 | sorting |
The Array passed as an argument to 'sort' tells the method the order in which preference of sorting should be given to each Vector.
The :ascending option will tell DataFrame the order in which you want the Vectors to be sorted. true for ascending sort and false for descending sort.
The :by option lets you define a custom attribute for each vector to sort by. This works similarly to passing a block to Array#sort_by.
In [41]:
df.sort([:a,:b,:c], ascending: [true, false, true], by: {c: lambda { |a| a.size }})
Out[41]:
Daru::DataFrame:17102340 rows: 5 cols: 3 | |||
---|---|---|---|
a | b | c | |
2 | g | 335 | is |
0 | g | 4 | This |
1 | g | 4 | dataframe |
3 | sort | 32 | for |
4 | this | 11 | sorting |
Additional examples¶
Sort a dataframe with a vector sequence.
In [42]:
df = Daru::DataFrame.new({a: [1,2,1,2,3], b: [5,4,3,2,1]})
df.sort [:a, :b]
Out[42]:
Daru::DataFrame:15834560 rows: 5 cols: 2 | ||
---|---|---|
a | b | |
2 | 1 | 3 |
0 | 1 | 5 |
3 | 2 | 2 |
1 | 2 | 4 |
4 | 3 | 1 |
Sort a dataframe without a block. Here nils will be handled automatically and appear at top.
In [43]:
df = Daru::DataFrame.new({a: [-3,nil,-1,nil,5], b: [4,3,2,1,4]})
df.sort([:a])
Out[43]:
Daru::DataFrame:15003920 rows: 5 cols: 2 | ||
---|---|---|
a | b | |
1 | 3 | |
3 | 1 | |
0 | -3 | 4 |
2 | -1 | 2 |
4 | 5 | 4 |
Sort a dataframe with a block with nils handled automatically.
In [44]:
df = Daru::DataFrame.new({a: [nil,-1,1,nil,-1,1], b: ['aaa','aa',nil,'baaa','x',nil] })
df.sort [:b], by: {b: lambda { |a| a.length } }
This would give "NoMethodError: undefined method `length' for nil:NilClass"
Instead you could do the following if you want the nils to be handled automatically
df.sort [:b], by: {b: lambda { |a| a.length } }, handle_nils: true
Out[44]:
Daru::DataFrame:14432560 rows: 6 cols: 2 | ||
---|---|---|
a | b | |
2 | 1 | |
5 | 1 | |
4 | -1 | x |
1 | -1 | aa |
0 | aaa | |
3 | baaa |
Sort a dataframe with a block with nils handled manually.
In [45]:
df = Daru::DataFrame.new({a: [nil,-1,1,nil,-1,1], b: ['aaa','aa',nil,'baaa','x',nil] })
To print nils at the bottom one can use lambda { |a| (a.nil?)[1]:[0,a.length] }
df.sort [:b], by: {b: lambda { |a| (a.nil?)?[1]:[0,a.length] } }, handle_nils: true
Out[45]:
Daru::DataFrame:14040080 rows: 6 cols: 2 | ||
---|---|---|
a | b | |
4 | -1 | x |
1 | -1 | aa |
0 | aaa | |
3 | baaa | |
2 | 1 | |
5 | 1 |