Notebook on nbviewer (original) (raw)

  1. sciruby-notebooks
  2. [Data Analysis](/github/SciRuby/sciruby-notebooks/tree/master/Data Analysis) Notebook

A Glimpse of Daru::Vector

In daru, the Daru::Vector is a 1 dimensional array with axis labels.

Labels should be unique. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods automatically exclude missing data (currently represented by default as nil).

Operations between Vectors (+, -, /, , *) align values based on their associated index values. The Vectors need not be of similar length. The result index will be the sorted union of the two indexes.

Daru::Vector is similar to pandas.Series.

The examples below demonstrates how a simple Daru::Vector can be created and its data viewed.

This first example shows very basic creation of a Vector with missing data (represented by nil).

The very basic way to create a Vector is by just passing an Array of values into the constructor.

Index labels can be specified using the :index option and you can also name your Vector something using the :name option. In case :index isn't specified, the Vector will be assigned an index starting from 0.

In [2]:

a = Daru::Vector.new([1,2,3,4,5], index: [:a, :b, :c, :d, :e], name: :bazinga)

Out[2]:

Daru::Vector:30420060 size: 5
bazinga
a 1
b 2
c 3
d 4
e 5

Values can be accessed using their labels with the #[] operator.

OR you can even specify a range with labels...

Out[4]:

Daru::Vector:29850660 size: 3
bazinga
b 2
c 3
d 4

Values can be assigned with the #[]= operator.

Out[5]:

Daru::Vector:30420060 size: 5
bazinga
a 1
b 999
c 3
d 4
e 5

If you want to treat values apart from nil as missing, you can specify them using the :missing_values option.

The #only_valid method can then be used for obtaining all the non-missing values of the Vector. Notice that only_valid preserves the indexes (labels) of the data.

In [6]:

a = Daru::Vector.new([1,2,3,5,5,4,6,nil,nil], missing_values: [5,nil]) a.only_valid

Out[6]:

Daru::Vector:29284260 size: 5
nil
0 1
1 2
2 3
5 4
6 6

The Vector.[] class method creates a vector from almost any object that has a #to_a method defined on it. It is similar to R's c method.

In [7]:

b = Daru::Vector[1,2,3,4,6..10]

Out[7]:

Daru::Vector:28825140 size: 9
nil
0 1
1 2
2 3
3 4
4 6
5 7
6 8
7 9
8 10

The new_with_size class method lets you create a Daru::Vector by specifying the size as the argument. The optional block, if supplied, is run once for populating each element in the Vector.

The result of each run of the block is the value that is ultimately assigned to that position in the Vector.

In [8]:

a = Daru::Vector.new_with_size(1000, name: :new_vector) { r=rand(5); r == 4 ? nil: r; }

Out[8]:

Daru::Vector:28500640 size: 1000
new_vector
0 2
1 3
2 0
3
4
5 2
6 2
7 2
8 1
9 1
10 3
11 0
12 2
13 0
14 3
15 1
16 3
17 1
18 3
19 0
20 1
21 1
22 2
23 2
24 2
25 3
26 3
27 2
28 0
29
30 3
31 3
... ...
999 3

Use the #head method for obtaining the top 10 values of the Vector.

Out[9]:

Daru::Vector:27175540 size: 10
new_vector
0 2
1 3
2 0
3
4
5 2
6 2
7 2
8 1
9 1

Sorting

The Daru::Vector#sort method will sort the Vector and preserve the indexes.

In [10]:

a = Daru::Vector.new([23,144,332,11,2,5,6765,3])

Out[10]:

Daru::Vector:25317760 size: 8
nil
0 23
1 144
2 332
3 11
4 2
5 5
6 6765
7 3

Out[11]:

Daru::Vector:24840120 size: 8
nil
4 2
7 3
5 5
3 11
0 23
1 144
2 332
6 6765

Basic Math

Arithmetic operations done between two vectors will always perform the arithmetic on corresponding elements of the same index.

The concerned vectors need not have the same size of even the same index. In case of a mismatch, a sorted union of the indexes of both the Vectors is used as an index for the resulting vector.

In case a particular index exists in one vector but not in the other, the result Vector has a nil placed in that index position.

Daru::Vector supports +, -, *, / and ** operators.

In [12]:

a = Daru::Vector.new([1,2,3,4,5,6], index: [:a, :b, :c, :d, :five, :f]) b = Daru::Vector.new([1,2,3,4,5], index: [:a, :b, :c, :ff,:five])

a + b

Out[12]:

Daru::Vector:24525720 size: 7
nil
a 2
b 4
c 6
d
f
ff
five 10

Out[13]:

Daru::Vector:24243560 size: 7
nil
a 1
b 4
c 27
d
f
ff
five 3125

Performing arithmetic with a single number will perform the operation on each element in the Vector and return the resultant Vector.

Out[14]:

Daru::Vector:23813900 size: 6
nil
a 5
b 10
c 15
d 20
five 25
f 30

Statistics

Daru::Vector defines a host of statistics methods, which are useful for performing ephemeral statistics on numeric data. All the statistics methods ignore the missing values and work only on the valid data.

For a complete list of statistics functions see the Daru::Maths::Statistics::Vector module in the docs.

In [15]:

v = Daru::Vector.new([1,2,3,4,5,nil,6,nil,7]) v.mean

Plotting

Daru uses nyaplot internally for generating interactive plots.

You can also use rubyvis through statsample for quickly generating scatter plots, histograms and box plots.

A simple scatter plot can be generated by simply calling the #plot function on Daru::Vector. Feel free to interact with the generated plot.

In [18]:

v = Daru::Vector.new((0..360).step(7).map { |i| Math.sin((i*Math::PI)/180) }) v.plot

Now, lets take some dummy data of a survey that shows the number of people of each age group that are part of this survey. We want to plot the number of people from each age group who have taken the test in a bar graph.

For this purpose we use the #plot function again, but this time supply it with the :type option, and set the value of this option as :bar. The plot function yeilds the corresponding Nyaplot::Plot object in the block, which can then be used for setting different parameters of the final plot. For more configuration methods see the Nyaplot::Plot documentation.

In [19]:

v = Daru::Vector.new([40,50,20,70,10], index: ['18-24', '24-30', 'Under 18', '30-40', '40-50'], name: "Age Range") v.plot(type: :bar) do |plt| plt.x_label "Age Groups" plt.y_label "Number of People Surveyed" end

The third kind of plot that Daru::Vector can easily generate from nyaplot is the histogram.

To demonstrate, we'll prepare some sample data using the rnorm function from the statsample ruby gem. The rnorm function just generates normally distributied random variables (1000 in this case) and returns a Daru::Vector object that contains these numbers (in variable a).

A histogram of the normally distributed function has been generated below.

In [20]:

require 'statsample' include Statsample::Shorthand

a = rnorm(1000) a.plot type: :histogram do |p| p.yrange [0,200] p.y_label "Frequency" p.x_label "Bins" end

More plotting support

Apart from interfacing with nyaplot, Daru::Vector also works out-of-the-box with rubyvis through statsample. To see generating plots with statsample and rubyvis in action, checkout the following notebooks: