DataFrame Processing Operations API - ArcticDB (original) (raw)

Build a query to process read results with. Syntax is designed to be similar to Pandas:

q = adb.QueryBuilder()
q = q[q["a"] < 5] (equivalent to q = q[q.a < 5] provided the column name is also a valid Python variable name)
dataframe = lib.read(symbol, query_builder=q).data

For Group By and Aggregation functionality please see the documentation for the groupby. For projection functionality, see the documentation for the apply method.

Supported arithmetic operations when projection or filtering:

Supported filtering operations:

isin/isnotin accept lists, sets, frozensets, 1D ndarrays, or *args unpacking. For example:

l = [1, 2, 3]
q.isin(l)

is equivalent to...

q.isin(1, 2, 3)

Boolean columns can be filtered on directly:

q = adb.QueryBuilder()
q = q[q["boolean_column"]]

and combined with other operations intuitively:

q = adb.QueryBuilder()
q = q[(q["boolean_column_1"] & ~q["boolean_column_2"]) & (q["numeric_column"] > 0)]

Arbitrary combinations of these expressions is possible, for example:

q = q[(((q["a"] * q["b"]) / 5) < (0.7 * q["c"])) & (q["b"] != 12)]

See tests/unit/arcticdb/version_store/test_filtering.py for more example uses.

Timestamp filtering

pandas.Timestamp, datetime.datetime, pandas.Timedelta, and datetime.timedelta objects are supported. Note that internally all of these types are converted to nanoseconds (since epoch in the Timestamp/datetime cases). This means that nonsensical operations such as multiplying two times together are permitted (but not encouraged).

Restrictions

String equality/inequality (and isin/isnotin) is supported for printable ASCII characters only. Although not prohibited, it is not recommended to use ==, !=, isin, or isnotin with floating point values.

Exceptions

inf or -inf values are provided for comparison Column involved in query is a Categorical Symbol is pickled Column involved in query is not present in symbol Query involves comparing strings using <, <=, >, or >= operators Query involves comparing a string to one or more numeric values, or vice versa Query involves arithmetic with a column containing strings

METHOD DESCRIPTION
apply Apply enables new columns to be created using supported QueryBuilder numeric operations. See the documentation for the
date_range DateRange to read data for. Applicable only for Pandas data with a DateTime index. Returns only the part
groupby Group symbol by column name. GroupBy operations must be followed by an aggregation operator. Currently the following five aggregation
head Filter out all but the first n rows of data. If n is negative, return all rows except the last n rows.
optimise_for_memory Reduce peak memory usage during the query, at the expense of some performance.
optimise_for_speed Process query as fast as possible (the default behaviour)
prepend Applies processing specified in other before any processing already defined for this QueryBuilder.
resample Resample a symbol on the index. The symbol must be datetime indexed. Resample operations must be followed by
row_range Row range to read data for. Inclusive of the lower bound, exclusive of the upper bound.
tail Filter out all but the last n rows of data. If n is negative, return all rows except the first n rows.
then Applies processing specified in other after any processing already defined for this QueryBuilder.

apply

Apply enables new columns to be created using supported QueryBuilder numeric operations. See the documentation for the QueryBuilder class for more information on supported expressions - any expression valid in a filter is valid when usingapply.

PARAMETER DESCRIPTION
name Name of the column to be created
expr Expression

Examples:

`>>> df = pd.DataFrame( { "VWAP": np.arange(0, 10, dtype=np.float64), "ASK": np.arange(10, 20, dtype=np.uint16), "VOL_ACC": np.arange(20, 30, dtype=np.int32), }, index=np.arange(10), )

lib.write("expression", df) q = adb.QueryBuilder() q = q.apply("ADJUSTED", q["ASK"] * q["VOL_ACC"] + 7) lib.read("expression", query_builder=q).data VOL_ACC ASK VWAP ADJUSTED 0 20 10 0.0 207 1 21 11 1.0 238 2 22 12 2.0 271 3 23 13 3.0 306 4 24 14 4.0 343 5 25 15 5.0 382 6 26 16 6.0 423 7 27 17 7.0 466 8 28 18 8.0 511 9 29 19 9.0 558 `

RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.

date_range

date_range(date_range: DateRangeInput)

DateRange to read data for. Applicable only for Pandas data with a DateTime index. Returns only the part of the data that falls within the given range. If this is the only processing clause being applied, then the returned data object will use less memory than passing date_range directly as an argument to the read method, at the cost of possibly being slightly slower.

PARAMETER DESCRIPTION
date_range A date range in the same format as accepted by the read method. TYPE: DateRangeInput

Examples:

`>>> q = adb.QueryBuilder()

q = q.date_range((pd.Timestamp("2000-01-01"), pd.Timestamp("2001-01-01"))) `

RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.

groupby

Group symbol by column name. GroupBy operations must be followed by an aggregation operator. Currently the following five aggregation operators are supported:

For usage examples, see below.

PARAMETER DESCRIPTION
name Name of the column to group on. Note that currently GroupBy only supports single-column groupings. TYPE: str

Examples:

Average (mean) over two groups:

`>>> df = pd.DataFrame( { "grouping_column": ["group_1", "group_1", "group_1", "group_2", "group_2"], "to_mean": [1.1, 1.4, 2.5, np.nan, 2.2], }, index=np.arange(5), )

q = adb.QueryBuilder() q = q.groupby("grouping_column").agg({"to_mean": "mean"}) lib.write("symbol", df) lib.read("symbol", query_builder=q).data `

           to_mean
 group_1  1.666667
 group_2       2.2

Max over one group:

`>>> df = pd.DataFrame( { "grouping_column": ["group_1", "group_1", "group_1"], "to_max": [1, 5, 4], }, index=np.arange(3), )

q = adb.QueryBuilder() q = q.groupby("grouping_column").agg({"to_max": "max"}) lib.write("symbol", df) lib.read("symbol", query_builder=q).data `

         to_max
group_1  5

Max and Mean:

`>>> df = pd.DataFrame( { "grouping_column": ["group_1", "group_1", "group_1"], "to_mean": [1.1, 1.4, 2.5], "to_max": [1.1, 1.4, 2.5] }, index=np.arange(3), )

q = adb.QueryBuilder() q = q.groupby("grouping_column").agg({"to_max": "max", "to_mean": "mean"}) lib.write("symbol", df) lib.read("symbol", query_builder=q).data `

         to_max   to_mean
group_1     2.5  1.666667

Min and max over one column, mean over another:

`>>> df = pd.DataFrame( { "grouping_column": ["group_1", "group_1", "group_1", "group_2", "group_2"], "agg_1": [1, 2, 3, 4, 5], "agg_2": [1.1, 1.4, 2.5, np.nan, 2.2], }, index=np.arange(5), )

q = adb.QueryBuilder() q = q.groupby("grouping_column") q = q.agg({"agg_1_min": ("agg_1", "min"), "agg_1_max": ("agg_1", "max"), "agg_2": "mean"}) lib.write("symbol", df) lib.read("symbol", query_builder=q).data `

         agg_1_min  agg_1_max     agg_2
group_1          1          3  1.666667
group_2          4          5       2.2
RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.

head

Filter out all but the first n rows of data. If n is negative, return all rows except the last n rows.

PARAMETER DESCRIPTION
n Number of rows to select if non-negative, otherwise number of rows to exclude. TYPE: int DEFAULT: 5
RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.

optimise_for_memory

Reduce peak memory usage during the query, at the expense of some performance.

Optimisations applied:

optimise_for_speed

Process query as fast as possible (the default behaviour)

prepend

Applies processing specified in other before any processing already defined for this QueryBuilder.

PARAMETER DESCRIPTION
other QueryBuilder to apply before this one in the processing pipeline.
RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.

resample

resample( rule: Union[str, DateOffset], closed: Optional[str] = None, label: Optional[str] = None, offset: Optional[Union[str, Timedelta]] = None, origin: Union[str, Timestamp] = "epoch", )

Resample a symbol on the index. The symbol must be datetime indexed. Resample operations must be followed by an aggregation operator. Currently, the following 7 aggregation operators are supported:

Note that not all aggregators are supported with all column types:

Note that time-buckets which contain no index values in the symbol will NOT be included in the returned DataFrame. This is not the same as Pandas default behaviour. Resampling is currently not supported with:

The resample results match pandas resample with origin="epoch". We plan to add an 'origin' argument in a future release and will then change the default value to '"start_day"' to match the Pandas default. This will change the results in cases where the rule is not a multiple of 24 hours.

PARAMETER DESCRIPTION
rule The frequency at which to resample the data. Supported rule strings are ns, us, ms, s, min, h, and D, and multiples/combinations of these, such as 1h30min. pd.DataOffset objects representing frequencies from this set are also accepted. TYPE: Union[str, DateOffset]
closed Which boundary of each time-bucket is closed. Must be one of 'left' or 'right'. If not provided, the default is left for all currently supported frequencies. TYPE: Optional[str] DEFAULT: None
label Which boundary of each time-bucket is used as the index value in the returned DataFrame. Must be one of 'left' or 'right'. If not provided, the default is left for all currently supported frequencies. TYPE: Optional[str] DEFAULT: None
offset Offset the start of each bucket. Supported strings are the same as in pd.Timedelta. If offset is larger than rule then offset modulo rule is used as an offset. TYPE: Optional[Union[str, Timedelta]] DEFAULT: None
origin The timestamp on which to adjust the grouping. Supported string are: epoch: origin is 1970-01-01 start: origin is the first value of the timeseries start_day: origin is the first day at midnight of the timeseries end: origin is the last value of the timeseries end_day: origin is the ceiling midnight of the last day start, start_day, end, end_day origin values are not supported in conjunction with date_range. TYPE: Union[str, Timestamp] DEFAULT: 'epoch'
RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.
RAISES DESCRIPTION
ArcticDbNotYetImplemented A frequency string or Pandas DateOffset object are provided to the rule argument outside the supported frequencies listed above.
ArcticNativeException The closed or label arguments are not one of "left" or "right"
SchemaException Raised on call to read if: If the aggregation specified is not compatible with the type of the column being aggregated as specified above. The library has dynamic schema enabled, and at least one of the columns being aggregated is missing from at least one row-slice. At least one of the columns being aggregated contains sparse data.
UserInputException start, start_day, end, end_day is used in conjunction with date_range origin is not one of start, start_day, end, end_day, epoch or a pd.Timestamp

Examples:

Resample two hours worth of minutely data down to hourly data, summing the column 'to_sum':

`>>> df = pd.DataFrame( { "to_sum": np.arange(120), }, index=pd.date_range("2024-01-01", freq="min", periods=120), )

q = adb.QueryBuilder() q = q.resample("h").agg({"to_sum": "sum"}) lib.write("symbol", df) lib.read("symbol", query_builder=q).data `

                     to_sum
2024-01-01 00:00:00    1770
2024-01-01 01:00:00    5370

As above, but specifying that the closed boundary of each time-bucket is the right hand side, and also to label the output by the right boundary:

`>>> q = adb.QueryBuilder()

q = q.resample("h", closed="right", label="right").agg({"to_sum": "sum"}) lib.read("symbol", query_builder=q).data `

                     to_sum
2024-01-01 00:00:00       0
2024-01-01 01:00:00    1830
2024-01-01 02:00:00    5310

Nones, NaNs, and NaTs are omitted from aggregations:

`>>> df = pd.DataFrame( { "to_mean": [1.0, np.nan, 2.0], }, index=pd.date_range("2024-01-01", freq="min", periods=3), )

q = adb.QueryBuilder() q = q.resample("h").agg({"to_mean": "mean"}) lib.write("symbol", df) lib.read("symbol", query_builder=q).data `

                     to_mean
2024-01-01 00:00:00      1.5

Output column names can be controlled through the format of the dict passed to agg:

`>>> df = pd.DataFrame( { "agg_1": [1, 2, 3, 4, 5], "agg_2": [1.0, 2.0, 3.0, np.nan, 5.0], }, index=pd.date_range("2024-01-01", freq="min", periods=5), )

q = adb.QueryBuilder() q = q.resample("h") q = q.agg({"agg_1_min": ("agg_1", "min"), "agg_1_max": ("agg_1", "max"), "agg_2": "mean"}) lib.write("symbol", df) lib.read("symbol", query_builder=q).data `

                     agg_1_min  agg_1_max     agg_2
2024-01-01 00:00:00          1          5      2.75

row_range

row_range(row_range: Tuple[int, int])

Row range to read data for. Inclusive of the lower bound, exclusive of the upper bound. Should behave the same as df.iloc[start:end], including in the handling of negative start/end values.

PARAMETER DESCRIPTION
row_range Row range to read data for. Inclusive of the lower bound, exclusive of the upper bound. TYPE: Tuple[int, int]
RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.

tail

Filter out all but the last n rows of data. If n is negative, return all rows except the first n rows.

PARAMETER DESCRIPTION
n Number of rows to select if non-negative, otherwise number of rows to exclude. TYPE: int DEFAULT: 5
RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.

then

Applies processing specified in other after any processing already defined for this QueryBuilder.

PARAMETER DESCRIPTION
other QueryBuilder to apply after this one in the processing pipeline.
RETURNS DESCRIPTION
QueryBuilder Modified QueryBuilder object.