Two-dimensional Datasets — Apache Arrow v20.0.0 (original) (raw)
Record Batches#
class RecordBatch#
Collection of equal-length arrays matching a particular Schema.
A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array
Public Functions
Result<std::shared_ptr<StructArray>> ToStructArray() const#
Convert record batch to struct array.
Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.
Result<std::shared_ptr<Tensor>> ToTensor(bool null_to_nan = false, bool row_major = true, MemoryPool *pool = default_memory_pool()) const#
Convert record batch with one data type to Tensor.
Create a Tensor object with shape (number of rows, number of columns) and strides (type size in bytes, type size in bytes * number of rows). Generated Tensor will have column-major layout.
Parameters:
- null_to_nan – [in] if true, convert nulls to NaN
- row_major – [in] if true, create row-major Tensor else column-major Tensor
- pool – [in] the memory pool to allocate the tensor buffer
Returns:
the resulting Tensor
bool Equals(const RecordBatch &other, bool check_metadata = false, const EqualOptions &opts = EqualOptions::Defaults()) const#
Determine if two record batches are exactly equal.
Parameters:
- other – [in] the RecordBatch to compare with
- check_metadata – [in] if true, check that Schema metadata is the same
- opts – [in] the options for equality comparisons
Returns:
true if batches are equal
bool ApproxEquals(const RecordBatch &other, const EqualOptions &opts = EqualOptions::Defaults()) const#
Determine if two record batches are approximately equal.
Parameters:
- other – [in] the RecordBatch to compare with
- opts – [in] the options for equality comparisons
Returns:
true if batches are approximately equal
inline const std::shared_ptr<Schema> &schema() const#
Returns:
the record batch’s schema
Result<std::shared_ptr<RecordBatch>> ReplaceSchema(std::shared_ptr<Schema> schema) const#
Replace the schema with another schema with the same types, but potentially different field names and/or metadata.
virtual const std::vector<std::shared_ptr<Array>> &columns() const = 0#
Retrieve all columns at once.
virtual std::shared_ptr<Array> column(int i) const = 0#
Retrieve an array from the record batch.
Parameters:
i – [in] field index, does not boundscheck
Returns:
an Array object
std::shared_ptr<Array> GetColumnByName(const std::string &name) const#
Retrieve an array from the record batch.
Parameters:
name – [in] field name
Returns:
an Array or null if no field was found
virtual std::shared_ptr<ArrayData> column_data(int i) const = 0#
Retrieve an array’s internal data from the record batch.
Parameters:
i – [in] field index, does not boundscheck
Returns:
an internal ArrayData object
virtual const ArrayDataVector &column_data() const = 0#
Retrieve all arrays’ internal data from the record batch.
virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#
Add column to the record batch, producing a new RecordBatch.
Parameters:
- i – [in] field index, which will be boundschecked
- field – [in] field to be added
- column – [in] column to be added
virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, std::string field_name, const std::shared_ptr<Array> &column) const#
Add new nullable column to the record batch, producing a new RecordBatch.
For non-nullable columns, use the Field-based version of this method.
Parameters:
- i – [in] field index, which will be boundschecked
- field_name – [in] name of field to be added
- column – [in] column to be added
virtual Result<std::shared_ptr<RecordBatch>> SetColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#
Replace a column in the record batch, producing a new RecordBatch.
Parameters:
- i – [in] field index, does boundscheck
- field – [in] field to be replaced
- column – [in] column to be replaced
virtual Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0#
Remove column from the record batch, producing a new RecordBatch.
Parameters:
i – [in] field index, does boundscheck
const std::string &column_name(int i) const#
Name in i-th column.
int num_columns() const#
Returns:
the number of columns in the table
inline int64_t num_rows() const#
Returns:
the number of rows (the corresponding length of each column)
Result<std::shared_ptr<RecordBatch>> CopyTo(const std::shared_ptr<MemoryManager> &to) const#
Copy the entire RecordBatch to destination MemoryManager.
This uses Array::CopyTo on each column of the record batch to create a new record batch where all underlying buffers for the columns have been copied to the destination MemoryManager. This uses MemoryManager::CopyBuffer under the hood.
Result<std::shared_ptr<RecordBatch>> ViewOrCopyTo(const std::shared_ptr<MemoryManager> &to) const#
View or Copy the entire RecordBatch to destination MemoryManager.
This uses Array::ViewOrCopyTo on each column of the record batch to create a new record batch where all underlying buffers for the columns have been zero-copy viewed on the destination MemoryManager, falling back to performing a copy if it can’t be viewed as a zero-copy buffer. This uses Buffer::ViewOrCopy under the hood.
virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const#
Slice each of the arrays in the record batch.
Parameters:
offset – [in] the starting offset to slice, through end of batch
Returns:
new record batch
virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0#
Slice each of the arrays in the record batch.
Parameters:
- offset – [in] the starting offset to slice
- length – [in] the number of elements to slice from offset
Returns:
new record batch
std::string ToString() const#
Returns:
PrettyPrint representation suitable for debugging
std::vectorstd::string\ ColumnNames() const#
Return names of all columns.
Result<std::shared_ptr<RecordBatch>> RenameColumns(const std::vectorstd::string\ &names) const#
Rename columns with provided names.
Result<std::shared_ptr<RecordBatch>> SelectColumns(const std::vector<int> &indices) const#
Return new record batch with specified columns.
virtual Status Validate() const#
Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data.
This is O(k) where k is the total number of fields and array descendents.
Returns:
virtual Status ValidateFull() const#
Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data.
This is potentially O(k*n) where n is the number of rows.
Returns:
virtual const std::shared_ptr<Device::SyncEvent> &GetSyncEvent() const = 0#
EXPERIMENTAL: Return a top-level sync event object for this record batch.
If all of the data for this record batch is in CPU memory, then this will return null. If the data for this batch is on a device, then if synchronization is needed before accessing the data the returned sync event will allow for it.
Returns:
null or a Device::SyncEvent
Result<std::shared_ptr<Array>> MakeStatisticsArray(MemoryPool *pool = default_memory_pool()) const#
Create a statistics array of this record batch.
The created array follows the C data interface statistics specification. See https://arrow.apache.org/docs/format/StatisticsSchema.html for details.
Parameters:
pool – [in] the memory pool to allocate memory from
Returns:
the statistics array of this record batch
Public Static Functions
static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<Array>> columns, std::shared_ptr<Device::SyncEvent> sync_event = NULLPTR)#
Parameters:
- schema – [in] The record batch schema
- num_rows – [in] length of fields in the record batch. Each array should have the same length as num_rows
- columns – [in] the record batch fields as vector of arrays
- sync_event – [in] optional synchronization event for non-CPU device memory used by buffers
static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<ArrayData>> columns, DeviceAllocationType device_type = DeviceAllocationType::kCPU, std::shared_ptr<Device::SyncEvent> sync_event = NULLPTR)#
Construct record batch from vector of internal data structures.
This class is intended for internal use, or advanced users.
Since
0.5.0
Parameters:
- schema – the record batch schema
- num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field
- columns – the data for the batch’s columns
- device_type – the type of the device that the Arrow columns are allocated on
- sync_event – optional synchronization event for non-CPU device memory used by buffers
static Result<std::shared_ptr<RecordBatch>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())#
Create an empty RecordBatch of a given schema.
The output RecordBatch will be created with DataTypes from the given schema.
Parameters:
- schema – [in] the schema of the empty RecordBatch
- pool – [in] the memory pool to allocate memory from
Returns:
the resulting RecordBatch
static Result<std::shared_ptr<RecordBatch>> FromStructArray(const std::shared_ptr<Array> &array, MemoryPool *pool = default_memory_pool())#
Construct record batch from struct array.
This constructs a record batch using the child arrays of the given array, which must be a struct array.
This operation will usually be zero-copy. However, if the struct array has an offset or a validity bitmap then these will need to be pushed into the child arrays. Pushing the offset is zero-copy but pushing the validity bitmap is not.
Parameters:
- array – [in] the source array, must be a StructArray
- pool – [in] the memory pool to allocate new validity bitmaps
class RecordBatchReader#
Abstract interface for reading stream of record batches.
Subclassed by arrow::TableBatchReader, arrow::csv::StreamingReader, arrow:🪶:sql::example::SqliteStatementBatchReader, arrow:🪶:sql::example::SqliteTablesWithSchemaBatchReader, arrow::ipc::RecordBatchStreamReader, arrow::json::StreamingReader
Public Functions
virtual std::shared_ptr<Schema> schema() const = 0#
Returns:
the shared schema of the record batches in the stream
virtual Status ReadNext(std::shared_ptr<RecordBatch> *batch) = 0#
Read the next record batch in the stream.
Return null for batch when reaching end of stream
Example:
while (true) {
std::shared_ptr batch;
ARROW_RETURN_NOT_OK(reader->ReadNext(&batch));
if (!batch) {
break;
}
// handling the batch
, the batch->num_rows()
// might be 0.
}
Parameters:
batch – [out] the next loaded batch, null at end of stream. Returning an empty batch doesn’t mean the end of stream because it is valid data.
Returns:
inline Result<std::shared_ptr<RecordBatch>> Next()#
Iterator interface.
inline virtual Status Close()#
finalize reader
inline virtual DeviceAllocationType device_type() const#
EXPERIMENTAL: Get the device type for record batches this reader produces.
default implementation is to return DeviceAllocationType::kCPU
inline RecordBatchReaderIterator begin()#
Return an iterator to the first record batch in the stream.
inline RecordBatchReaderIterator end()#
Return an iterator to the end of the stream.
Result<RecordBatchVector> ToRecordBatches()#
Consume entire stream as a vector of record batches.
Result<std::shared_ptr<Table>> ToTable()#
Read all batches and concatenate as arrow::Table.
Public Static Functions
static Result<std::shared_ptr<RecordBatchReader>> Make(RecordBatchVector batches, std::shared_ptr<Schema> schema = NULLPTR, DeviceAllocationType device_type = DeviceAllocationType::kCPU)#
Create a RecordBatchReader from a vector of RecordBatch.
Parameters:
- batches – [in] the vector of RecordBatch to read from
- schema – [in] schema to conform to. Will be inferred from the first element if not provided.
- device_type – [in] the type of device that the batches are allocated on
static Result<std::shared_ptr<RecordBatchReader>> MakeFromIterator(Iterator<std::shared_ptr<RecordBatch>> batches, std::shared_ptr<Schema> schema, DeviceAllocationType device_type = DeviceAllocationType::kCPU)#
Create a RecordBatchReader from an Iterator of RecordBatch.
Parameters:
- batches – [in] an iterator of RecordBatch to read from.
- schema – [in] schema that each record batch in iterator will conform to.
- device_type – [in] the type of device that the batches are allocated on
class RecordBatchReaderIterator#
class TableBatchReader : public arrow::RecordBatchReader#
Compute a stream of record batches from a (possibly chunked) Table.
The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.
The table is expected to be valid prior to using it with the batch reader.
Public Functions
explicit TableBatchReader(const Table &table)#
Construct a TableBatchReader for the given table.
virtual std::shared_ptr<Schema> schema() const override#
Returns:
the shared schema of the record batches in the stream
virtual Status ReadNext(std::shared_ptr<RecordBatch> *out) override#
Read the next record batch in the stream.
Return null for batch when reaching end of stream
Example:
while (true) {
std::shared_ptr batch;
ARROW_RETURN_NOT_OK(reader->ReadNext(&batch));
if (!batch) {
break;
}
// handling the batch
, the batch->num_rows()
// might be 0.
}
Parameters:
batch – [out] the next loaded batch, null at end of stream. Returning an empty batch doesn’t mean the end of stream because it is valid data.
Returns:
void set_chunksize(int64_t chunksize)#
Set the desired maximum number of rows for record batches.
The actual number of rows in each record batch may be smaller, depending on actual chunking characteristics of each table column.
Tables#
class Table#
Logical table as sequence of chunked arrays.
Public Functions
inline const std::shared_ptr<Schema> &schema() const#
Return the table schema.
virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#
Return a column by index.
virtual const std::vector<std::shared_ptr<ChunkedArray>> &columns() const = 0#
Return vector of all columns for table.
inline std::shared_ptr<Field> field(int i) const#
Return a column’s field by index.
std::vector<std::shared_ptr<Field>> fields() const#
Return vector of all fields for table.
virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0#
Construct a zero-copy slice of the table with the indicated offset and length.
Parameters:
- offset – [in] the index of the first row in the constructed slice
- length – [in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly
Returns:
a new object wrapped in std::shared_ptr