[Roadmap] Phasing out the support for old binary format. · Issue #7547 · dmlc/xgboost (original) (raw)

XGBoost has a custom binary model format that has been used since day 1. Later in 1.0, we
introduced the JSON format as an alternative, which has a schema and has better
extensibility. The JSON format has been used as a default format for memory snapshot
serialization (pickle, rds, etc) and has extra features including categorical data support,
extra data feature names, and features types. However, for performance and compatibility
reasons we have continued the support for the old binary format. In 1.6 we plan to add
universal binary JSON as an extension to the current JSON format also as a replacement for the old
binary format.

Motivation

The old binary format is essentially copying internal structures like parameters, tree
nodes into a memory buffer, so it has a fixed memory layout that's difficult to change and
debug. If we look at the Learner class it's full of conditions to work around some
issues in binary format accumulated over the past. These issues root from the situation
that we can not change the binary output in any way, which also has an indirect impact on
how we write code. For instance, we can not change the RegTree structure due to how the
node is stored in the output and it's the very core of XGBoost. To overcome these issues
and clear some room for future development we need to phase out its use.

Roadmap

If the Universal Binary JSON implementation is accepted, I propose the following roadmap
for phasing out the support of the old binary format:

note