GitHub - greenfish77/gaenari: c++ incremental decision tree (original) (raw)
gaenari
gaenari is the Korean name for spring-blooming forsythia in East Asia. it is a plant with small yellow flowers.
Here, gaenari means:
when machine learning works with real world data, its accuracy decreases over time. it is quite different from the AI industry news or the success stories of academic journals. what is the cause of the problem?
concept drift
concept drift is one of the biggest obstacles of machine learning in the real world AI.
can data before the COVID-19 pandemic predict future information? not easy. because data trends are always changing, so current models cannot easily predict the future one step ahead.
we live in an incomplete
real worldrather than atoy worldthat separates dog and cat images.
solutions
- assume that perfect modeling is impossible.
it relies onheuristic techniqueinstead of mathemetical and statistical algorithms. - update the model through incremental learning.
- accumulate data and provide insights.
- minimize the use of complex hyper-parameters.
this is an example of resolving the decrease in accuracy due to data trend change by calling rebuild().
supul
supul means forest in Korean, and it is a metaphor for multiple decision trees.supul is the another library in gaenari.
- support for
incremental learningthroughinserts,updates, andrebuilds. - combine multiple
decision trees. - manage a database.
generation
the supul expands by generation. rebuild trains a single decision tree on the weak parts and then combine it.
the goal is to increase(or keep) accuracy through rule segmentation. similar to the effect of sharpening a photo as shown in the picture below.
i learned that overfitting is bad.
as above, the tree seems to overfit over time. so it is true that negative thoughts arise.
in traditional machine learning, the training data is a sampled subset of the population. so there is a sampling error(the training data are not representative of all cases in the population), overfitting is the target of avoidance.
however, the goal of the supul is to learn the whole data, not the samples. therefore, in this case, it adaps to new data through continous incremental learning. this reduces the risk of overfitting.
library design
a single decision tree and dataset are implemented in gaenari.supul implements a public supul methods that can be called externally. database and model processing for incremental learning are key.
dataframe repository is implemented as an interface and can be easily appended to. the same goes for databases. databases other than sqlite are also possible.
build
gaenari is a header-only library, so only include is needed. but some external libraries, such as sqlite, require c/c++ compilation for link.
build gaenari
$ cd /path/to/gaenari $ mkdir build $ cd build $ cmake .. $ cmake --build . --config release
test
find executables in build/tests/* and just execute it, display in color.
build with gaenari
wrapper/wrapper.cpp
#include "gaenari/gaenari.hpp" int main(void) { gaenari::logger::init1("/temp/_log.txt"); using supul_t = supul::supul::supul_t; supul_t::api::project::create("/temp/supul_dir"); supul_t::api::project::add_field("/temp/supul_dir", "x1", "REAL"); supul_t::api::project::add_field("/temp/supul_dir", "x2", "INTEGER"); supul_t::api::project::add_field("/temp/supul_dir", "x3", "TEXT_ID"); supul_t::api::project::add_field("/temp/supul_dir", "y0", "TEXT_ID"); supul_t::api::project::x("/temp/supul_dir", {"x1", "x2", "x3"}); supul_t::api::project::y("/temp/supul_dir", "y0"); supul_t::api::project::set_property("/temp/supul_dir", "db.type", "sqlite"); supul_t supul; supul.api.lifetime.open("/temp/supul_dir"); supul.api.model.insert_chunk_csv("/temp/dataset.csv"); supul.api.model.update(); // ... supul.api.model.rebuild(); supul.api.lifetime.close(); return 0; }
wrapper/CMakeLists.txt
cmake_minimum_required(VERSION 3.6) project(wrapper)
call order is important.
add_subdirectory(</path/to/gaenari>) check_cpp17_gaenari()
add_executable(wrapper wrapper.cpp) add_gaenari(wrapper)
build
wrapper/build$ cmake .. wrapper/build$ cmake --build . --config release
walkthrough
you can call supul.api.<_category_>.<_function_>(...). it is implemented as noexcept, so checks for errors by checking the return false(or std::nullopt). for convenience, the return value check is omitted. see comments for more details.
walkthrough :: ready
since gaenari has only headers, a single include is required.
#include "gaenari/gaenari.hpp"
initialize log:
gaenari::logger::init1("/temp/log_gaenari.log");
walkthrough :: project
supul runs as a project in a directory unit. the project directory contains configuration, and sqlite database files. the project creation is as follows.
supul::supul::supul_t::api::project::create("/temp/my_project");
all functions in project category are static, so they can be called directly without an object. the main files in the project directory are:
| file name | note |
|---|---|
| property.txt | project configuration |
| attributes.json | schema definition |
| *.db | sqlite database file |
/temp/my_project/property.txt
supul configuration.
ver = 1.0.0
supported db type : sqlite.
db.type = sqlite
set default database name.
db.dbname = supul
set table name prefix.
db.tablename.prefix =
if the treenode is less accurate(<=) than this value, it is weak. the higher value, the more aggresive rebuild, and the more complex the tree.
model.weak_treenode_condition.accuracy = 0.8
it is weak when the number of treenode's instances is greater(>=) than this. the lower value, the more aggresive rebuild, and the more complex the tree.
model.weak_treenode_condition.total_count = 5
you must choose db.type after project_create(). for example, select sqlite. you can edit manually or fix it using following function.
supul::supul::supul_t::api::project::set_property("/temp/my_project", "db.type", "sqlite");
attributes.json
{ "revision": 0, "fields": { "salary": "REAL", "commission": "REAL", "age": "INTEGER", "elevel": "TEXT_ID", "car": "TEXT_ID", "zipcode": "TEXT_ID", "hvalue": "REAL", "hyears": "INTEGER", "loan": "REAL", "group": "TEXT_ID" }, "x": [ "salary", "commission", "age", "elevel", "car", "zipcode", "hvalue", "hyears", "loan" ], "y": "group" }
the above json is an example of agrawal dataset. it is created with the dataset generator provided by weka and divided into two groups. there are 9 functions and the same function have the same data trend. used for concept drift experiments.
INTEGER, REAL, and TEXT_ID are supported as data types.TEXT_ID use index stored in a string table. it is nominal data.
these fields should be included in the header of the csv where the instances are stored. not all fields need to be included in x(e.g. internal id values needed for tracking). explicitly determines the x items in the fields. and choose one y item as well.
you can edit the json manually or use the function below, too.
using supul_t = supul::supul::supul_t; std::string base_dir = "/temp/my_project";
supul_t::api::project::add_field(base_dir, "salary", "REAL"); supul_t::api::project::add_field(base_dir, "commission", "REAL"); ... supul_t::api::project::add_field(base_dir, "group", "TEXT_ID");
supul_t::api::project::x(base_dir, {"salary", "commission", ..., "loan"}); supul_t::api::project::y(base_dir, "group");
walkthrough :: create a supul object
after project creation, create a supul object.
supul::supul::supul_t supul;
if you want to use a supul object as a function return, you can use unique_ptr.
auto supul = std::make_uniquesupul::supul::supul_t(); ... return supul;
we can get supul api hints from ide tools (ex, visual studio).
you can use the lifetime api to open and close your project.
supul.api.lifetime.open("/temp/my_project");
walkthrough :: insert a csv file
supul supports incremental learning. train a continuous dataset, and one dataset is called a chunk.
prepare the csv in the same format as the definition in attributes.json.
| salary | commision | age | elevel | car | zipcode | hvalue | hyears | loan | group |
|---|---|---|---|---|---|---|---|---|---|
| 111811.9025 | 0 | 50 | L2 | C16 | Z2 | 135000 | 9 | 374566.1561 | G1 |
| 62308.5782 | 33338.59959 | 52 | L3 | C3 | Z0 | 135000 | 6 | 64557.41339 | G1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
when creating a csv with weka, elevel, car, zipcode, and group are expressed only as numbers(actually nominal).
we can create agrawal dataset.csv as below.
$ java -classpath weka.jar weka.datagenerators.classifiers.classification.Agrawal -r temp -S 0 -n 100 -F 0 -P 0.005 > dataset.arff $ java -classpath weka.jar weka.core.converters.CSVSaver -i data.arff -o dataset.csv
(see create_agrawal_dataset() function.)
insert an instances in csv into the database.
supul.api.model.insert_chunk_csv("/temp/dataset.csv");
supulinserts all new in-comming data into database. therefore, the database size is continuously increasing. it requres techniques to keep it on a limited scale. it is in TO-DO.
walkthrough :: update
insert stores only instance data, so we need to call update() for the next step. the update information includes things like the evaluation results for the current model. so, unlike insert, update requires extra time.
supul.api.model.update();
when update is called, the first model training will automatically proceed if the model has not yet been built.
update also stores statistical data(accuracy, etc.) for each chunk. this allows you to see how well the currently trained model reflects the new chunk.
walkthrough :: rebuild
when the trend in the data changes, the accuracy of the chunks decreases.rebuild() finds weak instances, re-trains only those parts, and combines them with the existing tree to overcome the loss of accuracy.
supul.api.model.rebuild();
if the
rebuildresults in somewhat less accurate, rollback to the previous state.
rebuildincreases the size of the model because it is a continous method of combining models. the way to maintain a limited scale is included in TO-DO.
reubildis not yet automatically invoked by trigger. the call torebuildunder certain conditions is not yet implemented.
walkthrough :: predict
predict the y value of the x parameters that is input to the current model. the previous model is used by database transactions when changes (insert, update, rebuild, etc.) are currently in progress. a map of (key, value) is used for the x parameter, where key and value are strings. value is automatically converted by attributes.json.
std::unordered_map<std::string, std::string> x; x = {{"salary", "1000.0"}, {"commision", "0.0"}, {"age", "25"}, {"elevel", "3"}, {"car", "1"}, {"zipcode", "1"}, {"hvalue", "132000"}, {"hyears", "3"}}; auto ret = supul.api.model.predict(x);
returned information of predict:
// predict result. struct predict_result { bool error = false; std::string errormsg; int64_t label_index = 0; std::string label; int64_t correct_count = 0; int64_t total_count = 0; double accuracy = 0.0; };
the label value is the predicted y value. label_index is the string table index of the label.correct_count, total_count, and accuracy are information of the leaf tree node classified in the decision tree.
these three values can be used as confidence information for prediction.
walkthrough :: report
current status can be output as json and gnuplot charts.
install
gnuplotand add to path.
to get report as json:
auto ret = supul.api.report.json(""); if (not ret) {/* error */} auto& json = ret.value();
to get report as gnuplot (png):
supul.api.report.gnuplot(json, {
{"terminal", "pngcairo"},
{"terminal_option", "font Times-New-Roman,10 size 800,800"},
{"output_filepath", "/tmp/chart.png"},
{"plt_filepath", "/tmp/gnuplot_script.plt"},
});
see configuring
gnuplotforterminalandterminal option. if the terminal isdumb, it will output an ascii chart.
pretty processed json:
{ "doc_ver": 1, "error": false, "category": { "global": { "schema_version": 1, "instance_count": 5000, "updated_instance_count": 5000, "instance_correct_count": 3766, "instance_accuracy": 0.7532, "acc_weak_instance_count": 4029 }, "confusion_matrix": { "label_name": [ "1", "0" ], ...
chart.png:
chunk_historyallows you to see the current accuracy trend of the model and callrebuildif necessary for better accuracy.
the chunk_history above is the result of _develop.hpp::report(). it processed in the following order(agrawal dataset).
- insert and update 10 chunks (func=1)
- insert and update 10 chunks (func=2)
- rebuild
- insert and update 10 chunks (func=2)
- rebuild
- insert and update 10 chunks (func=2)
- insert and update 10 chunks (func=1)
gnuplot_script.plt
# ${SET_TERMINAL}
# common
# data block
$data_block_chunk_history << EOD
0 0.99 100
...
# multiplot: chunk_history
set origin 0, 0.67
set size 1, 0.34
...
reset
unset key
unset multiplot
gnuplot's script leaves theterminalconfiguration blank to use the system defaults. set it yourself if necessary.
walkthrough :: wrap-up
stage 1: create a project that calls only once at the start.
#include "gaenari/gaenari.hpp" ... gaenari::logger::init1("/temp/log_gaenari.log"); std::string base_dir = "/temp/my_project"; supul::supul::supul_t::api::project::create(base_dir); supul::supul::supul_t::api::project::set_property(base_dir, "db.type", "sqlite"); supul::supul::supul_t::api::project::add_field(base_dir, "salary", "REAL"); supul::supul::supul_t::api::project::add_field(base_dir, "commission", "REAL"); supul::supul::supul_t::api::project::add_field(base_dir, "group", "TEXT_ID"); // ... omit ... supul::supul::supul_t::api::project::x(base_dir, {"salary", "commission", ..., "loan"}); supul::supul::supul_t::api::project::y(base_dir, "group");
stage 2: continuous insert of new data(chunks).
#include "gaenari/gaenari.hpp" ... gaenari::logger::init1("/temp/log_gaenari.log"); supul::supul::supul_t supul; supul.api.lifetime.open("/temp/my_project"); supul.api.model.insert_chunk_csv("/temp/dataset1.csv"); supul.api.model.update(); supul.api.model.insert_chunk_csv("/temp/dataset2.csv"); supul.api.model.update(); supul.api.model.insert_chunk_csv("/temp/dataset3.csv"); supul.api.model.update();
stage 3: predict a instance.
#include "gaenari/gaenari.hpp" ... std::unordered_map<std::string, std::string> instance = {{"salary":"3"}, ...}; ... gaenari::logger::init1("/temp/log_gaenari.log"); supul::supul::supul_t supul; supul.api.lifetime.open("/temp/my_project"); auto ret = supul.api.model.predict(instance); auto& predicted = ret.label;
stage 4: rebuild due to data trend change.
#include "gaenari/gaenari.hpp" ... gaenari::logger::init1("/temp/log_gaenari.log"); supul::supul::supul_t supul; supul.api.lifetime.open("/temp/my_project"); supul.api.model.rebuild();
stage 5: analyze the report.
#include "gaenari/gaenari.hpp"
...
gaenari::logger::init1("/temp/log_gaenari.log");
supul::supul::supul_t supul;
supul.api.lifetime.open("/temp/my_project");
auto ret = supul.api.report.json("");
if (not ret) {/* error */}
auto& json = ret.value();
supul.api.report.gnuplot(json, {
{"terminal", "pngcairo"},
{"terminal_option", "font Times-New-Roman,10 size 800,800"},
{"output_filepath", "/tmp/chart.png"},
{"plt_filepath", "/tmp/gnuplot_script.plt"},
});
database
database is at the heart of supul. so, it is helpful to understand the database structure.
- the primary key for all tables is
id. - fields in the
instancetable are dynamically determined byattributes.json. - fields with
*ref_*are references to other tableid. - implementations that rely on specific database are prohibited.
- fields used in the
whereclause are added to the index. - numerous
treenodequeries are required while runningpredict. so thecacheis used for performance. - use
prepared statementsfor security and performance.
a tool like DB Browser for SQLite makes it easier to understand the structure.

the sqlite database file with extension .db is located under the project directory. the agrwal instances are in the instance table.
let's look for misclassified instances.
execute this query:
select instance.* from instance join instance_info on instance.id = instance_info.ref_instance_id where instance_info.correct = 0

5629 instances were found (id: 1, 3, 7, ...).

30000 - 24371 = 5629, matches the calculation result of the global value.
error handling
the supul api in api category is a noexcept function, so no exceptions are thrown. check for errors with the return value.
| return type | error |
|---|---|
| bool | false |
| std::optional | std::nullopt |
| struct | some bool member variable |
if an error occurs, check the reason by:
auto msg = supul.api.misc.errmsg();
static functions are not supported.
example
if x in the predict call does not have the required value:
std::unordered_map<std::string, std::string> x = {{"foobar", "1"}}; auto result = supul.api.model.predict(x); if (result.error) { std::cout << "* supul.api.misc.errmsg(): " << supul.api.misc.errmsg() << std::endl; }
the error occurred because x does not have a required age value. the code location and reason are printed. it is also returned by calling errmsg().
api list
here is the list of supported apis.
see the comments in the code for detail.
| category | static | name |
|---|---|---|
| project | O | create |
| O | set_property | |
| O | add_field | |
| O | x | |
| O | y | |
| lifetime | open | |
| close | ||
| model | insert_chunk_csv | |
| update | ||
| rebuild | ||
| predict | ||
| report | json | |
| O | gnuplot | |
| misc | O | version |
| errmsg | ||
| property | set_property | |
| get_property | ||
| save | ||
| reload | ||
| test | verify |
property
the property.txt file in the project directory is the configuration file.
call set_property() or modify it yourself. see the comments in property.txt for detail.
| name | change possible | type | default | desc |
|---|---|---|---|---|
| ver | str | library version | ||
| db.type | str | none | support sqlite | |
| db.tablename.prefix | str | set prefix table name | ||
| model.weak_treenode_condition.accuracy | O | double | 0.8 | see comment |
| model.weak_treenode_condition.total_count | O | int | 5 | see comment |
| limit.chunk.use | O | bool | true | see comment |
| limit.chunk.instance_lower_bound | O | int | 1000000 | see comment |
| limit.chunk.instance_upper_bound | O | int | 2000000 | see comment |









