NEWS (original) (raw)
groupdata2 2.0.5
- Fixes bug in
all_groups_identical()when there were different numbers of groups in the two input vectors.
groupdata2 2.0.4
- Updates documentation.
groupdata2 2.0.3
- Fixes some warnings.
- Fixes rounding error issue on PowerPC (https://github.com/LudvigOlsen/groupdata2/issues/10). Thanks @barracuda156.
groupdata2 2.0.2
- Makes use of suggested packages conditional.
- Makes testing conditional on the availability of
xpectr. - Fixes
tidyselect-related warnings. - Removes hydroGOF from suggested packages.
groupdata2 2.0.1
- Regenerates documentation.
groupdata2 2.0.0
Summary
This version introduces collapse_groups() and friends, as well as summarize_balances() andranked_balances(). It also improves numerical balancing infold() which breaks reproducibility.
Changes
- Breaking: The numerical balancing (
num_col) infold()gets multiple improvements. This breaks reproducibility in some contexts.- Fixes bug with selection of groups to redistribute when
extreme_pairing_levels > 1. The groupings were likely to be fine, but the fix should give better groupings on average. - When possible, it redistributes the smallest and/or largest group if they are 1 standard deviation from the second smallest/largest group to avoid imbalances due to very small/large scores.
- Adds use of extreme triplet grouping when too few grouping columns are created with extreme pairing. This can lead to an increase in the number of created fold columns. In some cases, these groupings may be more balanced than with extreme pairing, but on average extreme pairing leads to more balanced groupings. See
rearrr::triplet_extremes()for more on extreme triplet grouping. - Adds argument
use_of_tripletsinfold()to allow using extreme triplet grouping instead of extreme pairing or disabling it completely.
- Fixes bug with selection of groups to redistribute when
- Adds
collapse_groups()for collapsing a set of existing groups into a smaller set of groups. Can balance the new groups by size and by numeric, categorical and ID columns. The more of these you balance at a time, the less balanced each will tend to be. Compare settings by summarizing the balances withsummarize_balances()afterwards. For creating the most balanced groups, enableauto_tune. - Adds
collapse_groups_by_size(),collapse_groups_by_numeric(),collapse_groups_by_levels(), andcollapse_groups_by_ids(). These are wrappers ofcollapse_groups()for a simplified interface. - Adds
summarize_balances()for inspecting the balance of numeric, categorical, and ID columns in-and-between groups. - Adds
ranked_balances()for extracting the across-group standard deviations of balances from the output ofsummarize_balances(). The standard deviations are a measure of how balanced a split is. - Adds
"every"method to grouping functions. Groups everyndata points together. - Prepares package’s tests for
checkmate 2.1.0.
groupdata2 1.5.0
- Breaking: Rewrites large parts of the numerical balancing engine used in
fold()andpartition(). This produces different groups in some cases. Outsources extreme pairing torearrr::pair_extremes(). Now uses hierarchical shuffling (rearrr::shuffle_hierarchy()) inpartition()and some cases offold()(relevant whenextreme_pairing_levels> 1). If you need reproducibility, the last version prior to this breaking change can be installed withdevtools::install_github("ludvigolsen/groupdata2@v1.4.2"). - Imports
rearrrfor use in numerical balancing. - Minor improvements to vignettes.
groupdata2 1.4.2
- Improves documentation for core grouping functions.
groupdata2 1.4.1
- Adds
summarize_group_cols()for finding the number of groups per fold column along with statistics about the number of rows per group. - Breaking: Fixes internal sorting of fold columns. This sometimes changes the order of fold columns, compared to the previous version.
- Adds
tidyras a required dependency. Previously, it was suggested.
groupdata2 1.4.0
- Breaking: In
fold(), thekargument can now be a multi-element vector with onek(number of folds) per fold column. This functionality required a minor rewrite, why you might see interchanged fold column names in comparison to the previous versions. - Bug fix: In
fold()andpartition(), when specifying multiplecat_colcolumns andnum_colin the same call, it would fail. This now works.
groupdata2 1.3.0
- Breaking: The following functions now work with grouped
data.frames(meaning that they are applied group-wise):fold(),partition(),group(),group_factor(),splt(),balance(),upsample(),downsample(),differs_from_previous(), andfind_missing_starts(). A message is generated once per session, when the input is grouped, to help users understand why their code is breaking.
groupdata2 1.2.1
checkmatecompatibility.- Small speed up of
n_distgrouping method.
groupdata2 1.2.0
- Adds Zenodo DOI for easier citation.
- Adds
lifecyclebadges to function documentation. - Adds argument
handle_natodiffers_from_previous()andfind_starts(). - Bug fix: In grouping functions with method
l_startsandn = "auto",NAs are now replaced by a unique value before finding group starts. E.g.c(1,1,1,2,2,NA,NA,4,4)yields 4 groups. - More explicit: the
dataargument infold()andparticipanttakes a data frame, not a vector. - Possibly breaking change: Adds
checkmateinput checks. Improves error messages but also restricts behavior. - Adds
xpectras suggested package. Doubles number of unit tests.
groupdata2 1.1.2
- Adds
all_groups_identical()for testing if two grouping factors contain the same groups, looking only at the group members, allowing for different group names / identifiers. - Unit tests were made compatible with R versions lower than 3.6.
- Adds badges to README, including travis-ci status, AppVeyor status, Codecov, min. required R version, CRAN version and monthly CRAN downloads. Note: Zenodo badge will be added post release.
groupdata2 1.1.1
- Bug fix:
fold()ungroups dataset before removing existing fold columns. - Unit tests are skipped on R versions lower than 3.6.
groupdata2 1.1.0
- New main function:
balance()used for up- and downsampling of data to balance sample size within categories and IDs. Thanks for the request from @jjesusfilho (#3). - New wrapper function:
upsample()wrapsbalance()withsize="max". - New wrapper function:
downsample()wrapsbalance()withsize="min". - Adds parameter
num_coltofold()andpartition()for balancing on a numeric column. - Adds parameter
id_aggregation_fntofold()andpartition(). Used when balancing on bothid_colandnum_col. - Adds helper tool
differs_from_previous(). Finds values in a vector that differs from the previous value by some threshold. Similar tofind_starts(). - Adds parameter
num_fold_colstofold(). Useful for creating multiple fold columns for repeated cross-validation. - Adds parameter
unique_fold_cols_onlytofold(). Whether to only include unique fold columns or not. - Adds parameter
max_iterstofold(). How many times to attempt creating unique fold columns. Note that it is possible to get fewer fold columns than specified innum_fold_cols. - Adds parameter
paralleltofold(). When creating multiple unique fold columns, we can run the column comparisons in parallel. Requires registered parallel backend. - Adds parameter
handle_existing_fold_colstofold(). When callingfold()on a data frame that already contains columns with names starting with".folds", we can either keep them and add more, or replace them. - Fixed behavior in
fold()when k is a percentage (between 0-1). It is now interpreted as the approximate size of each fold and used to calculate the number of folds. E.g.k=0.2will lead to 5 folds.
groupdata2 1.0.0
- New main function:
partition()- used for creating balanced partitions by partition sizes. - New method category:
l_methods - n is passed as a list. - New method:
l_sizes- Uses list of group sizes to create grouping factor. Can be used for partitioning (e.g.n = c(0.2, 0.3)returns 3 groups with 0.2 (20%), 0.3 (30%) and the exceeding 0.5 (50%) of the data points). - New method:
l_starts- Uses list of start positions to create groups. Define which values from a vector to start a new group at. Skip to later appearances of a value. Use n = ‘auto’ to automatically find starts usingfind_starts(). - New helper tool:
find_starts()- Finds start positions in a vector. I.e. values that differ from the previous value. Get the values or indices of the values. Output can be used asninl_startsmethod. - New helper tool:
find_missing_starts()- Returns the start positions that would be recursively removed when using thel_startsmethod with remove_missing_starts set to TRUE. - Added argument
remove_missing_startsto grouping functions. Recursively remove the starting positions not found withl_startsmethod. - New method:
primes- similar tostaircasebut with primes as steps (e.g. group sizes 2,3,5,7..). - New remainder tool:
%primes%- similar to%staircase%but for the new primes method.
groupdata2 0.1.0
- Submitted package to CRAN.
- Main functions and tools of this version is
group_factor(),group(),splt(),fold(), and%staircase%.
groupdata2 0.0.0.9000
- Created package :)