LSMTree compaction creates duplicate timestamped indexes that are not cleaned up (original) (raw)

Description

When creating indexes on large datasets (33.8M records), ArcadeDB's LSMTree compaction process creates multiple timestamped duplicate indexes that persist in the database instead of being cleaned up after compaction completes.

Steps to Reproduce

Import a large dataset (e.g., MovieLens ml-latest with 33,832,163 ratings)
Create indexes on the imported data:

CREATE INDEX ON Movie (movieId) UNIQUE CREATE INDEX ON Rating (userId) NOTUNIQUE CREATE INDEX ON Rating (movieId) NOTUNIQUE CREATE INDEX ON Link (movieId) UNIQUE CREATE INDEX ON Tag (movieId) NOTUNIQUE

Query the schema to see all indexes:

SELECT name, typeName, properties, unique, automatic FROM schema:indexes ORDER BY typeName, name

Expected Behavior

Expected 5 indexes total (one per CREATE INDEX command).

Actual Behavior

Found 80 indexes instead of 5 - with 15+ timestamped duplicates per table:

Movie[movieId] (expected)
Movie_0_172987397898984 (duplicate)
Movie_1_172987421520553 (duplicate)
Movie_2_172987445142122 (duplicate)
... (13 more duplicates)
All duplicates are marked as automatic=true.

Analysis

Based on source code review:

LSMTreeIndexMutable.java (line 168):

public LSMTreeIndexCompacted createNewForCompaction() { final String newName = componentName.substring(0, last_) + "_" + System.nanoTime(); return new LSMTreeIndexCompacted(..., newName, ...); }

LSMTreeIndex.java (line 548):

protected LSMTreeIndexMutable splitIndex(...) { final String newName = mutable.getName().substring(0, last_) + "_" + System.nanoTime(); final LSMTreeIndexMutable newMutableIndex = new LSMTreeIndexMutable(..., newName, ...); }

These timestamped index files are created during compaction but appear not to be properly cleaned up after compaction completes.

Impact

Functional: ✅ Queries work correctly using the main indexes
Performance: ⚠️ Duplicates don't affect query speed but waste disk space
Storage: ❌ 16x storage overhead for index files

Environment

Dataset: MovieLens ml-latest (33,832,163 ratings, 86,538 movies, 2,328,316 tags, 9,742 links)
ArcadeDB: Python bindings via arcadedb_embedded
JVM Heap: 8GB
Database: Embedded mode

Logs

During index creation on large dataset:

⚠️ Index creation failed: Command failed: com.arcadedb.exception.NeedRetryException:
Cannot create a new index while asynchronous tasks are running (LSMTreeIndexCompactor)

LSMTree compaction logs show:

LSMTreeIndex 'Movie[movieId]' compacted 50 pages, remaining 0 pages
(totalKeys=289037 totalValues=2251732)

Questions

Are timestamped index files intended to be temporary during compaction?
Should they be automatically cleaned up after compaction completes?
Is there a configuration to control compaction cleanup behavior?

Suggested Fix

After compaction completes, cleanup logic should:

Identify timestamped index files matching pattern {indexName}_\d+
Remove them from schema if they're marked as temporary/compaction artifacts
Delete the corresponding physical files

Workaround

Users can manually drop timestamped indexes:

DROP INDEX Movie_0_172987397898984; -- Repeat for all timestamped duplicates

However, this requires knowing which indexes are duplicates vs. legitimate user-created indexes.[+] Tested on 25.10.1-SNAPSHOT