LSMTree compaction creates duplicate timestamped indexes that are not cleaned up (original) (raw)
Description
When creating indexes on large datasets (33.8M records), ArcadeDB's LSMTree compaction process creates multiple timestamped duplicate indexes that persist in the database instead of being cleaned up after compaction completes.
Steps to Reproduce
- Import a large dataset (e.g., MovieLens ml-latest with 33,832,163 ratings)
- Create indexes on the imported data:
CREATE INDEX ON Movie (movieId) UNIQUE CREATE INDEX ON Rating (userId) NOTUNIQUE CREATE INDEX ON Rating (movieId) NOTUNIQUE CREATE INDEX ON Link (movieId) UNIQUE CREATE INDEX ON Tag (movieId) NOTUNIQUE
- Query the schema to see all indexes:
SELECT name, typeName, properties, unique, automatic FROM schema:indexes ORDER BY typeName, name
Expected Behavior
Expected 5 indexes total (one per CREATE INDEX command).
Actual Behavior
Found 80 indexes instead of 5 - with 15+ timestamped duplicates per table:
Movie[movieId](expected)Movie_0_172987397898984(duplicate)Movie_1_172987421520553(duplicate)Movie_2_172987445142122(duplicate)- ... (13 more duplicates)
All duplicates are marked asautomatic=true.
Analysis
Based on source code review:
- LSMTreeIndexMutable.java (line 168):
public LSMTreeIndexCompacted createNewForCompaction() { final String newName = componentName.substring(0, last_) + "_" + System.nanoTime(); return new LSMTreeIndexCompacted(..., newName, ...); }
- LSMTreeIndex.java (line 548):
protected LSMTreeIndexMutable splitIndex(...) { final String newName = mutable.getName().substring(0, last_) + "_" + System.nanoTime(); final LSMTreeIndexMutable newMutableIndex = new LSMTreeIndexMutable(..., newName, ...); }
These timestamped index files are created during compaction but appear not to be properly cleaned up after compaction completes.
Impact
- Functional: ✅ Queries work correctly using the main indexes
- Performance: ⚠️ Duplicates don't affect query speed but waste disk space
- Storage: ❌ 16x storage overhead for index files
Environment
- Dataset: MovieLens ml-latest (33,832,163 ratings, 86,538 movies, 2,328,316 tags, 9,742 links)
- ArcadeDB: Python bindings via arcadedb_embedded
- JVM Heap: 8GB
- Database: Embedded mode
Logs
During index creation on large dataset:
⚠️ Index creation failed: Command failed: com.arcadedb.exception.NeedRetryException:
Cannot create a new index while asynchronous tasks are running (LSMTreeIndexCompactor)
LSMTree compaction logs show:
LSMTreeIndex 'Movie[movieId]' compacted 50 pages, remaining 0 pages
(totalKeys=289037 totalValues=2251732)
Questions
- Are timestamped index files intended to be temporary during compaction?
- Should they be automatically cleaned up after compaction completes?
- Is there a configuration to control compaction cleanup behavior?
Suggested Fix
After compaction completes, cleanup logic should:
- Identify timestamped index files matching pattern
{indexName}_\d+ - Remove them from schema if they're marked as temporary/compaction artifacts
- Delete the corresponding physical files
Workaround
Users can manually drop timestamped indexes:
DROP INDEX Movie_0_172987397898984; -- Repeat for all timestamped duplicates
However, this requires knowing which indexes are duplicates vs. legitimate user-created indexes.[+] Tested on 25.10.1-SNAPSHOT