No logging during FULL_TEXT index build phase (original) (raw)

Problem Description

When creating FULL_TEXT indexes on large datasets, the index build process is completely silent for extended periods (hours), making it impossible to determine if:

This is particularly problematic for datasets with millions of records where the build phase can take 5-10+ hours with zero feedback.

Current Behavior

LSM_TREE indexes provide good logging during compaction:

2025-11-08 15:34:58.737 WARNI [LSMTreeIndex] - Creating a new entry in index 'PostLink_0_...' root page [2024-02-12...]->385
2025-11-08 15:34:59.334 WARNI [LSMTreeIndex] Index 'PostLink_0_...' compacted in 83381ms (keys=5758994 values=6541209...)

FULL_TEXT indexes are silent during the build phase:

15:34:59 - Previous index completes
[5-10 HOURS OF SILENCE - NO LOGS]
23:45:00 - Compaction starts (finally some logs)

Reproduction

  1. Create a FULL_TEXT index on a large text field with millions of records:
    database.getSchema().createTypeIndex(
    SchemaImpl.INDEX_TYPE.FULL_TEXT,
    false,
    "Post",
    new String[] { "Body" }
    );
  2. Observe complete silence during the initial build phase (scanning records, tokenizing text, inserting into LSM tree)
  3. Only when compaction starts do logs appear

Root Cause

In LSMTreeIndex.java (line 626-657), the build() method has no logging:

public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) { // ... setup code ...

db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
    db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
    total.incrementAndGet();
    
    // NO LOGGING HERE - could be processing for hours!
    
    if (total.get() % buildIndexBatchSize == 0) {
        db.getWrappedDatabaseInstance().commit();
        db.getWrappedDatabaseInstance().begin();
    }
    
    if (callback != null)
        callback.onDocumentIndexed((Document) record, total.get());
    
    return true;
});

// ... cleanup code ...

}

For FULL_TEXT indexes, this means:

Impact on Large Datasets

Based on testing with Stack Overflow dataset:

Dataset Size Records Build Time Logging
Medium (425K posts) 425,735 3m 25s ❌ None
Large (60M posts) 59,819,048 5-7 hours ❌ None

During the 5-7 hour build phase, there is zero indication that work is being done.

Proposed Solution

Add periodic progress logging in LSMTreeIndex.build():

public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) { checkIsValid(); final AtomicLong total = new AtomicLong(); final long LOG_INTERVAL = 10000; // Log every 10K records

if (propertyNames == null || propertyNames.isEmpty())
    throw new IndexException("Cannot rebuild index '" + name + "' because metadata information are missing");

final DatabaseInternal db = getDatabase();

if (status.compareAndSet(INDEX_STATUS.AVAILABLE, INDEX_STATUS.UNAVAILABLE)) {
    
    LogManager.instance().log(this, Level.INFO, 
        "Building index '%s' on %d properties...", name, propertyNames.size());
    
    final long startTime = System.currentTimeMillis();
    
    db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
        db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
        total.incrementAndGet();
        
        // ADD: Periodic progress logging
        if (total.get() % LOG_INTERVAL == 0) {
            final long elapsed = System.currentTimeMillis() - startTime;
            final double rate = total.get() / (elapsed / 1000.0);
            LogManager.instance().log(this, Level.INFO, 
                "Building index '%s': processed %d records (%.0f records/sec)...", 
                name, total.get(), rate);
        }
        
        if (total.get() % buildIndexBatchSize == 0) {
            db.getWrappedDatabaseInstance().commit();
            db.getWrappedDatabaseInstance().begin();
        }
        
        if (callback != null)
            callback.onDocumentIndexed((Document) record, total.get());
        
        return true;
    });
    
    // ADD: Completion logging
    final long elapsed = System.currentTimeMillis() - startTime;
    LogManager.instance().log(this, Level.INFO, 
        "Completed building index '%s': processed %d records in %dms", 
        name, total.get(), elapsed);
    
    status.set(INDEX_STATUS.AVAILABLE);
    
} else
    throw new NeedRetryException("Error on building index '" + name + "' because not available");

return total.get();

}

Expected Output (After Fix)

2025-11-08 15:35:00 INFO [LSMTreeIndex] Building index 'Post[Body]' on 1 properties...
2025-11-08 15:35:12 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 10000 records (833 records/sec)...
2025-11-08 15:35:24 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 20000 records (833 records/sec)...
2025-11-08 15:35:36 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 30000 records (833 records/sec)...
...
2025-11-08 21:45:23 INFO [LSMTreeIndex] Completed building index 'Post[Body]': processed 59819048 records in 22823000ms
2025-11-08 21:45:24 INFO [LSMTreeIndex] Compacting index 'Post_0_...' (pages=150000 pageSize=262144...)

Benefits

  1. Progress visibility: Users can monitor index build progress in real-time
  2. Time estimation: Records/sec rate allows estimating time to completion
  3. Debugging: Immediately identify if process is stuck vs. just slow
  4. Consistency: Matches existing logging behavior during compaction phase
  5. Operations: Easier to monitor production index builds

Environment