[TEXT-228] StringTokenizer performance degradation when parsing large lines (original) (raw)

After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed our system "hangs" (or likely will take an excessively long time to process) large lines (100MB+ in size) when splitting strings with StringTokenizer.

Mitigation: Revert to Apache Commons Text 1.9

Scala version:

scala -version Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.

Java version:

java -version openjdk version "1.8.0_382" OpenJDK Runtime Environment (build 1.8.0_382-b05) OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)

Reproduction Steps:

Generate a sample large file

echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile dd if=/dev/zero bs=100MB count=1 >> largefile sed -ie "s/\x0/0/g" largefile echo -n "\0" >> largefile

Setup reproduce.scala

import org.apache.commons.text.StringTokenizer val lines = scala.io.Source.fromFile("./largefile").getLines.toList val st: StringTokenizer = new StringTokenizer(lines(0)) val res = st.getTokenArray()

Download Apache Commons Jars
Run program with a 10 second timeout (1.10 seems to hang for >1 minute)

time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s user 0.83s system 121% cpu 2.818 total time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s user 0.00s system 0% cpu 10.002 total

As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 seconds. I haven't come across a definite amount of time 1.10 takes, but it seems to run for >1 minute