Issue 11740: difflib html diff takes extremely long (original) (raw)
If you try to difference the attached files with difflib and a html difference it take 10 minutes or more. In comparison other differencing tools like windiff and araxis merge will show the diff within a second.
Example code I'm using is:
sourceText = open("source.xml", "rU").readlines() targetText = open("target.xml", "rU").readlines()
html_diff = difflib.HtmlDiff(tabsize=4) result = html_diff.make_file(sourceText, targetText, "Source", "Target", context=True, numlines=10) f = open('c:/libdiff_html.html', 'w') f.write(result) finish()
The culprit seems to be Differ._fancy_replace. There is a nasty quadratic loop there, that has pretty complex internal code. I have done a quick a fix, that makes example run below a second at the expense of not calling _fancy_replace for longer chunks and using _plain_replace instead.
Another solution for long chunks would be to split them into smaller parts and process separately. This way quadratic time will be smaller and we still can benefit from _fancy_helper logic.