Issue 914575: difflib side by side diff support, diff.py s/b/s HTML option (original) (raw)

Created on 2004-03-11 23:50 by dmgass, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (58)

msg45507 - (view)

Author: Dan Gass (dmgass)

Date: 2004-03-11 23:50

lib/difflib.py:

Added support for generating side by side differences.
Intended to be used for generating HTML pages but is generic where it can be used for other types of markup.

tools/scripts/diff.py:

Added -m option to use above patch to generate an HTML page of side by side differences between two files. The existing -c option when used with the new -m option controls whether contextual differences are shown or whether the entire file is displayed (number of context lines is controlled by existing -l option).

NOTES: (1) Textual context diffs were included as requested. In addition, full and contextual HTML side by side differences (that this patch generated) are also included to assist in analyzing the differences and showing an example.

(2) If this functionality is worthy of inclusion in the standard python distribution, I am willing to provide more documentation or make modifications to change the interface or make improvements.

(3) When using Internet Explorer some font sizes seem to skew things a bit. Generally I've found the "smallest" to work best. If someone knows why, I'd be interested in making any necessary adjustments in the generated HTML.

msg45508 - (view)

Author: Dan Gass (dmgass)

Date: 2004-04-09 13:30

Logged In: YES user_id=995755

In the time since submission I've found that the interface to the chgFmt and lineFmt functions (arguments of mdiff) should include both the line number and an indication of side (from/to). The use for it I've found is for dropping anchors into the generated markup that it can be hyperlinked from elsewhere.

msg45509 - (view)

Author: Dan Gass (dmgass)

Date: 2004-04-29 04:30

Logged In: YES user_id=995755

Also, I will need to submit the fix for making it behave nice when there are no differences!!

msg45510 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-06-11 14:35

Logged In: YES user_id=764593

Thank you; I have often wished for side-by-side, but not quite badly enough to write it.

That said, I would recommend some tweaks to the formatting.

"font" is deprecated; "span" would be better.

On my display, the "next" lines don't always seem to be counted (except the first one), so that the anchors column is not lined up with the others. (This would also be fixed by the separate rows suggestion.)

Ideally, the line numbers would be in a separate column from the code, to make cut'n'paste easier. (Then you could replace font.num (or span.num) with td.linenum.)

Ideally, each change group should be a separate row in the table. (I realize that this probably means a two-layer iterator, so that the line breaks and blank lines can be inserted correctly in the code columns.)

msg45511 - (view)

Author: Dan Gass (dmgass)

Date: 2004-06-24 06:23

Logged In: YES user_id=995755

I just attached an updated patch. I based the patch on diff.py(CVS1.1) and difflib.py(CVS1.20) which was the latest I saw today on viewCVS.

The following enhancements were made:

  1. user interface greatly simplified for generating HTML (see diff.py for example)
  2. generated HTML now 4.01 Transitional compliant (so says HTML Tidy)
  3. HTML color scheme for differences now matches that used by viewCVS.
  4. differences table now has a row for each line making the HTML less susceptible to browser quirks.
  5. took care of all issues to date enumerated here in this patch.

It would be great if I could get some help on: A) getting some JavaScript? written to be able to select and cut text from a single column (right now text is selected from the whole row including both "from" and "to" text and line numbers. B) solving the line width issue. Currently the "from" / "to" column is as wide as the widest line. Any ideas on wrapping or scrolling?

As of now the only feature I may want to add in the near future is optional tab expansion.

Thanks to those that have commented here and emailed me with suggestions and advice!

msg45512 - (view)

Author: Walter Dörwald (doerwalter) * (Python committer)

Date: 2004-06-25 18:01

Logged In: YES user_id=89016

Submitting difflib_context.html to validator.w3.org gives 2333 errors. The character encoding is missing and there's no DOCTYPE declaration. The rest of the errors seem to be mostly related to the nowrap attribute (which must be written as nowrap="nowrap", as this is the only allowed value). Furthermore the patch contains a mix of CR and CRLF terminated lines.

msg45513 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-02 18:17

Logged In: YES user_id=995755

With the updated patch code I just posted, both full and contextual differences pass the validator.w3.org check (XHTML 1.0 Transitional). Also the extra carriage returns put in by DOS were removed from the difflib.py patch.

msg45514 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2004-07-15 06:30

Logged In: YES user_id=21627

The pack lacks changes to the documentation (libdifflib.tex) and changes to the test suite. Please always submit patches as single unified or context diffs, rather than zip files of the revised files.

msg45515 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-16 09:13

Logged In: YES user_id=995755

Since I have not gotten any feedback on the user interface I took the liberty to tweek it the best I thought how. Since I consider this new functionality very solid I went ahead and created changes to the documentation and test suite.

Per Martin v. Löwis (loewis) instructions I generated a patch which hopefully meets his needs. My CVS access is limited to the web interface and thus the patch is based on the latest checked in as of today:

python/python/dist/src/Lib/difflib.py -- rev 1.21 python/python/dist/src/Tools/scripts/diff.py -- rev 1.2 python/python/dist/src/Lib/test/test_difflib.py -- rev 1.10 python/python/dist/src/Doc/lib/libdifflib.tex -- rev 1.17

Note, I am not very familiar with .tex but it seemed straight forward. I editted the file by hand and it should be very close to what I intended. Unfortunately I am not set up to convert the .tex to HTML. I may try that next week.

msg45516 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-16 09:16

Logged In: YES user_id=995755

Since I have not gotten any feedback on the user interface I took the liberty to tweek it the best I thought how. Since I consider this new functionality very solid I went ahead and created changes to the documentation and test suite.

Per Martin v. Löwis (loewis) instructions I generated a patch which hopefully meets his needs. My CVS access is limited to the web interface and thus the patch is based on the latest checked in as of today:

python/python/dist/src/Lib/difflib.py -- rev 1.21 python/python/dist/src/Tools/scripts/diff.py -- rev 1.2 python/python/dist/src/Lib/test/test_difflib.py -- rev 1.10 python/python/dist/src/Doc/lib/libdifflib.tex -- rev 1.17

Note, I am not very familiar with .tex but it seemed straight forward. I editted the file by hand and it should be very close to what I intended. Unfortunately I am not set up to convert the .tex to HTML. I may try that next week.

msg45517 - (view)

Author: Adam Souzis (asouzis)

Date: 2004-07-17 19:12

Logged In: YES user_id=57486

The diffs look cool but if the to and from lines are identical an exception is thrown:

File "", line 23, in makeDiff File "c:_dev\rx4rdf\rx\htmldiff.py", line 1687, in make_file summary=summary)) File "c:_dev\rx4rdf\rx\htmldiff.py", line 1741, in make_table if not diff_flags[0]: IndexError: list index out of range

(This is on python 2.3.3 -- I renamed your difflib.py to htmldiff.py and put it in site-packages)

Perhaps you should add the case of two identical files to test_difflib.py.

thanks

msg45518 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2004-07-19 13:12

Logged In: YES user_id=80475

Unfortunately, I do not have time to give this more review.

Several thoughts:

msg45519 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-23 11:53

Logged In: YES user_id=995755

I intend on implementing all the suggestions from the last two comment submissions when I hear back from Raymond Hettinger. I'm questioning making the templates private global variables. I intentionally made them members of a class so that they could be overridden when the class is subclassed.

msg45520 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-07-23 13:24

Logged In: YES user_id=764593

I agree that users should be able to override the template.

I think Raymond was concerned about

(1) backwards compatibility if you want to change the interface. For instance, you might later decide that there should be separate templates for header/footer/match section/ changed line/added line/deleted line.

(2) matching the other diff options, so this doesn't look far more complicated.

Unfortunately, at the moment I don't see a good way to solve those; using a private member and access functions wouldn't really simplify things much.

msg45521 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-23 13:57

Logged In: YES user_id=995755

Maybe a good compromise would be to leave them members of the class but name them to imply they are private (add a leading underscore). That way if someone wants to create their own subclass they can at their own risk. They can always write a little extra logic to insure the interface didn't change and raise a custom exception explaining what happened and what they need to look at.

I didn't read (2) in Raymond's comments. Could you expand on those concerns?

msg45522 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-07-23 15:17

Logged In: YES user_id=764593

For a context or unified diff, you don't really need to parametrize it very much. Having a much larger API for side- by-side puts things out of balance, which can make side-by- side look either more difficult or preferred.

msg45523 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-23 15:48

Logged In: YES user_id=995755

I agree the API for make_file() and make_table() has alot of optional parameters that can look daunting. The alternative of providing accessor methods is less appealing to me. It sounds like Jim & I are in agreement on this.

As far as the compromise of making the templates private members, I am assuming Jim is in favor of it. I'm in favor of them being members, it doesn't matter to me if they are private. I'll wait until Raymond weighs in on this one before I move on it.

Jim -- Thanks for your input.

msg45524 - (view)

Author: Neal Norwitz (nnorwitz) * (Python committer)

Date: 2004-07-23 16:10

Logged In: YES user_id=33168

I have not reviewed the code yet, so this is just a general comment. I don't know if it is applicable.

You could create a configuration class that has attributes. A user would only need to assign to the params that they want to update. If you change the names you could add properties to deal with backwards compatibility.

msg45525 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-07-23 16:31

Logged In: YES user_id=764593

Actually, I'm not convinced that the templates should be private, though I can certainly see protected. (double- underscore private is mangled; single-underscore protected will be left out of some imports and doco, but acts like a regular variable if people choose to use it anyway.)

I'm still thinking about nnorwitz's suggestion; the config option could be ignored by most people ("use the defaults") but would hold persistent state for people with explicit preferences (since they'll probably want to make the same changes to all compares).

msg45526 - (view)

Author: Adam Souzis (asouzis)

Date: 2004-07-23 20:01

Logged In: YES user_id=57486

certainly you need to be able to access the templates and modify them. And should be documented too -- one of the first things i wanted to know is how can i customize the output. But a config object seems like conceptual clutter. keyword arguments can be ignored and are just as persistent if you use ** on a dict.

few more suggestions:

thanks

msg45527 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2004-07-24 04:20

Logged In: YES user_id=80475

To clarify, the private variables are just a way to move the defaults out of the class definition and give the OP more control over formatting:

_legend = """

Legends
Colors
 Added 
Changed
Deleted
Links
(f)irst change
(n)ext change
(t)op
"""

class HtmlDiff(object):

file_template = _file_template
styles = _styles
linenum_template = _linenum_template
table_template = _table_template
legend = _legend

def __init__(self,prefix=['from','to'], linejunk=None,
  .  .  .

msg45528 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-24 15:29

Logged In: YES user_id=995755

I will be implementing the following suggestions:

Raymond 7/19 & 7/23 [all] Adam 7/17 [all] Adam 7/23 [partial, but I need feedback]

Unless the template issue (public/protected/private) is discussed further, I will assume everyone is in agreement or can live with Raymond's suggestion & clarification.

I will not be implementing any changes to the API right now.
I lean somewhat strong toward leaving the optional arguments as they are but this can be talked about further if anyone thinks there are strong arguments to the contrary.

Thanks Raymond, Adam, Jim, and Neal for your latest comments. Unless I get further suggestions, expect the patch hopefully by the end of the weekend.

msg45529 - (view)

Author: Adam Souzis (asouzis)

Date: 2004-07-25 06:51

Logged In: YES user_id=57486

Why move the attributes to a class for the two tables? In general its good practice to separate the style info from the html. For this case in particular i had to do this because i'm embedding the diff tables in a generic page template that applies general style rules to tables in the page. By adding a class for those tables i was able to add a rule for it that overrode the more general table rule.

Can I get a recommended solution for the font issue? Not sure, I added this rule: .diff_table th, .diff_table td { font-size: smaller; }
But its a bit problematic as Mozilla defaults to a considerably smaller monotype font size than IE so its hard to be consistent across browsers.

msg45530 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-25 13:51

Logged In: YES user_id=995755

Adam's suggestion seems to make alot of sense regarding moving the table attributes to a class. I will implement it in the next update. I will experiment with the font issue but will likely hold off on implementing anything.

Adam -- thanks for the quick feedback.

msg45531 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-26 06:34

Logged In: YES user_id=995755

Updated the patch as follows:

  1. handles no differences now in both full and context mode (Adam discovered that it crashed when no differences in context mode).
  2. handles empty string lists (I'm pretty sure it would have crashed had someone tried it).
  3. "links" column in the middle now appears on the left as well.
  4. Moved prefix argument from constructor to make_file and make_table methods. Also made it work by default so that if you are generating multiple tables and putting them on the same HTML page there are no anchor name conflicts.
  5. mdiff() function now is protected: _mdiff() so we are at liberty to change the interface in the future
  6. templates moved to protected global variables (still public members of the class) so that the indenting could be improved.
  7. Improved the indenting in other areas so that the HTML is now much more readable.
  8. Inlined the special character escaping so that the xml.sax library function is not used (this seems to have improved the performance quite a bit).
  9. Moved as many attributes as possible to a style sheet class. Adam, please review this incase I missed some.
  10. Expanded test suite to cover the above changes and made it easier to baseline.
  11. Updated documentation to reflect above changes.

NOTES

N1) Raymond, you had mentioned this crashes when the newlines are stripped. I modified the test to include stripping and not and have found both to work without having to fix anything. Can you duplicate what you saw and give me more info?

N2) I've noticed the HTML does not render tabs very well (at all). Is this OK or does anyone have any good ideas?

N3) I've attached the patch (you will also need the new file test_difflib_expect.html). I've zipped the patched code as well as example side by side differences of the patch. Diffs ending with _UPDATE.html show differences between the patched code and the last version of the patched code (what I've done since my last posting). Diffs ending with _PATCH.html show differences between the patched code and what I obtained from CVS a couple weeks back:

python/python/dist/src/Lib/difflib.py -- rev 1.21 python/python/dist/src/Tools/scripts/diff.py -- rev 1.2 python/python/dist/src/Lib/test/test_difflib.py -- rev 1.10 python/python/dist/src/Doc/lib/libdifflib.tex -- rev 1.17

msg45532 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-07-27 14:11

Logged In: YES user_id=764593

Technically, HTML treats all whitespace (space, tab, newline, carriage return, etc) as interchangable with a single space, except sometimes when you are in a

 block.

In practice, user agents often honor it anyhow. If tab isn't working, you might have better luck replacing it with a series of 8  , ((ampersand, then "nbsp", then semicolon)*8) but I'm not sure the ugliness is worth it.

msg45533 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-27 14:49

Logged In: YES user_id=995755

I am considering making an optional argument to the constructor for specifying tab indentation. If nothing was passed it defaults to not handling tabs (better performance). If a number is passed, the string sequences (lists of lines) would be wrapped with a generator function to convert the tabs in each line with the expandtabs string method. The other option is to expand tabs all the time. If one is going to handle tabs it must be done on the input because once it is processed (markup added) the algorithm for expanding tabs becomes really compilicated. Any opinions regarding these options are welcome.

I think I should change the default prefix (_default_prefix) to be a class member rather than initialize it with the constructor. (The default prefix is used to generate unique anchor names so there are no conflicts between multiple tables on the same HTML page). I'm leaning this way because a user may create separate instances of HtmlDiff (to have different ndiff or tab settings) but place the tables on the same page. If I left it the hyperlinks in the second table would jump to the first table.

msg45534 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-07-27 15:12

Logged In: YES user_id=764593

Your tab solution sounds as good as any.

I'm not sure I understand your intent with the default context.

module-level is shared. class-level is shared unless/until assigned. instance-level (including "set by constructor") lets different HtmlDiff have different values.

msg45535 - (view)

Author: Walter Dörwald (doerwalter) * (Python committer)

Date: 2004-07-27 16:28

Logged In: YES user_id=89016

Formatting one line for output should be the job of an overwrittable method. This makes it possible to implement various tab replacement schemes and possibly even syntax highlighting. (BTW, I'd like my tabs to be replaced by ·  )

msg45536 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-27 17:25

Logged In: YES user_id=995755

You can replace tabs with markup by overriding format_line(). The disadvantage is that doing smart replacements (such as expandtabs() does) is more difficult because there could already be markup in there which doesn't count toward the tab stops. You can accomplish Walter's substition easily by overriding format_line(). This substitution cannot be done at the front end because the markup will be escaped and displayed. I'm seeing this as a supporting argument for making the tab filtering optional on the front end (dependent on how much of a performance hit it is to do it all the time).

I intend on making the default prefix class-level so that different HtmlDiff instances share (and increment) the same value to avoid anchor name conflicts between two tables placed on the same HTML page. Jim, does that help clarify?

msg45537 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-28 19:39

Logged In: YES user_id=995755

Unless I hear opposing arguments I will be updating the patch to handle tabs and to change the default prefix to be class-level either tonight or tommorow night.

I plan on adding both pre and post processing to handle tabs.

Preprocessing will be controlled by an an optional keyword argument in the HtmlDiff constructor specifying the number of spaces per tab stop. If absent or None is passed, no preprocessing will occur. Otherwise the tabs in each line of text will be expanded to the specified tab stop spacing (note, I will use the expandtabs() string method but will convert the resulting expanded spaces back into tabs so they get handled by post processing).

Post processing will always occur. Any tabs remaining in the HTML markup will be substituted with the contents of a class-level variable (which can be overridden). By default, this variable will be set to ' '

These two changes will allow tab expansion to the correct tab stop without sacrificing the ability to see differences in files where tabs were converted to spaces or vice versa. It also provides a mechanism where by default, tabs are reasonably represented (or can be easily changed to your custom representation).

msg45538 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-07-28 21:36

Logged In: YES user_id=764593

Are you saying that the default will replace tabs with a single non-breaking space? Not 3-4, as in many programming environments, or 8 as in standard keyboard, but 1?

No objections other than that.

msg45539 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-28 22:02

Logged In: YES user_id=995755

For the case where you instantiate HtmlDiff saying you want tabs expanded it will insert non-breaking space(s) up to the next tab stop.

For the case where you do NOT specify tab expansion it will substitute one non-breakable space unless you override it with a different string (where you could choose 3,4, or 8 spaces). We could make 3,4, or 8 spaces the default but it makes it more complex because it would require two overridable class-level members ... spaces_per_tab = 8 tab_space = ' ' ... and the post processing would look like ... return s.replace('\t',self.tab_space*self.spaces_per_tab) ... and the pre-processing setup in the constructor would need to override the class-level member used for the post processing: self.spaces_per_tab=1

We could also use a special character for tabs. We could even attempt to use a combination of nbsp and a special character to show the tab stops. I'd need to play with re's to do that.

msg45540 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-07-28 22:41

Logged In: YES user_id=764593

If you're dealing with tabs anywhere except at the start of a line, then you probably can't solve it in a general way -- tabstops become variable.

If you're willing to settle for fixed-width tabstops (as on old typewriters, still works in some environments, works fine in tab-initial strings) then

tab="        "

Does everything you want in a single variable for tab-initial, and has all the information you need for fancy tab-internal processing (which I don't recommend).

msg45541 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-29 14:32

Logged In: YES user_id=995755

Updated the patch as follows:

  1. Now handles expanding tabs and representing them with an appropriate HTML representation.
  2. moved default prefix to class-level

NOTES

N1) See _libdifflib_tex_UPDATE.html and test_difflib_expect.html for an example of how tab differences get rendered.

N2) I've attached the patch (you will also need the new file test_difflib_expect.html). I've zipped the patched code as well as example side by side differences of the patch. Diffs ending with _UPDATE.html show differences between the patched code and the last version of the patched code (what I've done since my last posting). Diffs ending with _PATCH.html show differences between the patched code and what I obtained from CVS a couple weeks back:

python/python/dist/src/Lib/difflib.py -- rev 1.21 python/python/dist/src/Tools/scripts/diff.py -- rev 1.2 python/python/dist/src/Lib/test/test_difflib.py -- rev 1.10 python/python/dist/src/Doc/lib/libdifflib.tex -- rev 1.17

msg45542 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-29 17:13

Logged In: YES user_id=995755

I think we are getting very close to having something for the next alpha release for Python 2.4.

One exception is the last patch update I used a list comprehension that calls a function for every line of text. I'm thinking I should have called the function with the list and have it pass back a newly constructed list. To be sure which is the better way I want to do a performance measurement.

I also would like to measure performance with and without "smarttabs". If it does not cost much I might be in favor of eliminating the option and just doing "smarttabs" all the time. In addition to performance degredation it would eliminate the ability to doing straight tab for spaces substitution (is this bad?).

msg45543 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2004-07-29 19:37

Logged In: YES user_id=80475

Rather than spending time on performance measurement, it is best to focus on other goals. Aim for the smallest amount of code, the neatest output, greatest ease of use, and the simplest way to change the output appearance.

The size of the docs is one metric of ease of use. Ideally, it would take a paragraph or two to explain.

Also, write some sample code that produces a different output appearance (XML for example). How easy was it to do.

The goal is to focus the code into three parts: the underlying ndiff, converting ndiff to side-by-side, and output formatting.

msg45544 - (view)

Author: Dan Gass (dmgass)

Date: 2004-07-29 20:10

Logged In: YES user_id=995755

Rather than spending time on performance measurement, it is best to focus on other goals. Aim for the smallest amount of code, the neatest output, greatest ease of use, and the simplest way to change the output appearance.

Noted.

The size of the docs is one metric of ease of use. Ideally, it would take a paragraph or two to explain.

So far I've patched the libdifflib.tex file with about that amount of material. It details the "out-of-box" methods for generating HTML side by side differences. It does not address templates that we are leaving public to adjust the output. Should the documentation be left simple with just a reference to the doc string documentation within the module for further information about using the templates to adjust the output?

Also, write some sample code that produces a different output appearance (XML for example). How easy was it to do.

The _mdiff() function could be used by those interested in doing side by side diffs with other markup such as XML. Previously you had mentioned that we should hide this function for now so we can reserve the right to change the interface. Truthfully I did not mind this decision because I don't think there is much need for it and it does avoid alot of documentation work to explain how to use it :-)

The goal is to focus the code into three parts: the underlying ndiff, converting ndiff to side-by-side, and output formatting.

  1. ndiff() is what it is and I had no need to change it.

  2. converting ndiff() to side-by-side is handled (and packaged neatly) by _mdiff(). The code in my opinion is well written and well commented. It is not public code so documentation for the python manuals is not required.

  3. There has been a great deal of discussion regarding output formatting (early on a fair amount of it was done through private emails, but as of late everything is being logged here). IMHO I think the interface has shaped up very well, but I am still open for more suggestions. To date most of the feedback I have gotten is on the API and the output. I haven't heard much about the code.

msg45545 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-03 20:37

Logged In: YES user_id=995755

Is this patch being considered for going into Python 2.4 (and hence being checked into an alpha release)? FWIW, Tim Peters was +1 in favor of having this functionality in Python (even before the API was significantly enhanced) before washing his hands of it.

I'd be interested in knowing whether this is going into 2.4, being shelved until the next release or whether the development/distribution of this functionlity should be moved to its own project. I'm getting a little anxious because I know an alpha release is in the works this week but I don't know how many more opportunities we have beyond that for 2.4. (I'm assuming there will be a long time before the next one.)

Are there any more outstanding issues with it that need to be resolved? I am currently under the assumption no one has any objections or further recommendations.

msg45546 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2004-08-04 04:41

Logged In: YES user_id=21627

Given that this code was developed from scratch in this tracker, and given the loads of changes it has seen, I'd be in favour of postponing it after 2.4.

msg45547 - (view)

Author: Tim Peters (tim.peters) * (Python committer)

Date: 2004-08-04 05:08

Logged In: YES user_id=31435

The first version I saw from Dan worked fine for my tastes, and he's been very responsive to the unusually large number of suggestions made by the unusually large number of reviewers. The number of reviewers demonstrates real interest, and while all have their own UI and API agendas to push, I don't see anyone opposed to the patch. This should go in.

msg45548 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-08-04 12:59

Logged In: YES user_id=764593

Martin -- there have been a fair number of changes, and I agree that major new functionality is often asked to cook for a release outside the core.

That said, the changes were mostly the sort of things that could be done even after it was in the core; they're just being done early. So I'm inclined to agree with Tim in this particular case.

I feel even more strongly that if it doesn't go in 2.4, then it should at least be targeted for 2.5, rather than an indefinate "future".

Also note that

(1) This doesn't require much in the way of changes to the previous code; it could be backed out if there are problems during the testing phase.

(2) there are a few alphas left, plus betas, and the author has been very fast with fixes; if there are problems, I'm confident that he can fix them in plenty of time.

msg45549 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-04 14:00

Logged In: YES user_id=995755

FWIW here is some background that may help determine your comfort level:

The basis of this code was written late last year. Just before submitting it as a patch early this year I cleaned it up quite a bit. Since that time its "core" _mdiff() has been extremely stable. I (we) have been using it a fair amount where I work without problems (early on its use flushed out some additional requirements). Obviously I've been using it quite a bit in this patch to show what I'm changing.

All of the changes to date have been to 1) improve the API to make it really easy to use yet still flexible to customize 2) change the output to meet XHTML 1.0 Transitional standards and 3) improve output HTML functionality and 4) correct some corner case bugs.

Jim's notes are correct, 1) this patch is strictly an addition to difflib.py as it required no changes to the existing code and 2) I am currently only involved in this patch and my config-py sourceforge project so providing support has not and will not be a problem.

msg45550 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-06 04:45

Logged In: YES user_id=995755

I did a rough performance measurement where I addressed a concern I had in a previous comment (function call in a list comprehension). The time difference was so small I could not determine which was better. Until I hear complaints about its speed I am not planning on looking into performance improvements.

As far as the "smarttab" option, I have not gotten any feedback. I'm inclined to leave it as is because it allows the most flexibility.

I will not be monitoring this patch (or my email) from now until next Monday (August 9).

msg45551 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-10 05:00

Logged In: YES user_id=995755

Raymond -- I am new to the Python-Dev community so I can only assume you would be the one to apply this patch and check it into CVS since this patch is assigned to you.
Because of that and the fact that you have looked it over to some degree I would like to hear your opinion on when you think this should get it. It would be helpful to know if it is going in 2.4 and what work needs to be done on my part. I would rather do the work sooner than later for everyone's comfort level. It may be a good break from reading about @|decorators :-)

I have found documentation on how to generate the HTML documentation from the .tex file I patched in order to check my work more accurately. My suspicion is that Linux is the platform that would be easiest to build the documentation since it would already have most of the tools (I have a Suse Linux system at home but am not very familiar with it).
Windows is still the most convenient platform for me to work under right now. Unfortunately I do not yet have broadband (or internet) on my Linux home system only dial up on my home Windows system :-( and high speed on my Windows system at work. Any suggestions on which platform to use and the easiest way to obtain all the source files to get started would be appreciated -- I can be reached directly at dan.gass_at_gmail_dot_com.

msg45552 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-13 13:49

Logged In: YES user_id=995755

Quick poll to interested parties: Do you want me to add logic to handle line wrapping now, later, or never? Its been something I wanted to address and I thought of a relatively easy way to do it this morning while in the shower. In the code that inserts the XHTML markup I would put temporary markers (perhaps \0,\1) around the markup so that I could easily count visible characters accurately and perform the break. I'd add a "wrapcolumn" attribute to the init method that would default to None (no wrapping). I would break on the exact column number (not worrying about breaking on whitespace). I'll also put some type of continuation marker in the line number column (probably a single ">").

I did get a hold of Raymond. He suggests recruiting either Tim or another developer to work on inserting the patch as his time is tight right now. I'll put out a plea for help on Python-Dev this weekend (I'll wait until after I implement line wrapping if feedback is positive) unless I'm lucky and get a volunteer based on this posting.

Raymond recommended a "tex" checker script that comes in the standard distribution to run on my modifications to the .tex documentation file. My changes are OK.

msg45553 - (view)

Author: Jim Jewett (jimjjewett)

Date: 2004-08-13 15:36

Logged In: YES user_id=764593

(1) I don't have checkin privs, so I can't help there. (2) Line wrapping is tricky - I couldn't think of a good way to do it, which is why I didn't mention it earlier. Your way sounds as good as any, so long as it is easily overridden (including a "no wrap" option)

msg45554 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-16 16:53

Logged In: YES user_id=995755

I'm pretty far implementing the (optional) line wrapping feature, I'll post it in the next day or two.

Assuming this goes into the next alpha, does anyone have any objections to changing everything (except some of the templates) to be protected (via naming everything with a leading underscore) for this release? It would discourage anyone from counting on any of the internal workings so we would be at liberty to change it in future releases should anything shake out from its more widespread use. I think the templates and the API allow enough control to tailor the output without exposing the remainder of the implementation.

msg45555 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2004-08-16 17:04

Logged In: YES user_id=80475

+1 on exposing as little of the API as possible.

msg45556 - (view)

Author: Brett Cannon (brett.cannon) * (Python committer)

Date: 2004-08-17 02:25

Logged In: YES user_id=357491

+1 on minimizing the API as well. Easier to expose more of an API later than to retract any part of it.

msg45557 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-17 16:58

Logged In: YES user_id=995755

I'm still in the process of implementing the line wrapping and am gaining a greater appreciation for minimizing the API (in this process I have been simplifying some of the code). Based on this and the feedback so far I am proposing the following API simplifications:

class HtmlDiff():

I will be making everything protected by convention (adding a leading underscore) except for the make_file() and make_table() methods. This should warn those looking at the internals that things may change in future releases. We may want to consider making additional public interfaces in the future when more experience is gained.

HtmlDiff.init():

  1. remove 'smarttabs' argument (I will always expand tabs using expand_tabs() string method).

HtmlDiff.make_file(): HtmlDiff.make_table():

  1. remove 'fromprefix' and 'toprefix' arguments (vast majority of applications don't need this, for now corner cases can solve it with a string search/replace algorithm after the fact).

  2. remove 'summary' argument (this added a summary attribute to the

    tag and I believe would not be used much if at all). A user could always do a string search and replace on the
    tag to insert this after the fact.

HtmlDiff.make_file():

  1. leave 'title' and 'header' arguments alone (I could be talked into removing these). These arguments are for controlling the window title and any markup to be inserted just above the table. Although these could be inserted after the fact using string search and replace methods I think these will be commonly used and should be convenient (plus they are easy to understand).

ANY OTHER API SIMPLIFICATION IDEAS?

msg45558 - (view)

Author: Brett Cannon (brett.cannon) * (Python committer)

Date: 2004-08-19 03:30

Logged In: YES user_id=357491

Don't have to look this up to see if this is already supported, but the only thing that I could see people wanting is a way in inject their own stylesheet. Otherwise it sounds good to me.

msg45559 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-19 14:08

Logged In: YES user_id=995755

Should we expect the user to search/replace the style sheet in the generated HTML? I'm on a simplification kick and this would keep the API from getting bigger. I'm inclined to do the same for title and headers so I could eliminate them from the API.

I have completed the line wrapping functionality and the simplifications/streamlining talked about to date. I'm currently testing and updating the documentation. Hope to post it late tonight or tommorow.

If there is anything else that should be changed, now would be a good time to speak up!

Thanks again for everyone's help.

msg45560 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-20 08:12

Logged In: YES user_id=995755

Updated the patch as follows:

  1. added optional line wrapping using a wrapcolumn argument to make_file and make_table methods.
  2. eliminated a number of optional arguments to simplify API (as discussed).
  3. made everything protected (by naming convention) except for the public make_table and make_file methods (as discussed).

NOTE1

I've attached the patch (you will also need the new file test_difflib_expect.html). I've zipped the patched code as well as example side by side differences of the patch. Diffs ending with _UPDATE.html show differences between the patched code and the last version of the patched code (what I've done since my last posting). Diffs ending with _PATCH.html show differences between the patched code and what I obtained from CVS a month or so back:

python/python/dist/src/Lib/difflib.py -- rev 1.21 python/python/dist/src/Tools/scripts/diff.py -- rev 1.2 python/python/dist/src/Lib/test/test_difflib.py -- rev 1.10 python/python/dist/src/Doc/lib/libdifflib.tex -- rev 1.17

NOTE2

I will not be monitoring emails or this patch from Saturday Aug21 thru Wednesday Aug25. When I break my internet silence I'll see if I've blundered this update (unlikely because I tested it pretty well but none the less possible). If everything went well I will be soliciting Python-Dev to try to get this in alpha3. I need someone with checkin privledges to do it. Should I be trying to gather more support for its inclusion or does this only need to be done to convince someone with checkin privledges to apply the patch?

msg45561 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2004-08-29 16:36

Logged In: YES user_id=21627

Based on Tim's approval, I have applied this patch as

libdifflib.tex 1.18 difflib.py 1.23 test_difflib.py 1.11 test_difflib_expect.html 1.1 ACKS 1.279 NEWS 1.1118 diff.py 1.4

If you want to make further changes to the code, please submit them as diff against the current CVS, in a new SF patch. Please generate this diff through "cvs diff -u" if possible, as this will put the proper file path into each chunk, so that patch does not need to ask what file each chunk applies to.

msg45562 - (view)

Author: Tim Peters (tim.peters) * (Python committer)

Date: 2004-08-29 19:40

Logged In: YES user_id=31435

Dan, there's a problem you need to fix: all .py files in the core are run thru reindent.py, which, among other things, chops invisible trailing whitespace off lines. Some of your new lines in difflib.py and in test_difflib.py do have invisible trailing whitespace (especially inside triple-quote strings), and test_difflib fails when that junk is removed.

Repairing this is probably just a matter of running reindent.py on those files, then generating a new test_difflib_expect.html. Please verify that's all that's needed. If you say it is, I'll do that and check in the result.
Else we'll have to revert the checkin.

msg45563 - (view)

Author: Dan Gass (dmgass)

Date: 2004-08-29 21:20

Logged In: YES user_id=995755

Tim, sorry for the inconvenience. I ran reindent.py over all the .py files and duplicated what you saw -- test_difflib.py does fail due to white space being removed from triple-quoted template strings.

You are correct, the proper action is to run reindent.py over the .py files and regenerate test_difflib_expect.html. (It can be done by temporarily uncommenting three lines in test_difflib.py on the lines following "# temporarily uncomment".) I went through this exercise and verified the new expectations are OK. (After regenerating, you may want to open test_difflib_expect.html just to make sure it renders reasonably for a sanity check.)

I would volunteer to generate a new patch file but I do not have direct access to the CVS archives and this would probably create as much work for you as I save. I appreciate your offer to help. If there is anything else I should do let me know by posting here or emailing me directly at dan.gass at gmail.com.

Thanks, Dan

msg45564 - (view)

Author: Tim Peters (tim.peters) * (Python committer)

Date: 2004-08-29 22:40

Logged In: YES user_id=31435

Thanks, Dan! I checked the changes in. The new HTML file wasn't obviously damaged, but since I'm not intimately familiar with what it's intending to show, I could well be missing something. If it looked OK to you when you tried it, I'm not gonna argue . Looks good!