msg56569 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2007-10-19 22:18 |
The patch fixes the output for profile and cProfile. Another patch from Alexandre and me added additional calls to the UTF-8 codec. |
|
|
msg56570 - (view) |
Author: Brett Cannon (brett.cannon) *  |
Date: 2007-10-19 22:29 |
There is no patch. =) |
|
|
msg56571 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2007-10-19 22:31 |
Sure there is a patch ... well it's ... *uhm* ... it's hidden under your bed. O:-) |
|
|
msg56621 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2007-10-21 04:31 |
The new patch does a far more better job. I had the idea after a discussion with Alexandre on #python and a small debugging session. The tests for profile, cProfile and doctest are failing on my Linux box because Py_FileSystemDefaultEncoding is "UTF-8" but PyUnicode_Decode() has a shortcut for "utf-8" only. I've added a tolower loop and shortcuts for UTF-16 and UTF-32 to PyUnicode_Decode(). I've also disabled the tests on system with a fs encoding that isn't built in. |
|
|
msg56635 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) *  |
Date: 2007-10-22 01:10 |
I don't think it's possible to add shortcuts in PyUnicode_Decode for UTF-16 and UTF-32 because the byte-order can be different depending of the platform. So, these two need to pass through the codecs module. I am sure if it's better, but I factored out the normalization routine into its own function. |
|
|
msg56637 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2007-10-22 01:17 |
Alexandre Vassalotti wrote: > I don't think it's possible to add shortcuts in PyUnicode_Decode for > UTF-16 and UTF-32 because the byte-order can be different depending of > the platform. So, these two need to pass through the codecs module. utf-16 and utf-32 are the the names for the native codecs. The explicit names are e.g. utf-16-be or utf-32-le. The last argument 0 also means "native byte order". I used a shorter algorithm to optimize the normalization for the special cases of the strcmp() shortcuts. Your version is fine but takes several CPU cycles longer. I don't think it has a large performance impact. ;) Christian |
|
|
msg56718 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2007-10-24 19:07 |
I like Chris's patch better, but he should use (sizeof(lower)-1) instead of 8, and he should use the TOLOWER macro from bytes_methods.h. Please submit a new one and I'll check it in! |
|
|
msg56739 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2007-10-25 10:47 |
Why should I use (sizeof(lower)-1)? Do you mean PyMem_Malloc(strlen(encoding) + 1)? Changes since last patch: * added #include "bytes_methods.h" in unicodeobject.c * fix all is* to use the macros from bytes_methods.h * use malloc/free the lower version instead of using a fixed buffer * replace '_' with '-' in encoding I still think that a fixed buffer of 12 chars (strlen(iso-8859-1) + 1 + \0) is sufficient and faster. * also check for iso-8859-1 |
|
|
msg56764 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2007-10-25 23:48 |
Committed revision 58659. I'm sorry I confused you; I was fine with the version that has char lower[N] but I wanted you to not repeat N-1 later in the code. See what I checked in. :-) |
|
|
msg56765 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2007-10-26 00:00 |
Guido van Rossum wrote: > Guido van Rossum added the comment: > > Committed revision 58659. > > I'm sorry I confused you; I was fine with the version that has char > lower[N] but I wanted you to not repeat N-1 later in the code. See what > I checked in. :-) I still don't understand why you are using (sizeof lower) - 2 and what &lower[(sizeof lower) - 2] returns. Is it the memory address of lower[17]? |
|
|
msg56766 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) *  |
Date: 2007-10-26 00:19 |
> I still don't understand why you are using (sizeof lower) - 2 It is simply to avoid duplicating the constant (a.k.a. the Don't Repeat Yourself (DRY) rule). > and what &lower[(sizeof lower) - 2] returns. Is it the memory address > of lower[17]? It the address of lower[18] to be exact. (l < &lower[(sizeof lower) - 2]) is simply tricky notation to check the bound of the array. Personally, I used like to subtract pointer, ((lower - l + 1) < (sizeof lower)) to get the bound. But now, I find Guido's trick more cute (and less error-prone). :) |
|
|
msg56767 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) *  |
Date: 2007-10-26 00:29 |
> Personally, I used like to subtract pointer, ((lower - l + 1) < > (sizeof lower)) to get the bound. Doh. I got it backward. It's (l - lower + 1), not (lower - l + 1). > But now, I find Guido's trick more cute (and less error-prone). At least, I got that right. :) |
|
|
msg56768 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2007-10-26 00:30 |
Alexandre Vassalotti wrote: > It the address of lower[18] to be exact. (l < &lower[(sizeof lower) - > 2]) is simply tricky notation to check the bound of the array. > Personally, I used like to subtract pointer, ((lower - l + 1) < (sizeof > lower)) to get the bound. But now, I find Guido's trick more cute (and > less error-prone). :) Wow, that's a cool trick. The old wizard has still some astonishing tricks in the sleeves of his robe. :] |
|
|