[Tutor] alphabetizing a file by lines (original) (raw)
Magnus Lyckå magnus at thinkware.se
Sun Jul 18 01:36:58 CEST 2004
- Previous message: [Tutor] alphabetizing a file by lines
- Next message: [Tutor] alphabetizing a file by lines
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 17:28 2004-07-17 -0400, Brian van den Broek wrote:
the advice at the top is good. The only thing I would add is that this will sort your lines by their entire contents and will perform a case sensitive sort.
Another problem is that is has to read the whole file into memory at once.
It seemed from your problem description that you wanted only the first letter taken into account regardless of case so that the lines:
abc ABC azz acd would not be rearranged.
This is really good if the file is so large that you don't want to read it into memory at once.
Both the case-insensitivity and the consideration of just the first letter can easily be obtained by writing a custom compare function to pass in with the sort method call.
The case insensitivity part can be accomplished like so: def alphabeticalsort(first, second): return cmp(first.lower(), second.lower()) (You could easily extend this to look at just the leading n-characters of the two strings being compared.) You'd use it like so: lines = open(file).readlines() lines.sort(alphabeticalsort) print lines
the problem with this is that it's slow for big lists, since you have to call a Python function over and over from inside the sort function.
The common approach is to do something like this.
sort_list = [] for line in open(file): sort_list.append((line[0].lower(), line)) sort_list.sort() for first_char, line in sort_list: print line
In the coming Python 2.4, there will be more features added to the sort method which will make this simpler.
Here we still have a problem with large files, but if we only need to sort on the first character, that's easy to overcome.
Just do something like this (untested) to store lines for each starting letter in a separate file and cat the files together in the end:
files = {}
def store(line): name = line[0].lower() if not name in files: f = open(name, 'w+') files[name] = f files[name].write(line+'\n')
for line in open(file): line = line.rstrip() if line: store(line)
file_names = files.keys() file_names.sort()
big_file = open('big_file.txt', 'w') for file_name in file_names: files[file_name].seek(0) chunk = files[file_name].read() big_file.write(chunk) big_file.close()
This approach never reads need to use up more memory than required for all lines starting with one letter. For a more memory conservative approach, make an inner loop in the last loop, and read one line at a time, or a few lines using something like files[file_name].readlines(100).
If I recall correctly, Bentley's "Programming Pearls" has a chapter on something like this.
-- Magnus Lycka (It's really Lyckå), magnus at thinkware.se Thinkware AB, Sweden, www.thinkware.se I code Python ~ The Agile Programming Language
- Previous message: [Tutor] alphabetizing a file by lines
- Next message: [Tutor] alphabetizing a file by lines
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]