Issue 1451466: reading very large files (original) (raw)

Created on 2006-03-16 17:21 by richardchristen, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg27792 - (view)	Author: christen (richardchristen)	Date: 2006-03-16 17:21
I work on the human genome I extracted words from chromosomes using a suffix tree (C compiled for 64 done on a SUN with 300 Go RAM, since my suffix tree requires 150 Go RAM for chromosome 1, the largest one) this gave some >5 Go files, for example with 163763326 lines for chr 4, the one presently analyzed. Using python 2.4.2 on a windows 32-computer (1.5 Go RAM), reading this file line by line either for li in file: do something or while li!='': li=file.readline() I got problems seemingly around the 4 Go boundary (after reading the problematic first line), for some lines (not all), the li returned the correct content but with the first word of the next line also within li (see below) As a result a simple =open('1') =open('2','w') li=.readline() while li!='': .write(li) li=.readline() produced a second file of only 163754385 lines problem lines were "seemingly random", i.e. not in a row, with the last line being OK. The same code on the same file but on my OSX 64-dualcore machine went fine, despite the use of default Python 2.2.3 and "file Python" showing it is a Mach-0 executable ppc, i.e. a 32 bit app. Everything was run from the command line. the first file looks like that ... TCAGCCACAGCAGAAAGTGA:\t33240 551212 751185 TCAGCCACAGCAGAAAGTGC:\t131324047 TCAGCCACAGCACTGTGTTA:\t61641912 .... the second file contains lines like these : TCAGCCACAGCAGAAAGTGC:\t131324047TCAGCCACAGCAGAAGAAGA: which is 'first line'+'1rst word of next line' PS1 : no problem to read the big file with UEdit on the windows machine. Therefore the OS itself is not the problem (also I transfered the bigfile from the Windows to the Mac, if the file had had problems, it would have been corrupted on the Mac) PS2 : I tried python 2.3.5 on windows with the same problem. PS3: If needed, I can run the same test on a similar file but for chromosome 8 which is slightly below the 4 Go limit (3.99). PS4: I think I remember having done a similar parsing on a Linux Athlon 64 monoCPU a month ago, with no trouble.
msg27793 - (view)	Author: Josiah Carlson (josiahcarlson) *	Date: 2006-03-18 00:35
Logged In: YES user_id=341410 Sounds like an issue with file objects on certain platforms not being able to handle offsets of 2**32 or larger. I personally have read and written files > 4gb on the windows platform, but I seem to recall having issues on 32 bit linux some time in the past.
msg27794 - (view)	Author: Tim Peters (tim.peters) *	Date: 2006-03-18 02:33
Logged In: YES user_id=31435 "windows 32-computer" is too vague. Which operating system (Win95, Win98, WinME, NT, Win2K, WinXP), and which filesystem (FAT, FAT32, NTFS)? Are you sure this is a text file? If it's a binary file, then all sorts of bad things can happen opening it in text mode (which your sample code does).
msg27795 - (view)	Author: christen (richardchristen)	Date: 2006-03-18 07:29
Logged In: YES user_id=1477618 In reply to previous comment Are you sure this is a text file? Yes I made it myself. Besides I transfered it from the UX machine to the windows one by ftp with change of the end of line character to the window's kind. I checked with type myfile, that the control character was indeed changed. Also, I mentioned that I manually checked with Uedit, both in ASCII and HEX modes for the akward lines. "windows 32-computer" is too vague." I agree, I should have been more specific: System: Microsoft Windows 2000 Professionnel Version 5.0.2195 Service Pack 4 version 2195 Mother card : ASUSTek System Model A7N8X-E BIOS Phoenix AwardBIOS v6-00PG Memory 1.5Go Swap 2.4 Go File System NTFS Best Regards
msg27796 - (view)	Author: christen (richardchristen)	Date: 2007-07-02 07:11
In 2006, I signaled a bug in windows 32 for reading very large files : python-Bugs-1451466 I have now tried with a windows 64 machines and python 2.5 I find the same bug For very large files (the two I tried were around 7-8 Go), the end of line is sometimes not taken into account The file is fine, as viewed in hexa, the end of line characters are perfectly ok at the place where the parser goes wrong. Everything seems to be ok with the same script on my Mac OSX Exemple : Original file reads: ########################### ......... Query= 10\|ENSG00000203288	pseudogene	105829416
msg63154 - (view)	Author: Joseph Armbruster (JosephArmbruster)	Date: 2008-03-01 01:04
I believe this may be related to issue 1672853. http://bugs.python.org/issue1672853
msg63441 - (view)	Author: Joseph Armbruster (JosephArmbruster)	Date: 2008-03-10 13:00
Note: If this issue is related to 1672853, I ran through the test code provided in the issue recently and it appeared to pass for both the trunk and 2.5 maint.
msg105542 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-05-11 20:49
Is this still an issue for 2.7?
msg105583 - (view)	Author: christen (Richard.Christen@unice.fr)	Date: 2010-05-12 12:30
I have no idea because - I am using 2.5 (windows) or 2.6 (2.5 because of old stuff that I compiled compatible with 2.5 not 2.6) - I am using open(file, 'U') that solved the problem under windows, and the pd does not exist in Linux best Richard Terry J. Reedy a écrit : > Terry J. Reedy <tjreedy@udel.edu> added the comment: > > Is this still an issue for 2.7? > > ---------- > nosy: +tjreedy > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue1451466> > _______________________________________ > > >
msg116719 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-09-17 20:22
describes why it's probably a bug in the C library. possible workarounds are to open the files in universal mode, to use io.open(), or to switch to python 3!

History
Date	User	Action	Args
2022-04-11 14:56:15	admin	set	github: 43040
2010-09-17 20:22:25	amaury.forgeotdarc	set	status: open -> closednosy: + amaury.forgeotdarcmessages: + superseder: Newline skipped in "for line in file" for huge fileresolution: wont fix
2010-05-12 12:30:50	Richard.Christen@unice.fr	set	nosy: + Richard.Christen@unice.frmessages: +
2010-05-11 20:49:45	terry.reedy	set	nosy: + terry.reedymessages: +
2009-03-30 06:32:45	ajaksu2	set	dependencies: + Error reading files larger than 4GBtype: behaviorstage: test neededversions: + Python 2.6, - Python 2.5
2008-03-10 13:00:50	JosephArmbruster	set	messages: +
2008-03-01 01:04:08	JosephArmbruster	set	nosy: + JosephArmbrustermessages: +
2006-03-16 17:21:35	richardchristen	create