Issue 33896: Document what components make up the filecmp.cmp os.stat signature. (original) (raw)

Created on 2018-06-18 18:18 by Dean Morin, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (7)

msg319904 - (view)

Author: Dean Morin (Dean Morin)

Date: 2018-06-18 18:18

By default filecmp.cmp() has shallow=True which can produce some surprising behavior.

In the docs it states:

If shallow is true, files with identical os.stat() signatures are taken to be equal.

However the "signature" only considers the file mode, file size, and file modification time, which is not sufficient. cmp() will return True (files are equal) in some circumstances for files that actually differ. Depending on the underlying file system, the same python script will return True or False when cmp() is called on the exact same files. I'll add the long-winded details at the bottom.

To fix, I believe st.st_ino should be included in _sig (https://github.com/python/cpython/blob/3.7/Lib/filecmp.py#L68).

I'm in the middle of a move, but I can make a PR in the next couple weeks if this seems like a reasonable fix and no one else gets around to it.

The long version is that we're migrating some existing reports to a new data source. The goal is to produce identical csv files from both data sources. I have a python script that pulls down both csv files and uses cmp() to compare them.

On my machine, the script correctly discovers the differences between the two. One of the date columns has incorrect dates in the new version.

However on my colleagues machine, the script fails to discover the differences and shows that the csv files are identical.

The difference is that on my machine, os.stat(f).st_mtime is a timestamp which includes fractional seconds (1529108360.1955538), but only includes the seconds (1529108360.0) on my colleagues machine. Since only the dates differed within the csvs, both files had the same file mode, file size, and both were downloaded within the same second.

We got a few more people to see what they got for st_mtime. The link could be the file system used. We're all using macs, but for those of us using an APFS Volume disk, st_mtime returns a timestamp which includes fractional seconds, and for those of us using a Logical Volume Mac OS Extended disk, it returns a timestamp which only includes the seconds (1529108360.0).

When comparing os.stat() between the two differing csv files, the only difference (other than fractional seconds for various timestamps) was st_ino which is why I believe it should be included in _sig().

msg319912 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2018-06-18 20:46

I understand your concern, but this is working as designed and documented. Using st_ino would mean you would return true if and only if it was the same file. That is not the intent. See issue 27396 for some background discussion.

msg319913 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2018-06-18 20:47

For your problem, just don't use the default shallow setting :)

msg319919 - (view)

Author: Dean Morin (Dean Morin)

Date: 2018-06-18 22:10

Fair enough, how about just making it clearer in the documentation? Currently you need to look at the source code to see what would be required for a signature clash to occur. Maybe something like:

Note that the os.stat() signatures only consider st_mode, st_size, and st_mtime. In some circumstances it's possible for differing files to be considered equal when shallow is True.

msg319970 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2018-06-19 13:54

I think it might be OK to document what goes in to the signature, but probably in a footnote, as it is somewhat of an implementation detail and could conceivably change. We could then also add a caution about mtime imprecision being a particular risk on some file systems. I wouldn't want to put such a caution in the main doc paragraphs, but in a footnote I think it would be OK.

msg379797 - (view)

Author: Stavros Macrakis (macrakis)

Date: 2020-10-27 19:42

I agree completely that the documentation should be more explicit. The concept of "os.stat signature" is not defined anywhere as far as I can tell. The naive reader (like me) might mistakenly assume that the "os.stat signature" is a "digital signature" (i.e. a hash) of all of os.stat, and in particular that it includes the st_ino.

I would suggest that the documentation be updated to read "If shallow is true, files with the same length, modification time, and mode are taken to be equal."

msg398308 - (view)

Author: Andrei Kulakov (andrei.avk) * (Python triager)

Date: 2021-07-27 16:30

Can be closed as a dupe of https://bugs.python.org/issue42958, which has a PR ready for review.

History

Date

User

Action

Args

2022-04-11 14:59:01

admin

set

github: 78077

2021-08-16 23:30:26

ned.deily

set

status: open -> closed
superseder: filecmp.cmp(shallow=True) isn't actually shallow when only mtime differs
stage: needs patch -> resolved
resolution: duplicate
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.7, Python 3.8

2021-07-27 16:30:44

andrei.avk

set

nosy: + andrei.avk
messages: +

2020-10-27 19:42:50

macrakis

set

nosy: + macrakis
messages: +

2019-01-04 19:47:04

cheryl.sabella

set

assignee: docs@python

nosy: + docs@python
components: + Documentation, - Library (Lib)
versions: - Python 3.6

2018-06-19 13:54:18

r.david.murray

set

versions: + Python 2.7

2018-06-19 13:54:06

r.david.murray

set

status: closed -> open
versions: + Python 3.8, - Python 2.7, Python 3.4, Python 3.5
title: filecmp.cmp returns True on files that differ -> Document what components make up the filecmp.cmp os.stat signature.
messages: +

resolution: rejected -> (no value)
stage: resolved -> needs patch

2018-06-18 22:10:28

Dean Morin

set

messages: +

2018-06-18 20:47:08

r.david.murray

set

messages: +

2018-06-18 20:46:17

r.david.murray

set

status: open -> closed

nosy: + r.david.murray
messages: +

resolution: rejected
stage: resolved

2018-06-18 18🔞51

Dean Morin

create