Handling UUID Data - PyMongo 4.13.2 documentation (original) (raw)

PyMongo ships with built-in support for dealing with UUID types. It is straightforward to store native uuid.UUID objects to MongoDB and retrieve them as native uuid.UUID objects:

from pymongo import MongoClient from bson.binary import UuidRepresentation from uuid import uuid4

use the 'standard' representation for cross-language compatibility.

client = MongoClient(uuidRepresentation='standard') collection = client.get_database('uuid_db').get_collection('uuid_coll')

remove all documents from collection

collection.delete_many({})

create a native uuid object

uuid_obj = uuid4()

save the native uuid object to MongoDB

collection.insert_one({'uuid': uuid_obj})

retrieve the stored uuid object from MongoDB

document = collection.find_one({})

check that the retrieved UUID matches the inserted UUID

assert document['uuid'] == uuid_obj

Native uuid.UUID objects can also be used as part of MongoDB queries:

document = collection.find({'uuid': uuid_obj}) assert document['uuid'] == uuid_obj

The above examples illustrate the simplest of use-cases - one where the UUID is generated by, and used in the same application. However, the situation can be significantly more complex when dealing with a MongoDB deployment that contains UUIDs created by other drivers as the Java and CSharp drivers have historically encoded UUIDs using a byte-order that is different from the one used by PyMongo. Applications that require interoperability across these drivers must specify the appropriateUuidRepresentation.

In the following sections, we describe how drivers have historically differed in their encoding of UUIDs, and how applications can use theUuidRepresentation configuration option to maintain cross-language compatibility.

Attention

New applications that do not share a MongoDB deployment with any other application and that have never stored UUIDs in MongoDB should use the standard UUID representation for cross-language compatibility. See Configuring a UUID Representation for details on how to configure the UuidRepresentation.

Legacy Handling of UUID Data

Historically, MongoDB Drivers have used different byte-ordering while serializing UUID types to Binary. Consider, for instance, a UUID with the following canonical textual representation:

00112233-4455-6677-8899-aabbccddeeff

This UUID would historically be serialized by the Python driver as:

00112233-4455-6677-8899-aabbccddeeff

The same UUID would historically be serialized by the C# driver as:

33221100-5544-7766-8899-aabbccddeeff

Finally, the same UUID would historically be serialized by the Java driver as:

77665544-3322-1100-ffee-ddccbbaa9988

This difference in the byte-order of UUIDs encoded by different drivers can result in highly unintuitive behavior in some scenarios. We detail two such scenarios in the next sections.

Scenario 1: Applications Share a MongoDB Deployment

Consider the following situation:

This example demonstrates how the differing byte-order used by different drivers can hamper interoperability. To workaround this problem, users should configure their MongoClient with the appropriateUuidRepresentation (in this case, client in applicationP can be configured to use theCSHARP_LEGACY representation to avoid the unintuitive behavior) as described inConfiguring a UUID Representation.

Scenario 2: Round-Tripping UUIDs

In the following examples, we see how using a misconfiguredUuidRepresentation can cause an application to inadvertently change the Binary subtype, and in some cases, the bytes of the Binary field itself when round-tripping documents containing UUIDs.

Consider the following situation:

from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS from bson.binary import Binary, UuidRepresentation from uuid import uuid4

Using UuidRepresentation.PYTHON_LEGACY stores a Binary subtype-3 UUID

python_opts = CodecOptions(uuid_representation=UuidRepresentation.PYTHON_LEGACY) input_uuid = uuid4() collection = client.testdb.get_collection('test', codec_options=python_opts) collection.insert_one({'_id': 'foo', 'uuid': input_uuid}) assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)})['_id'] == 'foo'

Retrieving this document using UuidRepresentation.STANDARD returns a Binary instance

std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD) std_collection = client.testdb.get_collection('test', codec_options=std_opts) doc = std_collection.find_one({'_id': 'foo'}) assert isinstance(doc['uuid'], Binary)

Round-tripping the retrieved document yields the exact same document

std_collection.replace_one({'_id': 'foo'}, doc) round_tripped_doc = collection.find_one({'uuid': Binary(input_uuid.bytes, 3)}) assert doc == round_tripped_doc

In this example, round-tripping the document using the incorrectUuidRepresentation (STANDARD instead ofPYTHON_LEGACY) changes the Binary subtype as a side-effect. Note that this can also happen when the situation is reversed - i.e. when the original document is written using ``STANDARD`` representation and then round-tripped using the ``PYTHON_LEGACY`` representation.

In the next example, we see the consequences of incorrectly using a representation that modifies byte-order (CSHARP_LEGACY or JAVA_LEGACY) when round-tripping documents:

from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS from bson.binary import Binary, UuidRepresentation from uuid import uuid4

Using UuidRepresentation.STANDARD stores a Binary subtype-4 UUID

std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD) input_uuid = uuid4() collection = client.testdb.get_collection('test', codec_options=std_opts) collection.insert_one({'_id': 'baz', 'uuid': input_uuid}) assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)})['_id'] == 'baz'

Retrieving this document using UuidRepresentation.JAVA_LEGACY returns a native UUID

without modifying the UUID byte-order

java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY) java_collection = client.testdb.get_collection('test', codec_options=java_opts) doc = java_collection.find_one({'_id': 'baz'}) assert doc['uuid'] == input_uuid

Round-tripping the retrieved document silently changes the Binary bytes and subtype

java_collection.replace_one({'_id': 'baz'}, doc) assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)}) is None assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)}) is None round_tripped_doc = collection.find_one({'_id': 'baz'}) assert round_tripped_doc['uuid'] == Binary(input_uuid.bytes, 3).as_uuid(UuidRepresentation.JAVA_LEGACY)

In this case, using the incorrect UuidRepresentation(JAVA_LEGACY instead of STANDARD) changes theBinary bytes and subtype as a side-effect.Note that this happens when any representation that manipulates byte-order (``CSHARP_LEGACY`` or ``JAVA_LEGACY``) is incorrectly used to round-trip UUIDs written with ``STANDARD``. When the situation is reversed - i.e. when the original document is written using ``CSHARP_LEGACY`` or ``JAVA_LEGACY`` and then round-tripped using ``STANDARD`` - only the :class:`~bson.binary.Binary` subtype is changed.

Note

Starting in PyMongo 4.0, these issue will be resolved as the STANDARD representation will decode Binary subtype 3 fields asBinary objects of subtype 3 (instead ofuuid.UUID), and each of the LEGACY_* representations will decode Binary subtype 4 fields to Binary objects of subtype 4 (instead of uuid.UUID).

Configuring a UUID Representation

Users can workaround the problems described above by configuring their applications with the appropriate UuidRepresentation. Configuring the representation modifies PyMongo’s behavior while encoding uuid.UUID objects to BSON and decoding Binary subtype 3 and 4 fields from BSON.

Applications can set the UUID representation in one of the following ways:

  1. At the MongoClient level using the uuidRepresentation URI option, e.g.:
    client = MongoClient("mongodb://a:27107/?uuidRepresentation=standard")
    Valid values are:
    Value UUID Representation
    unspecified UNSPECIFIED
    standard STANDARD
    pythonLegacy PYTHON_LEGACY
    javaLegacy JAVA_LEGACY
    csharpLegacy CSHARP_LEGACY
  2. At the MongoClient level using the uuidRepresentation kwarg option, e.g.:
    from bson.binary import UuidRepresentation
    client = MongoClient(uuidRepresentation=UuidRepresentation.STANDARD)
  3. At the Database or Collection level by supplying a suitableCodecOptions instance, e.g.:
    from bson.codec_options import CodecOptions
    csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY)
    java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY)

Get database/collection from client with csharpLegacy UUID representation

csharp_database = client.get_database('csharp_db', codec_options=csharp_opts)
csharp_collection = client.testdb.get_collection('csharp_coll', codec_options=csharp_opts)

Get database/collection from existing database/collection with javaLegacy UUID representation

java_database = csharp_database.with_options(codec_options=java_opts)
java_collection = csharp_collection.with_options(codec_options=java_opts)

Supported UUID Representations

We now detail the behavior and use-case for each supported UUID representation.

UNSPECIFIED

Attention

Starting in PyMongo 4.0,UNSPECIFIED is the default UUID representation used by PyMongo.

The UNSPECIFIED representation prevents the incorrect interpretation of UUID bytes by stopping short of automatically converting UUID fields in BSON to native UUID types. Decoding a UUID when using this representation returns a Binaryobject instead. If required, users can coerce the decodedBinary objects into native UUIDs using theas_uuid() method and specifying the appropriate representation format. The following example shows what this might look like for a UUID stored by the C# driver:

from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS from bson.binary import Binary, UuidRepresentation from uuid import uuid4

Using UuidRepresentation.CSHARP_LEGACY

csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY)

Store a legacy C#-formatted UUID

input_uuid = uuid4() collection = client.testdb.get_collection('test', codec_options=csharp_opts) collection.insert_one({'_id': 'foo', 'uuid': input_uuid})

Using UuidRepresentation.UNSPECIFIED

unspec_opts = CodecOptions(uuid_representation=UuidRepresentation.UNSPECIFIED) unspec_collection = client.testdb.get_collection('test', codec_options=unspec_opts)

UUID fields are decoded as Binary when UuidRepresentation.UNSPECIFIED is configured

document = unspec_collection.find_one({'_id': 'foo'}) decoded_field = document['uuid'] assert isinstance(decoded_field, Binary)

Binary.as_uuid() can be used to coerce the decoded value to a native UUID

decoded_uuid = decoded_field.as_uuid(UuidRepresentation.CSHARP_LEGACY) assert decoded_uuid == input_uuid

Native uuid.UUID objects cannot directly be encoded toBinary when the UUID representation is UNSPECIFIEDand attempting to do so will result in an exception:

unspec_collection.insert_one({'_id': 'bar', 'uuid': uuid4()}) Traceback (most recent call last): ... ValueError: cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED. UUIDs can be manually converted to bson.Binary instances using bson.Binary.from_uuid() or a different UuidRepresentation can be configured. See the documentation for UuidRepresentation for more information.

Instead, applications using UNSPECIFIEDmust explicitly coerce a native UUID using thefrom_uuid() method:

explicit_binary = Binary.from_uuid(uuid4(), UuidRepresentation.STANDARD) unspec_collection.insert_one({'_id': 'bar', 'uuid': explicit_binary})

STANDARD

Attention

This UUID representation should be used by new applications or applications that are encoding and/or decoding UUIDs in MongoDB for the first time.

The STANDARD representation enables cross-language compatibility by ensuring the same byte-ordering when encoding UUIDs from all drivers. UUIDs written by a driver with this representation configured will be handled correctly by every other provided it is also configured with the STANDARD representation.

STANDARD encodes native uuid.UUID objects toBinary subtype 4 objects.

PYTHON_LEGACY

Attention

This uuid representation should be used when reading UUIDs generated by existing applications that use the Python driver but don’t explicitly set a UUID representation.

Attention

PYTHON_LEGACYwas the default uuid representation in PyMongo 3.

The PYTHON_LEGACY representation corresponds to the legacy representation of UUIDs used by PyMongo. This representation conforms withRFC 4122 Section 4.1.2.

The following example illustrates the use of this representation:

from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS from bson.binary import Binary, UuidRepresentation

No configured UUID representation

collection = client.python_legacy.get_collection('test', codec_options=DEFAULT_CODEC_OPTIONS)

Using UuidRepresentation.PYTHON_LEGACY

pylegacy_opts = CodecOptions(uuid_representation=UuidRepresentation.PYTHON_LEGACY) pylegacy_collection = client.python_legacy.get_collection('test', codec_options=pylegacy_opts)

UUIDs written by PyMongo 3 with no UuidRepresentation configured

(or PyMongo 4.0 with PYTHON_LEGACY) can be queried using PYTHON_LEGACY

uuid_1 = uuid4() pylegacy_collection.insert_one({'uuid': uuid_1}) document = pylegacy_collection.find_one({'uuid': uuid_1})

PYTHON_LEGACY encodes native uuid.UUID objects toBinary subtype 3 objects, preserving the same byte-order as bytes:

from bson.binary import Binary

document = collection.find_one({'uuid': Binary(uuid_2.bytes, subtype=3)}) assert document['uuid'] == uuid_2

JAVA_LEGACY

Attention

This UUID representation should be used when reading UUIDs written to MongoDB by the legacy applications (i.e. applications that don’t use the STANDARD representation) using the Java driver.

The JAVA_LEGACY representation corresponds to the legacy representation of UUIDs used by the MongoDB Java Driver.

Note

The JAVA_LEGACY representation reverses the order of bytes 0-7, and bytes 8-15.

As an example, consider the same UUID described in Legacy Handling of UUID Data. Let us assume that an application used the Java driver without an explicitly specified UUID representation to insert the example UUID00112233-4455-6677-8899-aabbccddeeff into MongoDB. If we try to read this value using PYTHON_LEGACY, we end up with an entirely different UUID:

UUID('77665544-3322-1100-ffee-ddccbbaa9988')

However, if we explicitly set the representation toJAVA_LEGACY, we get the correct result:

UUID('00112233-4455-6677-8899-aabbccddeeff')

PyMongo uses the specified UUID representation to reorder the BSON bytes and load them correctly. JAVA_LEGACY encodes native uuid.UUID objects to Binary subtype 3 objects, while performing the same byte-reordering as the legacy Java driver’s UUID to BSON encoder.

CSHARP_LEGACY

Attention

This UUID representation should be used when reading UUIDs written to MongoDB by the legacy applications (i.e. applications that don’t use the STANDARD representation) using the C# driver.

The CSHARP_LEGACY representation corresponds to the legacy representation of UUIDs used by the MongoDB Java Driver.

Note

The CSHARP_LEGACY representation reverses the order of bytes 0-3, bytes 4-5, and bytes 6-7.

As an example, consider the same UUID described in Legacy Handling of UUID Data. Let us assume that an application used the C# driver without an explicitly specified UUID representation to insert the example UUID00112233-4455-6677-8899-aabbccddeeff into MongoDB. If we try to read this value using PYTHON_LEGACY, we end up with an entirely different UUID:

UUID('33221100-5544-7766-8899-aabbccddeeff')

However, if we explicitly set the representation toCSHARP_LEGACY, we get the correct result:

UUID('00112233-4455-6677-8899-aabbccddeeff')

PyMongo uses the specified UUID representation to reorder the BSON bytes and load them correctly. CSHARP_LEGACY encodes native uuid.UUIDobjects to Binary subtype 3 objects, while performing the same byte-reordering as the legacy C# driver’s UUID to BSON encoder.