Make parsers CharacterStream aware by ashleysommer · Pull Request #1145 · RDFLib/rdflib (original) (raw)
Went to do a simple fix for #1144 and ended up creating quite a big (and IMHO important) set of changes.
There are two NTriples parsers in RDFLib. One in /plugins/parsers/nt.py called NTParser
and another in /plugins/parsers/ntriples.py called NTriplesParser
.
The latter is the original reference implementation of the NTriples W3C standard as provided by W3C.
It is a legacy style parser which takes a file which is an open filepointer, and when run it emits triples into a Sink.
The other NTParser
in nt.py
is a wrapper around the legacy parser, it adds rdflib compatibility, takes an rdflib.InputSource as input and emits triples to a rdflib.Graph.
This PR puts both in the same file, and renames the legacy NTriplesParser to W3CNTriplesParser
to avoid confusion.
The most important change in here is adding CharacterStream support to the rdflib InputSource. This allows parsers to read unicode streams directly from the input source, as opposed to reading from the inputsource.ByteStream then converting to str
with data.decode()
. Often the InputSource was already a string to begin with. PR changes some parsers to prefer reading from the inputsource.CharacterStream if available instead of the ByteStream, this removes many useless string->bytes->bytestream->textstream->string conversions which were happening in the Parser pipelines.
- Merged two Ntriples parser files
- Changed name of NTriplesParser to W3CNTriplesParser, it is the legacy parser
- Populate CharacterStream attr on several types of rdflib InputSource, to provide unicode text stream, in addition to ByteStream
- Add support to N3, Trig, NTriples, NQuads parsers to use the CharacterStream instead of the ByteStream where possible
- Reduces many useless string->bytes->string conversions in parsers.
- Added tests for N-triples parser: reading a file fails without binary mode on Python 3.6 #1144 fix
- All tests pass after these changes
- Fixes N-triples parser: reading a file fails without binary mode on Python 3.6 #1144