Python Source Reader (original) (raw)

David Goodger

Contact:

docutils-develop@lists.sourceforge.net

Revision:

9906

Date:

2024-08-15

This document has been placed in the public domain.

This document explores issues around extracting and processing docstrings from Python modules.

For definitive element hierarchy details, see the "Python Plaintext Document Interface DTD" XML document type definition, <pysource.dtd>(which modifies the generic docutils.dtd). Descriptions below list 'DTD elements' (XML 'generic identifiers' or tag names) corresponding to syntax constructs.

Contents

Model
Docstring Extractor
Interpreted Text

Model

The Python Source Reader ("PySource") model that's evolving in my mind goes something like this:

Extract the docstring/namespace [1] tree from the module(s) and/or package(s).
Run the parser on each docstring in turn, producing a forest of doctrees (per nodes.py).
Join the docstring trees together into a single tree, running transforms:
- merge hyperlinks
- merge namespaces
- create various sections like "Module Attributes", "Functions", "Classes", "Class Attributes", etc.; see <pysource.dtd>
- convert the above special sections to ordinary doctree nodes
Run transforms on the combined doctree. Examples: resolving cross-references/hyperlinks (including interpreted text on Python identifiers); footnote auto-numbering; first field list -> bibliographic elements.
(Or should step 4's transforms come before step 3?)
Pass the resulting unified tree to the writer/builder.

I've had trouble reconciling the roles of input parser and output writer with the idea of modes ("readers" or "directors"). Does the mode govern the transformation of the input, the output, or both? Perhaps the mode should be split into two.

For example, say the source of our input is a Python module. Our "input mode" should be the "Python Source Reader". It discovers (from__docformat__) that the input parser is "reStructuredText". If we want HTML, we'll specify the "HTML" output formatter. But there's a piece missing. What kind or style of HTML output do we want? PyDoc-style, LibRefMan style, etc. (many people will want to specify and control their own style). Is the output style specific to a particular output format (XML, HTML, etc.)? Is the style specific to the input mode? Or can/should they be independent?

I envision interaction between the input parser, an "input mode" , and the output formatter. The same intermediate data format would be used between each of these, being transformed as it progresses.

Interpreted Text

DTD elements: package, module, class, method, function, module_attribute, class_attribute, instance_attribute, variable, parameter, type, exception_class, warning_class.

To classify identifiers explicitly, the role is given along with the identifier in either prefix or suffix form:

Use :method:Keeper.storedata to store the object's data in Keeper.data:instance_attribute:.

The role may be one of 'package', 'module', 'class', 'method', 'function', 'module_attribute', 'class_attribute', 'instance_attribute', 'variable', 'parameter', 'type', 'exception_class', 'exception', 'warning_class', or 'warning'. Other roles may be defined.