[Python-Dev] status of development documentation (original) (raw)

Sun Dec 25 05:43:08 CET 2005

bom = '\xef\xbb\xbf'
compile(bom + 'print 1\n', '', 'exec')
It peels off the first character (\xef), and says "syntax
error" at that point:

Py_CompileStringFlags ->
PyParser_ASTFromString ->
PyParser_ParseStringFlagsFilename ->
parsetok ->
PyTokenizer_Get

That sets `a` to point at the start of the string, `b` to point at the
second character, and returns type==51.  Then `len` is set to 1,
`str` is malloc'ed to hold 2 bytes, and `str` is filled in with
"\xef\x00" (the first byte of the input, as a NUL-terminated C
string).

PyParser_AddToken then calls classify(), which falls off the end of
its last loop and returns -1:  syntax error.
I'm getting a strong suspicion that I'm the only developer to _try_
building the trunk on WinXP since the AST merge was done, and that
something obscure is fundamentally broken with it on this box.  For
example, in tokenizer.c, these functions don't even exist on Windows
today (because an enclosing #ifdef says not to compile them):

error_ret
new_string
get_normal_name
get_coding_spec
check_coding_spec
check_bom
fp_readl
fp_setreadl
fp_getc
fp_ungetc
decoding_fgets
decoding_feof
buf_getc
buf_ungetc
buf_setreadl
translate_into_utf8
decode_str

OK, that's not quite true.  "Degenerate" forms of three of those
functions exist on Windows:

static char *
decoding_fgets(char *s, int size, struct tok_state *tok)
{
       return fgets(s, size, tok->fp);
}

static int
decoding_feof(struct tok_state *tok)
{
       return feof(tok->fp);
}

static const char *
decode_str(const char *str, struct tok_state *tok)
{
      return str;
}

In the simple failing test, that degenerate decode_str() is getting
called.  If the "fancy" decode_str() were being used instead, that one
_does_ call check_bom().  Why do we have two versions of these
functions?  Which set is supposed to be in use now?  What's the
meaning of "#ifdef PGEN" today?  Should it be true or false?
bom = '\xef\xbb\xbf'
compile(bom + 'print 1\n', '', 'exec')