Attached patch fixes test_eval by dealing with unicode input correctly. I'm also compiling tokenizer.c directly rather than tokenizer_pgen.c -- tokenizer_pgen.c simply defines PGEN and then includes tokenizer.c, which disables some of the unicode support. I don't think tokenizer_pgen.c is needed.