Proposal: Embed sources in PDBs · Issue #12625 · dotnet/roslyn (original) (raw)

Implementation progress

Windows PDB support is tracked by #13707.

This proposal addresses #5397, which requests a feature for embedding source code inside of a PDB.

Scenarios

Recap from #5397

Also

Command Line Usage

Since common usage will already leverage a source server and only require generated code to be embedded, we need to be able to specify the files to embed individually.

Proposal: Add a new /embed switch for vbc.exe and csc.exe:

Examples

csc /debug+ src\*.cs /embed:generated\*.cs

#line directives

There is also a scenario where debugging requires external files that are not part of the compilation and are lined up to the actual source code via #line directives.

Proposal: A file targeted by a #line directive shall be embedded in the PDB if either the target file or the referencing source file are embedded.

Example

source.cs

class P { static void Main() { #line 1 "example.xyz" System.Console.WriteLine("Hello World"); } }

example.xyz

csc source.cs /embed:example.xyz /debug+   
csc source.cs /embed /debug+
csc source.cs /embed:source.cs /debug+

Source Generators

This feature would pair nicely with https://github.com/dotnet/roslyn/blob/features/source-generators/docs/features/generators.md if/when both land, allowing generator output to be debugged without any requirement to acquire (or regenerate) the output by some other means.

We might choose to handle embedding source generator output in one of 3 ways:

  1. Always embed generator output if a PDB is being emitted.
  2. Add a way to decorate a generator as opting in (or out) of having its output embedded.
  3. Add a command-line

After much discussion about an earlier version of this proposal, there was a strong desire to keep the command-line interface minimal, so I think (1) or (2) should be preferred. I personally think always embedding generator output is the best option as it means that generators get good debuggability with no fuss. We could always add a command-line or generator API opt-out later if there was anyone pushing back on embedding the generator output.

I propose that we open a separate follow-up issue to track how to integrate these two features after both have arrived in a common branch and discuss 1-3 or other alternatives there.

Command Line API

Proposal: Add a property to Microsoft.CodeAnalysis.CommandLineArguments to indicate a list of files to be embedded in the PDB.

public class CommandLineArguments { ... // New property: file to be embedded in the PDB. public IEnumerable EmbeddedFiles { get; } }

Note that if /embed is specified without arguments it is surfaced here by appending the full set of source files to this list and not via a separate API.

Emit API

It should be possible to embed source and additional text via public API without routing through the command-line compiler interface.

Proposal:
NOTE: Additions of optional parameters below to be done in the usual binary-compat-preserving way.

namespace Microsoft.CodeAnalysis.Text { // ... public abstract class SourceText { //... public static SourceText From( // existing parameters Stream stream, Encoding encoding = null, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1, bool throwIfBinaryDetected = false,

          // new parameter: capture enough information to save exact original bytes to PDB 
          bool canBeEmbedded = false);

    public static SourceText From(
          // existing parameters
          byte[] buffer, 
          int length, 
          Encoding encoding = null, 
          SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1,
          bool throwIfBinaryDetected = false,

          // new parameter: capture enough information to save exact original bytes to PDB 
          bool canBeEmbedded = false);

     // new property: indicates if it is possible to create EmbeddedText from instance. 
     // Either canBeEmbedded=true must have been specified with original bytes, or, 
     // if not constructed from bytes/stream, must have Encoding.
     public bool CanBeEmbedded { get; }
 }

}

namespace Microsoft.CodeAnalysis { public abstract class Compilation { // ... public EmitResult Emit( // Existing parameters Stream peStream, Stream pdbStream = null, Stream xmlDocumentationStream = null, Stream win32Resources = null, IEnumerable manifestResources = null, EmitOptions options = null, IMethodSymbol debugEntryPoint = null,

         // New parameter: specify the texts (with their paths) to embed
        IEnumerable<EmbeddedText> embeddedTexts = null,

        // Existing parameter
        CancellationToken cancellationToken = default(CancellationToken));
}

// new type
public sealed class EmbeddedText {
    private  EmbeddedText();

    public string FilePath { get; }
    public SourceHashAlgorithm ChecksumAlgorithm { get; }
    public ImmutableArray<byte> Checksum { get; }

     // create embedded text from source text, SourceText.CanBeEmbedded must be true
    public static EmbeddedText FromSource(string filePath, SourceText text)

    // create embedded text from a stream (for file that is not source)
    public static EmbeddedText FromStream(string filePath, Stream stream, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1)

   // create embedded text from bytes in memory (for file that is not source)
   public static EmbeddedText FromBytes(string filePath, ArraySegment<byte> bytes, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1)
}

}

Note that it is the caller's responsibility to the gather source and non-source text as appropriate. Text will line up with corresponding source/sequence points by the existing mechanism for de-duping debug documents generated by source trees, #line, and #pragma checksum: i.e. paths will be normalized and then compared case-insensitively for VB and case-sensitively for C#.

Compression

Files beyond a trivial size should be compressed in the PDB. Deflate format will be used. Tiny files do not benefit from compression and can even waste cycles making the file bigger so we should have a threshold at which we start to compress.

Encoding

Any source text created from raw bytes/stream shall be copied (or compressed and copied) to the PDB without decoding and re-encoding bytes -> chars -> bytes. This is required since encodings do not always round-trip and the checksum must match the original stream.

A source text created by other means (e.g. string + encoding) in which its checksum will be calculated by encoding to bytes via SoruceText.Encoding, will have its text encoded with SourceText.Encoding.

See also CanBeEmbedded requirements above,

Portable PDB Representation

In portable PDBs, we will put the embedded source as a custom debug info entry (with a new GUID allocated for it) parented by the document entry.

The blob will have a leading int32, which when zero indicate the remaining bytes are the raw, uncompressed text, and when positive indicates that the remaining bytes are comrpessed by deflate and the positive value is the byte size when decompressed.

Portable PDB spec is being updated accordingly: dotnet/corefx#10560

Windows PDB Representation

The traditional Windows PDB already had a provision for embedded source, which we will use via ISymUnmanagedDocumentWriter::SetSource.

The corresponding method for reading back the embedded source returned E_NOTIMPL until recently, but I have made the change to implement it and an update to the nuget package is pending.

The blob format will be identical to the portable PDB. This is already a diasymreader custom PDB "injected source" so we can define the source portion as we wish. Using the same blob for Windows and portable PDBs opens up optimizations in the implementation (less copying) and also simplifies it.