In this post I take a short look at the new StringBuilder.MoveChunks() API introduced in .NET 11 preview 5. First we'll described what the API does and how to use it, then we'll look at how it's implemented. Finally, we'll look at why this API was introduced.
Building text efficiently with StringBuilder
StringBuilder is a mainstay of almost all .NET applications. You can use StringBuilder to efficiently concatenate strings, characters, and other ToString()ed objects without creating a lot of intermediate strings. Recent versions of .NET added even more efficient rendering with ISpanFormattable implementations.
A few years ago I wrote a deep-dive into the implementation of
StringBuilder. Things have changed a bit since then, but hopefully you'll still find it an interesting read!
StringBuilder is one of the first basic optimisations you learn when doing .NET development, because it avoids generating lots of intermediate strings, which in turn reduces the pressure on the garbage collector (GC).
For example:
// Create a StringBuilder object with no text.
var sb = new StringBuilder();
// Append some text.
sb.Append('*', 10)
.Append(" Adding Text to a StringBuilder Object ")
.Append('*', 10);
sb.AppendLine("\n");
sb.AppendLine("This avoids allocations");
var result = sb.ToString();
// ********** Adding Text to a StringBuilder Object **********
//
// This avoids allocations
Without using a StringBuilder, you might end up with many intermediate string objects.
As I describe in my deep dive, the StringBuilder type achieves this by using multiple char[] buffers, called chunks internally, and writing to these buffers when you call Append(). Only when you call ToString() are these buffers converted into a string object.
Using the new StringBuilder.MoveChunks() method
.NET 11 preview 5 introduces a new static method on the StringBuilder type, MoveChunks():
public static StringBuilder MoveChunks(StringBuilder source);
Conceptually, this method works somewhat similar to the following code (though much more efficiently, as we'll see shortly):
StringBuilder source;
// Create a new StringBuilder with the contents of the original
var snapshot = new StringBuilder(source.ToString());
// The original source is "reset"
source.Clear();
So the MoveChunks() method copies the contents of the StringBuilder source instance into a new StringBuilder instance, and then clears the contents of the source.
The main difference between the code above and
MoveChunks()(other than performance) is that when you callClear(), any allocatedchar[]chunks in theStringBuilder sourceare kept, and it's just thelengthetc that is reset. WithMoveChunks, thesourceinstance won't have any internal chunks any more.
You can see the effect of MoveChunks with the simple example below, where the contents of source is effectively moved to the new instance:
var source = new StringBuilder();
source.AppendLine("Adding ");
source.AppendLine("some text");
StringBuilder snapshot = StringBuilder.MoveChunks(source);
Console.Write(snapshot.ToString()); // "Adding some text"
Console.WriteLine(source.ToString()); // ""
Console.WriteLine(source.Length); // 0
source.Append("ready for reuse");
Now, if you're like me, you might wander why you can't just do the following instead:
StringBuilder source;
// Point new StringBuilder at previous source
StringBuilder snapshot = source;
// Create a new instance and point source at it
source = new();
This is almost identical to what MoveChunks does, and in many cases it will be identical, but there's still a good reason for introducing the API, which I'll talk about later.
But before we get to that, let's look at the implementation of MoveChunks itself.
Looking at the implementation of MoveChunks()
The MoveChunks API works by moving the "guts" of the StringBuilder source into a new StringBuilder, and resetting the guts of the original StringBuilder to a "fresh" state:
namespace System.Text;
public sealed partial class StringBuilder
{
public static StringBuilder MoveChunks(StringBuilder source)
{
ArgumentNullException.ThrowIfNull(source);
// Create a new instance, destination, from the contents of source
StringBuilder destination = new StringBuilder(source);
// Reset all the internal fields to a "fresh" state
source.m_ChunkChars = [];
source.m_ChunkPrevious = null;
source.m_ChunkLength = 0;
source.m_ChunkOffset = 0;
return destination;
}
// "Move" constructor
private StringBuilder(StringBuilder from)
{
// Set all the private properties of the new StringBuilder
// from the source instance. This essentially "copies" all the state
// from the old Stringbuilder to the new one (no allocation, we're just
// copying references and value types)
m_ChunkLength = from.m_ChunkLength;
m_ChunkOffset = from.m_ChunkOffset;
m_ChunkChars = from.m_ChunkChars;
m_ChunkPrevious = from.m_ChunkPrevious;
m_MaxCapacity = from.m_MaxCapacity;
// Only runs in Debug mode, ensures everything is valid
// (which it always will be, unless you have done anything weird with reflection)
AssertInvariants();
}
}
And that's it for the API implementation, it's really very simple. Which brings us back to the question, why do we need this API?
Why do we need MoveChunks?
The question of "why do we need this API?" was discussed extensively in the .NET API Review session, with a lot of push back precisely for the reasons described above—you can effectively already do this today, so why do you need it?
The main motivation is an interesting one, and it comes down to source generators. You can use source generators to generate additional source code at runtime, and a very common way to do that is to use the classic StringBuilder! It's common to build up the source file in a StringBuilder and then create a SourceText instance.
There are lots of existing APIs for creating a SourceText in Roslyn:
namespace Microsoft.CodeAnalysis.Text;
public abstract class SourceText
{
public static SourceText From(string text, Encoding? encoding = null, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1);
public static SourceText From(TextReader reader, int length, Encoding? encoding = null, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1);
public static SourceText From(Stream stream, Encoding? encoding = null, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1, bool throwIfBinaryDetected = false, bool canBeEmbedded = false);
public static SourceText From(byte[] buffer, int length, Encoding? encoding = null, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1, bool throwIfBinaryDetected = false, bool canBeEmbedded = false);
}
Of these APIs, the only one that's really applicable is the string text constructor. Which means if you build up the text using StringBuilder, you have to call ToString() on it afterwards. Now, obviously that's not terrible, but it could still mean a big extra allocation for large strings. But is it really necessary?
The proposal is to add an additional overload, which takes a StringBuilder instance directly:
public abstract class SourceText
{
public static SourceText From(StringBuilder stringBuilder, Encoding? encoding = null, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1);
}
The big problem with the above API is that Roslyn can't guarantee who is still holding a reference to the StringBuilder instance. In order to not violate all sorts of downstream constraints, Roslyn would have to create a copy of the contents of the StringBuilder, which would nullify a lot of the performance advantages.
And this is where the MoveChunks() API comes in.
The good thing about the MoveChunks() API, is that anyone that was previously holding a reference to StringBuilder continues to hold a valid version (albeit one that has been cleared out). But importantly they can't mutate the instance that Roslyn is holding onto. That's vital for Roslyn to be able to safely rely on the StringBuilder as a backing value for the SourceText.
So when is the new Roslyn API going to be available?
The API proposal above is just a proposal, it's not implemented yet. One current problem is with Roslyn actually being able to use the new MoveChunks() API. Today, to write a source generator, you have to target netstandard2.0, which means this new API being in .NET 11 isn't much use. That's not to say it can't be used internally (there's already a StringBuilderText implementation in the library). They can't just make this type public, because it doesn't preserve the non-mutability requirement.
If this fundamental issue is resolved, then it seems like there's an easy analyzer path for adding a warning on SourceText.From(stringBuilder.ToString()) that can recommend a switch to SourceText.DrainFrom(stringBuilder). And then you get a "free" performance boost 🎉
Although, I guess that you might find that if you're using a single
StringBuilderinstance cache, and resetting it, performance might not be that much better, given you're going to be allocating a bunch of extrachar[]compared to previously (but fewerstringobviously!)
One thing I haven't played with is simply vendoring in the internal StringBuilderText into a source generator to get the best of both worlds. I don't really see why it shouldn't work (and doesn't require .NET 11), but I haven't tried it yet!
Summary
In this post I discuss the new StringBuilder.MoveChunks() API added in .NET 11. I described how it works, compare it to the Clear() method, and show how it's implemented. I then discussed why it was added, primarily for Roslyn to be able to avoid having to call ToString() when creating SourceText instances. That API hasn't been added yet, so for now this is somewhat academic, but if you have similar APIs in your own libraries, it's maybe something worth looking into.
