Serialization refactor

Ok, this week's theme is serialization (no porting work at all). I also foresee the work to continue like this until it's complete, and this will take a while. From an outside perspective and on the grand scheme of things, it looks like yet another rabbit hole (game -> nope, port to Godot -> well, let's redo the serialization from scratch before finish porting). So, why bother?

Motivation/background

I've been using BinaryFormatter since my first foray into Unity, several years ago. BinaryFormatter can serialize anything as long as you tag a [Serializable] on your class -- fantastic! In some cases I had serious performance issues, especially in arrays of simple datatypes. I wrote a few specialised converters, and the issue was resolved. On top of that, I added some LZ4 compression to the bytestream and I thought I was done. I was not.

A couple of years ago now, I discovered that BinaryFormatter has very serious security issues. Like, a bad actor can infect a savefile and while you're loading the savefile you might execute arbitrary code. So, yeah .... bad. It's bad enough that it's being getting slowly obsolete. "Best" thing is that Microsoft will not offer an alternative, they say "just use JSON or XML instead". Gee thanks Microsoft, very useful. So, since I don't want to potentially be sued for damages if something like that happens, I knew I have to boot it out, but I was postponing.

Another issue is robustness of save files. Currently, because the game has complex state ( overworld, potentially hundreds of levels active, potentially thousands of entities active, destructible terrain support so I need to store the map rather than changes), I do NOT use any "save objects". The game state is being dumped as-is on disk. With my optimisations, save/load like that, currently (with few entities and levels) happens really quickly: less than a second. But of course, we can only ever load a single version. ANY variable change in the game state invalidates the save file. It's ok for early development, but for later on I know it will give me lots of headaches. So, how to solve this?

I've done some rudimentary investigation in serialization libraries, meaning I've been looking at graphs and reading about features and limitations, rather than testing them. Plenty out there: Json, UTFJson, MessagePack, Protobufs, FlatBuffers, etc. There's a new one out there now, from the developers of MessagePack (who seem to be very experienced on the topic), called MemoryPack that is the most performant of them all. Intriguing! Ok let's test that thing.

First attempt: MemoryPack

The way MemoryPack works is by dynamically generating source code for each of your serializable classes, that are marked as such with a MemoryPackable attribute. So, it looks like a safer drop-in for Serializable of BinaryFormatter. So, I went through the entire codebase and changed most things, so that I can test it on some real-world data structures. Results? Good, but with limitations. I tested saving and loading the world generation config, which contains the biome data per tile (that's a quarter million tiles), the resources of the world, all cities and their configurations. Testing involved using MemoryPack without compression, and some built-in Brotli compression. LZ4 compression can still be applied using my code on the uncompressed bytestream. Some numbers:

Uncompressed, save file is 16MB, compresses in 20ms, decompresses in 20ms.
Applying LZ4, save file is 5.4MB, compresses in 40ms, decompresses in 20ms.
Using Brotli "fast", save file is 3.5MB, compresses in 70ms, decompresses in 60ms.
Using Brotli "best", save file is 3.2MB, compresses in 270ms, decompresses in 50ms.

So, this tells me that for now LZ4 is fantastic, and if size goes wild I'll consider Brotli "fast" preset. Right, so this little test was all nice, so I started porting more types, confidently. And I hit on a few limitations:

Polymorphism is not well supported. If I have a variable of class Foo, which can be either Foo, FooDerived1 or FooDerived2, memorypack cannot pick and choose correctly. It can only do that if Foo is abstract or an interface (plus it requires some extra code).
WeakReference that I've been using, is not supported. Oops! What the hell do I do now.
Versioning is very limited and comes with a list of "you can/cannot do that", plus it possibly makes things slower.

So, this ended up being a bit disheartening. I asked on reddit and I got a few opinions, and one of them described his system and gave me a few numbers re performance etc. What I got out of that was that I need to implement something similar with "SaveObjects" rather than state-dump. But maintaining save objects is error prone and I'm very forgetful. Plus, I can't use JSON as I know for a fact that performance will plummet. So, what do I do?

Plan: Source Generation Squared

So, MemoryPack uses source code generators. When I change my MemoryPackable classes, new source files are being generated and automatically become part of the project. These classes are responsible for (de)serialization.

I want to use "SaveObjects" from now on, so that I can save the state to a SaveObject, which can be serialized in and out. SaveObjects should use MemoryPack, whereas the normal code should not.

I want to dynamically generate SaveObjects because, let's face it, I'm not going to be maintaining SaveObject datatypes after each change I'm doing in the game state. To do that, I want to use source generators.

So, effectively, I want to use source generators to generate code decorated with "MemoryPackable" which will call more source generators. What is the benefit of doing this? My generator should be able to create code in a "latest save version" namespace, whereas SaveObjects from previous versions are also kept alive. The game state can only import/export latest SaveObject version.

To be able to load old saves, I can provide very targetted migration logic for particular datatypes, otherwise the default behaviour would be to 1) copy a type that exists 2) initialize with default a type that didn't exist in the past 3) ignore a type that used to exist but not anymore. By providing code to move from one version to the one immediately after, I can port to any version (theoretically)

This is the plan, anyway. I hope it works. But hope is not reliable, so I need to test. I made a new "proof of concept" project with some datatypes and simple class hierarchies, and try to get part of the whole thing working. How to proceed? Roughly, in 3 stages:

Stage 1: Proof of concept, manual. Implement the target classes that I hope to generate, and make sure that we can go between State <-> current SaveObject <- older SaveObject <- even older SaveObject.
Stage 2: Proof of concept, automated. Actually write the source generator that creates identical code to what I wrote and works. This will generate ALL SaveObject classes based on saveable datatypes, include all partial State classes that implement the appropriate "ToSaveObject" and "FromSaveObject" functions.
Stage 3: Prepare codebase. This can be done in parallel to Stage 2. Here, I need to make sure that my codebase is appropriately decorated with some custom attributes on classes and fields, so that the generator will "just work". Follows similar approach to MemoryPack and many other serializers. I also need to refactor out the WeakReference somehow
Stage 4: Code refactor. Well, here I should try the generator, test it, fix all bugs that will appear since I'm going to be applying it to a vastly larger hierarchy.

That's it! So, when I come out of this rabbit hole, I should have 1) better, refactored code 2) A save system that is as secure as it gets 3) A performant, automated and versioned save system. Currently, I've done some of stage 1 and some of stage 2, handling different types except collections and generics. Crossing fingers for the rest.