Disco Narrator - Data Scraping
Part 1/4 of a series
To train an AI Text-To-Speech (TTS) model, we’ll need to obtain a Labelled Dataset with two things:
- Clean audio files, containing only the voice we’re cloning
- The dialogue transcript (text) for each audio file
I refer to these points as
(2) in the next section.
Now, I’ve never Reversed Engineered a Unity game before. But I knew, at bare minimum, that getting (2) done was definitely possible, because of Disco Reader – a third-party app that renders dialogue trees from the game.
A bit of googling later, and I find an informative reddit thread:
I needed a structured export of all the conversations in the game […] I now use AssetStudio to extract the dialoguebundle myself.
While googling, I also found another project, Disco Courier, which suggested much of the same:
place a copy of your exported data in /data/dialog.json
So, I’m looking to get a
.json file from AssetStudio somehow. Let’s try that.
I download and open the program. It looks like this:
Okay, “Load file/folder”. The data has to be somewhere in the game files, so… I could try loading
Steam\steamapps\common\Disco Elysium\, and look from there?
A crash and a reboot later, and I realise I should’ve read the
README in closer detail:
When AssetStudio loads AssetBundles, it decompresses and reads it directly in memory, which may cause a large amount of memory to be used. You can use File-Extract file or File-Extract folder to extract AssetBundles to another folder, and then read.
So I extracted the folder elsewhere, and then I tried
The first thing I see are the audio files for the game. This is good – we’ve solved for (1) – but I still haven’t found the dialogue data yet.
After clicking around in the UI a bit, I luck out again: sorting the assets by size, I find an asset named
Disco Elysium, stored at
/Assets/Dialogue Databases/Disco Elysium.asset. Loading the preview for this asset nearly crashes my computer again because of how large it is – Every piece of writing (dialogue, thoughts, item descriptions, etc.) in the entire game is bundled in this one asset. Exported, this amounts to a 266MB
.json file, as we expected earlier.
After extracting the audio files as well, I have the following data:
AudioClip/ contains the files for (1),
MonoBehaviour/ contains the lines for (2).
Linking audio to dialogue
I had hoped the
AudioClip assets would contain some labelled metadata, but to my consternation, the audio files and dialogue text were bundled separately. So the
.wav files themselves aren’t very useful, and I need to gather more information.
Specifically, for each audio file, I need to know
- Which voice actor is speaking, and
- What lines are being read
The first thing I tried was to locate the information in
The Disco Courier project I mentioned earlier was made to work with it, so I started with an
The app was a little bit old – last commit in 2021. 1 year is about two centuries long in the NodeJS ecosystem, so the app obviously crashed on first try:
So, I did a little bit of debugging and added the fixes to my fork.
To start, I grep the source for
the version of your data, and quickly find the problem:
The version is hardcoded… along with some metadata about the json file. I extracted these with a oneliner:
And patched that output into the source. After that, I tried running the project with one of the suggested example commands:
Naturally, this didn’t do what it said it would do. After fixing another minor bug (the output directory,
./src/data/json, did not exist), I found a list of conversations from The Player to Cunoesse at
After accounting for bugs, I get a json file of dialogue entries from
courier. A single dialog entry looks like this:
And although there’s a lot of information in that entry, there’s nothing that tells me what audio file (if any) this line is linked to. Dead end.
So, a third-party CLI app failed to give me the information I wanted. Maybe I should’ve been a little less credulous, and checked the game files myself?
disco-courier is doing about the best it can. The only thing I really notice is that what the CLI app calls
refIds are named
articyIds in the game files.
In desperation, I try to run
grep on the raw assets to see if I’d get anything.
To my surprise, it did find something.
sharedassets1; what’s in there?
A lot of things, unfortunately. Let's try harder.
As with before, let’s start by sorting by size:
That won't work; textures are big. Filter by
I click on the asset, hit
Esc after a file browser pop-up appears inexplicably, and I see:
An empty object?
Ah. I failed to the documentation.
One installation of
Il2CppDumper.exe later, and I create the fake
.dll files AssetStudio is looking for:
Re-opening AssetStudio, I follow the README, and…
AssetNames match up with the
.wav filenames we extracted, and the
ArticyIDs match exactly to the expected dialogue for each audio file. Taking the first example in the image,
0x010000060001BDCA refers to this part of the massive Dialogue
json:With all the information we need (theoretically) in tow, we can move on to Data Formatting.