Disco Narrator - Data Scraping
Part 1/4 of a series
To train an AI Text-To-Speech (TTS) model, we’ll need to obtain a Labelled Dataset with two things:
- Clean audio files, containing only the voice we’re cloning
- The dialogue transcript (text) for each audio file
I refer to these points as (1)
and (2)
in the next section.
Asset extraction
Now, I’ve never Reversed Engineered a Unity game before. But I knew, at bare minimum, that getting (2) done was definitely possible, because of Disco Reader – a third-party app that renders dialogue trees from the game.
A bit of googling later, and I find an informative reddit thread:
I needed a structured export of all the conversations in the game […] I now use AssetStudio to extract the dialoguebundle myself.
While googling, I also found another project, Disco Courier, which suggested much of the same:
place a copy of your exported data in /data/dialog.json
So, I’m looking to get a .json
file from AssetStudio somehow. Let’s try that.
I download and open the program. It looks like this:
Okay, “Load file/folder”. The data has to be somewhere in the game files, so… I could try loading Steam\steamapps\common\Disco Elysium\
, and look from there?
A crash and a reboot later, and I realise I should’ve read the README
in closer detail:
When AssetStudio loads AssetBundles, it decompresses and reads it directly in memory, which may cause a large amount of memory to be used. You can use File-Extract file or File-Extract folder to extract AssetBundles to another folder, and then read.
So I extracted the folder elsewhere, and then I tried Load Folder
:
The first thing I see are the audio files for the game. This is good – we’ve solved for (1) – but I still haven’t found the dialogue data yet.
After clicking around in the UI a bit, I luck out again: sorting the assets by size, I find an asset named Disco Elysium
, stored at /Assets/Dialogue Databases/Disco Elysium.asset
. Loading the preview for this asset nearly crashes my computer again because of how large it is – Every piece of writing (dialogue, thoughts, item descriptions, etc.) in the entire game is bundled in this one asset. Exported, this amounts to a 266MB .json
file, as we expected earlier.
After extracting the audio files as well, I have the following data:
|
|
AudioClip/
contains the files for (1), MonoBehaviour/
contains the lines for (2).
Linking audio to dialogue
I had hoped the AudioClip
assets would contain some labelled metadata, but to my consternation, the audio files and dialogue text were bundled separately. So the .wav
files themselves aren’t very useful, and I need to gather more information.
Specifically, for each audio file, I need to know
- Which voice actor is speaking, and
- What lines are being read
Dead-end: disco-courier
The first thing I tried was to locate the information in dialog.json
.
The Disco Courier project I mentioned earlier was made to work with it, so I started with an npm install
:
The app was a little bit old – last commit in 2021. 1 year is about two centuries long in the NodeJS ecosystem, so the app obviously crashed on first try:
So, I did a little bit of debugging and added the fixes to my fork.
To start, I grep the source for the version of your data
, and quickly find the problem:
The version is hardcoded… along with some metadata about the json file. I extracted these with a oneliner:
|
|
And patched that output into the source. After that, I tried running the project with one of the suggested example commands:
|
|
Naturally, this didn’t do what it said it would do. After fixing another minor bug (the output directory, ./src/data/json
, did not exist), I found a list of conversations from The Player to Cunoesse at ./src/data/json/conversations/conversations.dialog.json
.
After accounting for bugs, I get a json file of dialogue entries from courier
. A single dialog entry looks like this:
|
|
And although there’s a lot of information in that entry, there’s nothing that tells me what audio file (if any) this line is linked to. Dead end.
Further sleuthing
So, a third-party CLI app failed to give me the information I wanted. Maybe I should’ve been a little less credulous, and checked the game files myself?
Answer: no, disco-courier
is doing about the best it can. The only thing I really notice is that what the CLI app calls refId
s are named articyId
s in the game files.
In desperation, I try to run grep
on the raw assets to see if I’d get anything.
|
|
To my surprise, it did find something. sharedassets1
; what’s in there?
As with before, let’s start by sorting by size:
That won't work; textures are big. Filter byMonoBehaviour
?
Hello.
I click on the asset, hit Esc
after a file browser pop-up appears inexplicably, and I see:
An empty object?
please check https://github.com/Perfare/AssetStudio#export-monobehaviour
Ah. I failed to the documentation.
Again.
One installation of Il2CppDumper.exe
later, and I create the fake .dll
files AssetStudio is looking for:
|
|
Re-opening AssetStudio, I follow the README, and…
It’s there! AssetName
, ArticyID
. The AssetName
s match up with the .wav
filenames we extracted, and the ArticyID
s match exactly to the expected dialogue for each audio file. Taking the first example in the image, 0x010000060001BDCA
refers to this part of the massive Dialogue json
:With all the information we need (theoretically) in tow, we can move on to Data Formatting.