Contents

Disco Narrator - Data Scraping

Part 1/4 of a series

To train an AI Text-To-Speech (TTS) model, we’ll need to obtain a Labelled Dataset with two things:

  1. Clean audio files, containing only the voice we’re cloning
  2. The dialogue transcript (text) for each audio file

I refer to these points as (1) and (2) in the next section.

Asset extraction

Now, I’ve never Reversed Engineered a Unity game before. But I knew, at bare minimum, that getting (2) done was definitely possible, because of Disco Reader – a third-party app that renders dialogue trees from the game.

A bit of googling later, and I find an informative reddit thread:

I needed a structured export of all the conversations in the game […] I now use AssetStudio to extract the dialoguebundle myself.

While googling, I also found another project, Disco Courier, which suggested much of the same:

place a copy of your exported data in /data/dialog.json

So, I’m looking to get a .json file from AssetStudio somehow. Let’s try that.


Using AssetStudio

I download and open the program. It looks like this:

Okay, “Load file/folder”. The data has to be somewhere in the game files, so… I could try loading Steam\steamapps\common\Disco Elysium\, and look from there?

/blog/dn-1/Pasted%20image%2020220817195617.png
uh oh

A crash and a reboot later, and I realise I should’ve read the README in closer detail:

When AssetStudio loads AssetBundles, it decompresses and reads it directly in memory, which may cause a large amount of memory to be used. You can use File-Extract file or File-Extract folder to extract AssetBundles to another folder, and then read.

So I extracted the folder elsewhere, and then I tried Load Folder:

The first thing I see are the audio files for the game. This is good – we’ve solved for (1) – but I still haven’t found the dialogue data yet.

After clicking around in the UI a bit, I luck out again: sorting the assets by size, I find an asset named Disco Elysium, stored at /Assets/Dialogue Databases/Disco Elysium.asset. Loading the preview for this asset nearly crashes my computer again because of how large it is – Every piece of writing (dialogue, thoughts, item descriptions, etc.) in the entire game is bundled in this one asset. Exported, this amounts to a 266MB .json file, as we expected earlier.


After extracting the audio files as well, I have the following data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
~/DiscoAudioSources$ tree 
.
├── AudioClip
│   ├── Abandoned Lorry-JAM  INSTIGATOR CABIN-118.wav
│   ├── Abandoned Lorry-JAM  INSTIGATOR CABIN-120.wav
│   ├── Abandoned Lorry-JAM  INSTIGATOR CABIN-123.wav
.....
│   └── Yellow Man Mug-INVENTORY  MUG-9.wav
└── MonoBehaviour
    └── Disco Elysium.json

AudioClip/ contains the files for (1), MonoBehaviour/ contains the lines for (2).

Linking audio to dialogue

I had hoped the AudioClip assets would contain some labelled metadata, but to my consternation, the audio files and dialogue text were bundled separately. So the .wav files themselves aren’t very useful, and I need to gather more information.

Specifically, for each audio file, I need to know

  1. Which voice actor is speaking, and
  2. What lines are being read

Dead-end: disco-courier

The first thing I tried was to locate the information in dialog.json.

The Disco Courier project I mentioned earlier was made to work with it, so I started with an npm install:

/blog/dn-1/Pasted%20image%2020220815203719.png
only 3 critical!

The app was a little bit old – last commit in 2021. 1 year is about two centuries long in the NodeJS ecosystem, so the app obviously crashed on first try:

So, I did a little bit of debugging and added the fixes to my fork.

Details of the bug, if you care

To start, I grep the source for the version of your data, and quickly find the problem:

The version is hardcoded… along with some metadata about the json file. I extracted these with a oneliner:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ fx src/data/dialog.json 'd => ({version: d.version, rowCounts: { locations: d.locations.length, actors: d.actors.length, items: d.items.length, variables: d.variables.length, conversations: d.conversations.length }})'
{
  "version": "5/20/2022 12:05:57 PM",
  "rowCounts": {
    "locations": 0,
    "actors": 424,
    "items": 259,
    "variables": 10645,
    "conversations": 1501
  }
}

And patched that output into the source. After that, I tried running the project with one of the suggested example commands:

1
2
$ courier -- --export=json --actor=3 --OR=true --conversant=6 conversations.dialog
# "Creates a detailed json export where the speaker is Kim, OR the conversant is Garte."

Naturally, this didn’t do what it said it would do. After fixing another minor bug (the output directory, ./src/data/json, did not exist), I found a list of conversations from The Player to Cunoesse at ./src/data/json/conversations/conversations.dialog.json.

After accounting for bugs, I get a json file of dialogue entries from courier. A single dialog entry looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "parentId": 1039,
  "dialogId": 222,
  "isRoot": 0,
  "isGroup": 0,
  "refId": "0x0100005A0000E5F6",
  "isHub": false,
  "dialogShort": "Little Lily: \"\"Ll... Luby... Rr... R-luuby.\" Sudd...\"",
  "dialogLong": "\"Ll... Luby... Rr... R-luuby.\" Suddenly the girl gets all serious and leans in, as if she's about to tell you a secret.",
  "actorId": 101,
  "actorName": "Little Lily",
  "conversantId": 0,
  "modifiers": [],
  "conditionPriority": 2,
  "userScript": "",
  "inputId": "0x0100002100000B63",
  "outputId": "0x0100002100000B70"
}

And although there’s a lot of information in that entry, there’s nothing that tells me what audio file (if any) this line is linked to. Dead end.

Further sleuthing

So, a third-party CLI app failed to give me the information I wanted. Maybe I should’ve been a little less credulous, and checked the game files myself?

Answer: no, disco-courier is doing about the best it can. The only thing I really notice is that what the CLI app calls refIds are named articyIds in the game files.

In desperation, I try to run grep on the raw assets to see if I’d get anything.

1
2
~/disco_Data$ grep 'Ruud Hoenkloewen-PLAZA  KORTENAER-74' *
Binary file sharedassets1.assets matches

To my surprise, it did find something. sharedassets1; what’s in there?

A lot of things, unfortunately. Let's try harder.
Shared Asset Extraction

As with before, let’s start by sorting by size:

That won't work; textures are big. Filter by MonoBehaviour?


Hello.

I click on the asset, hit Esc after a file browser pop-up appears inexplicably, and I see:


An empty object?

/blog/dn-1/Pasted%20image%2020220817202349.png
No, wait. An error. I search through Github Issues and...

please check https://github.com/Perfare/AssetStudio#export-monobehaviour

Ah. I failed to the documentation.

Again.

One installation of Il2CppDumper.exe later, and I create the fake .dll files AssetStudio is looking for:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
C:\Program Files\Steam\steamapps\common\Disco Elysium\disco_Data>C:\Program Files\Il2CppDumper-net6-v6.7.25\Il2CppDumper.exe ..\GameAssembly.dll .\il2cpp_data\Metadata\global-metadata.dat C:\il2cpp_out
Initializing metadata...
Metadata Version: 27
Initializing il2cpp file...
Il2Cpp Version: 27
Searching...
Change il2cpp version to: 27.1
CodeRegistration : 18216e9c0
MetadataRegistration : 182173350
Dumping...
Done!
Generate struct...
Done!
Generate dummy dll...
Done!
Press any key to exit...

C:\Program Files\Steam\steamapps\common\Disco Elysium\disco_Data>

Re-opening AssetStudio, I follow the README, and…

It’s there! AssetName, ArticyID. The AssetNames match up with the .wav filenames we extracted, and the ArticyIDs match exactly to the expected dialogue for each audio file. Taking the first example in the image, 0x010000060001BDCA refers to this part of the massive Dialogue json:

With all the information we need (theoretically) in tow, we can move on to Data Formatting.