Sonic 2 Compression Spec ------------------------ version 2.00 by Brett Kosinski Introduction ------------ This document has been written to describe the compression format used in Sonic the Hedgehog 2 for compressing various level data, graphics, and other stuff. This format has baffled many people, making it very difficult to create Sonic 2 mods and hacks, since the data structure is hard to modify without disrupting the file format. So, here, I describe the format used in the game. The Basics ---------- It's pretty clear that, in level maps, graphics, and so on, there is a great deal of data redundancy. And, in the days when Sonic 2 was being produced, ROM space was at somewhat of a premium. So, being able to compress this data down and reduce some of the redundancy can be very beneficial and mean the difference between a 1 or 2 megabyte ROM. However, this format must be easily decompressed, requiring little in the way of computation, while still providing reasonable compression ratios. In Sonic 2, the compression format chosen relies on the fact that segments in a piece of data may be repeated in other places. So, it is designed similar to a run-length encoding scheme. Essentially, during decompression, the format allows you to specify data copies from previously decompressed data. So, given an offset into the currently decompressed buffer and a length, you can exploit the data repetition by copying previously decompressed segments. This provides a great deal of compression (for example, in EHZ1, the compressed level is 1/10th the size of the original) while being easy to decompress. Overview of the Format ---------------------- The compression scheme used in Sonic 2 provides four different ways for storing data. The first method simply specifies uncompressed data. The other three are different methods of indicating the copy offset and length for decompression. These methods each have tradeoffs in terms of amount of bytes needed to represent them, the offset range, and the maximum count expressible. Thus, during compression, one could choose the best method given a particular set of data. Now, the disadvantage of such a format is thatit does create added complexity, and requires some way of indicating the compression method for each segment. In the next section I'll go over the details of the compression format and we'll see how this indication is achieved. The Details ----------- The compressed data is composed of 16-bit bitfields stored in little-endian format, which are used to indicate the compression methods used for the next chunk of data, and the data itself. The bitfield is read starting at the right, and each bit is used to determine how to treat the next byte (or two or three) of data. There are three bit combinations: 1 - Straight byte copy (no compression) 10 - Embedded/Separate compression 00 - Inline compression In the straight byte copy format, the data is simply not compressed. So, read it as is. The Embedded/Separate compression formats are slightly more complex. The next two or three bytes contain data about the compression format. The format for these bytes is as follows: OOOOOCCC OOOOOOOO [CCCCCCCC] <- optional O - Offset bit C - Count bit To construct the offset from this data, take the five offset bits from the first byte and prepend them to the offset bits from the second byte. The result is a 13-bit offset. The count is slightly more complicated. By default, take the count from the bottom three bits of the first byte and add one. However, if these bits are all zero then you must read the next byte to get the count. If, at this point, the count is zero, we're at the end of the data buffer. If the count is a one, we do nothing. Now, the first case (where the 3 count bits are non-zero) is what I refer to as the Embedded format. The second format is the Separate compression format. The Inline compression format is the easiest compression format. The next byte contains data about the compressed format. This byte is used as an offset. The count is acquired by taking the next two bits from the bitfield, reversing them, and adding one. So, for example, if the next two bits are a 1 and a 0, then the count is 01 + 1 (binary), or 2. Now, in ALL cases, the number of bytes actually copied during decompression is equal to the count *PLUS 1*. So, in the above example, three bytes would actually be copied from the specified offset. As well, the offsets are all stored in signed two's complement. Also, if, at any time during reading bits from the bitfield, you run out of bits, *immediately* read the next two bytes from the data buffer and use them as your bitfield. Decompression Pseudo-Code ------------------------- Read bitfield from data buffer MainLoop: Shift bit from right side of bitfield If no bits left, read next bitfield If bit = 1 Append next byte from data buffer to decompressed buffer Position = Position + 1 Goto MainLoop Else Shift bit from right side of bitfield If no bits left, read next bitfield If bit = 1 Low = Next byte from buffer Hi = Next byte from buffer Offset = (Hi << 5) | Low Count = Hi & 0x07 If Count != 0 Count = Count + 1 Else Count = Next byte from buffer If Count = 0 terminate If Count = 1 goto MainLoop Else Shift bit from right side of bitfield If no bits left, read next bitfield Count |= Bit << 1 Shift bit from right side of bitfield If no bits left, read next bitfield Count |= Bit Offset = Next byte from buffer Count = Count + 1 while (Count > 0) decompressed_buffer[position] = decompressed_buffer[offset + position] position = position + 1 Decompression C Code -------------------- /* To run, compile and then run it as: decomp [file.rom] [output_file.dat] [compressed data start location] */ #include #include int get_bit(FILE *file, unsigned short int *field, int *bit_cnt) { int bit = *field & 1; *field >>= 1; (*bit_cnt)--; if (*bit_cnt < 0) { fread(field, sizeof(short), 1, file); *bit_cnt = 15; } return bit; } int main(int argc, char *argv[]) { FILE *file; FILE *output; int file_pos; unsigned char buffer[65535]; int buf_pos = 0; unsigned short field; int bit_cnt; unsigned char high, low, buf; int count; signed short int offset; if (argc < 4) { printf("Invalid params!\n"); exit(0); } file = fopen(argv[1], "r"); output = fopen(argv[2], "w"); sscanf(argv[3], "%x", &file_pos); memset(buffer, 0, sizeof(buffer)); fseek(file, file_pos, SEEK_SET); fread(&field, sizeof(field), 1, file); bit_cnt = 15; while(1) { if (get_bit(file, &field, &bit_cnt)) { fread(&buf, sizeof(buf), 1, file); buffer[buf_pos++] = buf; continue; } else { if (get_bit(file, &field, &bit_cnt)) { fread(&low, sizeof(low), 1, file); fread(&high, sizeof(high), 1, file); offset = (0xFF00 | high) << 5; offset = (offset & 0xFF00) | low; high &= 0x07; if (high) { count = high + 1; } else { fread(&buf, sizeof(buf), 1, file); if (buf == 1) continue; count = buf; if (count == 0) break; } } else { count = (get_bit(file, &field, &bit_cnt) << 1) | get_bit(file, &field, &bit_cnt); count++; fread(&buf, sizeof(buf), 1, file); offset = 0xFF00 | buf; } } if ((buf_pos + offset) < 0) { printf("Invalid offset found at %i, terminating.\n", ftell(file)); fwrite(buffer, 1, buf_pos, output); fclose(output); exit(1); } while (count >= 0) { buf = buffer[buf_pos + offset]; buffer[buf_pos++] = buf; count--; } } printf("Read: %i bytes, Wrote: %i bytes\n", ftell(file) - file_pos, buf_pos); fclose(file); fwrite(buffer, 1, buf_pos, output); fclose(output); } Contact Info ------------ This document was written by me, Brett Kosinski. You can reach me at: brettk@frodo.dyn.gno.org Acknowledgements ---------------- Well, I *definitely* couldn't have done this without the work of Dave, Joe Groff, Phil K. Hornung, and everyone else who wrote/worked on DGen. Damn, this program is sweet... :) Damian Grove and the Sonic 2 Hacking Guide website (http://ssrg.emulationzone.org/s2hack/). The information on level data locations is just what I needed to get going. :) My girlfriend Lenore, for being fairly understanding about me staying up 'til 3:00am working on this damn stuff.