Sonic 2 Compression Spec
			   ------------------------

			         version 2.00
			       by Brett Kosinski

Introduction
------------

This document has been written to describe the compression format used
in Sonic the Hedgehog 2 for compressing various level data, graphics,
and other stuff.  This format has baffled many people, making it very
difficult to create Sonic 2 mods and hacks, since the data structure
is hard to modify without disrupting the file format.  So, here, I
describe the format used in the game.

The Basics
----------

It's pretty clear that, in level maps, graphics, and so on, there is a
great deal of data redundancy.  And, in the days when Sonic 2 was
being produced, ROM space was at somewhat of a premium.  So, being
able to compress this data down and reduce some of the redundancy can
be very beneficial and mean the difference between a 1 or 2 megabyte
ROM.  However, this format must be easily decompressed, requiring
little in the way of computation, while still providing reasonable
compression ratios.

In Sonic 2, the compression format chosen relies on the fact that
segments in a piece of data may be repeated in other places.  So, it
is designed similar to a run-length encoding scheme.  Essentially,
during decompression, the format allows you to specify data copies
from previously decompressed data.  So, given an offset into the
currently decompressed buffer and a length, you can exploit the data
repetition by copying previously decompressed segments.  This provides
a great deal of compression (for example, in EHZ1, the compressed
level is 1/10th the size of the original) while being easy to
decompress.

Overview of the Format
----------------------

The compression scheme used in Sonic 2 provides four different ways
for storing data.  The first method simply specifies uncompressed
data.  The other three are different methods of indicating the copy
offset and length for decompression.  These methods each have
tradeoffs in terms of amount of bytes needed to represent them, the
offset range, and the maximum count expressible.  Thus, during
compression, one could choose the best method given a particular set
of data.

Now, the disadvantage of such a format is thatit does create added
complexity, and requires some way of indicating the compression method
for each segment.  In the next section I'll go over the details of the
compression format and we'll see how this indication is achieved.

The Details
-----------

The compressed data is composed of 16-bit bitfields stored in
little-endian format, which are used to indicate the compression
methods used for the next chunk of data, and the data itself.  The
bitfield is read starting at the right, and each bit is used to
determine how to treat the next byte (or two or three) of data.  There
are three bit combinations:

1  - Straight byte copy (no compression)
10 - Embedded/Separate compression
00 - Inline compression

In the straight byte copy format, the data is simply not compressed.
So, read it as is.

The Embedded/Separate compression formats are slightly more complex.
The next two or three bytes contain data about the compression format.
The format for these bytes is as follows:

OOOOOCCC OOOOOOOO [CCCCCCCC]  <- optional

O - Offset bit
C - Count bit

To construct the offset from this data, take the five offset bits from
the first byte and prepend them to the offset bits from the second
byte.  The result is a 13-bit offset.

The count is slightly more complicated.  By default, take the count
from the bottom three bits of the first byte and add one.  However, if
these bits are all zero then you must read the next byte to get the
count.  If, at this point, the count is zero, we're at the end of the
data buffer.  If the count is a one, we do nothing.

Now, the first case (where the 3 count bits are non-zero) is what I
refer to as the Embedded format.  The second format is the Separate
compression format.

The Inline compression format is the easiest compression format.  The
next byte contains data about the compressed format.  This byte is
used as an offset.  The count is acquired by taking the next two bits
from the bitfield, reversing them, and adding one.  So, for example,
if the next two bits are a 1 and a 0, then the count is 01 + 1
(binary), or 2.

Now, in ALL cases, the number of bytes actually copied during
decompression is equal to the count *PLUS 1*.  So, in the above
example, three bytes would actually be copied from the specified
offset.  As well, the offsets are all stored in signed two's complement.

Also, if, at any time during reading bits from the bitfield, you run
out of bits, *immediately* read the next two bytes from the data
buffer and use them as your bitfield.

Decompression Pseudo-Code
-------------------------

Read bitfield from data buffer

MainLoop:
  Shift bit from right side of bitfield
  If no bits left, read next bitfield

  If bit = 1
    Append next byte from data buffer to decompressed buffer
    Position = Position + 1

    Goto MainLoop
  Else
    Shift bit from right side of bitfield
    If no bits left, read next bitfield

    If bit = 1
      Low = Next byte from buffer
      Hi = Next byte from buffer

      Offset = (Hi << 5) | Low
      Count = Hi & 0x07

      If Count != 0
        Count = Count + 1
      Else
        Count = Next byte from buffer

        If Count = 0 terminate
        If Count = 1 goto MainLoop

    Else
      Shift bit from right side of bitfield
      If no bits left, read next bitfield

      Count |= Bit << 1

      Shift bit from right side of bitfield
      If no bits left, read next bitfield

      Count |= Bit

      Offset = Next byte from buffer

  Count = Count + 1

  while (Count > 0)
    decompressed_buffer[position] = decompressed_buffer[offset + position]
    position = position + 1

Decompression C Code
--------------------

/*
  To run, compile and then run it as:

    decomp [file.rom] [output_file.dat] [compressed data start location]
*/

#include <stdlib.h>
#include <stdio.h>

int get_bit(FILE *file, unsigned short int *field, int *bit_cnt)
{
  int bit = *field & 1;

  *field >>= 1;

  (*bit_cnt)--;

  if (*bit_cnt < 0)
    {
      fread(field, sizeof(short), 1, file);
      *bit_cnt = 15;
    }

  return bit;
}

int main(int argc, char *argv[])
{
  FILE *file;
  FILE *output;
  int file_pos;

  unsigned char buffer[65535];
  int buf_pos = 0;

  unsigned short field;
  int bit_cnt;

  unsigned char high, low, buf;

  int count;
  signed short int offset;

  if (argc < 4) { printf("Invalid params!\n"); exit(0); }

  file = fopen(argv[1], "r");
  output = fopen(argv[2], "w");
  sscanf(argv[3], "%x", &file_pos);

  memset(buffer, 0, sizeof(buffer));

  fseek(file, file_pos, SEEK_SET);

  fread(&field, sizeof(field), 1, file);
  bit_cnt = 15;

  while(1)
    {
      if (get_bit(file, &field, &bit_cnt))
        {
          fread(&buf, sizeof(buf), 1, file);
          buffer[buf_pos++] = buf;

          continue;
        }
      else
        {
          if (get_bit(file, &field, &bit_cnt))
            {
              fread(&low, sizeof(low), 1, file);
              fread(&high, sizeof(high), 1, file);

              offset = (0xFF00 | high) << 5;
              offset = (offset & 0xFF00) | low;

              high &= 0x07;

              if (high)
                {
                  count = high + 1;
                }
              else
                {
                  fread(&buf, sizeof(buf), 1, file);
                  if (buf == 1) continue;

                  count = buf;

                  if (count == 0) break;
                }
            }
          else
            {
              count = (get_bit(file, &field, &bit_cnt) << 1) |
                       get_bit(file, &field, &bit_cnt);
              count++;

              fread(&buf, sizeof(buf), 1, file);

              offset = 0xFF00 | buf;
            }
        }

      if ((buf_pos + offset) < 0)
        {
          printf("Invalid offset found at %i, terminating.\n", ftell(file));

          fwrite(buffer, 1, buf_pos, output);
          fclose(output);
          exit(1);
        }

      while (count >= 0)
        {
          buf = buffer[buf_pos + offset];
          buffer[buf_pos++] = buf;

          count--;
        }
    }

  printf("Read: %i bytes, Wrote: %i bytes\n",
         ftell(file) - file_pos,
         buf_pos);

  fclose(file);

  fwrite(buffer, 1, buf_pos, output);
  fclose(output);
}

Contact Info
------------

This document was written by me, Brett Kosinski.
You can reach me at: brettk@frodo.dyn.gno.org


Acknowledgements
----------------

Well, I *definitely* couldn't have done this without the work of Dave,
Joe Groff, Phil K. Hornung, and everyone else who wrote/worked on
DGen.  Damn, this program is sweet... :)

Damian Grove and the Sonic 2 Hacking Guide website
(http://ssrg.emulationzone.org/s2hack/).  The information on level
data locations is just what I needed to get going. :)

My girlfriend Lenore, for being fairly understanding about me staying
up 'til 3:00am working on this damn stuff.