An ILOC Simulator

W. A. Barrett

Computer Engineering Dept.

San Jose State University

March 1, 2007, version 1.1

Comments and/or suggestions are invited: wbarrett

 

Appendix A of Cooper and Torczon’s compiler design book (Keith Cooper and Linda Torczon, Engineering a Compiler, Morgan-Kaufman 2004. ISBN 1-55860-698-X) describes an intermediate language, ILOC. 

The language is essentially an assembler language, but one that is not dedicated to any particular platform.  The instruction set assumes that the implementing machine has an unlimited number of registers.

ILOC supports “integers”, which we will assume are 32-bit signed integers.  Some instructionsl work with “characters”, apparently to provide a capability of reading and writing text.

Arithmetic, logical and comparison operators are provided for, as well as several forms of branch instruction.

An “output” instruction provides a simple way to send numbers and characters to the screen.

Floating-point is not supported.

There are no explicit instructions to call and return from a function, nor any to manage a stack and stack frames.  These can be formed from the ILOC instruction set, however.

 

Unfortunately, there appears to be no syntax checker, parser or simulator available for these.  A simulator obtained through the author supports only a subset of the language’s instructions, and it provides no way to debug a program.

 

Installing the Simulator

This is very easy.  It’s in one executable file -- sim.exe for Windows and sim for Debian Linux, Intel platform.  Other Linux platforms will require obtaining the source code and compiling, using a C++ compiler.

Download the appropriate file and link it into your PATH variable.

Using the Simulator

This works through a command prompt, not a window-like interface.  (Perhaps a future version will have a windows interface).

You can view the execution requirements by calling it with no parameters:

wbarrett@ce3:~/lab1/sim$ ./sim

no filename

Usage: main [-d][-c] {-i memloc {contents}} <filename>

   accepts a block of ILOC instructions,

   executes the code

Author:  W. A. Barrett, version 1.0

 -d: execute under debug mode

 -c: check syntax only

 -i parameter -- initialize some memory locations

        with 4-byte integers

 

Option “-d” causes it to run under a debugging mode.  Syntax checking is done before execution.

Option “-c” causes it to just check your assembler syntax.  This includes whether your instructions are part of the legal list, and also whether your register, constant and/or label fields are correct.  If you use labels with branches, it will also verify that every label in an instruction has a target, and that every labelled instruction has at least reference.

Option “-i” can be used repeatedly.  Follow it with a memory address (use 1024 and higher addresses), then one or more values to be carried in that address.  A “value” may be a signed integer expressed in decimal, hex or octal.  It may also be single-quoted character.  A double-quoted string is also a legal value, causing its bytes to be written to memory.

A character occupies 1 memory byte, while an integer occupies 4.  These sizes are reflected in the addresses assigned to the various components.

Note that these are initial values only.  Your program can change any of these memory cells.

Rather than using “-i”, we recommend using the new “mem” instruction in your program, described later.

 

The “<filename>” is the name of your ILOC text file.

Comments

As in the author’s examples, a comment opens with // and continues to the end of its line.

Register r0

Cooper and Torczon seem inconsistent about register r0.  Some ILOC programs seem to assume that it carries 0, while others explicitly set it to 0.  Others seem to treat it as any other register.

This simulator assumes that r0 contains a constant 0, and cannot be changed.  Any setting of this register, including to 0, will generate an error.  Some of the C/T programs will have to be changed to accommodate this.

New storeI Instruction

I was puzzled about why C/T provided no storeI instruction, although there is a similar loadI instruction.  However, there is a storeAI r1 => r2,c which stores register r1 into memory(r2+c). If this is used with r2 = r0, and r0 is zero, then it serves as a storeI.

So this simulator supports storeI, defined as follows:

storeI r1 => C   // r1 -> memory(C)

New loadL Instruction

loadL accepts a code label (an identifier), and sends its simulator index number to a register:

loadL  LABEL2 => r3

This is different than loadI, which copies a constant value to a register.

The register value is not the line number, rather a zero-based index of the target instruction.

You can later execute this command against such a register:

jump  => r3

which expects a valid index value in r3, and transfers control to it.  If the register value is out of range, an error message is generated.

nop Instruction

The authors refer to “nop” in the book, but it isn’t in the appendix.

This is supported by the simulator -- it does nothing.

New output Instructions

Two new output instructions are provided that print single characters:

outputC  r1     // output the LSB 8 bits in r1 as an ASCII character

outputCI c      // output ‘c’ as an ASCII character

Neither of these write anything else, so you can chain a bunch of them in sequence to generate text messages.

The character set includes a line feed, written '\n', and that will cause a carriage return-line feed in the command window.

Change in output and outputI

These instructions will print a number in decimal to the output stream.

Unlike the author’s simulator, NO line feed is printed after that.  You should add a second instruction to generate a line feed, space or whatever, for example:

outputCI '\n'

New mem Macro Instruction

A memory initialization instruction is provided.  This causes various memory positions to be initialized before the program executes.  It must appear before any executable instructions.  Note that these memory locations are not identified by a label, though a future enhancement along this line is feasible.  For example:

mem 1026 15 16 0x7FFF abcde” '\n'

will initialize the locations 1026 to 15, 1030 to 16, 1034 to 0x7FFF, 1038 to ‘a’, 1039 to ‘b’, etc.  The last byte filled will be a line feed.

C constant forms supported

Constant integers in the instructions or the command line may be written in any of the three C styles -- decimal, octal or hex.  You can place an optional sign in front:  + or -.

If there’s no leading 0, the number is assumed to be in decimal.

With a leading 0, but no following “X” or ‘x’, the number is assumed to be in octal.

With a leading 0x or 0X, the number is assumed to be hexadecimal.

An ASCII constant can be written with single quotes.  Inside the single-quote pair, you may write any of the standard C special characters, for example:

'\''   '\n'  '\r'  '\0'  '\”'   '\222'

A double-quoted ASCII string can be used in the input “-i” sequence or in the mem instruction.  It expands into the appropriate single-byte values, placed in memory in successive byte positions.

Labels

Any instruction can carry a branch label, which is an ordinary identifier followed by a colon “:”.   This may be on a line by itself, or be followed by an ILOC instruction.  You can also write several of these referring to the same ILOC line.

You may not declare a label twice in an ILOC file -- there’s no “local” structuring provided.

The branch instructions typically require one or two labels, so use one the program labels in them.

Using the Debugging Tools

When you start the simulator with “-d”, it will stop just before executing the first instruction, and expect a single-letter command.

Type “?” to get a list of the available instructions:

 

wbarrett@ce3:~/lab1/sim$ ./sim -d sblock1.i

  18:     loadI    1032 => r1 // ##18

>> ?

 c: continue w/o debugging

 n, CR: next instruction w debugging

 b index: set or clear breakpoint at index

 B: show all breakpoints

 i: show next instruction

 p: show previous instruction

 l [index]: list group of instructions

 m: show memory

 M: show memory each time (toggle)

 r: show registers

 R: show registers each time (toggle)

 q: quit

>> 

 

c” will continue the program without debugging, except that it will stop if a breakpoint is reached.

n” or CR will single-step.  At a minimum, the instruction about to be executed will be displayed.

b index” will set or clear a breakpoint on a particular line index.  The index should be the same as the source line number in your ILOC text file.  It makes sense to open a separate window with a text editor on your source file, with an editor that shows the line numbers.  You can also use the “l” command to sniff around among your instruction list.

‘B’ shows all breakpoints currently set.

i’ displays the next instruction, which is about to be executed.

p’ displays the previous instruction, which has just been executed.  (“Previous” means the previous one in the file, not the dynamically previous instruction -- another bug to fix)

l’ shows several instructions in your input list

m” and ‘M’ shows all the memory cells.  M will show them on every step, until you do another M.

r’ and ‘R’ shows all the registers.  R will show them on every step, until you do another R.

How to Call and Return from a Function

You will notice there is no “function call” instruction.  However, this can be done with the ILOC instruction set.

Functions are best called through a pushdown stack mechanism, which supports recursion.  This requires allocating two registers, one that points to the current stack top, and a second one that points to the current “stack frame”.  These are used as address pointers.

The stack space allocation is a bit problematic, since there is no mechanism for allocating a block of memory.  That’s up to the programmer.

Assume that register r1 points to the “stack top”, and r2 is the current “stack frame pointer”.  We’ll also assume that the stack top is at the high end of the stack allocation, which means that as items are pushed in the stack, r1 is decreased.  No mechanism for stack underflow will be provided, though that can be done by a test on whether r1 exceeds some lower limit or not.

The function calling scheme is then reasonably standard, summarized as follows:

o   allocate stack space for the return value.  This amounts to decrementing r1 by the size of the return value.

o   push each actual parameter onto the stack.  This amounts to decrementing r1 by the size, then storing the parameter through r1.

o   push a return address on the stack.  This is where we use the loadL instruction.  The return address should be labelled as the instruction following the function call.  So loadL goes into a free register, then its value is pushed in the stack.

o   call” the function.  This is just a jump to the (labelled) instruction that starts the function.

We are now “inside” the function.  A few operations are needed before the function’s body is executed:

o   push the stack frame register r2 into the stack.

o   copy r1 to r2.  This sets the stack frame register r2 to point to the current stack.

o   decrement r1 by as many bytes as needed for the local variables.  These will be referenced through the stack frame register, using a negative offset.

In the function body, any reference to a formal parameter is made through the stack frame register, using a positive offset.  A reference to a local variable is made the same way, using a negative offset.

When the function is about to return, these operations are required:

o   copy any return value into the return value stack location, again using the frame register r2 with an appropriate offset.

o   copy r2 to r1.  This sets the stack top at the position of the frame register

o   pop the stack into r2.

o   pop the stack into a free register, say r3, then execute this:

jump => r3

The location just after the function call instruction should be labelled -- and that label is used in the loadL instruction executed prior to calling the function.

The formal parameters and the return value are still in the stack.  Under C calling conventions, these need to be popped off by the caller, which is a matter of incrementing r1 by the appropriate amount.  That will leave the return value on the stack top, and again, the caller may choose to remove that or use it in some manner.

How to Encode a Switch Branching

The C switch statement can be encoded in two different ways --

o   as a sequence of “if-if-if-if-else” statements, or

o   through a table of branch addresses.

Note that the “switch” asks for a cardinal number, then executes the branch based on a match of that number with a list of “case” statements.

A branch table will be very efficient, provided that the range of numbers in the switch is limited.  The general idea here is this:

o   first test the switch value (call it V) against the upper and lower bounds of the case statement values. 

o   If V is out of range, branch to the default switch (which may be the statement following the switch).

o   If V is in range, then there should be an indexable table of memory addresses through which a branch can be executed.  The index used will be V-lower, where lower is the least value of the case statement values.

Do that in ILOC like this:

load  V => r3   # the switch value

(test for lower/upper limits)

subI  r3,lower => r3   # make r3 zero-based

loadL TABLE => r4   # index of first TABLE instruction

add   r3,r4 => r3   # r3 now points into the table

jump => r3

TABLE:

jump => B0  # branch to B0 for r3 == 0

jump => B1  # branch to B1 for r3 == 1

jump => B2  # etc.

...

What makes this work is the indexing scheme used in the simulator -- each of the jump instructions in TABLE uses one code location.  So the first jump to B0 is at TABLE+0, effectively.  The branch at B2 is at TABLE+2, etc.

Memory Pecularities

The memory cells are simulated by carrying their values in a symbol table, rather than as binary numbers strung out in an array of bytes.  That should be changed in a future version, but for now, you will notice that two ints (4 bytes?) can be declared in two byte-adjacent cells, say 1024 and 1025, and they are considered distinct.

To be more realistic, memory should be carried in a contiguous array of bytes, so that any overlapping of two adjacent memory cells will be obvious, and the simulator more realistic.