General rules for writing a X compiler to Z in Y

Question

Suppose X is the input language, Z is the output language, then f is the compiler, which is written in language Y.

f = X -> Z

Since f is only a program, I think Y can be any language, right? So we can have compilers f1, f2, each written in Y1, Y2.

f1 = f Y1 f2 = f Y2 g = Z -> M h = g . f # We get a compiler X -> M

Take cpython compiler for example, X is Python, Z is the Python VM code, Y is C.

cpython = Python -> PythonVMCode C interpreter = PythonVMCode -> Nothing interpreter2 = PythonVMCode -> MachineCode

Python sources are compiled to the Python VM code, the .pyc files, then interpreted by the interpreter. Looks like it's possible there exists a compiler which can directly do Python -> MachineCode, though much hard to implement:

 hardpython = interpreter2 . cpython

We can also write another compiler do the work Python -> PythonVMCode, in another language, say Python itself.

mypython = Python -> PythonVMCode Python mypython2 = Python -> PythonVMCode Ruby

Now, here is the complicated example PyPy. I'm just a newbie of PyPy, correct me if I'm wrong:

PyPy doc http://doc.pypy.org/en/latest/architecture.html#pypy-the-translation-framework

Our goal is to provide a possible solution to the problem of language implementers: having to write l * o * p interpreters for l dynamic languages and p platforms with o crucial design decisions.

We can think l is X, p is Y. There exists a program which translates all RPython programs to C:

 rpython_compiler = RPython -> C Python pypy = Python -> Nothing RPython translate = compile the program pypy written in RPython using rpython_compiler py2rpy = Python -> RPython Python py2c = Python -> C Python py2c = rpython_compiler . py2rpy

RPython programs are just like VM instructions, rpython_compiler is the VM.

q1. pypy is the interpreter, a RPython program which can interpret Python code, there is no output language, so we can't consider it as a compiler, right?

Added:

I just found that even if after the translating, pypy is still a interpreter, only this time written in C.
If we look deep into the interpreter pypy, I believe there must exist some kind of compiler, which compiles the Python sources to some AST, then execute

like this:

compiler_inside_pypy = Python -> AST_or_so

q2. Can compiler py2rpy exist, transforming all Python programs to RPython? In which language it's written is irrelevant. If yes, we get another compiler py2c. What's the difference between pypy and py2rpy in nature? Is py2rpy much harder to write than pypy?

q3. Is there some general rules or theory available about this?

More compilers:

gcc_c = C -> asm? C # not sure, gimple or rtl? g++ = C++ -> asm? C clang = C -> LLVM_IR C++ jython = Python -> JVMCode java ironpython = Python -> CLI C#

q4. Given f = X -> Z, a program P written in X. When we want to speed up P, what can we do? Possbile ways:

rewrite P in more efficient algorithm
rewrite f to generate better Z
if Z is interpreted, write a better Z interpreter (PyPy is at here?)
speed up programs written in Z recursively
get a better machine

ps. This question is not about the tech stuffs of how to write a compiler, but the feasibility and complexity of write a certain kind compiler.

Not directly related, but somewhat a similar concept: en.wikipedia.org/wiki/Supercompilation — SK-logic
– SK-logic, Commented May 28, 2011 at 13:32
I'm not sure this question really fits Stack Overflow, especially as there are so many subquestions in it, but I still admire the thought that went into this. — John Y
– John Y, Commented May 28, 2011 at 14:24
Despite what you may have been taught, an AST is not required - it's simply a strategy some compilers use. — Neil Butterworth
– Neil Butterworth, Commented May 28, 2011 at 14:26
PyPy's Python implementation, like most "interpreters", is really a bytecode compiler and an interpreter for that bytecode format in one. — user7043
– user7043, Commented May 28, 2011 at 15:32

Lie Ryan · Accepted Answer · 2011-05-28 15:47:22Z

q1. pypy is the interpreter, a RPython program which can interpret Python code, there is no output language, so we can't consider it as a compiler, right?

PyPy is similar to CPython, both has a compiler+interpreter. CPython has a compiler written in C that compiles Python to Python VM bytecode then executes the bytecode in an interpreter written in C. PyPy has a compiler written in RPython that compiles Python to Python VM bytecode, then executes it in PyPy Interpreter written in RPython.

q2. Can compiler py2rpy exist, transforming all Python programs to RPython? In which language it's written is irrelevant. If yes, we get another compiler py2c. What's the difference between pypy and py2rpy in nature? Is py2rpy much harder to write than pypy?

Can a compiler py2rpy exists? Theoretically yes. Turing completeness guarantees so.

One method to construct py2rpy is to simply include the source code of a Python interpreter written in RPython in the generated source code. An example of py2rpy compiler, written in Bash:

// suppose that /pypy/source/ contains the source code for pypy (i.e. Python -> Nothing RPython) cp /pypy/source/ /tmp/py2rpy/pypy/ // suppose $inputfile contains an arbitrary Python source code cp $inputfile /tmp/py2rpy/prog.py // generate the main.rpy echo "import pypy; pypy.execfile('prog.py')" > /tmp/py2rpy/main.rpy cp /tmp/py2rpy/ $outputdir

now whenever you need to translate a Python code to RPython code, you call this script, which produces -- in the $outputdir -- an RPython main.rpy, the RPython's Python Interpreter source code, and a binary blob prog.py. And then you can execute the generated RPython script by calling rpython main.rpy.

(note: since I'm not familiar with rpython project, the syntax for calling the rpython interpreter, the ability to import pypy and do pypy.execfile, and the .rpy extension is purely made up, but I think you get the point)

q3. Is there some general rules or theory available about this?

Yes, any Turing Complete language can theoretically be translated to any Turing Complete language. Some languages may be much more difficult to translate than other languages, but if the question is "is it possible?", the answer is "yes"

q4. ...

There is no question here.

Your py2rpy compiler is really clever. It leads me to another idea. 1. Does pypy have to be written in RPython in your compiler? All you need is something can interpret Python files, right? 2. os.system('python $inputfile') may also work if it's supported in RPython. Not sure whether it still can be called compiler, at least not literally. — jaimechen
– jaimechen, Commented May 28, 2011 at 16:56
Is pypy still using the Python VM? Now it's clear. pypy_the_compiler = Python -> PythonVMCode RPython, pypy_the_interpreter = PythonVMCode -> Nothing RPython, cpython_the_compiler = Python -> PythonVMCode C, cpython_the_interpreter = PythonVMCode -> Nothing C — jaimechen
– jaimechen, Commented May 28, 2011 at 17:08
@jaimechen: Does pypy have to be written in RPython in your compiler? No, it does not need to be written in RPython, but RPython must be able to tell the "auxiliary interpreter"/"runtime" to execute a Python code. Yes it's true this isn't a "compiler" in practical sense, but it is a constructive proof that it is possible to write Python -> RPython. Is pypy still using the Python VM? I believe pypy does not use CPython at all (I could be wrong), instead PyPy has its own implementation of "Python VM" that is written in RPython. — Lie Ryan
– Lie Ryan, Commented May 28, 2011 at 17:41
@jaimechen: a more practical compiler could possibly analyze the input file for code sequences that it knows how to compile and compile these separately and also a way to jump back and forth between the "recompiled-to-RPython" Python and the "interpreter-aided" Python. It may also uses techniques commonly used in JIT compilation to detect if a particular input may produce different output due to differences in the semantic of RPython and Python and fallback to interpretation in those cases. All those are sophistication that might be seen in a more practical Python -> RPython compiler. — Lie Ryan
– Lie Ryan, Commented May 28, 2011 at 17:56
Maybe a constraint should be added here: transform state machine X to state machine Z, without the aid of an existing 3rd machine. This is the case when X is completely new, no compiler or interpreter ever exist so far. — jaimechen
– jaimechen, Commented May 29, 2011 at 3:54

user207421 · Accepted Answer · 2011-05-28 23:25:16Z

To answer q2 only, there is a compiler book by William McKeeman in which the theory of compilers for language X written in language Y producing output language Z is explored via a system of T-diagrams. Published in the 1970s, title not to hand, sorry.

Yes, this is it, thanks. en.wikipedia.org/wiki/Tombstone_diagram — jaimechen
– jaimechen, Commented May 29, 2011 at 4:16

John R. Strohm · Accepted Answer · 2011-05-29 03:58:37Z

q1. Generally, an interpreter is not a compiler. The key difference between a compiler and an interpreter is that an interpreter starts fresh, with source code in the source language, every time. If your pypy was instead pyAST, or pyP-code, and then you had an AST or P-code interpreter, then you could call pyAST a compiler. This is how the old UCSD PASCAL compiler worked (as well as quite a few others): they compiled to some P-code, which was interpreted when the program was run. (Even .NET provides something like this, when compactness of the generated object code is far more important than speed.)

q2. Yes, of course. See UCSD PASCAL (and a bunch of others).

q3. Dig through the classic texts in computer science. Read up on Concurrent PASCAL, by Per Brinch-Hansen (if memory serves me). A lot has been written about compilers and code generation. Generating a machine-independent pseudocode is usually a lot easier than generating machine code: the pseudocode is usually free of the quirks that real machines invariably contain.

q4. If you want your generated object to run faster, you make the compiler smarter, to do better optimization. If your object is interpreted, you consider pushing more complex operations down into primitive pseudoinstructions (CISC vs. RISC is the analogy), then you do your best to optimize the frack out of your interpreter.

If you want your compiler to run faster, you have to look at EVERYTHING it does, including rethinking your source code. After loading the compiler itself, the most time-consuming part of compilation is ALWAYS reading the source code into the compiler. (Consider C++ for example. All other things being relatively equal, a compiler that has to chomp down 9,000 (or maybe 50,000) lines of #include files to compile a simple "Hello, World" program is never going to be as fast as one that only has to read four or five lines.)

I don't remember where I read it, but the original Oberon compiler at ETH-Zurich had a very sophisticated symbol table mechanism, quite elegant. Wirth's benchmark for compiler performance was the time it took for the compiler to compile itself. One morning, he went in, yanked out the gorgeous multiply-linked ultra-tree symbol table, and replaced it with a simple linear array and straight linear searches. The graduate students in his group were SHOCKED. After the change, the compiler was faster, because the modules it was compiling were always small enough that the elegant monster imposed more total overhead than the linear array and linear search.

Thanks. A compiler 'compiles', while a interpreter 'executes', can there be more insight about the two kinds of programs, like their types are different? — jaimechen
– jaimechen, Commented May 29, 2011 at 4:53

Alex ten Brink · Accepted Answer · 2011-05-29 12:14:18Z

Your questions as stated lead me to believe that what you really want/need is an explanation of what a compiler is, what an interpreter is and the differences between the two.

A compiler maps a program written in language X to a functionally equivalent program written in language Y. As an example, a compiler from Pascal to C might compile

function Square(i: Integer) begin Square := i * i end

to

int Square(int i) { return i * i; }

Most compilers compile 'downwards', so they compile higher-level programming languages into lower-level languages, the ultimate lower level language being machine code.

Most compilers compile directly to machine code, but some (notably Java and .NET languages) compile to 'bytecode' (Java bytecode and CIL). Think of bytecode as machine code for a hypothetical computer. This bytecode is then interpreted or JITted when it is run (more on that later).

An interpreter executes a program written in some language Z. An interpreter reads a program bit by bit, executing it as it goes along. For instance:

int i = 0; while (i < 1) { i++ } return i;

Imagine the interpreter looking at that program line for line, examining the line, executing what it does, looking at the next line and so forth.

The best example of an interpreter is the CPU of your computer. It interprets machine code and executes it. How the CPU works is specified by how it is physically built. How an interpreter program works is specified by what its code looks like. The CPU therefore interprets and executes the interpreter program, which in turn interprets and executes its input. You can chain interpreters this way.

A JITter is a Just-In-Time compiler. A JITter is a compiler. The only difference is the time it is executed: most programs are written, compiled, shipped to their users and then executed, but Java bytecode and CIL are shipped to their users first, and then just before they are executed they are compiled to the machine code of their users.

C# -> (compile) -> CIL -> shipped to customer -> (compile just before execution) -> machine code -> (execute)

The last thing you'll want to know about is Turing completeness (link). A programming language is Turing Complete if it can compute everything a 'Turing machine' can, ie it is at least as 'powerful' as a Turing machine. The Church-Turing thesis states that a Turing machine is at least as powerful as any machine we can ever build. It follows that every Turing complete language is exactly as powerful as the Turing machine, and therefore all Turing complete languages are equally powerful.

In other words, as long as your programming language is Turing complete (nearly all of them are), it doesn't matter which language you pick, since they can all compute the same things. This also means that it is not very relevant which programming language you pick to write your compiler or your interpreter. Last but not least, it means you can always write a compiler from language X to Y if X and Y are both Turing complete.

Note that being Turing complete doesn't say anything about whether your language is efficient, nor about all the implementation details of your CPU and other hardware, or the quality of the compiler you use for the language. Also, your operating system might decide your program doesn't have the rights to open a file, but that doesn't impede your ability to compute anything - I deliberately did not define computing, since that would take another wall of text.

Stack Exchange Network

General rules for writing a X compiler to Z in Y

4 Answers 4

Hot Network Questions

General rules for writing a X compiler to Z in Y

4 Answers 4

Related

Hot Network Questions