The University of Nottingham Homepage The University of Nottingham Homepage School of Computer Science Homepage

G53CMP Compilers: Frequently Asked Questions and Common Mistakes

Index

Frequently Asked Questions

  1. Named Fields and Updating Values
  2. Reading Files into GHCi
  3. Are late submissions of the coursework accepted?
  4. How to write your own EBNF grammar
  5. Character Literals in HMTC
  6. Abstract syntax
  7. Command vs Commands
  8. Relational operators in expressions
  9. My grammar allows an arbitrary type on the left of a conditional expression, is this wrong?
  10. Do we need to regenerate the Parser code?
  11. Can't we treat the conditional expression as a function?
  12. If there is a type error inside a conditional expression, what is the overall type?
  13. In Happy, what does %prec mean?
  14. What is the meaning of the type checker output for variable declarations?
  15. HMTC doesn't recompile when I copy it from my personal computer to clyde.
  16. What is the difference between the Abstract Syntax Tree (AST) and the Intermediate Representation (IR)?

Common Mistakes

  1. Pattern Matching on Lists
  2. Tabs and Carriage Returns

Back to main page

FAQ

  1. Named Fields and Updating Values

    In (standard) Haskell you do not update values as you would in an imperative language. You instead create a new variable and instantiate it to the value of the old variable, with any modifications. Consider the following Java and Haskell programs:

    ----------------------------------------------------------------
    
    class Person {
    
    	public int age;
    	public String name;
    
    	public Person (int a, String n) {
    		age = a;
    		name = n;
    	}
    }
    
    class AgeUpdater {
        
        public static void main(String args[]) {
    	Person henrik = new Person(25, "Henrik");
    	System.out.println(ageDif(henrik));
        }
        
        private static int ageDif(Person p) {
    	Person oldHenrik = new Person(p.age = 29, p.name);
    	int d = oldHenrik.age - p.age;
    	return d;
        }
    }
    
    ----------------------------------------------------------------
    
    module AgeUpdater where
    
    data Person = MkPerson {age :: Int, name :: String}
    
    main     :: IO ()
    main     =  let
                    henrik = MkPerson {age = 25, name = "Henrik"}
                 in
    	        putStrLn (show (ageDif henrik))
    
    ageDif   :: Person -> Int
    ageDif p =  let
                    oldHenrik  =  p {age = 29}
                    d          =  age oldHenrik - age p
                 in
    		d
    
    ----------------------------------------------------------------
    

    Both programs appear the same, but they give different answers. The reason for this is that the Java program destructively updates the value of the "age" field in "henrik", whereas the Haskell program only modifies it for that expression (in the same way that typing "y = x + 1" (in either language) has no effect on the value of x). So in the Java program, "p.age" evaluates to 29, whereas in the Haskell program, "age p" evaluates to 25.

    What actually happens behind the scenes is that

    let oldHenrik = p {age = 29}
    gets translated to
    let oldHenrik = f p
    where
    f Person{name = n} = Person{name = n, age = 29}
    Back to top
  2. Reading Files into GHCi

    You can read the contents of a file into GHCi by typing the following:
    Prelude> x <- readFile "filepath"
    
    x is just an arbitrary variable name, while filepath needs to be the file path of the file you wish to read in. If you are in the same directory as the target file, then filepath is just the name of the file. The contents of the file are read in and saved as a string, which can be accessed under the variable name given (in this case x). For example, if you have a TXL file called test.txl in the current directory, containing the code
    let x = 'a' + '\r' in x + 1
    
    then you could type
    Prelude> test1 <- readFile "test.txl"
    
    to instantiate the value of test1 to the contents of the file. You may now use test1 in future statements, such as
    Prelude> :t test1
    test1 :: String
    Prelude> test1
    "let x = 'a' + '\\r' in x + 1"
    Prelude> head test1
    'l'
    
    For those of you taking the advanced functional programming module, note that readFile is a function mapping a String to a computation in the IO monad that produces a string.
    type FilePath = String
    
    readFile :: FilePath -> IO String
    
    Have a look at the function "main" in the source code and see if you can understand it.

    Back to top

  3. Are late submissions of the coursework accepted?

    The School's standard late submission policy applies to the coursework. Thus a late submission will be accepted, but earns you less marks. Also, in any case, you need to be ready in time for your oral examination slots.

    Back to top

  4. How to write your own EBNF grammar

    1. Take a (recursive) BNF grammar.
    2. Eliminate left-recursion (and if it is the case, left-factor the grammar) as explained in the lectures. Remember, a production is left-recursive when the non-terminal on the left hand side appears first on the right hand side in some of its productions. The typical pattern is
      A → X
        | A Y
      

      where X and Y stand for any sequence of terminals and/or non-terminals that do not contain A. For example

      Commands →  Command
               |  Commands ; Command
      

      which follows the pattern where

      A = Commands
      X = Command
      Y = ; Command
      
      Rewrite
      A → X 
        | A Y
      
      to
      A → X Y*
      
      In our Commands example we would obtain
      Commands → Command (; Commands)*
      
    3. The obtained grammar can be used now to produce a systematic parser implementation as explained in the lectures.

    Back to top

  5. Character Literals in HMTC

    There is quite a bit of confusion about adding character literals to HMTC, so here's some clarifications:

    1. Once a character literal has been read by the scanner it needs to be stored in a character literal token. This character literal token will have an associated Char to store the character literal, in much the same way as the integer literals.
    2. Once scanned, the escape sequences should be treated exactly the same as other characters - there should be no difference between the character 'x' and the carriage return character '\r'. This should also be the case when you modify the parser to accept character literals.
    3. Bear in mind the different representations of escape sequences in the various environments. In the .mt file, there are already newline characters but your text editor interprets them and so displays following code on the next line, rather than actually showing a newline character. Haskell on the other hand does not do this, but instead represents newline characters as '\n'. Now, the initial scanner you are given already throws away any newline characters, considering them white space. What you are asked to extend the scanner for are sequences of characters which your scanner will then interpret as character literals, some of which you will represent as escape characters. Thus, the sequence of four characters '\n' needs to be read by the scanner, and then stored in a character literal token with the escape sequence \n as the associated character. Note that when you read in single quotes or slashes into Haskell, it represents them with its own escape sequences (which happen, somewhat confusingly, to be the same as the ones you're scanning for). So reading in the sequence of four characters ' \ n ' actually gives you this list of characters: ['\'', '\\', 'n', '\''].

      However, Haskell formats lists so as to make them more readable. Thus, rather than displaying 'h' : 'e' : 'l' : 'l' : 'o' : [] it displays "hello". Perhaps unfortunately, one other formatting trick it performs it to display any '\'' characters as just a single quote character (not as an escape sequence). This can be done without creating ambiguity, as its known we're inside a string. Thus, 'i' : 't' : '\'' : 's' : [] is displayed as "it's", not as "it\'s" as you may expect. And this happens in reverse as well. However, the escape sequence for a single quote also works as normal inside a string, meaning there are two ways to represent the same character within a string. Thus, "aaa\'bbb" == "aaa'bbb". Try typing

      head "'twas"
      and
      head "\'twas"
      into Haskell.

      So, now consider the earlier example of ' \ n ' which is represented as the Haskell list ['\'', '\\', 'n', '\'']. When displayed to the screen as a string, this would look like "'\\n'". The single quotes are not displayed as escape sequences, while the backslash still is.

      You may find it easier to test the scanner by writing HMTC programs in their own files, and then reading the files into GHCi (see above), rather than writing them directly into GHCi, as there is a lot of potential to make an error with the escape sequences.

    Back to top
  6. Abstract syntax

    A student once put forth this relevant question:

    Looking at MiniTriangle's Abstract Syntax ... my assumption is that there should be a line reading

      | begin Command end  BeginCommand
    

    Abstract syntax describes only the relevant aspects of syntax. For example, we do not care whether a variable declaration starts with an identifier and is followed by a type-denoter (as in MiniTriangle) or the other way around (as in Java). Abstract syntax only cares about what components make up a language construct, not about their order or the tokens in between, like punctuation tokens, that only play their role during parsing.

    Nonetheless, abstract syntax must be specified in some way. If we choose to write them down on a line we need to pick an order, but such order is just for presentation. For instance, instead of writing the abstract syntax of a MiniTriangle Declaration in grammatical form as in

    Declaration → Identifier Type-denoter
    
    we could have used a tree representation with Declaration at the root node and two sub-nodes Identifier and Type-denoter. The following two trees would be valid representations:
              Declaration                  Declaration
                 /  \                       /      \
                /    \                     /        \
               /      \                   /          \
        Identifier  Type-Denoter    Type-Denoter    Identifier
    
    In the coursework and the book, MiniTriangle's abstract syntax is written as a grammar. In the book the authors have also included the tokens in the specification of the abstract syntax, which is a bit confusing. The reason is that in constructs like Command, where there are two different types of command with Identifier and Expression components, we can't just write
    Command → Identifier Expression
    
    for we cannot distinguish between an assignment or a call command. The tokens aid in the distinction and that's why they are shown. The provided UML diagrams for the abstract syntax trees should clear the confusion.

    Back to top

  7. Command vs Commands

    MiniTriangle's grammar as described in the coursework has the production
    Program → Command
    
    which may strike you in the sense that it seems ridiculous to have a program composed of a single statement, but if you look at the productions for Command, you find the two interesting cases:
    Command  → ...
             |  let Declaration in Command
                ...
             |  begin Commands end
    
    which allow you to write more stuff as part of the program. Hence, if we want to write a short program in MiniTriangle that prints two integers, according to this grammar we cannot write
    putint(1);
    putint(2)
    
    but instead we must write
    begin
       putint(1);
       putint(2);
    end
    
    which makes perhaps more sense.

    We can have more than one command inside a let command by using the same device:

    let
      var m := integer
    in
      begin
        putint(1);
        m := 1;
        putint(m);
      end
    
    The Parser code contains a few comments on this matter.

    Back to top

  8. Relational operators in expressions

    Notice that, although expressions like x>y>4 do not usually make sense (indeed, x>y returns a boolean value and in most languages you cannot compare a boolean to an integer), they are valid syntactically. They should get parsed with the right precedence, e.g. ((x>y)>4).

    It is the Checker, not the Parser, that invalidates such expressions. Notice that expressions like

     5 > 3 == (7 > 2) 
    

    which should be parsed as (5 > 3) == (7 > 2), also have three relational operators but do make sense.

    If we don't allow more than one relational operator in an expression, the only way to handle expressions like the previous one in the parsing phase is to distinguish between "<" and ">" on one side and "==" on the other (modifying the grammar).

    In MiniTriangle, operator "==" cannot be used associatively as in "5 == 3 == 2" (the C language allows this). It can only be used conjunctively as in "5 == 3 and 3 == 2" which has a completely different meaning. Indeed, you would be overloading "=" as an operator that can be applied to booleans ("false == true", yields false) and integers ("7 == 8" yields false) which is not the case in our MiniTriangle language.

    An easier solution is to allow all relational operators to be used associatively in the parsing phase and leave the job of checking their validity to the type-checker (contextual analysis).

    Back to top

  9. The expression on the left of '?' in a conditional expression must be a boolean, but my grammar allows an expression of arbitrary type here. Is my grammar wrong?

    This is as it should be.

    Typing rules (except in extremely trivial cases) are examples of contextual constraints: loosely speaking, one needs to take the context (such as an environment that lists all variables that are in scope at some particular point in a program and their types) in order to decide if the constraint is met or not.

    A context-free grammar cannot express such rules. Hence the name! And if you look at a production in a context free grammar, for example,

        A → aBc
    

    it is easy to see why this is so: the production says that the non-terminal A can be replaced by aBc in any string of grammar symbols, regardless of the context (surrounding symbols) in that string.

    You will also note that the grammar as given also allows ill-typed expressions. For example, the following is a syntactically correct MiniTriangle expression,

        1 + (2 < 3)
    

    but it is not type correct.

    And the reason is that the MiniTriangle grammar makes no attempt to capture any of the typing rules as that isn't possible to do using a context-free grammar for a language like MiniTriangle. MiniTriangle may be small and simple, but it exhibits most of the characteristics of a "real" programming language.

    You might think you could create separate productions for the binary and arithmetic operators, and indeed it very instructive to try. If you do, you will quickly and concretely experience why the formalism of context-free grammar simply isn't powerful enough to express what you need to express.

    Thus to express the typing rules we need something else. And for this, the MiniTriangle uses typing rules expressed in logic.

    Back to top

  10. Do we need to regenerate the Parser code?

    When you need to extended the Happy parser, you need to change the Happy source code Parser.y and re-generate Parser.hs from that. (The build-system should do this for you: just invoke "gmake".)

    This can be generalised: Never edit generated code unless it is absolutely unavoidable! If you find yourself thinking about editing generated code, then think again!

    As always, there are exceptions. Some tools do generate code that is intended to be edited. Usually some kind of "boiler-plate code" generator that aims at saving work by generating tedious, repetitive stub code from some kind of specification, allowing the programmer to then add the interesting missing bits. This isn't a great design in my opinion, as this does not address the need to save work when maintaining code, only to ease writing the code the first time round.

    However, some really sophisticated such tools may even be able to re-generate the boiler plate code and add any user-added bits from the previously generated and edited version back in.

    That's a better design, but complicated. It would seem to be a better idea to embed code fragments in a specification in Yacc/Bison/Happy style, and generate everything from there. A single spec/source file is much simpler and easier to maintain.

    Finally, in some unfortunate circumstances, it simply may not be possible to re-generate code. Such as if the original source is gone, or the tool used to generate code can no longer be run for whatever reason. Then one does what one has to, if starting over from scratch isn't an option.

    Back to top

  11. Can't we treat the conditional expression as a function, just like the other operators?

    After all, in Haskell we could define a working conditional function like this:
    cond          :: Bool -> a -> a -> a
    cond b a1 a2  =  if b then a1 else a2
    

    (Haskell only supports binary operators, though, not trinary ones, hence the name "cond" instead of "?:".)

    The answer is yes, in principle, but:

    • The type of "cond" uses parametric polymorphism. The type checking infrastructure of HMTC is not set up to handle parametric polymorphism. Doing this properly would be quite a bit of work. Moreover, HMTC supports a simple notion of subtyping, even if it isn't developed very far. Mixing parametric polymorphism and subtyping adds another layer of complexity. (Haskell and ML do not have subtypes as such.)
    • "cond" only works as a conditional because Haskell is a non-strict ("lazy") language! MiniTriangle is intended to be strict (because it is an imperative language, which does not blend well with laziness).

      So, if :? indeed was truly handled just like a function in MiniTriangle, it wouldn't work when we try to run programs using it!

      Of course, this point is kind of moot at present, since there isn't a MiniTriangle code generator, and hence we cannot run MiniTriangle programs. But nevertheless, the issue is there.

    To see the latter point, consider:
    f     :: [Int] -> Int
    f xs  =  cond (xs == []) 0 (head xs)
    

    This works as intended in Haskell. If the list "xs" is empty, the expression "head xs" will not be evaluated thanks to the non-strict semantics of Haskell, and thus the result of the conditional expression is 0.

    Note that a conditional expression is not just about selecting one of two values, but also about making sure only one of two expressions gets evaluated.

    In a strict language, on the other hand, all three arguments to "cond" would be evaluated before the function is called. Thus "head xs" will be evaluated regardless of whether the list "xs" is empty or not, and thus, in case "xs" is empty, the result of the conditional expression is not 0, but that the program stops and reports an error.

    Try to implement something like "cond" in a strict language like Java, and you'll see what I mean. It's quite instructive!

    Back to top

  12. For a conditional expression (in the form of c?e1:e2) the type of the overall expression can be determined as the type t1 of the expression e1 is identical to type t2 of expression e2. However, if there is a type error (t1 ≠ t2), what type should I return for the whole conditional expression?

    Strictly speaking: no type! The type system can only be used to infer types for well-typed expressions.

    However, I suppose the question really is: "What should the compiler do in the event that it discovers this particular type error?"

    There are a number of possible answers to that question. The approach that is is used in the MiniTriangle compiler is that the type checker should first report any errors, and then try to "recover" so that further type errors may be found and reported. As opposed to just finding and reporting a single error at a time.

    So, how can we recover? Three obvious alternatives suggest themselves:

    • Make the type of the whole expression t1. After all, that could be the intended type!
    • Make the type of the whole expression t2. After all, that might also be the right type!
    • As we don't know which of t1 or t2 is the intended type (maybe neither!), admit defeat and return the special type "SomeType".

    The question here, then, is what would work best in practise. That's not easy to say for sure. But at least we can say that in the first two cases, there are some source programs for which either guess is going to be the wrong guess. That would likely lead to reporting false type errors: that is, type errors that are a mere consequence of earlier type errors. This could be confusing for a user, and one can thus argue that this should be avoided.

    The type "SomeType" on the other hand, means "the type should be some type (all expressions have a type in a type correct program), but unclear which one". Thus, "SomeType" is compatible with any other type! Therefore, if "SomeType" is returned, the compiler will not report a later error simply because we guessed wrong as to the type of the conditional.

    On the other hand, being this forgiving, might mean that we miss a genuine type error this time round. That's not a big problem, though, as it would be found once the first error has been corrected and the source program is recompiled.

    So, pick the approach you think seems best, and explain why you think so!

    Back to top

  13. In Happy, what does %prec mean?

    %prec is used at the end of some of the productions of the grammar to set the precedence and associativity of those productions.

    If you have a production that explicitly mentions X, like

    | foo X fum
    
    then you don't need the %prec bit. So
    | expression '+' expression
    
    would be fine as it is. However, in the actual production '+' is not mentioned explicitly (because there are often multiple productions, and essentially repeating the production for each one individually would be tedious and bad style). Instead the permissible operators are given by the production for, in this case, opclass_additive. But Happy does not attempt to figure out what operators opclass_additive might stand for in order to figure out what the precedence should be (as it would not work in general). So instead the grammar writer has to explicitly state the desired precedence in a case like this.
    | expression opclass_additive expression %prec '+'
    
    The actual precedences (and associativity) are set in a list of declarations elsewhere in the Happy code. The order of the declarations sets the precedence, and the left, right and nonassoc keywords set the associativity. In the Happy code for the MiniTriangle parser, this looks like:
    %left '||'
    %left '&&'
    %nonassoc '<' '<=' '==' '!=' '>=' '>'
    %left '+' '-'
    %left '*' '/'
    %right '^'
    

    Back to top

  14. What is the meaning of the type checker output for variable declarations?

    If you have some MiniTriangle code staring with
    let var v : Integer in
    
    and you typecheck it with ./hmtc --print-after-checking, the output is
    CmdLet <line 1, column 1>
    
      DeclVar <line 1, column 5>
    
        "v@1" : Ref Integer
    

    What this means is that v is declared to be a variable.

    In an imperative language, a variable is essentially a memory cell into which a piece of data can be written or from which a piece of data can be read. In the case of v, the type of the data is declared to be Integer. Thus, the way MiniTriangle is set up, v is a reference to (a memory cell that can hold an) Integer.

    On the other hand, a literal like 1 has type Integer because it is an integer.

    Thus

    v := 1
    
    makes sense, because it means "write the integer 1 into the cell capable of holding an integer that v refers to".
    v + 1
    

    also makes sense because it means "read whatever integer that is stored in the cell that v refers to and add the integer 1 to it, producing a new integer".

    But

    1 := v
    

    does not make sense because it is not possible to overwrite an integer!

    Similarly, you will also find that there are notions of "Source" and "Sink" in MiniTriangle. These are references that only has, respectively, "read-only" and "write only" capabilities, while a reference has both.

    Back to top

  15. HMTC compiles fine on my personal computer, but it won't recompile on clyde.

    The problem could be that tuck and your personal computer have different versions of Happy and GHC installed.

    If you've already compiled HMTC on your personal computer, then if you copy the entire directory to tuck you also copy the Happy-generated "Parser.hs" file. Even if you "gmake clean", this "Parser.hs" file is not deleted. When you try to recompile, this existing "Parser.hs" file may be used, rather than being regenerated.

    The solution is to delete the "Parser.hs" file before recompiling.

    Back to top

  16. What is the difference between the Abstract Syntax Tree (AST) and the Intermediate Representation (IR)?

    The main difference between the two is that the IR contains type information whereas the AST does not. The purpose of the type checker is to convert an AST into an IR, rejecting ill-typed programs.

    Additionally, identifiers in the AST have been replaced by symbols in the IR. These symbols carry semantic information such as: type, scope level, internal or external, and the value of the symbol if external.

    Generally, MTIR is an abstract syntax tree annotated by semantic information. The module documentation says:

    Module: MTIR
    Purpose: MiniTriangle Internal Representation
    (typechecked AST with semantical annotations)
    
    -- MiniTriangle Internal Representation. The definitions mirror the
    -- AST, but the program should be type correct at this stage and semantical
    -- annotations have been added. In particular, variables are represented
    -- by symbols (with name, type info, and source position) as opposed to just
    -- their names.
    

    Back to top

Common Mistakes

  1. Pattern Matching on Lists

    The most common mistakes with Part I of the coursework last year stemmed from failing pattern match on lists. In particular, while some pattern matched on the head and tail of a list, very few pattern matched to greater depth. In Haskell you can pattern match to arbitrary depth on any datatype (though only on the constructors of that datatype).

    Basic examples of pattern matching on lists:

    head          :: [a] -> a
    head (a : _)  =  a
    
    tail          :: [a] -> [a]
    tail (_ : as) =  as
    
    (++)            :: [a] -> [a] -> [a]
    []       ++ bs  =  bs
    (a : as) ++ bs  =  a : (as ++ bs)
    
    But you can pattern match deeper if you want:
    secondelement              :: [a] -> a
    secondelement (_ : a2 : _) =  a2
    
    You can also pattern match on the structure of the elements within the list:
    iffy                       :: [Bool] -> Bool
    iffy (True  : b1 : _)      =  b1
    iffy (False : _  : b2 : _) =  b2
    
    hasPrefixABC                        :: [Char] -> Bool
    hasPrefixABC ('A' : 'B' : 'C' : cs) =  True
    hasPrefixABC _                      =  False 
    
    There's a guide to pattern matching on the Haskell Wiki that you may find useful.

    Back to top

  2. Tabs and Carriage Returns

    People are often caught out by issues related to how different operating systems (like Windows and Unix) represent text files, and differences in how different tools (in particular editors, like Emacs and Wordpad) choose to interpret certain characters in text files.

    Here is a couple of suggestions to help you get around these issues.

    First of all, as you know, parsing of Haskell programs take layout (indentation) into account (unless you explicitly group things using curly braces and semicolons). If tab characters are used in a Haskell file, it thus become a critical question just how wide (in spaces) a tab stop is supposed to be. The Haskell language standard has a precise definition (to ensure that Haskell programs always are interpreted the same way): a tab stop is 8 spaces wide. This is also the default in many (most?) text editors, like Emacs.

    However, for example Wordpad, which is a popular text editor for Windows users at least, seems to have a different idea about the default: it opts for tab stops being 4 spaces wide.

    To avoid unnecessary grief caused by these seemingly inexplicable parse errors, I recommend that you, when working on the G53CMP labs, go into the Wordpad settings menu and change the width of tabs to 8, and that you also tick the box "treat tabs like spaces".

    The second issue is the difference between Windows and Unix (and hence also Linux, Solaris, Mac OS) when it comes to how end of lines are encoded in text files. The Unix convention is to use a single new line character (NL, ASCII/UNICODE character 10, denoted by the escape sequence "\n" in languages like C, Java, Haskell). The Windows convention is to use a carriage return (CR, ASCII/UNICODE character 13, denoted by the escape sequence "\r" in languages like C, Java, Haskell), followed by NL (see en.wikipedia.org/wiki/Newline).

    This can cause issues with the MiniTriangle scanner if the input to the MiniTriangle compiler are files that were created under Windows: the scanner might complain about "illegal character '\r'".

    There are a number of ways around this. One is to create any test files under Linux. Another is to work exclusively within GHCi and provide the input to the scanner as a Haskell string.

    Yet another is to use a tool that translates between the two different conventions. For example, the program "dos2unix" converts from Windows conventions to Unix conventions. The dos2unix command takes two arguments: the original filename and the filename to save the converted file as.

    dos2unix inputFile outputFile 
    

    However, as you have the source code for the scanner, you can do one better and simply fix the problem once and for all by extending it a little bit. The idea is to essentially make it ignore CR characters. Here is how. The interpretation of the carriage return character is that it returns the printing position to the beginning of the line (column 1), but does not advance the position to the next line. So we should to the same thing when we keep track of the source code position within the scanner. Here's the extra line you need to add to the scanner, in context:

    scan l c ('\n' : s) = scan (l + 1) 1 s
    scan l c ('\r' : s) = scan l 1 s
    scan l c ('\t' : s) = scan l (nextTabStop c) s
    

    Back to top


Last updated 22nd November 2013.

Valid XHTML 1.0 Strict