Koodal
Kood Scribbles

Kood Scribbles

Let’s translate programming languages like Google Translate - part 1

Understanding the internals of a compiler

Koodal's photo
Koodal

Published on Aug 17, 2020

4 min read

Subscribe to my newsletter and never miss my upcoming articles

Background:

When I first moved to a non-English speaking country I realized the importance of google translate. I wondered if there was a tool similar to Google Translate for translating between programming languages it would be cool, so I decided to build one.

Parallelly I was getting started with machine learning and I found it difficult with all the python examples out there because I was comfortable with C-style languages that use curly braces like Javascript and Java but the python syntax was initially difficult to comprehend. So I decided to create a translator that converts Python code to Javascript.

To translate between programming languages we need to create a source to source compiler aka a transpiler.

What is a transpiler?

According to Wikipedia:

Transpiler is a type of translator that takes the source code of a program written in a programming language as its input and produces an equivalent source code in the same or a different programming language.

Babel and Typescript compiler and are commonly used transpilers in the javascript community. Before creating our own Google Translate, let's understand a transpiler from the inside of it.

Transpiler from the inside

Let's learn how a transpiler works by creating a simple transpiler that compiles URLs. Our URL compiler will take an URL and transform it into a different URL based on a provided transformer.

image.png

The common set of components in a transpiler are:

  • Lexer / Tokenizer
  • Parser
  • Transformer
  • Code Generator

Lexer / Tokenizer:

Lexer is a program that takes a stream of characters, breaks it into tokens based on some rule.

giphy.webp

Our Url lexer will categorize chars in URL string into two items

  1. Delimiter - Special chars in a URL such as ":", "/", "?", "=", "#"
  2. Identifier - All other chars apart from delimiters

ezgif.com-gif-maker.gif

function tokenizer(input) {
  const delimiters = [":", "/", "?", "=", "#"];

  if (delimiters.indexOf(char) > -1) {
    tokens.push({
      type: "delimiter",
      value: char
    });
  } else {
    while (delimiters.indexOf(char) === -1) {
      identifier += char;
    }
    tokens.push({
      type: "identifier",
      value: identifier
    });
  }
  return tokens;
}

Parser

ezgif.com-gif-maker.webp

The output of the lexer is fed into the parser to create a hierarchical structure. The parser is a program that takes input data (tokens) and builds a data structure – often some kind of parse tree, abstract syntax tree, or other hierarchical structure.

Our URL parser will try to convert the tokens from the lexer into a tree-like structure so it can be easily traversed while modifying the URL.

ezgif.com-gif-maker (1).gif

function parser(input) {
  const root = {
    type: "url",
    body: [],
  };
  let tokens = [...input];
  let token;

  while (tokens.length) {
    /* Traversing the tokens */
    token = tokens.shift();
    if (token.type === "identifier") {
      if (isColon(tokens[0]) && isSlash(tokens[1]) && isSlash(tokens[2])) {
        root.body.push({
          type: "Protocol",
          value: token.value,
        });
      }
    }
    /* Check for port */
    if (isColon(token)) {
      if (Number.isInteger(parseInt(tokens[0].value, 10))) {
        root.body.push({
          type: "Port",
          value: parseInt(tokens.shift().value, 10),
        });
      }
    }
  /* Other checks ...*/

  }
  return root;
}

Transformer

giphy.webp

A transformer takes the syntax tree generated by the parser and decorates it based on a set of provided rules or using a visitor.

Our Url transformer receives a visitor object and modifies the tree created by the parser by traversing each node in the tree and applying the rules provided in the visitor object.

ezgif.com-gif-maker (2).gif

function transformer(ast, visitor) {
    ast.body = ast.body.map(node => {
      const visitorFunc = visitor[node.type];
      if (typeof visitorFunc === 'function') {
        return visitorFunc(node);
      } else { return node; }
    });
    return ast;
}

In layman's terms, the visitor is a simple JSON object that contains functions for each node type that needs to be modified in the AST. When we traverse the AST and find a visitor function matching the current node type, we trigger the visitor function in the scope of the current node.

Below are some examples of visitor objects

  • Modify a lang query parameter node from value EN to NL.
const visitor = {
  QueryParamater(node) {
    if ((node.body.name === "lang") && (node.body.value === "EN")) {
      node.body.value = "NL";
    }
    return node;
  }
};
  • Modify the host from video.google.in to youtube.com
const anotherVisitor = {
  Host(node) {
    node.body = {
      domain: "youtube",
      subDomain: "",
      tld: "com",
    };
    return node;
  },
};

Code Generator

giphy1.webp Code Generator is the final piece of our transpiler puzzle. The code generator takes the modified AST and converts it back to text.

Our URL generator traverses each node of the modified AST and converts it back to an URL.

function generator(ast) {
  if (ast.type === "url") {
    let modifiedUrl = "";
    ast.body.forEach((node) => {
      switch (node.type) {
        case "Protocol":
          modifiedUrl += node.value + COLON + SLASH + SLASH;
          break;
        case "Port":
          modifiedUrl += COLON + node.value;
          break;
        /** Other cases */
      }
    });
    return modifiedUrl;
  }
}

The working principle of most compilers is similar to our url-compiler . Now that we have understood how a compiler works we can use this knowledge to create a compiler to translate programming languages in the next part of this blog. The source code for the url-compiler is available in this Github repo.

Hope you enjoyed reading this. Ciao for now :)

 
Share this