Parsing Text using DSL Grammar

We have a task to parse the following file:

6 pcs apple
12 pcs orange
3 pcs peach  
custname : John Tanazaki
address : 5382 Foo Ave
address2 : 
zip : 380802
city : Kawaguchi-shi
state : SA
country : JAPAN
phone : 
custmail : foo@cool.jp
shipby : priority
invoice : 1544
comment : 

Our goal is to convert this file into JSON format:

{
  "items": [
    {
      "item": "apple",
      "quantity": 6
    },
    {
      "item": "orange",
      "quantity": 12
    },
    {
      "item": "peach",
      "quantity": 3
    }
  ],
  "custname": "John Tanazaki",
  "address": "5382 Foo Ave",
  "address2": "",
  "zip": "380802",
  "city": "Kawaguchi-shi",
  "state": "SA",
  "country": "JAPAN",
  "phone": "",
  "custmail": "foo@cool.jp",
  "shipby": "priority",
  "invoice": "1544",
  "comment": ""
}

The top 3 lines can change depending on what the customer bought, a line for each item type. First thing that comes to mind is to step through line-by-line and use a long list of rules for each line. But this would be a nightmare - we will see why later.

There is a better way. We can treat this text block as a Domain Specific Language (DSL) and build grammar to parse it. The grammar is trying to reconstruct the original logic that went into constructing the text block. Usually, this is unknown and we need to reverse-engineer it. In our order example, we know that the first few lines will contain item:quantity relationship.

What is this Grammar that we speak of?

We will be using what's known as a variant of Backus-Naur Form (BNF) to describe context-free grammar. It turns out that BNF is used everywhere, e.g. GNU-Bison is based on BNF and used in many modern compilers.

Steps for parsing context-free grammar
Steps for parsing context-free grammar

The parsing pipeline is composed of three major steps. The first step is to formally conduct Lexical Analysis. Lexical analysis is a process of defining the types of lego bricks, so to speak, that we will be using to build the grammar. These are formally referred to as "tokens".

Knolling
Knolling, Source: https://www.reddit.com/r/knolling/comments/6ivd68/almost_all_of_the_603_pieces_of_the_lego_technic/

Semantic Parsing is the process of building an abstract-syntax tree or AST. An AST is the representation of what the final lego build is comprised of. At the root would be the final build and the leaves would be each of the "knolled" lego pieces.

And the last part is Assembly. This is the process of traversing the AST and doing something with each node. In the case of compilers, each node would represent the node operation, an assembly language instruction.

Lexical Analysis

In the example order, the "knolling" of tokens involves determining the fundamental building blocks of the grammar. They are as follows:

Token description Token Name Token Values
New lines newline [\r\n]+
White space (excluding new lines) ws [^\S\r\n]
Value (any text) value .*
Key key 'custname' / 'address2' / 'address' / 'zip' / 'city' / 'state' / 'country' / 'phone' / 'custmail' / 'shipby' / 'comment' / 'invoice' / 'printed'
Key/Value separator colon :
Quantity quantity \d+

Using these fundamental tokens we can build more complex tokens that build up the entire text block as follows:

Token description Token Name Token Values
Record record key ws colon ws value newline
Item item quantity ws "pcs" ws value newline
Items items item +
File file (newline / items / record ) +

Here, the / means OR operator. The rest of the syntax is standard Regular Expressions.

Parsimonious library for Python (EBNF Parser)

To build DSL grammar, we can use Parsimonious, a python library that is based on Parsing Expression Grammar (PEGs). Quoting the docs:

PEG parsers don't draw a distinction between lexing and parsing; everything is done at once. As a result, there is no lookahead limit, as there is with, for instance, Yacc. And, due to both of these properties, PEG grammars are easier to write: they're basically just a more practical dialect of EBNF. With caching, they take O(grammar size * text length) memory (though I plan to do better), but they run in O(text length) time.

PEGs can describe a superset of LL(k) languages, any deterministic LR(k)language, and many others—including some that aren't context-free (http://www.brynosaurus.com/pub/lang/peg.pdf). They can also deal with what would be ambiguous languages if described in canonical EBNF. They do this by trading the | alternation operator for the / operator, which works the same except that it makes priority explicit: a / b / c first tries matching a. If that fails, it tries b, and, failing that, moves on to c. Thus, ambiguity is resolved by always yielding the first successful recognition.

The syntax for building tokens is based on the simplified Extended Backus-Naur Format (EBNF). Syntax for the tokens is detailed in README.

Abstract Syntax Tree (AST)

The token relationship can be described as an AST, starting with file as the root node:

Order Abstract-Syntax Tree (AST)
Order Abstract-Syntax Tree (AST)

All nodes are broken down into fundamental building blocks. I've added a newline node to file since there can be empty lines at the end of the file. This node will return None as we will see in the next section. The tree can be represented using Parsimonious syntax in Python as follows:

from parsimonious.grammar import Grammar

grammar = Grammar(
    r"""
    file = (newline / items / record) +
    items = item +
    item = quantity ws "pcs" ws value newline
    quantity = ~"\d+"
    record = key ws colon ws value newline
    colon = ~":"
    key = 'custname' / 'address2' / 'address' / 'zip' / 'city' / 'state' / 'country' / 'phone' / 'custmail' / 'shipby' / 'comment' / 'invoice' / 'printed'
    value = ~".*"i
    ws = ~"[^\S\r\n]*"
    newline = ~"[\r\n]+"
    """
)

Assembly

We can use parsimonious library to "assemble" the nodes and create a json object. We first create a class called Order which inherits from NodeVisitor parsimonious class. Each node in the tree gets a method with a visit_ prefix. For example, for node item, we need to define visit_item method.

The full Order class is defined as follows:

from parsimonious import NodeVisitor

class Order(NodeVisitor):

    def visit_file(self, node, visited_children):
        """ Assembles items and records for a file """
        result = {'items': []}
        for i in visited_children:
            if i[0] is None:
                continue
            try:
                result.update(dict(i))
            except Exception as e:
                print(i)
                print(e)
                pass

        return result

    def visit_newline(self, node, visited_children):
        """ Newline node does nothing and returns None """
        return None

    def visit_items(self, node, visited_children):
        """ Constructs a dictionary for each item from items """
        return 'items', [{'item': p, 'quantity': q} for p, q in visited_children]

    def visit_item(self, node, visited_children):
        """ Creates a tuple of item, quantity """
        quantity, _, _, _, product, _ = visited_children
        print(f'ITEM: {product.text} : {quantity.text}')
        return product.text, int(quantity.text)

    def visit_record(self, node, visited_children):
        """ Assembles a record (tuple) of key, values """
        key, _, _, _, value, *_ = node.children
        print(f'RECORD: {key.text} : {value.text}')
        return key.text, value.text

    def visit_value(self, node, visited_children):
        """ Returns a value as is """
        return node

    def generic_visit(self, node, visited_children):
        """ Returns all """
        return visited_children or node

We can now convert to JSON :

import json

order = Order()
tree = grammar.match(data)
output = order.visit(tree)
output_json = json.dumps(output)
print(output_json)

Which prints our JSON order object:

{
  "items": [
    {
      "item": "apple",
      "quantity": 6
    },
    {
      "item": "orange",
      "quantity": 12
    },
    {
      "item": "peach",
      "quantity": 3
    }
  ],
  "custname": "John Tanazaki",
  "address": "5382 Foo Ave",
  "address2": "",
  "zip": "380802",
  "city": "Kawaguchi-shi",
  "state": "SA",
  "country": "JAPAN",
  "phone": "",
  "custmail": "foo@cool.jp",
  "shipby": "priority",
  "invoice": "1544",
  "comment": ""
}

Why not rule based line-by-line parsing?

One must wonder - why write grammar when we can simply put some rules in a for loop that can traverse the file line-by-line?

Let's imagine our text block order format has evolved overtime and now there is a category field.

fruits:
    6 pcs apple
    12 pcs orange
    3 pcs peach
vegetables:
    3 pcs onions
    5 pcs carrots
custname : John Tanazaki
address : 5382 Foo Ave
address2 : 
zip : 380802
city : Kawaguchi-shi
state : SA
country : JAPAN
phone : 
custmail : foo@cool.jp
shipby : priority
invoice : 1544
comment : 

which should parse as:

{
  "items": {
    "fruits": [
      {
        "item": "apple",
        "quantity": 6
      },
      {
        "item": "orange",
        "quantity": 12
      },
      {
        "item": "peach",
        "quantity": 3
      }
    ],
    "vegetables": [
      {
        "item": "onions",
        "quantity": 3
      },
      {
        "item": "carrots",
        "quantity": 5
      }
    ]
  },
  "custname": "John Tanazaki",
  "address": "5382 Foo Ave",
  "address2": "",
  "zip": "380802",
  "city": "Kawaguchi-shi",
  "state": "SA",
  "country": "JAPAN",
  "phone": "",
  "custmail": "foo@cool.jp",
  "shipby": "priority",
  "invoice": "1544",
  "comment": ""
}

It is easy to modify the original AST with the following grammar:

grammar = Grammar(
    r"""
    file = (newline / categories / items / record) +
    categories = category +
    category = key_cat ws colon ws newline items
    items = ws item +
    item = ws quantity ws "pcs" ws value newline
    quantity = ~"\d+"
    record = key ws colon ws value newline
    colon = ~":"
    key_cat = 'fruits' / 'vegetables'
    key = 'custname' / 'address2' / 'address' / 'zip' / 'city' / 'state' / 'country' / 'phone' / 'custmail' / 'shipby' / 'comment' / 'invoice' / 'printed'
    value = ~".*"i
    ws = ~"[^\S\r\n]*"
    newline = ~"[\r\n]+"
    """
)

And modify the Order class as:

class Order(NodeVisitor):

    def visit_file(self, node, visited_children):
        """ Assembles items and records for a file """
        """ Returns file """
        result = {}
        for i in visited_children:
            if i[0] is None:
                continue
            try:
                result.update(dict(i))
            except Exception as e:
                print(i)
                print(e)
                pass

        return result


    def visit_newline(self, node, visited_children):
        """ Newline node does nothing and returns None """
        return None


    def visit_categories(self, node, visited_children):
        """ Creates a tuple of items and dictionary of categories """
        return 'items', dict(visited_children)


    def visit_category(self, node, visited_children):
    """ Creates a tuple of category and list of items """
        key, _, _, _, _, items = visited_children
        print(f'CATEGORY: {key[0].text} : {items[1]}')
        return key[0].text, items[1]


    def visit_items(self, node, visited_children):
        """ Constructs a dictionary for each item from items """
        return 'items', [{'item': p, 'quantity': q} for p, q in visited_children[1]]


    def visit_item(self, node, visited_children):
        """ Creates a tuple of item, quantity """
        _, quantity, _, _, _, product, _ = visited_children
        print(f'ITEM: {product.text} : {quantity.text}')
        return product.text, int(quantity.text)


    def visit_record(self, node, visited_children):
        """ Assembles a record (tuple) of key, values """
        key, _, _, _, value, *_ = node.children
        print(f'RECORD: {key.text} : {value.text}')
        return key.text, value.text


    def visit_value(self, node, visited_children):
        """ Returns a value as is """
        return node


    def generic_visit(self, node, visited_children):
        """ Returns all """
        return visited_children or node

Its possible to parse it line-by-line but it will get laborious, erronious and incredibly fragile to maintain and debug. Lexical analysis and sementic parsing is a very powerful tool, one to have in your pocket when there is a need for parsing weirdly formatted log files, text blobs, or even building your own mock compiler for learning purposes.

The end.

← Back to Home