Introduction to Grok (original) (raw)

Introduction to the Grok Language

Ric Holt,� 5 May 2002

Abstract.� This note introduces Grok, a notation for manipulating binary relations.� Grok has an interpreter, which can be considered to be a relational calculator.� This interpreter has been used extensively for analyzing factbases produced by parsers that extract information from source programs.�

Background and Overview.� The initial version of Grok was created by the author in 1995.� It has evolved to become a language for manipulating factbases.� Grok operates at the level of a relational database, in that operators generally apply across entire relations, and not just to single entities.� The Grok interpreter has been optimized to handle large factbases (up to several hundred thousands of facts).� It keeps all of its data structures in memory.� It is written in the Turing language.�

The term grok means to understand, which is appropriate in that the Grok language is used to help understand large-scale software.� This term was coined in the book Stranger in a Strange Land, by Robert A. Heinlein.

This report consists of five parts.� Part 1 introduces Grok�s set and relation operators using an example involving Garfield the cat.� These operators are summarized in Part 2.

Part 3 shows how Grok can be used to interactively query factbases.� It uses an example to introduce schemas (also called schemes) that characterize the form of factbases.� The reader may want to consult the related report on TA (the Tuple Attribute languages) when reading this part.� Part 4 shows how scripts written in Grok can be used to query and transform factbases.� Finally, Part 5 summarizes the features of Grok.

Table of Contents

Part 1. Introduction to Grok�s Operators

Part 2. Summary of Grok�s Operators

Part 3. Grok Queries, Factbases and Schemas

Part 4. Writing Grok Scripts

Part 5. Summary of Grok Features

Part 1.

Introduction to Grok�s Operators

We will introduce Grok�s set and relation operators using an example in which there are two cats (Garfield and Fluffy), two mice (_Mickey_and Nancy) and two pieces of cheese (Roquefort and Swiss).� We will begin by defining sets for these cats, mice and cheese.

Set constructors.� In the Grok language we write

cat:= {�_Garfield_�, �_Fluffy_�}

to construct the set variable named cat containing the strings Garfield and Fluffy.� Similarly, we can create set variables _mouse_and cheese:

mouse:= {�_Mickey_�, �_Nancy_�}

cheese:= {�_Roquefort_�, �_Swiss_�}

Set union, intersection and subtraction.� We can combine sets.� For example, the Grok statement

animal:= cat + mouse

creates the set called _animal_which is the union of the cat and mouse sets.� As a result, animal contains the four strings Garfield, Fluffy, Mickey and Nancy.� Similarly, we can take the union of the _mouse_and cheese sets to create the set of items that are food:

food:= mouse + cheese

As a result, the _food_set will contain Mickey, Nancy, Swiss and Roquefort.

Intersecting animals with food, we can find the set of animals that are food:

animalsWhichAreFood:= animals ^ food

These animals are _Mickey_and Nancy.� We can use set subtraction to find those animals, which are not food:

����������� animalsWhichAreNotFood := _animals_� food

These animals are _Garfield_and Fluffy.

Set cardinality.� We can ask how many items are food:

����������� howManyThingsAreFood :=� # food

This creates a new scalar (string) variable called howManyThingsAreFood, which is set to 4.� This statement uses the cardinality operator # to return the size of the set

����������� {�Mickey�, �Nancy�, �Swiss�, �Roquefort�}.�

Command line statements.� When Grok is used as a command line interpreter, you can directly enter statements, such as the example Grok statements we have given so far.�You can type an expression or value, and the Grok interpreter will evaluate it and print it.� For example, if you have already entered the above statements and you type

����������� mouse

then Grok will print back to you

����������� Mickey

����������� Nancy

Sets are not ordered. �Since ordering is immaterial in sets, we do not know if Grok will output Mickey before or after Nancy.

Script language. �Grok is also a script language, so you can write a Grok program, consisting of Grok statements, to be executed by the Grok interpreter.

Set comparisons.� We can compare sets, for example:

����������� mouse <= food

This determines whether_mouse_ is a subset of _food. �_�Both elements of mouse (_Mickey_and Nancy), are elements of food, so the Grok interpreter will output true.

The Grok set comparison operators are given in Figure 1.1.

== equal, which can also be written as a single equal sign
~= not equal
<= subset
>= superset
< strict subset
> strict superset

**Figure 1.1.**� Grok Comparison Operators

Summary of set operators.� This section has illustrated how sets are constructed using the notation {_item1, item2,_�}, how these sets can be combined using the union (+), intersection (^), and subtraction (-) operators, and how they can be compared, using <=, =, etc. As well, we introduced the cardinality operator (#).�

Relations in Grok

We will now introduce Grok�s relations and relational operators.� In Grok, a relation is a set of ordered pairs.� For example, a relation might consist of the two pairs (Garfield, Fluffy) and (Mickey, Nancy).� This relation might be the _love_relation, signifying that Garfield loves Fluffy and� Mickey loves Nancy.�

Cross product. �We can use the cross product operator X (capital letter X) to create relations, for example:

����������� chase := cat X mouse

This can be considered to mean that cats chase mice.� As defined by this statement, the chase relation consists of these four pairs

����������� Garfield��������� Mickey

����������� Garfield��������� Nancy

����������� Fluffy������������� Mickey

����������� Fluffy������������� Nancy

In other words, both cats chase both mice.� In general, the cross product of sets S and T produces every pair (x, y) such that x is in S and y is in T.

We can create the eat relation this way

����������� eat := _chase_� +� _mouse_X cheese

This creates the _eat_relation to be the same as �chase, but with the addition (union) of the cross product of mice and cheese.� As a result, the eat relation will contain these eight pairs:

����������� Garfield��������� Mickey

����������� Garfield��������� Nancy

����������� Fluffy������������� Mickey

����������� Fluffy������������� Nancy

����������� Mickey����������� Swiss

����������� Mickey����������� Roquefort

����������� Nancy������������� Swiss

����������� Nancy������������� Roquefort

We can interpret this to mean that cats eat mice and mice eat cheese.

Operator precedence.� Grok uses the traditional precedence of operators from mathematics, namely, additions (as well unions and subtractions) are done after multiplications (as well as intersections and cross products).� For example, in the statement eat := _chase_� +� mouse X cheese, the cross product X is done before union (+).

Relation cardinality.� We can use the cardinality operator # to ask how many pairs are in a relation:

����������� howManyPairs := # chase

This will create the string variable howManyPairs and will assign it the value 4.�

RSF.� We sometimes represent the facts of a collected set of relations as triples of the form (R, x, y), where relation R contains the pair (x, y).� For example, the� _chase_and eat relations, taken together, can be represented this way:

chase�������������� Garfield��������������� Mickey

chase�������������� Garfield��������������� Nancy

chase�������������� Fluffy������������������� Mickey

chase�������������� Fluffy������������������� Nancy

eat������������������ Garfield��������������� Mickey

eat������������������ Garfield��������������� Nancy

eat������������������ Fluffy������������������� Mickey

eat������������������ Fluffy������������������� Nancy

eat������������������ Mickey����������������� Swiss

eat������������������ Mickey����������������� Roquefort

eat������������������ Nancy������������������� Swiss

eat������������������ Nancy������������������� Roquefort

This list notation is called RSF (Rigi Standard Format), because it was developed for use in the Rigi reverse engineering tool.� We sometimes call a triple a fact, and we sometimes call these lists factbases.� Ordering is immaterial in relations, so we can change the ordering of the lines in this list without changing the meaning of the list.�

Set projectors.� We can use the set projector operator, written as a dot,�to determine how a set is related to another set.� For example, we can determine what Mickey eats, by asking the question: How is the set {�_Mickey_�} related, via _eat,_�to another set:

����������� mickeyEatsWhat :=� {�_Mickey_�} . eat

This statement assigns _mickeyEatsWhat_the set containing Swiss and Roquefort. Similarly, we can ask what eats Mickey:

����������� whatEatsMickey := eat . {�_Mickey_�}

This statement assigns _whatEatsMickey_the set containing Garfield and Fluffy.� In the general case, when s is a set and R is a relation, the form� s . R produces the set� t such that if _s_contains x and R contains (x, y) then t contains y.� The form R . s produces the set _t_such that if R contains (x, y) and s contains _y_then t contains x.

Domains and ranges.� The dom (domain) of a relation is the set of all beginnings (left ends) of pairs in a relation.� For example

����������� eater := dom eat

This statement assigns _eater_the set of things that eat.� This set consists of Garfield, Fluffy, Mickey and Nancy.� The statement

����������� food := rng eat

assigns food �the set of things that are eaten.� This set consists of _Mickey, Nancy, Swiss_and Roquefort.

The _sink_of a relation R is the set of items that are in _R_�s range but not in its domain.� We can compute the sink for the eat relation as follows:

����������� bottomOfFoodChain := rng eat � dom eat

This assigns� bottomOfFoodChain the set containing _Swiss_and Roquefort.� The _source_of a relation R is the set of items that are in the relation�s domain, but not in its range, as in:

����������� topOfFoodChain :=� dom eat � rng eat

This assigns _topOfFoodChain_the set containing Garfield and Fluffy.

Relation union, intersection and subtraction.� There are a number of Grok operators that combine relations.� For example, we can intersect the _eat_and chase relations to tell us about both eating and chasing:

����������� bothEatAndChase :=� eat ^ chase

This statement assigns _bothEatAndChase_�this set of pairs:

����������� Garfield��������� Mickey

����������� Garfield��������� Nancy

����������� Fluffy������������� Mickey

����������� Fluffy������������� Nancy

This means that _Garfield_both chases and eats Mickey, that Garfield both eats and chases Nancy, etc.� Relation intersection (^), returns the sets of pairs that are common to two relations.� Analogously, relation union (+), returns all pairs that exist in either of the two relations.

Subtraction (-) is another relation operator, for example,

����������� eatButNotChase := eatchase

This assigns _eatButNotChase_those pairs in eat that are not in chase.� As a result, eatButNotChase records that fact that mice eat but do not chase cheese.

Similarly, we can determine the relation for chasing but not eating:

����������� chaseButNotEat := chaseeat

It turns out that this is the empty relation, because in our example, all chased things (the mice) are eaten (by the cats).�

Empty relation.� We could as well have written

����������� chaseButNotEat �:= EMPTYREL

EMPTYREL is Grok notation for the relation containing no pairs.�Another way to accomplish the same thing is to write the cross product of two empty sets:

����������� ChaseButNotEat �:=� { }� X�{ }

Relation comparisons.� It turns out that the relation_bothEatAndChase_ and chase are identical, which we could check using comparison of relations, by writing:

����������� bothEatAndChase == chase

This will return the value true.� We can compare relations using the comparison operators that apply to sets, namely, ==, ~=, >=, <=. > and <.� Each relation is actually a set of pairs, and these operators have the same meanings as they do for proper sets.

Relation composition and transitive closure.� We will now introduce the composition operator (o).� Relation R1 composed with relation R2 produces a new relation by a join in which each result pair is formed by attaching the right end of a pair in R1 to the left end of a pair in R2. Consider this example,

����������� secondOrderEat :=� _eat_�o� eat

As a result, _secondOrderEat_will contain pairs such as (Garfield, Swiss) because _Garfield_eats something which eats Swiss.�In general, if R1 contains (x, y) and R2 contains (y, z), then the result of R1 o R2 will contain (x, z)

The transitive closure of a relation R composes R with itself repeatedly, for example:

����������� anyOrderEat := eat +

We can think of the result as �eats at any level of indirection�.�The relation anyOrderEat will contain the pair that are in eat, such as (Garfield, Mickey) and (Mickey, Swiss), as well as second order edges such as (Garfield, Swiss).�In this example, there are no higher level indirections of eating, due to the simple nature of our eat relation.

We can also compute the reflexive transitive closure of a relation R, written as R*, which is the same as transitive closure, with the addition of self loops, (x, x), on all items in the factbase.�

Set id (identity) operator.� A set of self loops, on a particular set of nodes, can be computed this way:

����������� isMouse := id (mouse)

This assigns _isMouse_�this set of edges:

����������� Mickey����������� Mickey

����������� Nancy������������� Nancy

In general, if set _s_contains x, then id s contains (x, x).

The _id_�(identity) operator, and its self loops, are handy for restricting a relation to start or to end with a given set of items.� For example, we can select the pairs in the eat relation that are eaten by mice as follows:

����������� eatByMouse:= id (mouse)� o� eat

The resulting _eatByMouse_relation contains only the pairs corresponding to eating by mice.� This composition (o) is computed by starting with the self loops in id (mouse).�This guarantees that each result pair starts with a mouse.� These self loops are �joined� with pairs from eat.� The resulting pairs corresponding to eating by mice.

The identity of all nodes in the current Grok factbase is written as ID.

Relation inverse.� Another relation operation is the inverse (or reverse) operator, for example:

����������� chasedBy := inv chase

For each pair (x, y) in chase, this produces the pair (y, x) in chasedBy.� So if Garfield chases Mickey then Mickey is chasedBy Garfield.

Part 2.

Summary of Grok�s Operators

This section summarizes the Grok operators that were introduced using the example involving Garfield et al.

Union, intersection, subtraction, comparison and cardinality. Using the example of the cats, mice and cheese, we have introduced the basic set and relation operators of Grok.� These include the following five operations given in Figure 2.1.

Operator Form
Union x + y
Intersection x ^ y
Subtraction x - y
Comparison x = y, x <= y, �
Cardinality # x

**Figure 2.1.**� Operators used with sets and with relations

The two operands (_x_and y in above examples) of union, intersection, subtraction and comparison must both be sets or else both be relations.

Set and relation constructors.� Sets can be created using a set constructor, for example:

����������� { x, y, z }

The empty set can be written as

����������� { }

or as

����������� EMPTYSET

Relations can be created using cross products, for example:

����������� { a, b }� X� { x, y, z }

The result of the cross produce of sets s and t,s_X t, contains all pairs (x, y) such that x is in s and_y is in t.

A collection of relations can be represented in RSF format, i.e., as a list of triples.� RSF format is used for files, but cannot be used in a Grok command.

The empty relation can be written as

����������� EMPTYREL

or as

����������������� {} X {}

Domain and range.� The domain (set of initial nodes) and range (set of final nodes) of relation R are written as:

����������� dom R

����������� rng R

Set projectors.� A set s can be combined with relation R in a set projector using the form

����������� s . R����������������

or

����������� R . s

The form� s . R produces the set� t such that if s contains _x_and R contains (x, y) then t contains y.� The form R . s produces the set _t_such that if R contains (x, y) and s contains _y_then t contains x.

**Identity relations.**� The identity relation on set s is written as:

����������� id s

If_s_ contains node x then id s contains the pair (x, x). The identity relation over all nodes in relations in Grok�s current factbase is written as:

����������� ID

Relation inverse.� The inverse (or reverse) of relation R is written as
����������������� inv R

If_R_ contains pair (x, y) then its inverse contains (y, x).

Relation composition.� The composition of relations R1 and R2 is written as:

����������� R1 o R2

If_R1_ contains the pair (x, y) and R2 contains the pair (y, z), then their composition is a relation that contains the pair (x, z).� (Composition is much like a �join� in a conventional relational database.)

Transitive closure. The (non-reflexive) transitive closure of relation _R_is written as:

����������� R +

The transitive closure consists the union of R composed with itself� any number of times:

����������� R +�� =�� _R_�� +��(R o R)� +� (R o R o R)� +� �

The symmetric transitive closure is written as

����������� R *

The value of R* is the same as _R_+ with the addition of self loops (ID):

����������� R *�� =�� ID�� +�� R+

�Part 3.

Grok Queries, Factbases and Schemas

This section shows how the Grok interpreter can be used as an interactive query engine.� Before doing this, we will introduce an example C program and will show how facts about that program can be represented as a factbase.� We will then give example Grok queries that probe these facts.

The EG Example: A Very Simple C Program

We will use a very simple C program, called EG, as an example. Our C program consists of three files: main.c, putint.c and putint.h. The first, main.c, sets a global variable to 5 and calls the putint function with a parameter of 10. The _putint.h_file declares a global variable to be an integer (as external) and gives the header of the putint function (as external). The putint.c program gives the body of the putint function, which computes the product of its parameter and the global variable and prints the result. These files are listed in Figure 5.1.

File main.c:

����� #include "putint.h"

�����

������void main(void){

��������� global_var=5;

��������� putint(10);

����� }

File putint.h:

����� #ifndef PUTINT_H

����� #define PUTINT_H

�����

������extern int global_var;

����� extern int putint(int param);

�����

������#endif

File putint.c:

����� #include "putint.h"

�����

������int global_var;

�����

������int putint(int param) {

��������� int local_var;

��������� local_var=global_var*param;

�����

����������printf("The number is:%d\n",local_var);

����� }

Figure 5.1. EG: A Very Simple C Program

Facts Extracted from EG

The PBS project, at the Universities of Toronto and Waterloo, developed a special purpose version of the GCC compiler front end called CFX (C Fact Extractor), which translates C programs into graphs.�In the case of the EG program the translation produces the graph as the stream of triples shown in Figure 5.2. (To be technically correct, we should say that CFX produces tables, which are translated into the stream by a program called fbgen.� To keep things simple, we will consider that CFX directly produces the stream.) CFX stores this output stream as RSF triples in a file called factbase.rsf.�

 funcdef main.c main

defloc main main.c:3

include main.c putint.h

linkcall main putint

linkref main global_var

macrodef Initialization GCC_NEW_VARARGS

macrodef Initialization GNUC_MINOR

macrodef Initialization GNUC

macrodef Initialization __sparc

macrodef Initialization sparc

macrodef Initialization __sun

macrodef Initialization sun

macrodef Initialization __unix

macrodef Initialization unix

macrodef Initialization sparc

macrodef Initialization sun

macrodef Initialization unix

funcdcl putint.h putint

dclloc putint putint.h:5

vardcl putint.h global_var

vardclloc global_var putint.h:4

macrodef putint.h PUTINT_H

funcdef putint.c putint

defloc putint putint.c:4

vardef putint.c global_var

vardefloc global_var putint.c:3

include putint.c putint.h

include putint.c putint.h

librarycall putint printf

Figure 5.2. The EG Program, as Translated by CFX to RSF Triples

As a reader, you do not need to concern yourself with the details of this stream.� Our purpose in giving this stream is to illustrate the kinds of streams that Grok deals with.� In this stream, the triple

�� funcdcl F p

means file F contains a function declaration (function header) for p.� The triple

���� funcdef F p

means file F contains a function definition (function body) for _p._�

 CFX Schema

 Figure 5.3 gives the list of the relations produced by CFX, in the form of a TA schema, with brief explanations of each relation.� (Aside: in the last part of the schema, the string file:lineNum has been placed in quotes.� This is done because the Grok interpreter puts quotes around strings containing colons.)

SCHEME������ TUPLE���� :

funcdef����� file����� function������� // function is defined in file

funcdcl����� file����� function������� // function declared in file

funcdcl����� file����� libraryfunction // libraryfunction declared in file

vardef������ file����� variable������� // var defined in file

vardcl������ file����� variable������� // var declared in file

macrodef���� file�����macro��������� �// macro defined in file

typedef����� file����� type����������� // type defined in file

structdef��� file�����struct��������� // struct defined in file

uniondef���� file�����union���������� // union defined in file

enumdef����� file����� enum���������� �// enum defined in file

include����� file����� file����������� // file 1 includes file 2

sourcecall�� function�function������� // function 1 calls function 2

�������������������������������������� // ( function 2 definition is

�������������������������� ������������//�� in same compilation unit

�������������������������������������� //���������� OR

�������������������������������������� //�� is included in same compilation

�������������������������������������� //�� unit as function 1 )

sourceref��� function�variable������� // function references variable

�������������������������������������� //� ( variable definition is

�������������������������������������� //��� in same compilation unit

�������������������������������������� //���������� OR

������� �������������������������������//��� is included in same compilation

�������������������������������������� //��� unit as variable )

linkcall���� function�function������� // function 1 calls function 2

�������������������������������������� //� ( function 2's definition

�������������������������������������� //��� must be linked to )

linkref����� function� variable������� // function references variable

�������������������������������������� //� ( variable definition must

��������������������������������� �����//���be linked to )

usemacro���� file�����macro���������� // file uses macro

usetype����� file����� type����������� // file uses type

usestruct��� file�����struct��������� // file uses struct

useunion���� file�����union���������� // file uses union

useenum����� file����� enum����������� // file uses enum

librarycall� function�libraryfunction // function calls libraryfunction

�������������������������������������� //� ( a libraryfunction is a func

�������������������������������������� //��� whose definition does not

�������������������������������������� //��� exist in the analysed code )

// Attribute tuples extracted by CFX

defloc������ function� �file:lineNum�������� // func definition location

dclloc������ function� �file:lineNum�������� // func declaration location

deflocend��� function��file:lineNum�������� // func definition end location

vardefloc��� variable��file:lineNum�������� // var definition location

vardclloc��� variable��file:lineNum�������� // var declaration location

macrodefloc� macro�����file:lineNum�������� // macro definition location

typedefloc�� type������file:lineNum�������� // type definition location

enumdefloc�� enum������file:lineNum�������� // enum definition location

structdefloc struct��� �file:lineNum�������� // struct definition location

uniondefloc� union�����file:lineNum�������� // union definition location

enumdeflocend�� enum������file:lineNum����� // enum definition end location

structdeflocend struct��� �file:lineNum����� // struct definition end location

uniondeflocend� union�����file:lineNum����� // union definition end location

Figure 5.3. The CFX Schema (Also Called the PBS Low Level Schema)

 The notation for the schema given in Figure 5.3 has, almost, the form of a TA scheme (schema).� However, the form of this schema is a bit dated (CFX and this schema were created before TA was defined), in that the attributes are written as relations.� For example, this line in the schema

��� defloc function���� �file:lineNum�

would be expected to be written in TA as

����� function { defloc = �file:lineNum� }

However, we can treat attributes as relations with no harm done, so we will not be concerned with this difference.

 CFX Schema as an Entity/Relation Diagram

 We can draw a schema as a diagram.� For example, the CFX schema is shown as a diagram in Figure 5.4.� This diagram has the form (almost) of an Entity/Relation Diagram; E/R Diagrams are commonly used in the design of relational databases.

 

Figure 5.4. The CFX Schema as a Diagram

Querying EG�s CFX Factbase

We can use the stream shown in Figure 5.2 as the basis for answering questions about the EG program.� In particular, we can use Grok to ask queries about the EG program.� To do this, we can start up the Grok interpreter (the command for Grok is grok or grok.x).� Our first command to Grok is:

�� getdb �factbase.rsf�

This reads the stream into Grok�s internal memory.� (Aside: In interactive mode, Grok is forgiving of syntactic details, so when we are typing this interactively, we can omit the quotation marks and type simply getdb factbase.rsf.)

We can ask Grok to list the entire database by typing

�� listdb

When we do this, Grok produces a list of triples much like that in Figure 5.2.� However, its ordering of triples may be different, because the order of relations is immaterial to Grok.�

We can determine the number of triples by typing

���� dbsize

Grok will output the number 29, which is the count of the triples in the list.�

For unknown reasons, the CFX extractor sometimes produces repeated copies of triples.� In Figure 5.2, the _factbase.rsf_stream contains two copies of the line

���� include putint.c putint.h

which appear near the bottom of the list.�� We can delete duplicate lines by typing

����������� deldups

After we have typed deldups, when we again type dbsize, Grok will tell us that there are only 28 triples, because the one of the duplicated copies was deleted.

Querying about File Inclusions�

Perhaps we want to check to see what includes were in the EG program.� To do this, we can type

�� include

This asks Grok to list all the pairs in the include relation, so it outputs:

���� main.c�� putint.h

���� putint.c putint.h

These two pairs tell us that both main.c and putint.c include putint.h.�

We can ask the question: What files does main.c include, by typing:

���� main.c . include

This form is a forward set projector.� As a result, Grok will type

���� putint.h

In this example, we have cut down on the amount of we needed to type by taking advantage of Grok�s forgiving nature.� Grok�s set projectors actually expect to receive a set value.�To be perfectly correct, and to avoid Grok error messages, we should really have written:

���� { �main.c� } . include

But when we are working interactively, we often use these shortcuts.�

We can determine what files include putint.h by typing:

���� include . putint.h

This form is a backward set projector, and again we are minimizing the amount we need to type.� As a result, Grok will type

���� main.c

���� putint.c

Querying about Function Calls

We might want to know more about calls in the EG program. �We could start out by asking the names of all relations, by typing

�� relnames

As a result, Grok will output the list of names of relations:

������ funcdef

������ defloc

������ include

������ linkcall

������ linkref

������ macrodef

������ funcdcl

������ dclloc

������ vardcl

������ vardclloc

������ vardef

������ vardefloc

������ librarycall

From this list, we can surmise that the fact base contains two kinds of calls: linkcall and librarycall.�� From the schema (Figure 5.3) we know that a_linkcall_ is a call that is linked from one user compilation to another, while a librarycall is a call to a library function.� To inspect all linkcalls, we type

���� linkcall

As a result Grok outputs

���� main���� putint

To find out about library calls, we type

���� librarycall

and Grok outputs

���� putint�� printf

We have determined that the only linked function call is from main to putint, and the only call to a library is from putint to printf.

We may want to know what things putint is related to.� To find out, we type

�� putint :

It is helpful to think of the colon operator (:) as two eyeballs, looking forward from putint to see what it is related to.� As a result, Grok will output:

dclloc������������� putint.h:5

defloc������������� putint.c:4

librarycall�������� printf

From looking at the comments in the CFX schema, we can surmise from this that putint is declared on line 5 of putint.h and defined on line 4 of putint.c.� We can also re-affirm that _putint_calls the library function printf.

We might want to know what depends onthe putint function.� To find out we type

�� : putint

It is helpful to think of this as two eyeballs looking at _putint.�_As a result, Grok will output:

���� main���� linkcall

���� putint.h funcdcl

���� putint.c funcdef

This means that function_main_ calls putint.� It also means that putint is declared in putint.h and is defined in putint.c.

The C488 Compiler Example

As our next example, we will consider a larger program.�This program is a compiler, written in C, for a Pascal-like language.� It has a traditional compiler structure, with a scanner, a parser, a semantics pass and a code generator.� We have run C488 through CFX and have obtained the factbase for it.�This factbase, factbase.rsf, for C488 is much larger than the factbase for EG.� Further, we have run this factbase though the PBS pipeline and as a result, we have created a landscape (structural diagram) for it.� You may want to view it on the web by this URL:

http://swag.uwaterloo.ca/pbs/examples/C488/V1/index.html

If you navigate around through this landscape, you will discover that the C488 compiler consists of a number of files, such as parser.c, parser.h, codgen.c, codegen.h, globalvars.h, tokendef.h, etc.� You can observe that the only relations shown in the diagram are call, reference, implementby and otherdep.� As you may recall, the CFX schema has many more relations than this.� Why does the diagram have fewer and different relations?�The answer is that a number of Grok scripts manipulated the factbase, combining and renaming relations, to make the diagrams easier to comprehend.� Another thing to notice is that the diagram contains higher level entities, such as the _semantics_subsystem, which group together a number of files.� In particular, the semantics subsystem groups these files:� semantics.h, semantics.c �and _�semanticext.h._�

When landscape diagrams are inspected, it is common to observe surprising dependencies.� For example, in the C488 top view, there is a function call from the codegen subsystem to the _semantics_subsystem.��� Why should the code generator call the semantic analysis part?�It seems that semantic analysis should be completed before code generation starts, so such a call would be wrong or maybe just peculiar.� We want to know what actual function call(s) are causing this high level interaction from the code generator to semantics.

Since this kind of question happens repeatedly, we will give it a name: the _Where-Did-This-Arrow-Come-From_problem.� Ideally, we would just press a button on the landscape diagram display, and we would get the answer.� The reason there is no button is that there has been considerable abstraction and simplification from the source program to the CFX factbase to the stream used as a basis for drawing the landscape diagrams.� However, by using the factbase and Grok, we can work our way backward to answer this question.

The _Where-Did-This-Arrow-Come-From_Problem

We will consider the following question.

Q1: Where did this arrow from subsystem S to subsystem T come from?

As an example, we will consider the call arrow from S, the code generator subsystem, to T, the semantics subsystem.�We will answer this question in two steps.

Step 1. Using the landscape diagram,� we will partly answer the question by finding out what files in system S call what files in subsystem T.� Given that we have discovered that file F in S calls file G in T, we are now ready for the easier question:

Q2: Where did the arrow from file F to file G come from?�

Step 2. �We will answer question Q2 by using Grok to query the program�s CFX factbase.

Aside: This example uses a factbase created by the CFX fact extractor. If we were using a different fact extractor, such as CPPX, we would need to modify our approach, but the same general method would apply.

In Step 1, we can use the diagrammatic querying facilities of the landscape drawing program to file out what file(s) in the code generator call what file(s) in semantic analysis.� By concentrating first on the codegen subsystem, we can look into its internal diagram to determine that the only file in it that makes calls is codegen.c.� We can conclude that F (the file in S) is codegen.c.� Then we can look inside the semantics subsystem, where we can immediately discover that the only file that is called is semantics.c.�So we can conclude that G (the file in T) is semantics.c.� In a more complex example, we would find more than one file F that calls more than one file G.� Also, in a more complex example, we may have to spend some time diving deeper into the landscape diagrams to locate the files that are causing the interaction.� At any rate, once files F and G are isolated, using the diagrams, we are ready to go to step 2.

To carry out Step 2, we use the program�s factbase.rsf file, which was created by CFX.� This factbase contains interactions at the level of functions and global variables.� After starting up Grok, we type this command to load the factbase:

�� getdb factbase.rsf

Once this is loaded, it is a good idea to check its size (using dbsize) and to take a look at the names of its relations (using relnames),� just to gain confidence that what we are inspecting makes sense.� It is also important to understand the schema reasonably well; it is recommended that you refer to the CFX schema (Figures 5.3 and 5.4).� From the schema, we can tell that the funcdef relation records when file F defines function P.� We can find the names of all functions declared in file codegen.c by typing:

���� P := codegen.c . funcdef

To check that F seems reasonable, we can find out how� many functions are in it, by typing:

���� # P

When we do this, Grok tells us that P contains 21 items.�Since P is not large, it is a good idea to list these items, by typing:

���� P

As a result, Grok outputs all the functions defined in codegen.c:

��� BBS_Pop

��� BBS_Push

��� Code

��� Code31

��� Code32

��� Code46

��� Code47

��� FBS_Pop

��� FBS_Push

��� FBS_Swap

��� LES_Pop

��� LES_Push

��� Pop_FuncName

��� Pop_arrayName

��� Push_FuncName

��� Push_arrayName

��� Return_Func_NumArgs

��� allocate_storage

��� codegenFinalize

��� codegenInitialize

��� generateCode

This list seems reasonable, i.e., it is not surprising that these functions are part of codegen.c.�

We will next find all functions in semantics.c.� In querying to find these functions, we will ask for all functions that� are either declared or defined in semantics.c, by typing:

�� Q := semantics.c . (funcdef + funcdcl)

We could have written simply

�� Q := semantics.c . funcdef

However, there is the possibility that semantics.c might contain a function header, without a corresponding body, and that the arrow we are trying to locate might be pointing to this header.� So we are using the first (the longer) query.�Once we have the value of Q, it is a good idea to see how big it is and to list the names of functions contained in it.� We have illustrated these operations for P, so we won�t show them for Q.

Now we are ready to give a query that asks for all linkcalls from functions in codegen.c(in P) to functions in semantics.c (in Q).� Here is the query:

id (P)�o� linkcall� o� id (Q)

This query makes use of the id function twice.� The first id produces self loops on each member of P.� These self loops are composed with _linkcall_so that the result includes only� those pairs in linkcall which start with a function in P.� These pairs are next composed with _id (Q)_so that the resulting relation contains only those pairs that end with a function in Q.� The result contains all those link calls from a function in codegen.c to a function in semantics.c.� As a result of this query, Grok outputs

���generateCode�� ErrorHandler

From this we conclude that there is only one function, generateCode, in codegen.c, which calls a function in semantics.c.� The only such called function is ErrorHandler.�

To find out more about the generateCode function, we can type

���� generateCode :

As a result, Grok outputs:

��� defloc�������� codegen.c:365

��� deflocend����� codegen.c:869

��� linkref������� errorFile

��� linkref������� scanInteger

��� linkref������� traceCodeGen

��� linkref������� traceFile

��� linkcall������ ErrorHandler

��� linkref������� PrevScope

��� linkref������� ScopeLevel

��� librarycall��� fprintf

��� librarycall��� strlen

��� dclloc�������� codegen.h:21

From this list, we can see that generateCode is declared (defloc) on line 21 of _codegen.h_and is defined (defloc) on line 365 of _codegen.c_� We could make a similar query for ErrorHandler, but instead we will specialize the query to give us the locations of _ErrorHandler_�s declaration and definition, by typing:

���� ErrorHandler . (dclloc + defloc)

As a result of this query, Grok outputs:

��� "semantics.c:138"

This output is surprising, in that it indicates that the Error Handler does not have both a definition and a declaration.� To find out more, we can limit our query to try to find only the declaration, by typing:

���� ErrorHandler . defloc

Intriguingly, Grok produces no output for this query, indicating that ErrorHandler has no declaration.� We can conclude that _ErrorHandler_�has a definition (we could write a query to explicitly check this), but no declaration.

What we have learned is that ErrorHandler is programmed with a peculiar style, with a definition but no declaration.�While this usage is allowed by the C language, it a frowned upon as being bad style by many programmers.�This is particularly suspicious, because we have also discovered that this function is part of semantics.c, but is called from codegen.c.� To see how widely ErrorHandler �is used, we type:

�� linkcall . ErrorHandler

As a result Grok outputs only

���� generateCode

This tells us that _ErrorHandler_is called (outside of semantics.c, externally via linking) by only the generateCode_function.� This is further evidence that_ErrorHandler is used in a peculiar fashion, and perhaps should be moved to a new location, or perhaps that generateCode should be calling a function other than ErrorHandler.�

We will not explore this peculiarity further here.�Our goal was to show how to solve the Where-Did-This-Arrow-Come-From Problem.� To keep the explanation simple, we have used a straightforward example.� In more complex cases, it is easy to get lost in the details.� Keeping track of the queries, and avoiding mistakes when typing the queries, is tedious and can be error prone.� At the same time, it can also be rewarding, leading to discoveries of unexpected and sometimes incorrect parts of the program.� Indeed, it is intellectually interesting to be able to navigate through the structure of a piece of software.

Credits.� Gerry Kovan, with assistance from Ric Holt, created the CFX TA schema in Figure 5.3.�Ivan Bowman created the E/R Diagram for this schema, in Figure 5.4.

Part 4.

Writing Grok Scripts

The preceding section showed how the Grok interpreter can execute user command interactively.� We showed how this facility can be used to determine what functions in file codegen.c call what functions in file semantics.c.

While it is handy to execute Grok commands interpretively, it is also handy to bundle up a set of commands and run them as a script.�Figure 6.1 gives a script that solves this same _Where-Did-This-Arrow-Come-From_Problem.

�1� % Find call edges from P to Q

�2�getdb "factbase.rsf"

�3� P := {"codegen.c"} . funcdef

�4�countP := # P

�5� put "codegen.c contains ", countP, " function defs"

�6� Q := {"semantics.c"} . (funcdef + funcdcl)

�7�countQ := # Q

�8� put "semantics.c contains ", countQ, " function defs and dcls"

�9� put "Here are the linkcalls from codegen.c to semantics.c"

10� id (P)�o� linkcall� o� id (Q)

Figure 6.1. Grok Script comefrom.grk

The Grok script in Figure 6.1 contains roughly the same commands that were given in the previous section as interactive command.�(The line numbers in the figure are only for reference purposes, and so not actually appear in the script.)� In the figure, we have inserted a comment in line 1; as you can see, comments start with a percent % character.�

In line 2, we added quotation remarks around factbase.rsf.� In lines 3 and 5, we added quotation marks as well as set brackets { � } around the file names.� These additions make the commands syntactically correct.� In the interactive commands, we avoided some typing by taking syntactic shortcuts, which are, officially speaking, illegal, but since the interpreter is forgiving, we saved ourselves a few keystrokes.

Aside: While it is convenient to take shortcuts when typing interactive commands, you should avoid these and eliminate syntax errors in a Grok script.� Grok scripts are hard enough to write without getting confused by error messages.�Besides, error messages often help isolate serious bugs in the script.

Lines 5, 8 and 9 illustrate the use of the put statement.� It accepts sequences of items separated by commas.� These items can be strings, as well as names of variables.

We can run this script from the command line by typing

�� grok� comefrom.grk

When we do this, we will get this output:

�� codegen.c contains 21 file defs

�� semantics.c contains 17 file defs and dcls

�� Here are the linkcalls from codegen.c to semantics.c

����generateCode�� ErrorHandler

There is another way to run this script, namely, we can give an interactive command to invoke it.� To do this, we can first start up the Grok interpreter by typing

���� grok

Once Grok starts up, we can give this command:

���� exec comefrom.grk

This runs the script and produces the same output as has just been shown.� When the script finishes running, Grok will still in interactive mode, so we can continue giving it commands.�For example, we could type this simple command:

���� P

This command lists the items contained in set P (these are the functions defined in codegen.c).� More generally, using the _exec_command in interactive mode is helpful because it allows us to inspect the values of relations and variables that have been computed by the script.� The exec command can also be used in a script, to all another script as a subroutine.� For example, the command exec comefrom.grk �could appear in a script.�

Using Arguments in a Script

Our script solves one particular Where-Did-This-Arrow-Come-From Problem.� However, it works only for two particular files, codegen.c and semantics.c and only for a factbase called factbase.rsf.� We can generalize it to handle any files and any factbase by using arguments, as is illustrated in Figure 6.2.

�1� % Find call edges from P to Q

�2� if $n ~= 3 then� % Are there 3 arguments?

�3����put "Usage:� grok comefrom.grk factbaseRsf fromFileName toFileName"

�4����put "Finds linkcalls from fromFileName to toFileNamec "

�5����put "Author: Ric Holt,��Date: 22 Apr 2002"

�6����quit

�7� end if

�8�

�9�factbaseRsf := $1

10� fromFileName := $2

11� toFileName := $3

12�

13� getdb factbaseRsf

14� P := { fromFileName } . funcdef

15� countP := # P

16� put fromFileName, " contains ", countP, " function defs"

17� Q := { toFileName } . (funcdef + funcdcl)

18� countQ := # Q

19� put fromFileName, " contains ", countQ, " function defs and dcls"

20� put "Here are the linkcalls from ", fromFileName, " to ", toFileName

21� id (P)�o� linkcall� o� id (Q)

Figure 6.2. Grok Script comefrom.grk Rewritten with Arguments

In the Grok script, the notation n(line2)givesthenumberofarguments,andn (line 2) gives the number of arguments, and n(line2)givesthenumberofarguments,and1, 2,2, 2,3, etc. (lines 9 �11) give the values of arguments 1, 2, 3, etc.� Grok�s form of if statement is illustrated by lines 2 to 7.� The _quit_statement (line 6) aborts the script.

The script can be run by giving the command

���� grok comefrom.grk factbase.rsf codegen.c semantics.c

or in interactive mode by typing

���� exec comefrom.grk factbase.rsf codegen.c semantics.c

Other file names and another name of a factbase can be given as arguments.�

If a script is used often, it is good to create a shell script that invokes it, in order to simplify use of the script.�For example, we might create a shell script named comefrom that consists of this line:

�� grok comefrom.grk $@

The form $@ matches all the arguments passed to Grok. This shell script allows us to use comefrom by just typing _comefrom factbase.rsf codegen.c semantics.c._���For this to be generally useful, the shell script should provide an absolute path (or a path determined by a shell variable) so that _comefrom.grk_can be found from any directory.

Invoking a Script from a Library

This_exec_ command can be used in another Grok script.� When this is done, care must be taken to be sure that the script, comefrom.grk in our example, can be located by Grok.� By default, Grok will expect to find the script in the local directory.�The SwagKit project, at the University of Waterloo, has a library of commonly used Grok scripts.� These are stored in a directory, which is located by a shell variable called SWAGKIT_PATH.� Figure 6.3 gives an example of using this shell variable.

execPath := getenv "SWAGKIT_PATH"

linkProcedure := execPath cat "/lib/pipeline/" cat "lift.grp"

exec linkProcedure

**Figure 6.3.**� Using a Shell Variable to Locate a Grok Script

In the figure, the Grok built-in function getenv fetches the value of the _SWAGKIT_PATH_shell variable.� This is assigned to string variable execPath.� The complete path to the Grok script is built up by catenating execPath, /lib/pipeline/ and lift.grp.�If SWAGKIT_PATH has the value /home/holt/dev_swagkit (as it currently does for the author of this note), the complete path for the Grok script will be

���� /home/holt/dev_swagkit/lib/pipeline/lift.grp

(Note: The suffix _.grp_is used by convention to label a Grok procedure.)� If you build up a library of Grok scripts, you can use a shell variable to locate your scripts, as is done for SwagKit.

Abstractions by Grok Scripts

A common use of Grok scripts is to read a factbase and to simplify or abstract it.� One way of doing this is to combine sets of similar relations into a single relation.�Figure 6.4 gives a script, named cfxadddep.grk, which combines all dependencies in a CFX factbase into a single relation called dep.� The factbase, with dep added to it, is written out to a new factbase.

�1� % Add dependency 'dep' to CFX fact base.

�2� if $n ~= 2 then� % Are there 2 arguments?

�3�����put "Usage:� grok cfxadddep.grk factbaseRsf newFactBaseRsf"

�4�����put "Combines all dependencies into a single dependency 'dep'"

�5�����put "Author: Ric Holt,��Date: 22 Apr 2002"

�6�����quit

�7� end if

�8�

�9�factbaseRsf := $1

10� newFactbaseRsf := $2

11�

12� cfxDep := { "include", "sourcecall", "sourceref", "linkcall", \

13���� "linkref", "usemacro", "usetype", "usestruct", "useunion", \

14���� "useenum", "librarycall" }

15�

16� for d in cfxDep

17����� $ d := EMPTYREL

18� end for

19�

20� getdb factbaseRsf

21�

22� dep := EMPTYREL

23� for d in cfxDep

24����� dep := dep + $ d

25� end for

26�

27� putdb newFactbaseRsf

**Figure 6.4.**� Combining Dependency Relations

Lines 12-14 of the script create the set cfxDep, which contains the names of all dependencies in the CFX schema (see Figure 5.3 and 5.4).� Note that a backslash \ has been used at the ends of lines 12 and 13 to extend the command to multiple lines.

Lines 16-18 and 23-25 are examples of Grok for statements.� These iterate across elements of a set of strings.� In each of these, the loop is executes with d assigned each of the values contained in set cfxDep.� We know that the iterations will assign _d_each value from the set, but we are not guaranteed what the ordering will be.

Line 17 contains the statement

�� $ d := EMPTYREL

The dollar sign $ means that the value of_d_ is to be treated as a relation or variable.� For example, when d has the value sourcecall, this statement will effectively be

�� sourcecall := EMPTYREL

The reason that we have assigned the value EMPTYREL is a bit subtle.� This assignment acts like a �declaration� of the relation.� This �declaration� needs to precede the statement getdb factbaseRsf, so that, if the factbase does not contain any pairs in the relation, then the relation still exists (is declared), even though it is empty.� Without this �declaration�, if there were no pairs in the relation, the relation would not exist, as far as the Grok interpreter is concerned.� This is in turn would cause problems in line 24, in which the _dep_relation is constructed from all the dependency relations.� If d has a value (such as sourcecall) in line 24 which isn�t the name of a relation, the Grok interpreter will not know how to execute the line.� To avoid this difficulty, any relation being read from an RSF or TA stream, which might not have any pairs in the stream, should be assigned EMPTYREL before reading the stream.

Combining dependencies into �higher level� dependencies is a common way to abstract information in a graph.� Another way is to eliminate low-level entities, such as statements, or even functions within files.� When this is done, the dependencies from the low level entities are usually lifted (also called_raised_) to the level of their parent nodes.� Grok scripts to do this lifting are part of the PBS and SwagKit pipelines that prepare graphs for visualization.

Other Grok Features

We have introduced key features of Grok by means of examples.� We will finish by mentioning a couple more features.�

It is common to need to delete some relations or some sets of nodes from a factbase.� For example, in the script in Figure 6.4, after adding the dep relation we might want to delete all the relations, which were combined into dep.� �This can be done as follows

for d in cfxDep

����delrel $ d

end for

This loop uses the _delrel_command to delete relations from the factbase.�To remove a set of nodes from the factbase, whose names are in a set _nodesToDelete,_we can use this statement

�� delset nodesToDelete

This command removes all triples in the factbase that reference nodes in nodesToDelete.

When debugging a Grok script, the following command can be very helpful:

�� option echo

This tells the Grok interpreter to trace the script, outputting each command as it is executed.� The command

���� option timing

outputs the time taken to execute each Grok command.

We will now give a final observation about debugging Grok scripts.� Grok does not have a compiler, so there is no direct way to see if your script is syntactically legal.� However, the following trick works reasonably well: run the script with dummy values for the arguments.� This makes Grok run the script (with some bogus messages about bad arguments), but more importantly, gives messages about other syntax errors.� You should fix such errors before testing your script on real data.

There are a number of other commands, which are summarized in the following section.� This concludes our introduction to Grok.

Part 5.

Summary of Grok Features

1. Grok input/output commands

Grok input commands:

�� getdb F������������� Read RSF factbase from file F

�� adddb F������������� Add new tuples to data base from file F

�� getta F������������� Read TA factbase from file F

�� addta F������������� Add more TA to date base from file F

�� getset s [F [D]]���� Read� set s [from file F [in dir D]]

Grok output commands:

�� putdb F������������� Write RSF data base to file F

�� relToFile R F������� Write rel R to file F

�� appendRelToFile R F� Append rel R to file F

�� putta F������������� Write TA factbase to file F

�� putset s [F [D]]���� Write set s [to file F [in dir D]]

�� put msg {, msg}����� Print msgs; each is a string or a variable name

The file format for relations for get/put is ASCII triples, eg,

����� R a b��where R relates a to b (this format is called 'RSF').

The file format for sets is blank-separated strings.

Beware that the read/write (compressed) format isn't portable.

When TA is read in, attributes are treated as relations (with @_

added as a prefix).� The relations and attributes at the scheme level

have $_ prepended to their names.

2. Grok relation operators:

�� R1 + R2������������� Union of relations

�� R1 - R2������������� Difference of relations

�� R1 ^ R2������������� Intersection of relations

�� R1 o R2������������� Relational composition [was previously '*']

�� inv R��������������� Inverse of R

�� s . R��������������� Project set s through rel R [was *]

�� R . s��������������� Project set s through rel R backwards [was *]

�� R+������������������ Transitive closure of R

�� R*������������������ Reflective transitive closure of R

�� R1 cmp R2����������� Compare: cmp is == [or =], ~=, <, <=, >, >=

�� EMPTYREL������������ The empty relation

�� # R����������������� Cardinality (number of tuples) of R

�� dom R��������������� Domain of R (a set)

�� rng R��������������� Range of R (a set)

�� ent R��������������� Range + domain of R (all entities in R)

�� id s���������������� Identity relation on set s

�� DOM����������������� Domain of all relations in factbase (a set)

�� RNG�����������������Range of all relations in factbase (a set)

�� ENT����������������� Range + domain of all relations (a set)

�� ID������������������ The identity relation (based on ENT)

�� head R�������������� Head of R, eg (x y) in R produces ((x R y) head y)

�� tail R�������������� Tail of R, eg (x y) in R produces (x tail (x R y))

�� rel2tripleset R����� Make R into a set, eg (x y) produces (x R y)

�� tripl2relname������� Given '(x R y)' produce 'R'

3. Other relation commands:

�� relnames������������ Set of the names of relations

�� delrel R������������ Remove R's tuples factbase

�� listdb�������������� Print triples in factbase

�� sample R [b [n] ]��� Print sample from R.�Begin with

������������������������� number b (default 1) & print n (10)

�� slice s����� ��������Remove tuples that reference other than s

�� wslice s������������ Remove tuples that do not reference s

�� delset s������������ Remove tuples from factbase that reference s

�� delsetset s t������� Remove tuples with source from s and target from t

�� deldups������������� Delete any duplicate tuples in factbase

�� dbsize�������������� Number of tuples in factbase

�� cleardb������������� Delete all relations

4. Grok set operators:

�� s1 + s2������������� Union of sets

�� s1 - s2������������� Difference of sets

�� s1 ^ s2������������� Intersection of sets

�� s1 cmp s2����������� Compare: cmp is == [or =], ~=, <, <=, >, >=

�� # s����������������� Cardinality (number of elements) of s

�� s1 X s2������������� Cross product s1 X s2 (a relation)

�� { e1, e2, ... }����� Construct set containing e1, e2, ...

�� EMPTYSET������������ The empty set.�You can use { } instead

�� pick s�������������� Pick a value (a string) from s

�� prefix Q s���������� Set of items in s with prefix Q

�� suffix s Q���������� Set of items in s with suffix Q

5. Other set commands:

�� setnames������������ Print the names of set variables

�� sample s [b [n] ]��� Print sample from s.�Begin with

������������������������� number b (default 1) & print n (10)

�� clearsets����������� Delete all sets

�� addprefix Q s������� Add prefix Q to items in set s

�� addsuffix s Q������� Add suffix Q to items in set s

�� addprefixrel Q s���� Add prefix Q to rel names in set s

�� addsuffixrel s Q���� Add suffix Q to rel names in set s

�� addprefixdb Q s �����Add prefix Q to src & trg of rel names in set s

�� addprefixsrc Q s���� Add prefix Q to src of rel names in set s

�� addsuffixsrc s Q���� Add suffix Q to src of rel names in set s

�� addprefixtrg Q s���� Add prefix Q to trg of rel names in set s

�� addsuffixtrg s Q���� Add suffix Q to trg of rel names in set s

�� delprefix Q s������� Delete prefix Q to items in set s

�� delsuffix s Q������� Delete suffix Q to items in set s

�� delprefixrel Q s���� Delete prefix Q to rel names in set s

�� delsuffixrel s Q���� Delete suffix Q to rel names in set s

�� delprefixsrc Q s���� Delete prefix Q to src of rel names in set s

�� delsuffixsrc s Q���� Delete suffix Q to src of rel names in set s

�� delprefixtrg Q s���� Delete prefix Q to trg of rel names in set s

�� delsuffixtrg s Q���� Delete suffix Q to trg of rel names in set s

�� duplicate s��������� Creates new set with elements e.dup for

�������������������������� each element e of s, with same

�������������������������� connecting relations as in s

6. Grok number operators:

�� n1 + n2������������� Addition

�� n1 - n2������������� Subtraction

�� n1 * n2������������� Multiplication

�� n1 / n2������������� Real division

�� n1 div n2����������� Truncating division

�� n1 mod n2����������� Modulo

�� n1 cmp n2�������� ���Compare: cmp is == [or =], ~=, <, <=, >, >=

7. Grok string operators

�� S1 cat S2����������� Catenate strings S1 and S2

�� S1 cmp S2����������� Compare: cmp is == [or =], ~=, <, <=, >, >=

�� s :����������������� Out edges from node s

�� : s����������� ������In edges to node s

8. Grok boolean operators:

�� b1 and b2����������� Conjunction

�� b1 or b2������������ Disjunction

�� not b��������������� Boolean negation

9. Levels of precedence are, from low to high (most binding):

�� and

�� or

�� cmp��������������������(==, ~=, <, <=, >, >=)

�� + - cat���������������� (additions)

�� * X

�� prefixOperators�������� (inv dom rng ent - not ~ # pick $)

�� suffixOperators�������� (*, +, ie, transitive closure)

Parentheses (...) can be used to specify order of operations

10. Statements:

�� x := expn����������� Assign expn to variable x (an identifier)

�$ x := expn����������� Assign expn to variable whose name is in x

������ (The 'x = expn' form is considered to be obsolete.)

�� if expn then������� Cascaded 'if' statement, with optional

������ statements����� 'elsif' and 'else' clauses

�� {elsif expn then

������ statements}

�� [else

������ statements]

�� end if

�� loop��������������� Loop statement, stopped by 'exit'

������ statements

�� end loop

�� for e in s���������Repeat with each element e from set s

������ statements���������� (exits are supported)

�� end for

�� exit��������������� Exit from loop/for

�� exit when expn����� Exit from loop/for when expn is true

�� return������������� Return from current Grok script

�� quit��������������� Quit (halt Grok)

11. Other Grok commands:

varnames��������������� List of all relation, set and string variables

�� $ str��������������� String 'str' is taken as a variable's name

����������������������� If value is integer i, evaluates to arg i

�� $i������������������ Value of parameter i, i is 0, 1, 2 ...

�� $n������������������ Number of parameters

�� getenv s������������ Get environment value s

�� sysexec s����������� Execute shell command s. Value is return code

�� expn���������������� The value of expn (expression) is printed

�� exec F p1 ...������� Exec file F (Grok script) with parameters p1 ...

���������������������������� where t is each member of s

�� randdb t r n�������� Generate t tuples with r relations and

��� �������������������������n nodes.

�� ?������������������� Print the help menu

�� reset��������������� Restart (clear relations and variables)

Use '\' in last column to extend commands across line boundaries

12. Setting Grok options:

���� option Q��� Set option Q

The options that can be set include:

���� timing����� Print time to complete each operation

���� echo������� Echo each command (useful for tracing scripts)

���� debug������ Interpret 'debug comand' as 'command' (default off)

���� verbal����� Print out extra info during commands

���� progress��� Show progress on large operations (verbose)

���� optimize��� Run fast (default)

���� dumpsets��� Dump all sets and set variables (sets.tu)

���� dumpnames�� Dump all stored names (storage.tu)

The first of these are turned off using 'no', as in 'option notiming'.

13. Limits on Sizes in Grok

Many/most data sizes are dynamically increased up to the maximum available memory.� The dynamically adjusted sizes are for:

������ Triples

������ Names

��������� (Along with internal tables using for sorting, etc.)

Sizes which are statically fixed are:

������ Maximum length of names; strSize=255 bytes, can increase to 255 at most

������ Maximum tokens on a source Grok line; maxTokens (can be increased)