Summer Internship at LexisNexis

I am starting this new post to write about my experiences at the internship at LexisNexis.

Week 1:

This week was spent more on building the platform for the main work. Following were the main completed tasks:

  • Cloned HPCC system on local machine, built it and intalled it. Ran regression and compiler tests on it.
  • Signed up for jira, github and gitter. Forked repo on github.
  • Went through documentation of github to be able to do the essential works. Also learnt how to clean up commits.
  • Sent pull request for the six examples to the master
  • Started exploring the main issue. Ran gsoc5 and observed the filter operation performed twice. Going through the source code: ecl/hqlcpp/hqlinline.cpp and ecl/hqlcpp/hqlhtcpp as mentioned in the email.

Week 2:

This week was not easy because I was often stuck on a small problem for a long time. Here are the updates:
1. Installed gdb and eclipse to step through the code as eclcc compiles an ecl program. I was stuck at various places in doing so.
2. I stepped through the code in gsoc5.ecl where the problem is that it is executing two similar filters twice using inline code. I was trying to figure out which part of code is used to make the decision that the exapression can be evaluated inline. I could not find the exact place and there were many questions which came up as listed below:
2.1 I saw that the canProcessInline is called quite often in the compilation process. But I was unable to figure out that which expression is it currently evaluating which calling this function? The two parameters to this function are ctx (BuildCtx) and expr (IHqlExpression). I thought that expr would help me figure out which part of the ecl code is being evaluated but could not figure out.
2.2 I see that a node operator is extracted from the expr in the function calcInlineFlags. Overall, there were only a handful of node operators which came out of it: alias, translated, workunit_dataset, createrow, filter and select (each of them is prefixed with no_). I noted down that what it did to these operators to find out the inline flags in the file 06-10.txt (in the Dropbox folder). But I do not understand what do these operators correspond to in the ecl query.
3. I plan to create a new dataset so that I can have generate ecl queries to figure out the child query problem. For this, I started going through the code here:https://github.com/hpcc-systems/ecl-samples/blob/master/bundles/CellFormatter/GenData/GenData.ecl, to see how was this dataset generated. I thought this would be a good example to understand the database concepts of ecl as it involves multiple aspects like normalization, indexing etc. I came up with the following questions as I went through the code:
3.1 What is a column name called in ecl? In traditional databases it is often referred to as an attribute or a column.
3.2 What is the difference between a recordset and a dataset? My intuition is that a dataset can have a child dataset but a recordset cannot have any child record.
3.3 NORMALIZE(BlankKids,TotalChildren,CreateKids(LEFT,COUNTER)); This definition calls the CreateKids function TotalChildren number of times on the dataset of BlankKids. How does it help in normalizing ie. removing redundant data?
3.4 base_people   := ITERATE(base_fln_dist,PopulateRecs(LEFT,RIGHT,HASHCRC(RIGHT.FirstName,RIGHT.LastName)),LOCAL); What is LEFT and RIGHT here? I see only one parameter base_fln_dist which could probably be LEFT.

Week 3:

In the first half of the week, I was able to come up with an ECL query and a new dataset which exposed some redundant inline operations performed in COUNT function. In the second half, I went through the canProcessInline function and got a better idea about the code through your reply. Today I did the following:

I went closely through the generated xml and c++ code for gsoc6.ecl. The xml code looked pretty self-explanatory. Here are the two observation I made while going through it:

1. There can be a node inside a graph and a graph inside a node
2. I suppose the xml represents a graph which is different from the graph passed to the canProcessInline function. I think this xml graph (xgmml) is created after the it has been determined that which operations will be executed inline.
In the c++ code, I noted down what I observed and understood and my observations are pasted below along with the c++ code. They are in blue font. Please let me know if my current understanding is correct by going through them.

Week 4:

I was finally able to make the changes to the code and figure out how it affects the eventual generated code. With this change most of the operations are done in a child query. The code change I made seemed to have fixed the problem with each of gsoc queries. To figure out which operations should be processed inline I put a breakpoint to find out which ones cause an error in the case of gsoc queries if I return false to each operators. But it does everything in child graph including creation of inline dataset.
I first created the a.out file using eclcc command (on eclipse). Then I was able to see the generated code by creating a workunit using ecl deploy command. But I could not figure out how to get cpp files. So, I installed the changed code.
Other than that, this week was spent on solving some small issue or the other. For example the setting up of the argument (had to add a – and an f) and dealing with the problem of libraries. The proposal work also took time to set up the room for presentation and preparing of the speech

Week 5:

I added a new option to the code which currently makes sure that each source node is assigned inline. I analyzed on how it effects the generated files in combination with the option minimalOperationsInline.
I found out which are the operators which are not assigned inline when the option minimalOperationsInline is used but otherwise are assigned inline for both compiler and running time regression tests. I have pasted the results at the end of the email. On looking at some of the codes generated and the code of canAssignInline, I came up with this list which I think should probably be always assigned inline:
no_activerow
no_rows
no_null
no_fromjson
no_fromxml
no_fail
no_getresult
no_call
no_id2blob
no_left
no_matchrow
no_right
no_typetransfer
no_xmlproject
no_datasetfromdictionary
But I need to look further in the generated code to make sure that this is true.

Week 6:

I came back from the conference in Cyprus on Tuesday.
This week was spent more on understanding the problem and the new code that Gavin sent me. I tried to include that Gavin’s brancy to my repo but got into some mess. After spending too much time on this, I forked the repo again and started from scratch. I was able to push the new changes and the diff can be seen here:
https://github.com/hpcc-systems/HPCC-Platform/compare/master…aranjan1002:childquery-2.1
I started experimenting with the code sent.

Week 7:

I made the following observations with the new code:
  • Using the option optimizeInlineOperations and not using it does not make a lot of difference in the generated C++ and XML code for gsoc5min.ecl
  • None of the gsoc queries run with the new option. While gsoc5 gives the following error, the others go into infinite recursion leading to segmentation fault:
    • gsoc5.ecl(53,8): error C9999: Internal Error at /home/aranjan/HPCC/HPCC-Platform/ecl/hql/hqlattr.cpp(3658
  • I made some change to the code which can be seen here: https://github.com/aranjan1002/HPCC-Platform/compare/childquery-2.1…aranjan1002:childquery-2.1.1. I see that no splitter is generated for gsoc5 or gsoc6 which confused me. (It was later on resolved that adding the case of no_createrow in mustAssignInline function resolves the issue).
  • I created some more changes in this new branch: https://github.com/aranjan1002/HPCC-Platform/compare/childquery-2.1…aranjan1002:childquery-2.2
    • I see that splitters are created in this case with the option minimalOperationsInline and with the option optimizeInlineOperations infinite recursion occurs

In general the aim of the experimentation was to figure out how to make split operations inline. Since no splitter was generated, it had to be investigated. I made some changes to make the splitter inline and can be found here:
https://github.com/aranjan1002/HPCC-Platform/compare/childquery-2.1…aranjan1002:childquery-2.3

The generated c++ file seems to do the job of splitter with this change. To test it I tried to run the ecl file but

I was getting this error
aranjan@aranjan-GX776AA-ABA-a6342p:~/HPCC/ECLQueries$ ecl run gsoc5.ecl -t=thor
Program was terminated by signal 11
Error creating archive
I rebuilt and reinstalled the system. But I am getting the same error.

Week 8:

The first thing that I did was to make subgraphs inline selectively based upon the kind of activities that they have. Specifically, a subgraph is inlined only if all of its activities can be inlined. The code can be seen here:

https://github.com/hpcc-systems/HPCC-Platform/compare/master…aranjan1002:childquery-2.4

Doing this caused the following error to occur in gsoc6:

Graph[13], workunitwrite[24]: MP link closed. Master exception : Error aborting job, will cause thor restart

But it worked fine for roxie and showed a different error for hthor. The source of error was identified by comparing xmls of thor and roxie. It was missing an attribute in the child graph of generated xml. Adding that removed the error and it also gave the correct output. But it was working for gsoc1 to 4 and gave this error for each:

error C4821: INTERNAL: Graph context not found

Gsoc1 to 4 are different because they have consecutive child graphs and the result of one is the source for the other. So, it had to be figured out that how can the result be passed if one of them is inlined. It was observed that the code threw the error when it was trying to inline getGraphResult activity.

After quite a bit of debugging and going through the flow of control I came to the conclusion that the error caused because an IHqlExpression was missing this IAtom: externalAtom. With quite a few hacks here and there, I was able to make it compile for gsoc1 and the code could be seen here:

https://github.com/hpcc-systems/HPCC-Platform/compare/master…aranjan1002:childquery-2.6.3

In essence the changes were to save the getgraphresult expression with attribute externalAtom in a global variable: graphResult. Then, use this variable in the function buildGetLocalResult instead of the passed parameter. Also, instead of inlining the second child graph in the function optimizeInlineGraph, it is inlined at the end of the function generateGraph. I did that because I wanted to do the inline after the first child graph has been properly generated.

The next step is to make it compile for gsoc 1 to 6 properly and make it run at least for gsoc 5 and 6 (as it was doing before).

Week 9:

So, the last week is over. I spent this whole week creating documentation about the whole internship experience. All the documentation can be accessed here:
https://github.com/aranjan1002/code4life/tree/master/HPCC

Here are some details about the files:

Diary.md – Covers all the work from start to beginning.

Experiences.md – A summary of my experience and recommendations.

CodeGenerator.md – Some questions I have regarding the code generator

DocumentationSuggestions.md – Some suggestions about how to improve the documentations for the system

The remaining files are referred in the four docs above.

I plan to continue working on the project in my free time and hopefully finish it before I graduate in December.

Leave a comment