0:01 Hi, my name is Steve O'Hara. This is part one of three talking about the
0:08 software from Eagle Legacy Modernization LLC.
0:13 This part part one is focused completely on parsing computer
0:19 programming languages. part two will focus on dynamic
0:27 analysis as well as interpreting and part three we'll talk about transformation and
0:34 generation. each of these should be fairly brief not trying to go into a lot of detail just trying to give an
0:41 overview. the first thing I'd like to point out is that parsing a computer programming language is not all that
0:50 difficult. it's been we've been doing this as an industry for many many decades. But it doesn't solve the whole
0:59 problem. There's a lot more to analysis of programming languages, meta programming if you will, than just
1:06 parsing. in my former career, one one of the things that I did was I was the CTO of a legacy modernization
1:16 company. We did this for a business and the most frequent problem we had was that the
1:23 the grammarss were done by one group of people and the analysis were done by a second group of people and they often
1:31 got out of sync. And the problem with them getting out of sync is the analysis people would be using XPathlike
1:38 formulations to try to get elements out of the ast tree but the a was changing
1:47 and it would return an empty result blindly would not fail would not give an error and we very frequently ran into
1:55 this problem of things getting out of sync. So after doing some research and experimentation
2:03 we came up with this new idea of what we call a program grammar or programmer
2:11 for short. no it's not a typo it's ends in a r instead of er program grammar. The idea of a program grammar
2:19 is that we put everything in code in Java in this case, but it could be C,
2:27 whatever and let the compiler tell us when things get out of sync. And I'm going to demo that for you. So, let
2:36 me start off by showing you this. So in the C programming language
2:44 there are a lot of constructs in there and one of them is the dow while loop. So as long as some condition is
2:51 true keep performing this set of actions and what you can see on your screen hopefully is the implementation of that
3:01 and this is a token sequence. A token sequence is equivalent to a production rule in a traditional
3:09 grammar. It has a sequence of things and there that annotation of s is the
3:16 sequence number. and it says first that there is a keyword called do. Then
3:23 there's optionally a comment. Then there is a statement and another keyword called while a parentheses an expression
3:31 and so on. And you can see how that kind of correlates with the traditional A like or BNF if you will grammar uh
3:40 that you're used to seeing in the more traditional Yak antler bison kinds of
3:47 tools. Okay. these program grammarss are contextsensitive.
3:54 In addition to showing this declarative form, you can actually inject code in there to check conditions
4:02 and have it have the parsing depend on the context, which is not something you easily get in a more traditional
4:09 grammar. The style used here is is recursive descent well doumented, well understood as opposed to like the
4:17 lr the look ahead type grammars.
4:23 This is using reflection. I'm not sure if if everybody's familiar with reflection in detail, but I'm pretty sure you all understand it in general
4:31 terms. Reflection is the ability of a program to essentially read itself. So this Java code that you can see on the
4:40 screen is in infers or implies the representation of the grammar and
4:49 this is what is actually used for parsing. One important thing to point out is that the terminal nodes, so a
4:58 terminal node is the the leaves on the on the syntax tree are simply code
5:06 in this case. So for example, the terminal nodes could be things like keywords, identifiers, numbers, strings
5:14 and so on. Well, these things are actually just code and they share code.
5:20 So for example, here's how a literal in C which starts with double quotes has an
5:27 escape character which is a backslash and so on is all processed. This is significantly faster
5:35 than and than expressing it all in the syntax, but it also works way better
5:43 because the I don't know if you've ever tried to write a grammar for a floatingoint number, but it's very
5:50 painful. It's very challenging. But if you're doing it in code, it's much easier. And in fact, if you do it in
5:57 shared code across multiple language, it gets even easier. You don't have to share across things. You can have your
6:05 own way of doing it in each language, but it's much easier if you can share that kind of stuff.
6:12 Okay. So, what I want to do is I want to show you some different ways of of parsing and running and doing all this
6:19 this kind of stuff. Okay. So the first thing I want to do is show you a
6:27 command line tool for parsing. So this little example
6:34 here is parsing a cobalt pro collection of cobalt programs. And let me just run it so you can kind of see what it does.
6:44 Now these cobalt has two different ways of representing.
6:49 There's fixed format which uses columns very column specific like it only goes up to column 72. Columns 1 to six are a
6:56 label and so on. what this is doing is it's going out there and compiling
7:03 parsing each of the programs in that in that project. That project is really just a group of programs. Okay. the all
7:13 of these languages that we're dealing with are all using a single parser. So
7:21 each of these languages that you can see on the right hand side are all using the same parser. Python for example uses
7:30 the same parser. Cobalt assembler they're all using that same the same parser. which is a very powerful uh
7:38 powerful idea. the next thing I want to do is I want to go through a little bit of detail on this screen. This is
7:47 essentially a parser management or project management kind of a thing. Uh
7:53 in this case we're dealing with 33 different projects I guess with and 34 different languages. Okay.
8:07 programming languages although some of these are really like a property file is not really much of a language XML is not really like a
8:15 programming language but nonetheless it's a parsible entity and the expectation is is that you don't have a perfect
8:24 system when you're dealing with thousands or millions of programs so in this case we have 19,000 source
8:34 files spread out into multiple groups Some have a few thousand, some have a few dozen, some only have one, but most
8:43 of them are actually parsing. So, you can see that out of this AP pack, which is a cobalt
8:50 counting package, 17 of the files aren't parsing, but 2,349 of them are. Uh, that's over a million
8:59 1.4 million lines, and 99% of them parse successfully.
9:06 And again by language you can see the same kind of thing. So the left and the right totals match up. They're
9:12 they're both representing the same logic. Okay. Now
9:20 one of the things that you can do is you can see essentially a BNF like
9:27 representation of cobalt. So we take the cobalt grammar programmer that you saw and
9:36 generate a BNF like grammar. So this is generated but in addition to generating
9:45 we show statistics and the statistics show you how often each of the pieces are present. Okay. So here's a choice.
9:53 There's a cobalt data section which is composed of either a cobalt comment or a cobalt file section or cobalt working storage. And 28% of the time it's a
10:02 comment. 19% of the time it's a file section. 19% it's working storage and so
10:07 on. So these tiny little numbers here give you detailed statistics. And there
10:15 are 4,375 instances of this cobalt data section in this set of code that we're we're dealing with. Very powerful concept.
10:28 you can also look to see why things failed. What happened? What what is it about this set of things? So you can
10:37 tell, you don't if you can see this very clearly, but pre-fail and postfail that this is the point between these two
10:45 where things stopped parsing for whatever reason. So you can look at these and it'll help you figure out,
10:53 well, what do I need to work on to get parsing closer to 100 100%.
10:59 And you can also look at the the ones that are successfully parsed. I don't know, maybe this one. And you can
11:08 see you can actually see a sort of a smart
11:15 version of the of the programs that's colorcoded for
11:21 various things, variables. You can click on things and you can see where they're defined, what their scope is, all the
11:30 cross references and and so on. There's quite a lot of stuff in here that that helps to to understand what's going on on a bigger picture.
11:41 All right, the next thing I want to show you is I want to show you the
11:50 debuggers.
11:59 Not this one. I want to show this one. here.
12:07 All right. So, here is a way of looking at the
12:16 results of parsing. So, this is a representation of the tree that that comes out and it's not I don't call it
12:23 an a anymore because it does have semantics. I call it a programmer semantic tree PST.
12:30 And this is a representation of the contents of the parse results. What happened during parsing, what was
12:38 successful. And you can see all the details of line numbers and all that kind of good
12:45 stuff. Uh, a lot of information going on in here. Okay.
12:51 And we can also have a debugger. So, let me pull this thing here.
12:58 And this is a parser debugger. And you can do things like set a break point
13:06 and then you can run it and it'll run until it gets to that spot. This is a little clunky. Personally I don't
13:16 find it to be really all that useful, but it's there and maybe we can make it more useful in the future. But you
13:23 can step into things, step over things, do stuff, whatever. Watch things happen. I If you
13:31 can see in the background, it's telling you a little bit about what it's doing. Uh, just let it run.
13:38 And you can see that in this case it was successful. Well, anyways, there is a debugger, uh, which is a useful tool on
13:47 occasion. most of the time presumably things parse successfully but if they don't that certainly helps and the next
13:57 thing I want to show you or the last thing I guess in this part
14:03 I want to show you the website e eagle legacy.com kind of smashed together
14:11 eagle and legacy it's eaglegacy.com
14:19 actually I think I can get it to show eaglegacy.com.
14:25 So this is the website and it has a lot of information much more than we're covering in here. And one of the things
14:32 it has is you can actually try the parser.
14:36 So let's try the parser. Notice that it's this is not a secure system. So you have to
14:44 you know be aware that it's public.
14:48 Okay. There's no no privacy available here. so you can take a random file
14:55 source source file and you can say hey let me parse this thing and yes I know it's not secure.
15:05 No thank you.
15:07 And you can see different things that came out of here. This parse tree is the same that we just saw a minute ago.
15:15 Here's the equivalent grammar for C.
15:20 And it shows you the statistics and what we just ran. And these are the terminal nodes. I told you these are not really
15:27 part of the grammar per se. They are really just code. and you can see the
15:34 source source code again. So this is actually available. This is a uh
15:42 part of the Eagle Legacy website. You can get to it. You can run it. There are some limitations on it of course. but
15:49 it but it is generally available for you.
15:56 Okay. So I want to stop here and remember this is part one of three.
16:04 and thank you.