0:01 Hi, my name is Steve O'Hara. This is part one of three talking about the 0:08 software from Eagle Legacy Modernization LLC. 0:13 This part part one is focused completely on parsing computer 0:19 programming languages. part two will focus on dynamic 0:27 analysis as well as interpreting and part three we'll talk about transformation and 0:34 generation. each of these should be fairly brief not trying to go into a lot of detail just trying to give an 0:41 overview. the first thing I'd like to point out is that parsing a computer programming language is not all that 0:50 difficult. it's been we've been doing this as an industry for many many decades. But it doesn't solve the whole 0:59 problem. There's a lot more to analysis of programming languages, meta programming if you will, than just 1:06 parsing. in my former career, one one of the things that I did was I was the CTO of a legacy modernization 1:16 company. We did this for a business and the most frequent problem we had was that the 1:23 the grammarss were done by one group of people and the analysis were done by a second group of people and they often 1:31 got out of sync. And the problem with them getting out of sync is the analysis people would be using XPathlike 1:38 formulations to try to get elements out of the ast tree but the a was changing 1:47 and it would return an empty result blindly would not fail would not give an error and we very frequently ran into 1:55 this problem of things getting out of sync. So after doing some research and experimentation 2:03 we came up with this new idea of what we call a program grammar or programmer 2:11 for short. no it's not a typo it's ends in a r instead of er program grammar. The idea of a program grammar 2:19 is that we put everything in code in Java in this case, but it could be C, 2:27 whatever and let the compiler tell us when things get out of sync. And I'm going to demo that for you. So, let 2:36 me start off by showing you this. So in the C programming language 2:44 there are a lot of constructs in there and one of them is the dow while loop. So as long as some condition is 2:51 true keep performing this set of actions and what you can see on your screen hopefully is the implementation of that 3:01 and this is a token sequence. A token sequence is equivalent to a production rule in a traditional 3:09 grammar. It has a sequence of things and there that annotation of s is the 3:16 sequence number. and it says first that there is a keyword called do. Then 3:23 there's optionally a comment. Then there is a statement and another keyword called while a parentheses an expression 3:31 and so on. And you can see how that kind of correlates with the traditional A like or BNF if you will grammar uh 3:40 that you're used to seeing in the more traditional Yak antler bison kinds of 3:47 tools. Okay. these program grammarss are contextsensitive. 3:54 In addition to showing this declarative form, you can actually inject code in there to check conditions 4:02 and have it have the parsing depend on the context, which is not something you easily get in a more traditional 4:09 grammar. The style used here is is recursive descent well doumented, well understood as opposed to like the 4:17 lr the look ahead type grammars. 4:23 This is using reflection. I'm not sure if if everybody's familiar with reflection in detail, but I'm pretty sure you all understand it in general 4:31 terms. Reflection is the ability of a program to essentially read itself. So this Java code that you can see on the 4:40 screen is in infers or implies the representation of the grammar and 4:49 this is what is actually used for parsing. One important thing to point out is that the terminal nodes, so a 4:58 terminal node is the the leaves on the on the syntax tree are simply code 5:06 in this case. So for example, the terminal nodes could be things like keywords, identifiers, numbers, strings 5:14 and so on. Well, these things are actually just code and they share code. 5:20 So for example, here's how a literal in C which starts with double quotes has an 5:27 escape character which is a backslash and so on is all processed. This is significantly faster 5:35 than and than expressing it all in the syntax, but it also works way better 5:43 because the I don't know if you've ever tried to write a grammar for a floatingoint number, but it's very 5:50 painful. It's very challenging. But if you're doing it in code, it's much easier. And in fact, if you do it in 5:57 shared code across multiple language, it gets even easier. You don't have to share across things. You can have your 6:05 own way of doing it in each language, but it's much easier if you can share that kind of stuff. 6:12 Okay. So, what I want to do is I want to show you some different ways of of parsing and running and doing all this 6:19 this kind of stuff. Okay. So the first thing I want to do is show you a 6:27 command line tool for parsing. So this little example 6:34 here is parsing a cobalt pro collection of cobalt programs. And let me just run it so you can kind of see what it does. 6:44 Now these cobalt has two different ways of representing. 6:49 There's fixed format which uses columns very column specific like it only goes up to column 72. Columns 1 to six are a 6:56 label and so on. what this is doing is it's going out there and compiling 7:03 parsing each of the programs in that in that project. That project is really just a group of programs. Okay. the all 7:13 of these languages that we're dealing with are all using a single parser. So 7:21 each of these languages that you can see on the right hand side are all using the same parser. Python for example uses 7:30 the same parser. Cobalt assembler they're all using that same the same parser. which is a very powerful uh 7:38 powerful idea. the next thing I want to do is I want to go through a little bit of detail on this screen. This is 7:47 essentially a parser management or project management kind of a thing. Uh 7:53 in this case we're dealing with 33 different projects I guess with and 34 different languages. Okay. 8:07 programming languages although some of these are really like a property file is not really much of a language XML is not really like a 8:15 programming language but nonetheless it's a parsible entity and the expectation is is that you don't have a perfect 8:24 system when you're dealing with thousands or millions of programs so in this case we have 19,000 source 8:34 files spread out into multiple groups Some have a few thousand, some have a few dozen, some only have one, but most 8:43 of them are actually parsing. So, you can see that out of this AP pack, which is a cobalt 8:50 counting package, 17 of the files aren't parsing, but 2,349 of them are. Uh, that's over a million 8:59 1.4 million lines, and 99% of them parse successfully. 9:06 And again by language you can see the same kind of thing. So the left and the right totals match up. They're 9:12 they're both representing the same logic. Okay. Now 9:20 one of the things that you can do is you can see essentially a BNF like 9:27 representation of cobalt. So we take the cobalt grammar programmer that you saw and 9:36 generate a BNF like grammar. So this is generated but in addition to generating 9:45 we show statistics and the statistics show you how often each of the pieces are present. Okay. So here's a choice. 9:53 There's a cobalt data section which is composed of either a cobalt comment or a cobalt file section or cobalt working storage. And 28% of the time it's a 10:02 comment. 19% of the time it's a file section. 19% it's working storage and so 10:07 on. So these tiny little numbers here give you detailed statistics. And there 10:15 are 4,375 instances of this cobalt data section in this set of code that we're we're dealing with. Very powerful concept. 10:28 you can also look to see why things failed. What happened? What what is it about this set of things? So you can 10:37 tell, you don't if you can see this very clearly, but pre-fail and postfail that this is the point between these two 10:45 where things stopped parsing for whatever reason. So you can look at these and it'll help you figure out, 10:53 well, what do I need to work on to get parsing closer to 100 100%. 10:59 And you can also look at the the ones that are successfully parsed. I don't know, maybe this one. And you can 11:08 see you can actually see a sort of a smart 11:15 version of the of the programs that's colorcoded for 11:21 various things, variables. You can click on things and you can see where they're defined, what their scope is, all the 11:30 cross references and and so on. There's quite a lot of stuff in here that that helps to to understand what's going on on a bigger picture. 11:41 All right, the next thing I want to show you is I want to show you the 11:50 debuggers. 11:59 Not this one. I want to show this one. here. 12:07 All right. So, here is a way of looking at the 12:16 results of parsing. So, this is a representation of the tree that that comes out and it's not I don't call it 12:23 an a anymore because it does have semantics. I call it a programmer semantic tree PST. 12:30 And this is a representation of the contents of the parse results. What happened during parsing, what was 12:38 successful. And you can see all the details of line numbers and all that kind of good 12:45 stuff. Uh, a lot of information going on in here. Okay. 12:51 And we can also have a debugger. So, let me pull this thing here. 12:58 And this is a parser debugger. And you can do things like set a break point 13:06 and then you can run it and it'll run until it gets to that spot. This is a little clunky. Personally I don't 13:16 find it to be really all that useful, but it's there and maybe we can make it more useful in the future. But you 13:23 can step into things, step over things, do stuff, whatever. Watch things happen. I If you 13:31 can see in the background, it's telling you a little bit about what it's doing. Uh, just let it run. 13:38 And you can see that in this case it was successful. Well, anyways, there is a debugger, uh, which is a useful tool on 13:47 occasion. most of the time presumably things parse successfully but if they don't that certainly helps and the next 13:57 thing I want to show you or the last thing I guess in this part 14:03 I want to show you the website e eagle legacy.com kind of smashed together 14:11 eagle and legacy it's eaglegacy.com 14:19 actually I think I can get it to show eaglegacy.com. 14:25 So this is the website and it has a lot of information much more than we're covering in here. And one of the things 14:32 it has is you can actually try the parser. 14:36 So let's try the parser. Notice that it's this is not a secure system. So you have to 14:44 you know be aware that it's public. 14:48 Okay. There's no no privacy available here. so you can take a random file 14:55 source source file and you can say hey let me parse this thing and yes I know it's not secure. 15:05 No thank you. 15:07 And you can see different things that came out of here. This parse tree is the same that we just saw a minute ago. 15:15 Here's the equivalent grammar for C. 15:20 And it shows you the statistics and what we just ran. And these are the terminal nodes. I told you these are not really 15:27 part of the grammar per se. They are really just code. and you can see the 15:34 source source code again. So this is actually available. This is a uh 15:42 part of the Eagle Legacy website. You can get to it. You can run it. There are some limitations on it of course. but 15:49 it but it is generally available for you. 15:56 Okay. So I want to stop here and remember this is part one of three. 16:04 and thank you.