223 RR Oga and Parsing with Yorick Peterse - Ruby Rogues -

223 RR Oga and Parsing with Yorick Peterse

Hosted by:

Charles Max Wood •

Jessica Kerr

Special Guests:

Yorick Peterse

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

02:35 - Yorick Peterse Introduction

03:07 - oga

nokogiri

06:38 - Fixing vs Writing an Alternative Feature

14:01 - Doing a Document Instead of a Programming Language

16:01 - Modifying XML Documents

17:19 - Inputting in Memory

19:09 - Extending oga with C

22:44 - Parsing

25:16 - Resources

LL Parser

28:57 - Lessons Learned Building oga

30:14 - Writing Parsers in Other Languages

31:19 - Getting Started

34:19 - Making oga and Using oga at Work

35:42 - Did it make a better API?

37:23 - The Community and Contribution

Documentation

Picks

AirPair (Chuck)
CAL(1) Shell Command (Jessica)
fish shell (Yorick)
asciinema (Yorick)

Special Guest: Yorick Peterse.

Transcript

[This episode is sponsored by Hired.com. Every week on Hired they run an auction where over a thousand tech companies in San Francisco, New York, and L.A. bid on Ruby developers, providing them with salary and equity upfront. The average Ruby developer gets an average of 5 to 15 introductory offers and an average salary offer of $130,000 a year. Users can either accept an offer and go right into interviewing with the company or deny them without any continuing obligations. It’s totally free for users. And when you’re hired, they give you a $2,000 signing bonus as a thank you for using them. But if you use the Ruby Rogues link, you’ll get a $4,000 bonus instead. Finally, if you’re not looking for a job and know someone who is, you can refer them to Hired and get a $1,337 bonus if they accept the job. Go sign up at Hired.com/RubyRogues.]

[This episode is sponsored by Codeship.com. Codeship is a hosted continuous delivery service focusing on speed, security and customizability. You can set up continuous integration in a matter of seconds and automatically deploy when your tests have passed. Codeship supports your GitHub and Bitbucket projects. You can get started with Codeship’s free plan today. Should you decide to go with a premium plan, you can save 20% off any plan for the next three months by using the code RubyRogues.]

[Snap is a hosted CI and continuous delivery that is simple and intuitive. Snap’s deployment pipelines deliver fast feedback and can push healthy builds to multiple environments automatically or on demand. Snap integrates deeply with GitHub and has great support for different languages, data stores, and testing frameworks. Snap deploys your application to cloud services like Heroku, Digital Ocean, AWS, and many more. Try Snap for free. Sign up at SnapCI.com/RubyRogues.]

[This episode is sponsored by DigitalOcean. DigitalOcean is the provider I use to host all of my creations. All the shows are hosted there along with any other projects I come up with. Their user interface is simple and easy to use. Their support is excellent and their VPS’s are backed on Solid State Drives and are fast and responsive. Check them out at DigitalOcean.com. If you use the code RubyRogues, you’ll get a $10 credit.]

CHUCK:

Hey everybody and welcome to episode 223 of the Ruby Rogues Podcast. This week on our panel we have Jessica Kerr.

JESSICA:

Good morning.

CHUCK:

I'm Charles Max Wood from DevChat.TV. And this week we have a special guest. That's Yorick Peterse.

YORICK:

[Chuckles] Hello.

CHUCK:

I'm sure I said it wrong.

YORICK:

Don't worry.

CHUCK:

It was American enough, right?

YORICK:

[Laughs] It lacks the certain amount of bald eagles and explosions and [inaudible].

JESSICA:

[Laughs]

CHUCK:

Oh, there we go. Do you want to introduce yourself?

YORICK:

Yeah, sure. So, I'm Yorick from the Netherlands. I've been doing Ruby for several years, lost count. I do all sorts of [nutty] little things. I work on Rubinius. I recently, well recently about a year ago or so, I started working on an XML parser, HTML parser, that [inaudible] stuff. Besides, I'm basically all over the place with fixing gems, that kind of stuff.

CHUCK:

We brought you on to talk about Oga and parsing and Ruby.

YORICK:

Mmhmm.

CHUCK:

So, do you want to give us a quick overview of Oga?

YORICK:

Where do I begin? [Chuckles] It's always difficult to…

CHUCK:

What does it do?

YORICK:

Oga is in short a parser for XML and HTML written in Ruby. And it also has support for creating documents using XPath, CSS, basically the standard set of features you'd expect from an XML parser. I started working on this about a little over a year ago, I think. At that point, I had a bunch of applications that were using Nokogiri, a similar library, probably one of the, if not the most popular XML parsing library in Ruby. And I was trying to run applications that were using that on Rubinius at the time and bumped into this particular problem where for reasons still not entirely clear, it will crash quite often up to the point where I couldn't decide like, “Okay, we'll just deal with the occasional crash and see if it actually yields us any benefits.”

But it just crashed so often that basically the application couldn't even do anything because it would already crash at that point. So, I spent some time together with a bunch of people trying to see if we could fix it. And after several weeks I decided like, “Okay, this is way over my head. [Chuckles] So, you know what? I'll write an alternative, because that is not way over my head.” initially I actually thought that, “Oh, maybe I can do that in two, three months.” [Chuckles] It took a little bit longer than that. No, I actually started in February 2014. So, it's almost two years, actually

CHUCK:

Wow, that's impressive.

YORICK:

Today it's at a point where it's actually useful. It's in certain areas not as fast. In certain areas it will be faster than Nokogiri in the near future. But already people are starting to use it. And perhaps one of the most common compliments I get is that it's so much easier to install because it's much smaller and doesn't require any system libraries whatsoever. So, that already is a nice achievement. Although [chuckles] if I look back at it, I might have been a little bit over-optimistic thinking I could do this in two, three months because it's the classical example of trying to shave a yak. You start and initially it looks okay. Then you bump into this problem and like, “Hmm, now we have to do this.” And then you fix that and two months later it's like, “Oh, now we also have to do that.” And it keeps repeating itself.

CHUCK:

So, I really want to… I know people are going to have questions about this. And I'm looking at it and it seems like for the most part it has at least the common features that you expect from a

library like this.

YORICK:

Mmhmm.

CHUCK:

So, you can look things up by CSS. You can look it up by XPath. You can look it up by several other things. You can extract the data out of it. I want to, I'm torn because I want to talk a little bit about Oga but really, the thing that fascinates me is parsing

YORICK:

Mmhmm.

CHUCK:

So, since we're already talking about Oga let's go ahead and talk about that for a little bit longer. And then yeah, then let's go into parsing because I think that's… anyway, that's a really interesting topic to me. I don't know about you, Jessica.

JESSICA:

Oh, I'm super fascinated and we can get back to this by the end of the show, by the part where he thought it would take two to three months. Also where he thought where fixing was over his head, but Yorick you said that writing an alternative is not over your head.

YORICK:

Mmhmm.

CHUCK:

Yeah, let's [inaudible] that.

JESSICA:

It fascinates me.

YORICK:

It might be easier to start with that because that's probably shorter to explain. Basically the problem at the time is that a large part of Nokogiri's codebase is in C or Java, depending on where they use JRuby or CRuby or Rubinius.

CHUCK:

Right, because it depends on libxml and libxslt.

YORICK:

Yes. So, the way they have it is for CRuby and Rubinius, it uses libxml and libxslt. And then for JRuby they use some different Java library. And they basically replicated the whole gem in Java for that. So, it's essentially two gems in one. The thing there is for me my C knowledge at least at that time was not extensive enough that I felt comfortable digging through both Nokogiri and potentially the libxml codebase trying to figure out what on earth was going on. There also played there were certain things I felt like, “Okay, this I would want to do differently or this I don't like or whatever.” So, after several weeks I decided that initially I would sort of as a prototype try to see if I could make something that would work. [Chuckles] And that sort of escalated from there on, basically.

JESSICA:

[Chuckles] Did you find that there was a lot more to Nokogiri than you thought there was?

YORICK:

So for me, it has never really been that. I dislike certain behaviors Nokogiri has, or the fact that it's a pretty big thing to install and whatever. But the code in itself, not necessarily Nokogiri but libxml based on the knowledge I have now, I can definitely see why these libraries tend to be as big as they usually are or as complex as they are. Because the subject they're trying to deal with is also quite complex. So, definitely an understanding came from that.

CHUCK:

Yeah, I find it interesting that you went and implemented an XML parser completely in Ruby because Nokogiri essentially extends to wrap around libxml. So, all the parsing actually happens there. And then the rest of Nokogiri is just exposing an API that the programmer can use.

YORICK:

Yeah, so initially I think for the first six months I guess into the project, it was pure Ruby, like literally everything. And then once it got complete enough that I could comfortably say like, “Okay, this is sort or XML compatible,” I started benchmarking things. And especially with larger input documents where large is 10 megabytes or more, which isn't really big but it's a size where if you start testing parsers like this you'll begin to notice any performance problems and so on.

And so, here I noticed that, I don't remember the exact number, but I believe that the pure Ruby version was 10, whatever, many times slower than Nokogiri with no real room for improvement. That already was kind of problematic, if I recall correctly. And I had this file that was 10 megabytes. And basically, only the first phase of the whole parsing process would already take I believe five seconds. Now five seconds for 10 megabytes for me is just way too much. I can't tell people, “You should use this. But it will take you quite a long time,” especially if Nokogiri can do it in maybe 400 milliseconds for example.

So then, I made the decision to basically move part of the parsing loop or process to native code. So, for CRuby and Rubinius that will be C. And for JRuby that will be Java. But most of the actual logic, for example the part of the code that determines when certain tags have to be inserted automatically to deal with invalid XML for example, that would all be done in Ruby. And so basically the way it works now is that the loop that basically goes through the input, that is in native code. And that dispatches back to Ruby methods that do the actual work in terms of adding bits to the parse tree and then so on. That made it much faster. But it came at the drawback that it's no longer pure Ruby. So, right now it's about, I think the GitHub statistics, if they're any accurate, they show about 91% being Ruby and the other 9% being the rest, basically.

CHUCK:

So, you basically just moved what you had into C to get it to speed up?

YORICK:

Yeah. And the interesting thing there I found is that at some point I felt like, “Okay, what is the overhead of Ruby itself in this process?” So, I took the first phase called the lexer which basically takes the input and breaks it up in little chunks that… usually of a type and a value. They're called tokens. And I basically took that part and basically stubbed out all the Ruby method calls so that it would essentially just go through the input and so on. And that would take… I don't remember exactly but it could process data like a few gigabytes per second. And you add Ruby to the mix and it suddenly becomes a hundred megabytes for example. So, that was quite a bit of a shock, because I know Ruby would add overhead but I didn't expect it to be that bad. However, it did teach me fairly early on to just accept this as this is basically the limit, instead of being frustrated and trying to make it fast and not really gaining anything.

JESSICA:

Ooh.

CHUCK:

Now, I want to just clarify a couple of things here. When we're talking about lexers usually that's something that takes the string of code, or in your case a string of XML or HTML and breaks it up into tokens which just represent the structure.

YORICK:

Mmhmm.

CHUCK:

Of the document.

YORICK:

Mmhmm. So, the way you could see, say you have, let's say it's the string '10 plus 10' right? The lexer, what it would do, it would take that string. It would say, “Okay, 10 is a number. Plus is an operator. And hey, that other 10 is also a number.” And so, the result you get could be for example an array where it's like, three different arrays in it. It has a type and a value. So, it's like, array integer 10, array operator plus, whatever, et cetera. And these tokens are then used in the second stage which is the actual parsing stage. And they're used here to basically always build a parse tree, which is as the name suggests a tree that defines the structure of the document. There are a bunch of different algorithms. Some parsing libraries for example, they don't have a separate lexing phase. They do that in one go, which has benefits or drawbacks depending on what side you're coming from.

CHUCK:

That makes sense. Usually you hear about this with languages. And I tend to talk about this a lot more on the JavaScript podcast that I do because there are so many languages that basically transpile to JavaScript. And so, it's much better understood over there and much more commonly used. In this case you're doing a document instead of an actual programming language.

YORICK:

Mmhmm.

CHUCK:

Do you find that there's a difference between the two in approach?

YORICK:

Yes. So, within Oga there are actually two types of parse trees that I build. For the XML stuff it basically directly builds the whole XML document, like these Ruby objects that you can query, et cetera. Whereas for the CSS selectors and XPath selectors, it will first compile it to an actual syntax tree that on itself can't really do anything. It's just a more refined definition of the syntax. And then there's in those cases an extra step which actually evaluates that syntax tree in order to query your document.

In the case of languages usually that extra step takes the syntax tree and turns it into for example bytecode which then a virtual machine can run. Or in case of a compiler, it will maybe turn it into an intermediate format that a compiler can then turn into machine code. So, depending on your use case, it's usually, lexer, parse, and then the thing that uses it. Or there can be several steps after. If your program is simple enough, some people they also only have a lexing phase and the parser parses it but also evaluates it on the fly. For example, if you're building a calculator, that could be an option.

With Oga, you have for the XML and HTML stuff, it just builds the document object right away. And then for everything else there's an extra step after it. But the core concept of it, it's not that different from programming languages.

CHUCK:

Right. So, the other question I have then is does Oga allow you to modify XML documents or build XML documents?

YORICK:

Yes. You can modify, insert, the whole shebang basically.

CHUCK:

And when you do that, do you insert things into the parse tree and then go back the other way? Or export in a particular manner?

YORICK:

The way it works is that you can… say you have your parsed document you can create a certain element and then you can just append directly to its list of child nodes.

CHUCK:

Mmhmm.

YORICK:

Or you can change attributes directly. But you can also just create an element on its own and then add elements to it. The actual XML classes, the element class or the document class or whatever, they're used by the parser but they don't require the parser. They're basically just the result of it.

CHUCK:

Okay.

YORICK:

So, you can use them standalone. So, you in theory similar to for example the builder gem which you can use to build XML documents using this DSL, similar to that you could build an XML document without ever using the parser, by just creating an element and adding stuff to it and setting attributes, et cetera.

JESSICA:

The XML document class is mutable, then?

YORICK:

Yes.

JESSICA:

Does Oga have to hold the entire document in-memory at once to work with it?

YORICK:

The resulting document, yes, if you use the standard parsing method. However during the parsing process, it doesn't have to keep everything in terms of input in-memory. You can provide different types of input. You can for example just give a string. So, if you have a small chunk of XML that's probably the easiest way. But you can also give a file object or an enumerator, for example. So, you could in theory do stuff where you download an XML file from the network and then as you're downloading it you have it parsed. And then those chunks of input, they only stay in-memory as long as they're basically being parsed. However, the result, that stays in-memory for as long as there are any references to it.

JESSICA:

Okay.

YORICK:

There are ways to do stream parsing, for example. Sorry, pull parsing. And it has a SAX parsing API as well.

JESSICA:

Wow.

YORICK:

Those are typically APIs you would use if you have really big documents and you want to have a bit more control over how it's parsed and so on. And those, they typically use far less memory but they're a bit more difficult to use as a result of being much more low-level.

JESSICA:

Right. So, Oga has SAX parsing as well?

YORICK:

Yes.

JESSICA:

What does it not have?

YORICK:

[Laughs] It doesn't support XSLT. And I don't really have any intentions for that, either. At some point somebody suggested it. I looked at it and I was like, “Oh, I prefer not to do that.” [Laughs]

CHUCK:

[Chuckles]

YORICK:

It'll probably depress me a little bit too much.

JESSICA:

[Chuckles]

CHUCK:

Yeah. When you talked about extending Oga with C, I'm assuming you used FFI and that works both in JRuby and C. How do you evaluate your program to say, “This would greatly benefit from C and here are the ways that I need to drop down to native code in order to do the work”?

YORICK:

No, it doesn't use FFI. I actually looked into that. But I find it to be too difficult to use. The reason for that is that Oga is essentially from the core built up to support streaming. And that will mean that during the parsing phase the parser would have to be able to be paused at any stage, because it would have to wait for this extra chunk of input to arrive from wherever. The tricky thing there is that certain rules of the parser can emit multiple tokens at once.

And so, if you would write a classical C program, whenever you would emit a token either you would push it into some sort of array or you would return it. If you would push into an array, that's probably what you would use if you just want to give back the final result in one go. To return it, then you could actually do streaming because you would return every chunk without having to keep it around. This however is quite difficult if you have to return multiple tokens at once. It's a bit tricky to explain. But the setup I have now is that the native code are actual Ruby extensions. So, for JRuby it uses the JRuby extension API and for CRuby and Rubinius the C API.

The way it works is that whenever it has to emit a token, it calls back into Ruby. And that essentially yields a block. So, every time that block just gets one token. That way, you don't have to keep anything in-memory for too long. And the setup I have here is essentially that there's a Ruby class which has these callback methods, let's call them that, that call on_element, on_comment, et cetera. And then the lexer calls back into these methods whenever it needs to. And that Ruby code then determines what has to be done with it.

Maybe if I look back at it, it would have been nice because you could use the same code on both CRuby and JRuby. Whereas now I have to essentially mirror some code between the different implementations. Though the way I've done that is in the lexer there's the set of rules that determine how it has to operate. Those are in a separate file. And I've made sure that the syntax of this file and everything that it uses is compatible with both C and Java.

So for example, if you would look at this file, you would see something like a function call called callback and it gives a bunch of arguments. And they syntax is compatible with both C and Java. And then in the extensions, they define those functions so that they do whatever they have to do for that specific language. So, the actual set of rules is reused. And then there's this little bit of support code for both C and Java that then hook that up to Ruby. That way, I can do as little in these lowlevel languages as possible.

CHUCK:

Yeah, that makes sense.

YORICK:

And then the parser has a similar setup. Though the parser's a little bit more complex. There I also had to duplicate some logic in the actual parsing library that I use, which brings us to parsing. The parser, it used to use this library called Racc which is R-A-C-C, [inaudible] also used by Rails for the routes files they have. And that was, Racc is also written in C. I used that for quite a while. At some point, I bumped into the problem where at least for me it wasn't fast enough. And also Racc parsers tend to be a bit tricky to debug due to the algorithm they use. So, if there will be a conflict anywhere in your rules, it would give this weird error saying something like, “Shift reduce conflict,” and you then have to figure out, “Okay, what's actually going on?”

So, I wrote a library called Ruby-LL for that which is again, it's part C and Ruby. The actual parsing loop happens in C or Java. And then again, it calls back in Ruby for as much as possible. There I went a little bit further with doing stuff in low-level code than in Oga itself. For example, for this particular parser I needed dynamic arrays, which in Java are very easy. They're built-in in the language. But in C, they don't exist. You have to either write them manually or you have to use one out of 15,000 libraries out there that supports it. Or you have to use Ruby arrays, but I found those would slow things down too much.

So, what I did there is for example for the C code I had to use this library which basically provided you with dynamic arrays in C. But then I would have to, because that code in itself is not compatible with Java syntax-wise also, you have to duplicate that with whatever Java equivalents there are. So, in this particular Ruby-LL library there are basically two bits of code for C and Java that are similar. If you put them next to each other, you can see the similarities between the two. Basically the Java version is a direct port of the C version adjusted to actually be Java code. That was a bit more tricky to write I guess than I initially anticipated. That whole process took me I think about a month or three, of which the vast majority was spent reading up all these resources. How you do parsing, how you implement them, what different algorithms are there, which one is suitable for what I need, and so on.

CHUCK:

What are some of the resources you looked at for that?

YORICK:

When I started doing that, my goal was to write a parser that wasn't necessarily just faster but also easy to debug and generally just less code to worry about. Racc there, in general it's a stable library but it has accumulated quite a bit of code over the years. It's a library from initially I think 2001 or 2 or something. So, it's been around for quite a while. And it has these weird conditions where I believe it still supports Ruby 1.8 or 1.9. I'm not sure which one of the two. So, debugging it is [sighs] it's not as convenient.

So, when I started this out I basically knew nothing about parsing algorithms. I knew a little bit from maybe what I've read on Wikipedia or what I've heard from others. But if somebody would have asked, “Hey, write your own parser,” I'd be like, “I don't know how to do that.” So, the first step there was to find resources that would explain how, what different algorithms there are, et cetera. And I was actually quite surprised that there's not that much available that is really useful when you're completely new to it.

I found like tons of PowerPoint presentations from universities for example that explain how you construct parse tables that determine, “Oh, if there's this type of input then you have to go to this row and do this thing, or else you have to do that,” et cetera. And there were a lot of papers that describe these complex algorithms that you could use and so on. The recurring problem I found there is that there were very little that actually showed code. And I don't have any form of academic background, so reading these papers for me was pretty difficult. Even though English is a language I'm pretty proficient in, I found it very difficult to read. So, I gave up on that fairly quickly. [Laughs]

But yeah, I think I spent probably one, one and a half, maybe two months, just digging up as much resources as I could before the frustration then came from it. And I think in the end the most useful bit was the Wikipedia page on this particular parsing algorithm called LL(1), because that one actually had an example. It had one written in Python and one in C++. And certainly, the C++ code wasn't even complete. So, that was not particularly useful. But at least it had an example and the Python one worked. So, I then spent probably two, three weeks trying to decipher that Python example. I translated that to Ruby code and from there on I started climbing up the hill and building on that knowledge.

And initially, basically when I had enough knowledge I wrote this parser by had with all the rules that I calculated manually. I had a notebook with pages full like, “Okay, if it's this type of input, it has to go here and then do that,” and et cetera. And then I benchmarked that to see if I would generate this code it would actually be better. And that particular code was I believe already a little bit faster than Racc. Not significantly but fast enough that it was worth continuing. And I think the end result that I have now, it's only up to 1.7 times faster or whatever. It really depends on the parser, how you've written it, et cetera. But it was quite a painful process.

And one of the things I still want to do is write about it, like write a whole guide. Like, “Oh, if you want to write a parser using this particular algorithm, this is how you do it. This is everything you need to know.” But that too will probably be a month or two of writing work, because it's pretty difficult to explain.

CHUCK:

I'm curious. What have you learned in the process of building Oga?

YORICK:

Oof, a ton. [Laughs]

CHUCK:

What stands out as something that you don't think you would have learned doing other projects?

YORICK:

Writing parsers, in particular. Because the idea of writing an XML/HTML parser itself or XML library, let's call it, to differentiate it from regular parsers, that you can do without knowledge of how to implement an actual parser because you could just use a library like Racc for example. And you read up on how you define the rules and that's basically all you need to know. But if you want to understand these algorithms, there is suddenly a ton of stuff you need to know.

I guess you can compare it to maybe transpiling to JavaScript versus writing your own virtual machine for example. There are similarities between the two, but one is significantly more difficult than the other. And just by doing that, you'll learn a ton of things. So for me, the biggest gain in terms of knowledge was knowing how to write parsers myself. Also understanding in particular this LL(1) algorithm.

JESSICA:

Now that you've done that in Ruby, do you feel like you could do it pretty quickly in a different language?

YORICK:

The particular parser I wrote, an LL(1), yes. It's a fun thing that, well fun thing but at that time it's rather frustrating, that these algorithms when you start out, they look very daunting. And you think, “Oh dear lord, how am I going to do this?” But then once you've done it, you look back at it like, “Huh, that was actually not that bad.” And if I look back at this particular algorithm, the algorithm itself is relatively, it's not that difficult. It's just that all the resources are pretty thick and difficult to understand. So, you spend a lot of time reading before you're actually into what you are reading. So, if I would look at this now, I've actually been thinking about porting this particular Ruby-LL library to Rust, because I might have a need for it myself. There, I've decided to wait with it, because it will probably take quite some time. But the actual algorithm, yeah I could probably port it to different languages by this point.

JESSICA:

It sounds like you… you said that the algorithm wasn't that hard but you say that after you wrote it.

YORICK:

Yes.

JESSICA:

Sounds like it wasn't easy to get started with.

YORICK:

So, there are definitely parts you have to understand first in order to understand the algorithm. But if you look at just the code for example, the actual code needed for it is not that complex. It's not as if you're trying to write operating system level of complexity. It's just that, it's like a really big doorstep, basically. And that's for me, if I look back at it, was the biggest problem in this whole process. And I hope that if I find the time to write about it, that becomes much easier for others. So, they don't have to go through that same amount of trouble trying to figure out just how they should start, basically.

JESSICA:

Yeah, that would be a big contribution to build a ramp up that doorstep.

CHUCK:

Yup.

JESSICA:

I also find it interesting that the most helpful information that you found had an actual code example, because people can talk about it in English all day but there's no language as precise as working code.

YORICK:

Yeah, exactly. And I suppose maybe technical papers, if you're native English or if you have an academic background, I suppose papers might be more convenient. But for me, they… basically unless people really recommend them, I avoid them because I know I just won't be able to finish reading them. I just can't get myself through it. However, if there's an example in code, I can reverse-engineer that pretty quickly and then see like, “Oh, so that's how it works. Oh, that's not that difficult after all.” But yeah, in the end, code is yeah the best example you could probably ever get. So, in that case, I don't know who wrote the particular Python example of this Wikipedia page.

But whoever did it, I probably owe them several beers or tea or coffee or whatever they drink.

JESSICA:

[Chuckles]

YORICK:

Because without that, I probably wouldn't have gotten as far as I am now.

JESSICA:

And you're clearly a very persistent person.

YORICK:

Very stubborn.

JESSICA:

No, I don't think it's you.

YORICK:

[Laughs]

JESSICA:

I think it's the papers.

YORICK:

[Kooky]. When I was doing this I basically every evening I was banging my head against the wall and I was really asking myself, “Okay, why am I doing this? I should just stop and do something less troublesome.” But if I look back at it, I'm glad I just pushed it through. I'm glad I'm stubborn because otherwise probably I wouldn't have gotten this far.

JESSICA:

Yeah. It took the stubbornness, too. But now you've accomplished it. You've got something useful. If you write something up, then you might become famous.

YORICK:

[Laughs] Fame isn't really a goal for me. I guess it's fun if people go, “Oh hey, you're that person.” For me ultimately it's about trying to make people's lives easier whether that's by writing something in terms of text or code or so on.

JESSICA:

Did you do all of this in your own time, just for your own excitement?

YORICK:

Yes. Basically, 99% was in my spare evenings, weekends, et cetera. If I had five days week, unlimited time, I probably could have done it much faster, of course. Now that for example, at my employer, we're using it, I have an excuse to also spend some company time on it, if there are certain parts that need to be improved or fixed or whatever. But it's still largely a project in my spare time.

JESSICA:

How long was it before you were able to use it at work?

YORICK:

I think I didn't get comfortable with it until two or three months back. By that point, most of the core features we needed were there. Performance was good enough to at least start testing it. And I put it in production for a few smaller applications that we have that are a bit more standalone than the rest. So, if they will break or not perform as well, it wouldn't really matter that much. I'm now at a point where I will be comfortable replacing all our uses of Nokogiri with it, although that's quite a lot of work, because we use it basically everywhere. So, in total it probably little over a year to get from zero to good enough.

JESSICA:

Did implementing this in 90% Ruby, 10% C instead of the other way around in Nokogiri, did it let you make a better API?

YORICK:

That I don't know. The way API is designed is more about the people that write them than what they are written in or what things they use. The language certainly plays a role. For example, Ruby makes it much easier to write elegant DSLs than for example C. I think that the API I have now, Nokogiri could have done just that while still using libxml and so on. It's more that in this case I had a clean start. I knew what I wanted based on my usage of Nokogiri but also other libraries. And I also knew what I didn't want. But if any library would basically start over again, the things they use don't necessarily limit them in how they can make their API. The tools I used didn't really affect the resulting API. It was more that I felt they had to be a certain way.

JESSICA:

So, it was your understanding of the problem and the usage of the library that helped?

YORICK:

Part of it, yes. And part of it is also bluntly put, personal opinion. [Chuckles]

JESSICA:

[Chuckles]

YORICK:

I have certain opinions on how I want to do things or how I want to make things and so on. And when you write your own code, you have that freedom. Whether that has ultimately been the best choice, I don't know yet. That's something you can only judge several years after when people use it, get feedback, et cetera.

CHUCK:

However so far, I've been quite happy with it.

JESSICA:

I noticed that so far, there is only a few small… that you've got some contributors with some small pull requests in Oga. Are you hoping to build more of a community around it?

YORICK:

Definitely. I would love for more people to help out with it. I think the difficult thing here is that the topic itself is maybe a bit daunting for a lot of people, because it does require some reading up. And you have to be familiar with the concepts and so on. And for some, XML parsing just doesn't sound interesting. I don't know. They might want to write a, whatever, a Twitter gem or some library for Amazon or whatever. I've had a few people that have consistently submitted patches for various things. That's absolutely wonderful. The tricky thing there is up until fairly recently, there was a lot of big plumbing going on. Like I would move these big parts around, especially prior to version 1. It was basically a battlefield, code flying all over the place.

JESSICA:

[Chuckles]

YORICK:

Like one day I would make this small patch and then the other day there would be thousands of lines changed, et cetera. That's now starting to settle down, which should make it much more convenient for people to contribute. Because I don't have to tell them, “Oh, wait with this because I'm refactoring this whole thing.” But I'm definitely looking forward to having more people submit changes but also suggestions or bug reports, basically just people using it. I've been keeping an eye on the gems that depend on it. It's slowly increasing. But I think realistically it will probably take another let's say year or two before it's really popular in the sense that people say, “Oh yeah, I'm using Oga,” and, “Oh yeah, that's that thing,” instead of, “What?” [Chuckles] But I don't know. It might change. Maybe in six months from now, everybody's using it. I can only wait and see.

JESSICA:

This is one of the places where if you write up how to parse, how to understand this parsing algorithm, that would help you with contributors.

YORICK:

Part of it, yes. Another part there that's something I've been toying with for a long time, if you look at code and in particular documentation for example, usually what you see is the result, the actual code and then documentation. A good documentation will show intents and reasoning, for example. Like if there's this weird piece of code it might say, “Oh, we did this because of that reason.” However, it's very difficult using these mediums to describe thought. And it's for me, at least the thought process that guides a lot of things.

So, I hope that I can find a way to explain it to people so that they can know, “Oh, okay. So, if I want to do this, this is the process I would have to take, or this is the reason why it was taken.” Because I feel that will make it more easy to contribute, not just to Oga but basically any project that would have that information available. Because it would let people, how would you describe it, transfer themselves as if they… like transfer themselves into the mind of the original author so they can think alike, which would hopefully improve the whole contribution process. Because one thing I do notice, certain parts that are totally obvious for me, people are like, “Oh, I would have done it this way.” I'm like, “No, it's better that way because I've already done this that many times.” How I will explain it, I don't know yet. Probably just a lot of blog posts. [Chuckles]

CHUCK:

[Chuckles]

YORICK:

Basically there's quite a bit of writing up I would have to do before I think more people will be interested besides maybe the ones that already have a slightly higher technical understanding of this particular topic. But yeah, I will basically wait and see and lots of writing to be done there.

JESSICA:

I know. I've found in my experience that when I take the time to write something up, it takes a ton of time. But then it's like I've taken that little piece of thought and it's gone from useful to me to useful to hundreds of people. And years later, I'll be surprised. Somebody will be like, “Oh, that was really helpful.”

YORICK:

Oh yeah, definitely. And for example, doing this whole process of writing these tools, I took a ton of notes in my notebooks. One of the things I've been thinking is trying to scan those. I don't know how useful it will be if it's just scanned papers. But at least you would see the notes I took, the thoughts, the problems I experienced, that you might not see in the code directly. But yeah, writing all that takes about as much time as writing the actual code itself, if not more. Because with code you can basically translate your knowledge directly into code in a way that you understand it. But if you have to… if you want other people to understand it, probably you have to make pretty radical changes, not just to the code but also the way you explain things and maybe how you think about things to ease that process. Yeah, which is very time-consuming. Although it's ultimately, if you would write, it can be tremendously useful.

I think for example the best example there, if you look at Rails, the code itself is pretty clear usually. But the documentation is perhaps one of the bigger reasons it's gotten so popular. The documentation is probably one of the best examples of good code documentation, and all the Rails guides, et cetera. So, they did a phenomenal job in explaining things, how you do things, et cetera.

But yeah, that was the work of I think dozens, if not hundreds of people over many, many years.

JESSICA:

Well, that's a great point, that Rails being easy to use, it's not just about the code. It's not just about the API. It's about the documentation. People built that ramp.

YORICK:

Yeah. The thing is, especially if you climb the complexity ladder, if you look at typically more lowlevel libraries, the documentation is what makes or breaks it. For example, I've been meaning to learn about LLVM which is this library toolkit you can use to write compilers for example. I've been meaning to learn that but the library itself is really complex and the documentation is so-so. They have API documentation which is great if you know what you're looking for, for, “Oh there was that class with that one method. What was it called again and what arguments do I have to pass it?” And besides that, they have some basic guides. But in between these absolute basics like the 'hello world' level documentation and the API documentation there's nothing. So, I've tried it numerous times. Maybe I just couldn't find good resources. But if there was a lot better documentation I probably would have been a lot further in that process than I am now.

And if you have a very simple gem, probably you can get away with just a basic example like, “Oh, this is how you use it. That's it.” But I would say that maybe the required quality of your documentation scales exponentially relative to the complexity of the code. But that's just a theory.

CHUCK:

Alright. Well, if people want to find or follow you or follow up on Oga or follow up on parsing stuff or things like that, where should they go?

YORICK:

My Twitter account is probably the best place. It's also usually filled with sarcastic and slightly ranty tweets, so just be warned. [Laughs] But that is @YorickPeterse on Twitter, just my full name. You could basically find me on that name pretty much anywhere. And probably my website. That's where I would write the bigger articles, which again is just YorickPeterse.com. I'm probably the easiest person to find on the internet.

CHUCK:

Alright then. Before we get to the picks, I just want to acknowledge our silver sponsors.

[This episode is sponsored by Code School. Code School is an online learning destination for existing and aspiring developers that teaches through entertaining content. They provide immersive video lessons with inbrowser challenges which means that each course has a unique theme and storyline and feels much more like a game. Whether you've been programming for a long time or have only just begun, Code School has something for everyone. You can master Ruby on Rails or JavaScript as well as Git, HTML, CSS, and iOS. And more than a million people around the world use Code School to improve their development skills by learning or doing. You can find more information at CodeSchool.com/RubyRogues.]

CHUCK:

I'll go ahead and start us off with the picks this time. The first pick I have is AirPair. If you're not familiar with AirPair, it's AirPair.com. And what it is, is it's a place where people can go and they can get help with coaching or with different areas of interest or topics. It's mostly focused around programming. Anyway, I've been able to help several people there with Rails, with databases, with things like that. I'm starting to get into helping people with Angular. So, it's a great place. I'm one of the folks that they match people up with on there. But there are a whole bunch of other people. So, if you're running into a problem, you want to sit down with a coach for an hour or so, then that's a great way to go.

That's the only pick I have today. Jessica, do you have some picks?

JESSICA:

Okay. Well, how about I type and you read my pick?

CHUCK:

That's pretty dangerous. I could say anything. [Laughter]

CHUCK:

So, her pick is a shell command. It's 'cal'. It displays a calendar in ASCII on the terminal. Now I got to check this out. Oh, that's cool.

JESSICA:

[Chuckles]

CHUCK:

I'm so going to use that. Because I can never remember what day it is.

YORICK:

[Laughs]

CHUCK:

And I'm always on the terminal. So, that's really freaking handy. If you do 'cal 2015' then yeah, it shows you the entire year. And 'cal oct 2015' shows you October 2015. But the thing that I'm seeing is that it highlights the date. So, if I type in cal, it shows me this month and highlights today, which is super handy. She also points out that 'cal 15' will show you the year 15 A.D. So, [chuckles] just keep that in mind.

Yorick, do you have some picks for us?

YORICK:

Yes. I have two, actually. First one is called Fish. It's a shell. I started using it since last week, absolutely love it. It basically does a whole bunch things like auto-suggestion, syntax highlighting. It has a better shell language compared to Bash at least. And I've been absolutely loving it so far.

The second one, I'd have to see if I pronounce it correctly. It's Asciinema. It's basically ASCII cinema combined into one word. And it's a screencasting application for terminals. And so, you start you terminal, you run this command, and it will start recording whatever you type and then upload it. And that's actually been pretty useful to showing examples, so how we use things. For example, I used it to show what Fish does. Because that [inaudible] highlighting that works better when it's an actual video instead of just the final text output.

So, those are my picks.

CHUCK:

Awesome. Well, thank you for coming. It was a fun conversation. And I'm going to have to look into using Oga for some of my XML parsing needs.

YORICK:

Mmhmm. Yeah, it was great being here.

JESSICA:

Yeah, thank you very much. I loved your thoughts on documentation and just how hard it is to [do these things].

YORICK:

Mmhmm.

CHUCK:

Yeah. And yeah, 'cal' doesn't highlight today's date in my terminal on my Mac but it does on the Linux machine where I typed it in initially.

JESSICA:

Oh.

YORICK:

Oh. It doesn't highlight it for me on my Linux laptop, though. So, that's an interesting difference there.

CHUCK:

Yeah, I wonder if it's just some terminal setting or something there. Alright, anyway, thanks for coming and we'll catch everyone next week.

[Hosting and bandwidth provided by the Blue Box Group. Check them out at BlueBox.net.]

[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.]

[Would you like to join a conversation with the Rogues and their guests? Want to support the show? We have a forum that allows you to join the conversation and support the show at the same time. You can sign up at RubyRogues.com/Parley.]

223 RR Oga and Parsing with Yorick Peterse

0:00

50:58

Playback Speed: