Show Notes

Check out JS Remote Conf!
 
02:07 - Rob Miller Introduction
03:11 - Why does text processing matter?
07:32 - One-off Data Processing, Core Competency
10:36 - Processing Less-structured, Unstructured Data
12:45 - The Command Line 
29:15 - Abstractions and Refactoring
35:12 - Munging: Tools and Practices
40:57 - Text Processing for Textual Visual Things
42:57 - Parallelization 
45:45 - Fanning Data
Picks
Special Guest: Rob Miller.

Transcript

 

DAVID:

That’s adorable.

JESSICA:

Oh, there he goes. [cat sound]

AVDI:

That’s my cat saying hello.

[Laughter]

[This episode is sponsored by Hired.com. Every week on Hired, they run an auction where over a thousand tech companies in San Francisco, New York, and L.A. bid on Ruby developers, providing them with salary and equity upfront. The average Ruby developer gets an average of 5 to 15 introductory offers and an average salary offer of $130,000 a year. Users can either accept an offer and go right into interviewing with the company or deny them without any continuing obligations. It’s totally free for users. And when you’re hired, they give you a $2,000 signing bonus as a thank you for using them. But if you use the Ruby Rogues link, you’ll get a $4,000 bonus instead. Finally, if you’re not looking for a job but know someone who is, you can refer them to Hired and get a $1,337 bonus if they accept the job. Go sign up at Hired.com/RubyRogues.]

[Snap is a hosted CI and continuous delivery that is simple and intuitive. Snap's deployment pipelines deliver fast feedback and can push healthy builds to multiple environments automatically or on demand. Snap integrates deeply with GitHub and has great support for different languages, data stores, and testing frameworks. Snap deploys your application to cloud services like Heroku, DigitalOcean, AWS, and many more. Try Snap for free. Sign up at SnapCI.com/RubyRogues.]

[This episode is sponsored by DigitalOcean. DigitalOcean is the provider I use to host all of my creations. All the shows are hosted there along with any other projects I come up with. Their user interface is simple and easy to use. Their support is excellent and their VPS’s are backed on Solid State Drives and are fast and responsive. Check them out at DigitalOcean.com. If you use the code Ruby Rogues, you’ll get a $10 credit.]

CHUCK:

Hey everybody and welcome to episode 235 of the Ruby Rogues Podcast. This week on our panel, we have Jessica Kerr.

JESSICA:

Good morning.

CHUCK:

Avdi Grimm

AVDI:

Hello from Tennessee.

CHUCK:

David Brady

DAVID:

Do not get me started about units of weights and measures.

CHUCK:

I’m Charles Max Wood from Devchat.TV. Quick shout-out for JS Remote Conf. Go check it out. We also have special guest this week and that is Rob Miller.

ROB:

Hello from a very windy and chilly London.

CHUCK:

Do you want to introduce yourself?

ROB:

Sure. I’m Rob Miller. I live in London where I work for a marketing company called Big Fish. I love Ruby and I just published a book called Text Processing with Ruby and I think that’s what we’re going to talk about today.

CHUCK:

I was going to say our topic is Text Processing with Ruby and what qualifies you to talk about this but you beat me to it, so…

ROB:

Sorry.

CHUCK:

Do you have a place that you like to start with this topic?

DAVID:

Like [inaudible]. [Laughter]

ROB:

Sure, I can start with my motivation for why I wrote the book.

CHUCK:

That’s a good one.

AVDI:

Rob, why did you get into text processing with Ruby?

ROB:

Well, it was actually accidental. I got into text processing because my job frequently demands me to manipulate lots of data. You know, system A has data in this format and it needs to be in system B or this big amount of data but it’s on a website that has no API so you need to scrape it or something like that. And I got into doing it with Ruby because I love Ruby and I enjoy programming in it.

CHUCK:

So why does text processing matter? It seems like it’s pretty straightforward as far as, "Oh, I've got a string," or, "Oh, I've got some paragraphs and I want to munge it." You can break, you can join, you can do all kinds of nifty things in Ruby. So since Ruby makes it so easy to manipulate strings and things, why write a whole book on text processing?

ROB:

I guess the thing is there's so much data in the world, so much stuff out there is in a textual format. It’s not in some structured database that’s really easy to consume and manipulate. It’s just raw text because we, as humans, linguistic animals and produce lots of text.

So it’s frequently necessary to grab there and process stuff that exists only in raw, big blobs of texts. And then, there’s so much you can actually do with that from passing it a little bit to the structure from the text through to you doing natural language processing and interpreting it as an actual structured English language. It’s a really huge field with about lots of useful things you can do in it.

DAVID:

Do you prefer the term ‘Markov chain’ or ‘drunken walk’? [Laughter]

ROB:

Markov chain, I guess. I’ve never used one except to make it ironic Twitter box. [Chuckles] I think that’s probably the only general use, isn’t it?

DAVID:

I actually built a gibberish BOT once that used text processing and the Markov chain was actually at the [inaudible] levels. So, it basically takes consonants and vowels. And then, it would string them back together into randomly mixed-up words. And so, you would get this garbled stuff but you could pronounce it.

ROB:

Yeah, but it would do it in a pronounceable order because it’s that same syllable that you find in the actual source tags.

AVDI:

That’s cool.

DAVID:

Yeah.

ROB:

It seems [inaudible] like that - password generation stuff where you generate a random, very long string of characters or password but you make it pronounceable in the same way.

DAVID:

Pronounceable, right.

CHUCK:

That actually sounds really weird. I don’t know why but it does.

DAVID:

I’ll pull out the Markov chain stuff that I wrote and show you, Chuck. It’s a lot of fun.

CHUCK:

It sounds like fun.

JESSICA:

Can you show samples of gibberish?

DAVID:

I've got a friend who draws Schlock Mercenary. I was watching the video and wearing a Schlock Mercenary t-shirt which says: Maxim 34: If you’re leaving scorch-marks, you need a bigger gun. And the aliens in Schlock Mercenary, it's set 3000 years in the future, and Howard started giving weird names to his aliens. And so I started talking to him one day and I said, give me the general vowels and consonants that these aliens would speak with because he introduced a character named [inaudible] and I’m like, "Okay, this got [inaudible]," whatever. And he gave me like 10 or 15 things for frequencies and I turned around and generated entire books in [inaudible] language and it looked like it was his language. I mean, it was plausible. If you don’t speak German and you run Markov chains on German words, and then spit them out as letters, a German will go, "What the crap he's saying?" But anybody who doesn’t speak German will go, "Oh, that’s German."

ROB:

So yeah, I frequently encounter a kind of junior developer who’s only ever really done webdevelopment that struggles when they come across these text processing tasks and the worst outcome of that is that they end up not enjoying them and viewing it as a kind of judge work or something that is not actually enjoyable.

Whereas, I, perhaps, because I'm strange, get quite a lot of enjoyment out of turning something, some big block of inert text into some valuable and interesting data. I find that actually quite rewarding.

I’ve always been more drawn to the kind of the practical getting-results side of coding towards the theoretical side of things, I guess. So the aim, I guess, with the book is to hopefully make it so that people feel as enthusiastic and interested in working oftenly viewed as a slightly boring area of necessary area. It’s actually quite interesting in these tons of interesting areas that you can come into, especially things like natural language processing and writing parsers and things like that.

CHUCK:

What in the book that you’ve put in there do you find yourself using the most often?

ROB:

The thing I used most often since developing the skill whenever I did is probably the kind of ad hoc exploratory sort of data processing on the command line, like throwing together, show one-liners. And like very often, the type of work that I have to do, it’s a one-off. It’s the kind of Write-once-Runonce type of data processing. You just want to figure out what the data looks like or answer a quick question or even kind of understand what question you need to ask and then later on, ask that question in a full-blown script. Say, in that kind of harnessing the existing Unix tools that are out there, throwing together this pipelines that are actually quite quick to create that becomes suddenly, incredibly powerful and can chew through enormous quantities of data really quickly.

That’s the point where I feel most productive and you get that kind of Zen-master feeling of like, "Well, I’ve just written that in 5 seconds but it’s amazingly powerful and capable."

AVDI:

I kind of feel like that’s almost Ruby’s core competency. Maybe I’m on the fringe thinking that.

ROB:

No, absolutely. I mean, Ruby existed for over 12 years before Rails even existed. I feel, and if you look at Ruby’s Perl heritage as well, I mean, Ruby was a general purpose programming language with a huge emphasis on text processing in a general kind of day-to-day ad hoc stuff for longer than it has been a language that is viewed as primarily centered around web development, I guess. It has more of that in its heritage than it does web development and I feel like it’s even more as suited as it is to web development. It's even more suited to this sort of task.

AVDI:

Yeah. I mean, for those who don’t know, one of the big reasons PERL was created was the idea was to take all of the little tools that are included in the Unix command line for munging text like sed, awk, tr and grep. And you know, tail and all these things. And sort of put them together into a programming language that makes them easier to glue together. And then Ruby basically just copied all of that in its [inaudible] from Perl.

ROB:

Yeah, absolutely. I think it’s quite telling that Larry Wall, creator of Perl, was a linguist. He has linguistic background. It came from that whole heritage of processing text and understanding text and language. And Perl, to me, feels like a very language-oriented programming language, if that makes sense. It is some kind of spectrum that programming language exists on between very linguistic ones and very mathematical ones. Perl and Ruby both occupy a very extreme position at the linguistic end of that spectrum for me. And that’s where I personally enjoy playing [inaudible] because of my background and partly because of just it’s what interests me. I have much more interest in that expressiveness and the language that you get with Ruby and Perl than I do with Haskell or something like that at the other end of the spectrum.

AVDI:

Right. When people think these days about chewing through a lot of data, a lot of times we automatically think about big data sets that are in something like JSON or CSV or even just sitting on a database or something like that. Where do you often find that you have to chew through less structured data using these kinds of techniques?

ROB:

All the time, unfortunately. I wish they had to do it with nicely structured data more often. I mean, very often it’s exports like for-better-or-for-worst kind of CSV has become the lowest common denominator export. But it’s also things like unstructured data entered by users into websites, into free-form text field comments and things like that. And it’s really often data that’s out there and exists on websites and exists in the world but it’s not available in a nice, clean, even a CSV file that you can download. It’s just embedded in a website that’s intended for human consumption, and very often, a website that was made with Microsoft Frontpage in 1997 or something. Say, I just frequently find that the data I need for my day-to-day job is in that raw, unstructured, horrible form and I have to try to do that.

It’s not necessary that the data sets are huge in a big data sense. They almost never are. But then I guess, made for most people. Most data sets they deal with [inaudible] huge in a big data sense.

CHUCK:

You know, some of these kind of have an issue with what you’re saying that CSV's are somewhat unstructured?

DAVID:

Not unstructured, they’re just always malformed.

CHUCK:

Yes.

ROB:

[Chuckles] Yes, absolutely.

CHUCK:

I was trying to think of the right way to say that. Thank you, Dave.

ROB:

That’s definitely structured but it’s harder to do things with them than things that have no relational database or something that’s really easy to [inaudible]. Very often you're kind of, you have these ingestions. That way, you’re just taking in a whole lot of that data and then you're having to query it using Ruby or you’re doing kind of stream-based processing while you’re just processing it record by record and things like that. It takes a different approach in the data that’s out there and a database or something like that.

CHUCK:

One thing that we talked about a minute ago was the command line tool, the shell one-liners that you talked about. This book is about Ruby. So do you use kind of a wrappers around those oneliners in Ruby or do you actually use the command line to munge some of the data to a certain point and then when you need something that is a little bit more intelligent or a little bit more intelligible to pool-in something else that’s…

ROB:

Yeah, almost always the later. I know that one of those are amazing things that Ruby inherited from Perl is that it plays really nicely with the command line and with shell pipelines in particular. So, you can invoke Ruby from the command line and passcode to be executed in straight from the command line. And it also comes with all sorts of views or flags that lets you sort the line by line processing an import to manipulate lines and also automatically print them out, all those sorts of things.

So my approach generally, the one I find really useful is to go as far as I can with the existing tools that are there. [Inaudible] text processing tools, you know. [Inaudible] and things like that. And then you always get to a little bit of the problem. The little bit that’s a bit too complicated. And for me, instead of dipping into awk and writing a complicated awk scripts, I reach for Ruby purely because it’s something I know but it’s also just as capable as awk and really straightforward to put in as that one slot in the pipeline that needs to do the complicated bet and then the rest is [inaudible].

And that seems to be a really nice balance between having to implement lots of things yourself and you get the productivity of all these things having been implemented for you while still getting to solve your specific problem.

DAVID:

Rob, how often do you use Ruby –i –n –e?

ROB:

[Chuckles] Not very often because it's in-place editing and time processing in e. I use -n and -e all the time but probably not -i too often. It didn’t catch me out there.

DAVID:

So for the listeners, do you want to go through what Ruby –n –e would do?

ROB:

Yes, sure. Passing the –e option to Ruby when you’re invoking it from a command line is a way to rather than execute a Ruby script as you normally would by saying Ruby [inaudible], you can use the -e option and pass in some code there and execute set directly which allows you to run Ruby code without committing it to a script. So, that’s really useful in those little one-liners.

And then, Ruby also comes with some other useful options, specifically for text processing and is probably the most commonly used one and will loop-over every line either in standard input, if you’re piping text into the Ruby script or in the [inaudible] that you pass on the command line after your code to the Ruby interpreter and it will loop over every line and all of the input and allow you to process the text line by line.

And line by line processing is incredibly common thing to want to do, lots of file formats or line or into web server logs, things like that. And so processing line by line is really useful.

DAVID:

So the –i is the modify the file in place and I only use -i and -e to change all of the source code in, like an entire tree.

ROB:

.Yes, so -i only lets you do a mass, find and replace across your entire directory if you want to, if you pass in the server recursive [inaudible] lots of files and things like that.

DAVID:

Good times.

ROB:

The point, Avdi made about big data was an interesting one, I think. Thinking about big data sets because that’s a personal [inaudible] with it and everyone else thinks the same way.

I know it’s a bit passé to target big data these days and the mass hysteria has kind of passed, which is good. But you’ll still find a lot of people thinking that they have a big data problem when they’ve got a data set that is, you know, a gigabyte or two in size. And that’s not a big data problem.

DAVID:

That’s adorable. [Laughter]

ROB:

And it’s like people seem to frequently be quite amazed by the amount of data, especially if you’re doing kind of stream-based processing. The amount of data you can process on a single laptop with very straightforward tools, most people don’t have big data problems. And trying to treat a small data problem as if it’s a big data problem is a road to horrible pain and suffering, I think.

AVDI:

Are there particular antipatterns that people naively use that lead them to thinking that their data set is too big? Like things you find people do that just cause them to process like a gigabyte-sized data set really, really slowly?

ROB:

I think the number 1 one that comes to mind is definitely not using streaming on a problem where a streamable solution is possible. And so instead of processing a large amount of data chunk by chunk, so they're only one particular chunk of the file a time is in memory. People read a huge amount of data into Ruby and end up allocating an array of several gigabytes of memory their computer [inaudible] so hard. Everything swaps the desk and they think, "Oh, this problem is too big for a single machine clearly," but actually if you think about the problem in a different way, it’s very trivially solvable in a single machine.

My favorite is an article that did the rounds some years back by the data scientist, Adam Drake. It is called 'Command-line Tools Can Be 235x Faster Than Your #Hadoop Cluster'. I really enjoyed it.

So a data scientist had looked to the results of thousands upon thousands of chess games and he’d used a 7-node Hadoop cluster running on AWS like, medium-easy [inaudible] instances. Big powerful, lots of service to process this 2-gigabyte data and his Hadoop solution took 26 minutes to complete, running in parallel on seven different machines.

In a blog post, Adam Drake writes a solution that uses three commands in a pipeline chain that runs 235x quicker. It ends up running in 12-seconds on a single laptop, not particularly amazinglyspec laptop. That’s just a perfect example of framing the problem in a wrong way and thinking about in a wrong way when if you actually take a step back and think, "Do I just need to process this particular chunk of the data at one time? Can I stream it," allows you to just chew through it almost infinitely quicker.

AVDI:

This is sort of broad but I mean, are there particular libraries or methods that lend themselves that kind of stream-oriented processing or techniques?

ROB:

For me, it usually depends on the shape of your input data, really. It doesn’t have to be line or [inaudible] though. It’s obviously quite straightforward, [inaudible] just talking about these lots of easy ways to do line by line streaming in Ruby, but just anything that you can delimit and only redone it to a certain delimit or at a time. So anything that’s quite an abstract thing, but anything that fits those kind of parameters.

Another thing that you can think of in terms of chunk and only process chunk at a time, and that might not seem immediately, like it’s everything but there’s solutions to that for things like, even XML and JSON parsing where you don’t have to read the whole JSON or XML file into memory at once. You can stream it and it fires up events when it encounters particular things in [inaudible]. So there’s lots of problems that lend themselves to that solution but it’s generally dependent on your input more I think than the specific answer that you’re trying to get or the specific problem that you have, if that makes sense.

AVDI:

Yeah. A lot of the classic Unix tools has sort of generic concept of record. And record was usually defined in terms of some kind of record separator. There wasn’t a specific one kind of record, onesize-fits-all. It was that you could tell the tool this is what the record separator looks like and at that point, then we can sort of process things in a streaming fashion, record by record or chunks of record by chunks of record. Ruby actually has that, doesn’t it? It has an input record separator variable that you can set either from the command line or encode that it kind of inherits from Perl.

ROB:

Yeah, absolutely. And then methods like getAs for example. People think of getAs as getting a line

of input but what it does is actually reads up to the input record separator. So if you have a huge amount of input that is on a single line on a comma separator, by default getAs is going to give you the whole input. Whereas, if you change your input record separator to a comma, you can read those records and stream them and process them one by one and then review so it has the input field separators as well, which is also really useful.

So if you think of CSV file for example, you have records which is separated by new lines and you have fields which is separated by commas. And Ruby just like it allows you to use the [inaudible] to process input line by line, it can also split those records into fields for you. You can pass it the f option with the delimiter and it will split the lines on that character and then pass the a option to auto-split. So it makes processing in a tab separator data or things about lots of data formats have that kind of record and field paradigm. It makes processing that sort of data really straightforward and again, it can be done in a streaming fashion which is really useful.

DAVID:

There’s a great thing in the – I'll put a link in the show notes. But if you go to the – I was going to say it’s on Wikipedia but it’s actually in Wikibooks on the Ruby. I think it’s from the Pickaxe books. But they talk about the pre-defined variables. These are all the dollar sign variables that are available on Ruby. And the thing that you can set with the –F field is $; so you can, you can take $;=’,’ and then you can take an input record and just say split and it will explode it on commas instead of on whitespace because that’s the input field separator. Using all of these $ extended ASCII characters will get a [inaudible] at your desk which is awesome. If you need to know how to [inaudible] one of those. [Laughter]

ROB:

Yes. It’s another one of those things that we talked about in Ruby’s Perl heritage. And there’s sort of – you [inaudible] about that in the Ruby community that especially manifested in the hatred for the cryptic globals. But you know, somehow Ruby's Perl heritage is not one of its assets. For me, that’s totally untrue. I think the loads of amazing things in Ruby can – it was stolen wholesale and verbatim from Perl and that’s really great. And I must admit, I’m occasionally a user of the cryptic globals without requiring English or anything like that. [Laughter]

ROB:

I actually use the globals, they're useful just not in libraries that other people use and things like that. But they’re definitely useful in little one-off scripts and things.

DAVID:

Chuck or Avdi, have I bored our listeners with why I threw out Python and switched to Ruby?

CHUCK:

I don’t think I’ve heard this one.

DAVID:

This is a real quick story but when I first got into scripting, I get into Perl scripting and I was doing a ton of IRC at the time which is, you know, it’s the messaging on the internet and you have one line for all of your input. And I had bots that I could give codes to execute or my IRC client could be modified on the fly by sending code to the client itself. You only have one line with which to enter everything. And I was trying to learn Python, I fell in love with Python in one day and threw away all of my Perl scripts to switch to Python. And then I discovered that whitespace dependents means you need multiple lines to enter code. So I went back to Perl whenever I was on IRC and then along came Ruby and you could spread out your program like Python and make it nice and readable. But if you had to crunch it down, you could. I’ve written thousand-character programs in Ruby and Perl on a single line of IRC. [Laughter]

ROB:

Yeah, there's more than one way to do heritage that people say will inherit the code. It’s a really fundamental philosophy in Ruby. It's totally acceptable in Ruby for that to be 10 or 15 ways of achieving the same thing because exactly like that. Sometimes this scenario is where one method is obviously superior to the other. It's all context dependent rather than that Python [inaudible]. There should be one clear way to do things except sometimes the business scenario where that one clear way of doing things is horrible and [inaudible] and it kind of falls down.

DAVID:

I don’t want to kick the Python guys too hard because they kick back. But the reason I just threw Python out was I ran into a case where the one right way to do it had kind of expired. Time to change and there was a new one right way to do it and people were still doing it the old way because it matched all their existing programs. And the thing that I love between Perl and Ruby is they both say there’s more than one way to do it.

Perl, it is morally acceptable to choose the worst possible way to do it wherein Ruby, there’s more than one way to do it, please use the most appropriate one.

ROB:

Yeah, Ruby is pro with taste, isn’t it?

DAVID:

Yeah, I like that. Yes.

AVDI:

Ruby often makes the elegant way also extra-ordinarily accessible. I mean, I really like the way that Ruby takes a lot of this Perl-isms and builds some structure around them. I mean, you know, like we were talking about getAs gets actually the next record depending on the input record separator but then you also have something that’s more Ruby-ish, like each line.

You can call each line on I/O, any I/O. it could be a stream coming in, it could be a file, whatever. And by default, that’s going to be a new line, delimited line and this is nice stream processing. You’re only getting a liner at the time but it also respects the input record separator or you can manually pass in as an argument the input record separator. And now your lines are separated by whitespace or they’re separated by some other character but you're using a very Ruby-ish block syntax. Or you can use that without the block and now you have the enumerator and now you can go all kinds of neat chaining things with the enumerator which is extraordinarily Ruby-ish but it’s also extraordinarily accessible.

ROB:

Yeah, absolutely. From a text processing perspective, enumerable is my favorite part of Ruby. And enumerable plus blocks for me is the single reason why Ruby is such a lovely language to use. It goes a long way to the reason why Ruby is brilliant.

And, from a text processing perspective, being able to combine all that power and flexibility of Perl with that, I guess, kind of functional programming influence in enumerable and all those methods that you get for free with the enumerable.

And like you were saying, each line and the same method on a file for instance, which means that a file class gives you all the power of enumerable. You can suddenly select lines based on particular criteria or you can group them into everything that you can even expect to do with the collection of Ruby. And you can do it with the stream-based processing of text from a file which is just the amazing combination of Ruby's strength plus the strength that comes from its Perl-y heritage.

DAVID:

I don’t usually swoon over the British accent but I love how you say 'Perl-y'. [Laughter]

ROB:

.Yeah, I knew I was going to get mocked for my ridiculous accent at some point.

DAVID:

It could have been worse.

CHUCK:

Perl.

[Laughter]

JESSICA:

That was pretty good.

DAVID:

But coming out of you, that sounds wrong! Sounds wrong… [Laughter]

ROB:

The reason that it sounds wrong is because it is wrong. It’s Perl. [Laughter]

DAVID:

Excellent!

AVDI:

I also love the fact that you have this abstractions that kind of gradually build up as you need them because you have what we were just talking about like the each line and the enumerable chaining you can do based on that. And then you have some more, like class-based abstractions like StringScanner when that stuff isn’t enough.

ROB:

Yeah, absolutely. And frequently, we were talking before about that kind of AdHoc exploratory sort of processing and it’s so nice that maybe your initial questions that you start to ask if some data starts off in the command line and you start to build at these pipelines that give you some information.

If you discover that you’re asking the right questions and you need to go further, it’s so straightforward to then take your Ruby solution maybe from the command line, scan it into the fullblown script and then the two things feel the same. It both feels like Ruby. It doesn’t feel like you’ve gone from suddenly this, you know, you’ve gone from your command line processing approach and then you suddenly have to completely rethink your whole problem to bring it into Ruby. That continuum is really great. And then eventually, you might turn something into an application that you’re using every single day rather than a one-liner that you run once but you can stop from that one-liner and progress right through the use of application. Ruby can take you on that whole journey which is one of its greatest strengths, I think.

JESSICA:

Yeah, that’s beautiful. So Ruby scales in that way. It scales from the exploratory, just quick one-off command line into a production program gradually and then it’s a re-factor, not a re-write.

ROB:

.Yeah, absolutely.

JESSICA:

Also, the Ruby community with its good idioms and the consistency of ‘there is a right way to do it’, there are many ways to do it but one of them is more Ruby than the others. That gives people guidance on how to re-factor it, unlike Perl.

ROB:

Yeah, absolutely. There’s a right way to do it for a given situation and I think that’s the thing, isn’t it? You can always look at a variety of solutions at that particular moment and say, "That is the most sensible one, these are the ones with a mistake," but for the same problem in different context.
There are different approaches and it has the flexibility.

DAVID:

Everything is a trade-off and different approaches have different trade-offs. My favorite thing, when you’re moving – I love that Ruby is just malleable and you can, you can just roll it from this command line things up to this things and start existing in a script file. And for me, I [inaudible] the silly grin on my face when I have this bit run doing Ruby –e and I’m transform-mungling the data, but I’m also doing –r and I’m requiring a class file that knows how to read and write the data that I’m manipulating at least at the low level. So I end up, now my –e is just, "Hey, for each record do this, do this, do this," because you could just say data.new$_ and you’re done.

ROB:

Yeah, absolutely and you’ve created a kind of command line interface to that data where the actual questions you’re asking and the processing you’re doing it different every time but you’re doing some initial processing in the same software and the command line is already a brilliant interface for that.

DAVID:

Yeah. It’s that moment when you realize, "Okay, I’ve got a common data model but every time I touch it, I needed a different app, a completely different application." And so, yeah, you write the application with –e and you pull in your data with –r.

ROB:

I thought of the answer to a question of an interesting thing.

DAVID:

Oh, excellent.

ROB:

I’m processing that. We do a lot of packaging for food brands like packaging design and often you need to print something on to its packs, maybe like you may have to put unique code if you’re doing some kind of competition thing. So we have this scenario where we had to print a unique URL on to the packs. It was fine, it was for some kind of redeemable coupon, not amazingly interesting but they designed the packs that have the space where the URL goes. And before the packs go to print, the day before they go to print, we discovered that the designer has left about 10-characters for the URL that needs to go on these packs.

The unique URL that the coupon people generated is about 150-characters long. And by the way, it’s on 50 million packs. [Chuckles] So who wants to go and put those URLs into URL [inaudible]. So we adjusted this very quick one-liner, took 5-minutes. We were talking about the [inaudible] just then to require libraries from a one-liner so it [inaudible] and CSVs require the CSV library from the standard library. And then chew through the URLs and for each URL, you find in the CSV, replace it with the Bitly version and write the file back again.

So, 10 million files, quite a big file but it’s a CSV so you can process it as a stream. And then, it has to make the request to Bitly so it’s not the fastest thing in the world but it was really quick to write. Surprisingly enough, I’ve never had to do that again so it was completely suited to a one-liner and you can just sit there and chew through and problem solved. You get to send those URLs, send the files to the printers and save the day which always feels quite good when you put this panic'd problem and you can save the day.

DAVID:

I was waiting for you to say that you had 50 million unique random strings. And the day before you went to press, somebody found out that every swear-word in the English language was in the URLs and those URLs needed to be skipped. [Chuckles]

ROB:

We have also had that problem before. As an irate customer writing in and saying, "Why is this word on my pack? My children, my children went to the website." And so, we now have a words filter that runs before it and any of these random codes that we generate.

CHUCK:

I bet those kids were excited.

ROB:

Yeah.

CHUCK:

Mom will never know.

ROB:

It’s just like the thing you read on the yogurt [inaudible], I would imagine.

AVDI:

When I’m doing this kind of exploratory text munging coding, I often find it helpful to enlist some kind of tools to give me more like a Sandbox environment, like a workspace environment. Those who watch Ruby Tapas know that I’m a big fan of the seeing_is_believing gem previously. Before that, I used the xmpfilter gem where I would, in my editor, I can write a little bit of a code and then execute it and see the results pasted in inline in special comment, blocks and then sort of fiddle with it a bit, add to it and then execute again with key binding and see the results filled in.

What kind of tools, if any, do you use, tools or practices do you use to sort of have that Sandbox experience where you don’t have to go back to square one all the time?

ROB:

I don’t have any tools necessarily. I mean the practice I do is always to work on a subset of the data if only for the sake of my own boredom. I don’t want chew through a large dataset.

AVDI:

That’s a good one.

ROB:

So, working on some feasible subset of the data, even if that’s only putting in the second part of your pipeline had [inaudible] or something like that, you just work on the first ten lines until I get to the point where I feel like I’ve got a solution.

But generally, I find that, especially for the stuff from the command line, it’s really easy to do that in an iterative sort of way because each step of the pipeline produces its own output. So I kind of start off with a cap of a file or the initial command and you see it, you look at the shape of it and then you think, okay, the next step is this and then you start to do that, that first step and then you see the output and it’s a really good feedback loop. And you get to see at each stage, okay, that command that I’ve added into the pipeline chain has manipulated the data in this way or oh no, that’s clearly a mistake. I can go back. And I find the feedback loop of that sort of development already quite tight and so long as it’s [inaudible] enough that I’m not distracted as soon as I hit enter because it’s taking me 10 or 15 seconds to run rather than less than a second. I feel like this is already quite well suited to that kind of feedback.

AVDI:

So you generally work at the command line rather than an IRB or pry command prompt?

ROB:

Yeah, usually. Even if it's something I'm going to turn into a script or I've usually inevitably start it from that kind of exploratory testing even if I don’t realize it yet because that is just my first instinct to go there. And then pry for maybe if there’s a particular Ruby step within that process that's quite complicated, maybe I’ll play with that and pry. I find the command line a quite natural environment for that sort of working.

DAVID:

I’ll do that but I will also, if I get to a point where I just don’t know what my data looks like, especially if the client has supplied some data. So I genuinely don’t know. I have a good reason for not knowing what the data looks like yet and so, I’ll end up writing like the first 20 lines of the script to open the file, read it in [inaudible], but I’ll do that as IRB and then –r to bring that file in. And now I can investigate a record, "Oh, it's pass_word," like nobody else in the world does it. [Chuckles]

DAVID:

Exit back out and then, you know. If it’s something that simple, then you can just do it with a put statement. But I mean, if it’s like a nested JSON string, like #ObjectToObjectToObject, sometimes I’d like to jump in to IRB that way as well.

ROB:

Well, that kind of arrayOfArrayOfArray of something is going to be one of my picks but there’s an amazing library by Piotr Solnica – I'm sure I’ve mungled his name – transpoc, if any of you guys have played with it. But it’s basically a way of defining transformations applied to data in a very functional programming style.

So you basically compose together these different methods, each of which performs a particular transformation on the data. And it comes with things like map values for hashes and map array to apply a particular function to every element and an array. And you can basically write these very short little functions for renaming the keys of a hash. So you have that input where the only input in the world or its pass_word. One of the steps of your transformation that you’re doing is to rename that key to [inaudible] because this is the [inaudible] to expect it to be password. And you can just compose together all of these different small methods. They are obviously highly reusable because they’re quite abstract things like renaming keys and nesting, hashes, taking a flat hash and nesting it into multiple hashes and things like that on nesting and all sorts of things. So I do a lot of kind of importing and exporting of data and I find transproc is amazingly useful and makes me a hundred times more productive that I would have [inaudible].

DAVID:

That’s awesome.

ROB:

It's that kind of the same philosophy as on the command line where each step of the text processing pipeline on the command line is performing this one transformation to the source data and it doesn’t have to concern it so it doesn’t know that it’s in a pipeline chain or what the other bits of the pipeline chain are doing. It just knows some data coming in and I need to do this one thing which makes it – it's highly modular, highly reusable and transproc has that same thing but applied to kind of data structures that are in memory in Ruby which when you have that hugely nested data structures, once you’ve passed your CSV, what do you then do next? I find it’s really useful for that stage. It’s pretty great.

DAVID:

Awesome.

CHUCK:

One thing that came to mind when Dave was talking about his Tweets, I wonder do you ever use this text processing stuff on kind of visual like textual visual things. What’s coming to mind for me is I’ve done game of life with several of the companies that I’ve done, coaching for. I've uses it as exercises, I’ve ran a couple of code retreats. And in a lot of cases, what happens is people even want to represent the live cells and the dead cells.

I’m not sure if you’re familiar with the exercise but the live cells and the dead cells is different characters or they use it to delineate a map or things like that. Could you use some of these techniques to basically store your information as far as which cells are alive or dead or some other visual representation in text? And actually, you have that be the canonical version of your data as well?

ROB:

Yeah, I mean if you imagine that – what you're essentially describing is kind of save games. You can set the initial conditions of the board. So, that’s a really useful thing because the text, the ASCII version of the board is then humanly editable in a really intuitive way. Then it would be really straightforward to parse. I mean, it’s almost not text processing because it’s just as a grid, isn’t it? That’s it. And that kind of thing, thinking about another file format for your data that enables someone to more effectively edit it, if it’s going to be human editable. I think it’s very often our first resort to go, " Oh, [inaudible] the grid is in array of arrays and then I’ll just dump that as a JSON file," or something. And that will be my safe file. But it’s a more imaginative solution that the representation on the screen also becomes a representation in a file which then makes it really straightforward to edit and you don’t have to think differently when you’re viewing the game as it's running to when you’re editing the starting conditions. That’s a really, really nice thing.

AVDI:

Have you ever run into a situation where you did find yourself wanting to do something in parallel, like either parallel I/O or splitting something across CPU nodes or splitting something across, even machines. I mean, did you ever run into something that was big enough or slow enough for that?

ROB:

Absolutely. So there’s a few different levels to that, I guess. So, the first thing that is great about specially the command line processing is that it’s parallelized already. So if you think about the different commands in a pipeline chain, as soon as you hit enter on a command line, all of those commands start. Sometimes you’ll think about it as though the first command runs and then it passes its inputs to the second and then that finishes and then it’s sequential in that way but it’s not. All those commands are running in the same way. So if you’ve got something that is quite CPU-bound, those other processes that are processing the data later on can be doing nothing, just at exactly the same time.

So, often that’s enough that gives you some kind of performance and that’s really great. And then, there’s a few extra things you can do. So, you discovered that one part of your text processing pipeline is really slow. Well, if you’re doing kind of streaming data and processing things line by line, that’s really trivially parallelizable by working on – just working on that at the same time, working on different chunks of the file and different processes. Say an xargs, the command, and lets you take lines of input and pass them as arguments to another process.

That’s a really useful mp option which lets you specify how many processes to invoke. So that can be commonly useful, say, you’ve got something that’s a step, maybe grepping or you’ve got a Ruby step that isn’t quite CPU-bound because you’re doing some kind of complicated text processing there. You can then run four at once, if you’ve got four cores on your machines and parallelize just that one step without then having to consider parallelization from within your Ruby script which then becomes a lot more complicated to your Ruby scripts. It just knows that it’s processing some chunks of data. It doesn’t know whether it’s the whole file or a quarter of the file or whatever it is. That’s really useful. And also, the GNU command parallel that lets you invoke multiple instances or [inaudible] command at the same time that’s also really useful.

I’ve not yet encountered a problem that forced me to scale up beyond that multiple machines. That’s maybe the point where you do genuinely have a big data problem and this approach is no longer the right one maybe. Yeah, I’d love to know if this approach is scalable to that extreme level.

DAVID:

Have you run into a similar [inaudible], have you run into cases where you have to fan in data or fan data out. By which I mean like, you might take a database full of customer [inaudible] text processing but, okay, whatever – CSV file full of customer records and we’ve got customer name and the customer’s address and then the customer’s phone 1, phone 2, phone 3 and fax because we still live in the 19th century.

CHUCK:

[Laughs] DAVID:

And you want to emit a customer’s list and an addresses’ list referencing those customers and a phone number’s list referencing those customers. That would be like the fan out version of it.

The fan in version of course is that somebody has built out two separate files that [inaudible] by the same index that really they shouldn’t be two separate files. And so, you actually want to munge them together into a single output. Do you run into that very much?

ROB:

Yeah, occasionally. So there are some great command line tools for that, like join, for example, that lets you basically perform as you would in a relational database that join between two files. So, say those CSVs that you’re fanning in, they’ve both got an email column or a user id column or something that that’s the thing that you’re joining on. It’s really trivial to then as maybe the first step in your pipeline, combine those together, and then continue to process it in a streaming way but be able to do those joints together which is really useful.

I can’t think of a time when I’ve encountered the opposite problem, the kind of fanning out. I don’t know whether anyone else does and needing to spit things out into multiple files. But it’s generally somewhat trivial to do that. As you're going along, just instead of writing everything to standard [inaudible], have a few different file handles that you’re writing to. The fanning out is probably generally more straightforward part of the problem than the combinations [inaudible] especially since often, that involves getting everything in memory at once.

DAVID:

Yeah, I run into this more often with database transformations than I do with text processing. And yeah, the Gotcha isn’t reading from multiple sources or writing to multiple sources. The Gotcha is how much stuff do you have to hold in memory so that you can hit it correctly? Because you might not – the two files might not be indexed. When I say indexed, I mean, like they might have an ID column as supposed to, like the line number indexing. Yeah, if line number indexing, you’re good to go. Just join the files together and down ready you go. And if they’re joined by ID’s then now you’ve got this mess where okay, I need to load this file and process it and hold it in RAM, now load this other one and then process it now, munge those two datasets together before we do output.

ROB:

.Yeah, that’s definitely generated to three-part, maintaining your ability to stream data. But yeah, it depends on the shape of the data, generally.

DAVID:

Yeah.

CHUCK:

This has been really interesting. And it’s something that I think a lot of times we just kind of take for granted. I’m going to have some pile of text and I can cut through it this way or that way or the other way.

ROB:

Yes.

DAVID:

I’ve been sitting on my hands through most of the show because I’ve spent the first five to ten years of my programming career doing nothing but munging text. And doing things like on a 16k computer, you can’t fit all the zip codes in America into memory. And so, if you want to write a phone billing program, you have to put a diskette, 180KB diskette into Drive B or Drive C or Drive D on one of the really [inaudible] machines and write your program to read stuff then go look up zip codes out of the disc. That’s just a tiny example of how boring I could have been in this show.

ROB:

[Laughs] But no, it’s exactly that kind of approach like I think developers who've seen it in the last few years are unused to a world in which resources are constrained because they have effectively infinite quantities of CPU and memory and disk at all times. And I think it’s probably that thing that leads the assumption that you have a big data problem because now, you just need to be inventive and figure out the way to cut the problem in a certain way that enables you to process it. I worry that those are lost techniques especially among younger and more junior developers.

DAVID:

Yeah.

CHUCK:

Alright, let’s do some picks. Avdi, do you have some picks for us?

AVDI:

I believe that I do. So first off, I’m going to go ahead and pick Rob’s book. He was nice enough to give me an early look at it awhile ago. And it’s really good stuff. There’s some coverage in there of techniques and classes that honestly, I had not seen given any decent coverage in Ruby books before, things like the StringScanner class and stuff like that, [inaudible] and stuff like that. It really deserved more attention. So yeah, Rob’s book is great. Definitely, pick up a copy. It’s fun stuff.

Now, if you want to play with this kind of techniques and you are just sort of at a loss for a fun dataset to play with, there’s an email newsletter that I discovered recently called Data is Plural. Ad basically, once a week, the author of this newsletter sends out a curated list of links to interesting public datasets. So just to site the most recent example, we have an arms transfer database which tracks the international flow of major weapons, somebody has got nearly 70,000 images from candidates' social media accounts. Okay, that’s not really text processing. So maybe let’s see…Oh, there’s the National Registry of Exonerations has dumped off all the criminal exonerations in the United States. There’s a listing of major breaches of hepaprotective health data. So this is a dump of the breaches, breach events. And then, finally…

DAVID:

The events or the actual data? [Laughter]

AVDI:

And then finally, there’s a public record of the contents of the official United Kingdom Wine Seller, all 34,052 bottles. So yeah, it’s a fun – again, again, it’s called Data is Plural. It’s an email newsletter once a week where the author just sends out interesting public datasets. So often, you have something you want to play with but you can’t find a good dataset lying around. And so, there’s a good source of them.

And finally, I just got back from Ruby Conf. I am biased because I was on the program committee. But I heard from a lot of people that they really felt like it was one of the best Ruby Conf ever. And I’ve seen that the videos of the conf recorded are already going out. There’s already a bunch of them up now.

CHUCH:

Oh, wow!

AVDI:

By the time this episode ships, they might even all be out. So check the others. There are a lot of really great talks. I think we had some fantastic speakers and it was a really great program. So, we’ll put a link to that in the show notes. And that is it for me.

CHUCK:

Alright. Jessica, do you have some picks for us?

JESSICA:

Yeah, I do have a pick. I'm in San Francisco this week in the Stripe office. And last night, I got to go to Papers We Love which is a meet-up. It started in New York but it has been franchised. So, I’m going to pick Papers We Love. If you go to PapersWeLove.org, it's in a whole bunch of cities. And the premise of the meet-up is somebody picks an academic paper or an industry paper and does a presentation about it. And hypothetically, maybe you can read the paper beforehand but I don’t know if anybody actually does that. The presentations last night for instance, Tony Arcieri, who has made a cool one on security and key chains and macaroons as layered cookies. And then there was one by Gareth on the rendering equation and how every 3D graphics is a solution to that. I’m not like into either of these things but they were super informative and it’s fabulous to hear a paper presented in context.

So I would recommend seeing, if you have a local papers who love meet-up, and yeah, maybe go to it. That’s my pick.

CHUCK:

Awesome. David, what are your picks?

DAVID:

Okay. So in descending order of relevance, I was going to pick migratrix which is a tool that I wrote for migrating data from Rails applications because all the freaking time, I run into the problem where somebody says,

“Hey, can you please pull the data out of our legacy database and import it into the new database? But, we’re still taking sign-ups in the old database, so can you kind of keep that going? Oh, and by the way, it turns out [inaudible] the beta sites six months ahead of date. So when somebody signs up on the new site, can you actually back-migrate that data?”

So there’s being this one-off application and then all of a sudden, it’s this 24/7 bi-directional synchronization tool. And so, migratrix turned into this thing where I was going to have it extract, load, or extract, transform, and load classes. I never got it fully completely recorded to Ruby 2.0. I had brainwave during the show that I should rip-out all the Rails stuff and just do it all to transforms and it turns out that the transproc thing is way ahead of me. So, all that talking was just to tell you that I’m poaching Rob’s pick ahead of him.

ROB:

[Laughs] Damn you.

DAVID:

Yes. Actually, it’s cool. Multiple people can pick the same thing. It’s fine.

CHUCK:

[Laughs] DAVID:

The second thing: I love this website, it's YourDataFitsInRAM.com. The question is, does your data fit in RAM? Does it fit in your memory? Can you manipulate it? And they just give you a box to type in? How big your data is between how many bytes all the way up through how many petabytes it is. If you pick something that is too big, then it says 'no, it probably doesn’t fit in RAM but it might'. But if you pick something that would fit in RAM like 6-terabytes, that will fit in RAM. And they’ll give you a link to a machine that will hold it. So you can actually click on it and it will take you to a Dell PowerEdge Server that has 6-terabytes of RAM on it. And you can have it for the low, low price of under $10,000.

CHUCK:

I’m getting one tonight.

DAVID:

That’s like 60 bucks or 60 pounds in the UK.

ROB:

That’s right.

DAVID:

So I mean, I don’t see any reason why you shouldn’t have one already. My last pick is advisor. I freaking love advisor. Aspect-oriented programming has been the Holy Grail/[inaudible] of programming for like 20 years. The ability to just take a class and say, “I want a log every call to every method in this class." How do you do that without opening every method and changing the source code? And the answer is you require the advisor gem, you open the class and say 'extend advisor loggable' and they you say 'log calls to' and you get a list of the methods that you want logged. And bam! You’re done. This thing injects all of the necessary frameworks so that when you call that class, it’s prepended in front of it, it intercepts the call, logs the call with all the arguments and then it yields to the method.

And when it returns, there’s instructions there for how to write your own thing. I’ve built out a calltracer with it that I’m using to eliminate illegal database calls in our functional test suite and I’m absolutely freaking loving it. It’s aspect-oriented programming in just pure Ruby. You don’t have to put the tag over a cucumber thing and have the DSL handle it. It’s just not 'extend this module' and now just say, I want to do something with these method calls and it takes care of it. I love it. So, those are my picks.

CHUCK:

Awesome. I’ve got a couple of picks. The first one is Swarm Simulator. I’ve just been laming it running for days and what it is it's just this dumb game on the internet. But yeah, I just leave it running and I come in periodically and upgrade my stuff. But yeah, you start out with drones and then once you have enough drones then you get queens and you have larvae and you get territory and the territory gets you more larvae. Anyway, so it’s been a lot of fun and I’ve just kind of let it ride.

Yesterday, I hardly got on my computer at all. So when I got to it this morning, I had all kinds of resources in it because it’s been running all night and all day and all night.

DAVID:

So, it’s like Progress Quest only in real-time strategy genre?

CHUCK:

Kind of. It’s all text-based but it’s kind of fun. And yeah, it’s online. Anyway, I want to pick that. The other thing and I heard about it and that’s another thing I want to pick. I heard about it on those 15minute calls that I do with listeners. So I’m going to pick that as well.

If you want to talk to me for 15 minutes on Skype, this is webcam-to-webcam face for 15 minutes, you can go to RubyRogues.com/15minutes and you can see and hear me for 15 minutes. I can see and hear you and we can talk. And it’s been fun. And I’ve talked to a whole bunch of people that are brand new to programming. I’ve talked to a whole bunch of people that are not brand new to programming that they’ve been doing it for a long, long time – longer than I have. And it’s just been a lot of fun to kind of get a feel for the people out there.

My last pick is Toastmasters. I picked it on the show before but I’m super excited because I just completed my company communicator which means I’ve given 10 speeches at Toastmasters and that was a lot of fun. So, I’m going to pick that as well.

Finally, the last pick I have is I’ve decided to do a remote conference every month next year. I know that sounds nuts but I’ve been enjoying them. People seem to want more of them. I’m covering general programming topics. I’m covering the topics that I have podcast on and then I’m covering just some other things that sound like fun. You can get the full list at AllRemoteConf.com. Not you guys on the call because it’s not up yet. But it will be up by the time this goes live. So go check it out and JS Remote Conf is the first one next year.

You can actually buy tickets to groups of 3, 6 or 9 or all 12 if you want remote conferences. And you can also submit proposals for the first three or four. Anyway, if you’re interested in speaking, say at Ruby Remote Conf which will be in March, then by all means, go submit a talk. And yeah, some of the other ones are git, Postgres, NoSQL, robots, newbies. So anyway, I've got all kinds of stuff in there. So go check it out. There’s stuff for everybody. So there’s my shameless selfpromotion plugs. Rob, what are your picks?

ROB:

I can’t be the only person who listens to Ruby Rogues and comes to [inaudible] every week about stuff that they would pick on the show. So I had to whittle down my list slightly. So I got two technical picks. I’ve limited myself to three.

DAVID:

Rob, my record is 12 minutes. That’s the score to beat. Go. [Laughter]

CHUCK:

Only 12, Dave?

ROB:

One of my technical picks was cruelly stolen away from me earlier in the show. So, I’d have to mention transproc which I would secondary-mention is very, very good. And my other technical recommendation would be the Sequel database library. It’s the other half of the thing of my toolkit when I have to do kind of database-based ETL kind of tasks is Sequel by Jeremy Evans. It’s a lovely both database access library and also a really nice ORM that’s much nicer than Active Records. So, those are my two technical picks.

And then a book that I read recently that I really enjoyed is called Priceless, the hidden psychology of value. And it’s a pop psychology book but I find it really interesting. All about the psychological foibles that we have, the cognitive vices and things that enable us to be tricked by a cold-hearted shops and marketers and people like that into not understanding the true value of things and therefore, paying too much for them.

Priceless:

The Hidden Psychology of Value by William Poundstone. That’s a great book. And that’s it.

CHUCK:

Alright. Thanks for coming, Rob. If people want to follow up, find out more about you or about your book or about what you do, where should they go?

ROB:

So, you can follow me on Twitter @robmil. And I have a blog, robm.me.uk where I infrequently but occasionally blog about our various text processing related things and other things that tick my fancy.

CHUCK:

Alright. I guess we’ll wrap up the show. And we’ll catch you all next week.

Hosting and bandwidth provided by the Blue Box Group. Check them out at BlueBox.net.]

[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.]

[Would you like to join a conversation with the Rogues and their guests? Want to support the show? We have a forum that allows you to join the conversation and support the show at the same time. You can sign up at RubyRogues.com/Parley.]

Album Art
235 RR Processing Textual Data with Ruby with Rob Miller
0:00
1:03:14
Playback Speed: