184 RR What We Actually Know About Software Development and Why We Believe It's True with Greg Wilson and Andreas Stefik
The Rogues talk SCIENCE! with Greg Wilson and Andreas Stefik.
Hosted by:
Show Notes
The Rogues talk SCIENCE! with Greg Wilson and Andreas Stefik.
Special Guests: Andreas Stefik and Greg Wilson.
Transcript
GREG:
Sure, anything programmers ignore is probably evidence.
[This episode is sponsored by Hired.com. Every week on Hired, they run an auction where over a thousand tech companies in San Francisco, New York, and L.A. bid on Ruby developers, providing them with salary and equity upfront. The average Ruby developer gets an average of 5 to 15 introductory offers and an average salary offer of $130,000 a year. Users can either accept an offer and go right into interviewing with the company or deny them without any continuing obligations. It’s totally free for users. And when you’re hired, they also give you a $2,000 signing bonus as a thank you for using them. But if you use the Ruby Rogues link, you’ll get a $4,000 bonus instead. Finally, if you’re not looking for a job and know someone who is, you can refer them to Hired and get a $1,337 bonus if they accept a job. Go sign up at Hired.com/RubyRogues.]
[This episode is sponsored by Codeship.com. Don’t you wish you could simply deploy your code every time your tests pass? Wouldn’t it be nice if it were tied into a nice continuous integration system? That’s Codeship. They run your code. If all your tests pass, they deploy your code automatically. For fuss-free continuous delivery, check them out at Codeship.com, continuous delivery made simple.]
[This episode is sponsored by Rackspace. Are you looking for a place to host your latest creation? Want terrific support, high performance all backed by the largest open source cloud? What if you could try it for free? Try out Rackspace at RubyRogues.com/Rackspace and get a $300 credit over six months. That’s $50 per month at RubyRogues.com/Rackspace.]
[Snap is a hosted CI and continuous delivery that is simple and intuitive. Snap’s deployment pipelines deliver fast feedback and can push healthy builds to multiple environments automatically or on demand. Snap integrates deeply with GitHub and has great support for different languages, data stores, and testing frameworks. Snap deploys your application to cloud services like Heroku, Digital Ocean, AWS, and many more. Try Snap for free. Sign up at SnapCI.com/RubyRogues.]
CHUCK:
Hey everybody and welcome to episode 184 of the Ruby Rogues Podcast. This week on our panel, we have Avdi Grimm.
AVDI:
Hello from Pennsylvania.
CHUCK:
David Brady.
DAVID:
Today I’ll be providing an even mixture of computer science folk medicine and unrefuted hypotheses based on personal observation.
CHUCK:
Jessica Kerr.
JESSICA:
Hello from St. Louis which is not a pretty place to be today.
CHUCK:
I’m Charles Max Wood from DevChat.TV. Quick reminder: go check out JSRemoteConf.com if you want to learn cool JavaScript stuff online live. Anyway, we also have two special guests this week.
We have Greg Wilson.
GREG:
Good day.
CHUCK:
And is it Andreas Stefik?
ANDREAS:
That’s me.
CHUCK:
Do you guys want to introduce yourselves really quick?
GREG:
Sure. Andreas, go.
ANDREAS:
Sure. I’m an Assistant Professor at the University of Nevada, Las Vegas. I study the science of programming languages in the context of how do programming languages impact people or communities. We test a lot of I guess you should say old myths in programming language design.
GREG:
And for the last 16 years I’ve been teaching scientists the basics of programming. And about halfway through that realized, I really ought to start teaching programmers the basics of science.
So, that’s part of what Andreas and I will be ranting about today.
CHUCK:
Boom.
JESSICA:
So, you put the coding into science and the science into coding?
GREG:
Absolutely.
DAVID:
You’re like the Reese’s Peanut Butter commercial. [Laughter]
DAVID:
You got coding on my science! You got science in my coding! [Laughter]
DAVID:
Mm.
The difference is scientists are willing to pay attention when you go and teach them version control. Most programmers stick their fingers in their ears and say, “La-la-la. I can’t hear you. XP!” when you say let’s look at evidence.
CHUCK:
[Laughs]
DAVID:
I’m really excited to have you on the show because I watched your talk and you’re here to talk to us about the theory of science, which I am hugely excited about.
GREG:
So, let’s dive straight into that. Well, let me ask the panel a question. Do you believe that geographic separation between members of the development team has any impact on the quality of their work?
DAVID:
So…
CHUCK:
I’m going to feel really dumb at the end of this episode. [Laughs]
DAVID:
Here’s where we find out who’s watched the video and who hasn’t. [Laughter]
DAVID:
Chuck?
GREG:
So, the [inaudible] is you’ve got to stick everybody in one room so they can communicate. Well, during the construction of Windows Vista, Microsoft collected huge reams of digital data. Not just every change to the code and every test that was run and every compile failure, but every meeting that took place, every message that was sent, every phone call that was scheduled, the whole thing. And then they threw a bunch of machine learning algorithms at it and said, “What can we find in here that correlates with quality?” We’re looking for predictors of bugs per DLL shipped, because there’s a good metric. How many faults were there per module in the code we sent off to our customers?
And it turns out that geography doesn’t really matter. It turns out there is something that matters a whole lot more, and that’s how far apart the developers are in the org chart. The higher you have to go to find a common parent in the org chart who can resolve disputes or just tell them what the project’s actually about, the worse the software is. And once you hear that, you’re not surprised.
JESSICA:
Where I work at, Outpace, we’re all remote. All the developers are remote. So, we sort of embody the outcome of that research. But also, we have a relatively flat org chart. What I liked, Greg, about in the talk when you mentioned the distance in the org chart being an indicator of more bugs, the org chart is a proxy for goal alignment.
Yeah.
JESSICA:
That’s something we work really hard at Outpace, is making sure we all share the same goals.
GREG:
Absolutely. And once you hear this result, as a programmer you go, “Aha. It’s a management problem,” and you’re happy. You’ll accept it. [Laughter]
GREG:
Now, in the same book where we report on that, there’s a meta-study compiling all of the evidence that we had in 2010 about test-driven development. And it turns out that on balance, there is no evidence that it has any impact up or down on the quality of software or the speed with which it’s produced.
DAVID:
La-la-la. I can’t hear you.
GREG:
Right.
CHUCK:
[Chuckles]
JESSICA:
Okay, so TDD is dead. There we have our answer.
GREG:
Oh, no, no. No.
CHUCK:
[Laughs]
DAVID:
Well, well, well, well. TDD is no more alive nor more dead, right?
JESSICA:
Oh great, now it’s undead.
CHUCK:
Schrodinger’s…
DAVID:
I’m fine with that.
CHUCK:
Test-driven development.
[Chuckles] Schrodinger’s TDD.
CHUCK:
[Chuckles]
GREG:
So, the reaction I get when I give this talk is really interesting. If you say, “Here’s some evidence that confirms that it’s management’s fault,” every programmer in the room will go, “Excellent.” If you say, “Here’s something that is now part of the catechism of programming, this is something you’re supposed to believe, and you say actually the evidence doesn’t support it. It doesn’t contradict it either,” the reaction isn’t, “Huh. Let me go and have a look at those studies because that doesn’t jive with my experience. Maybe there’s a fault in the study. Maybe you’re measuring the wrong things.” No, the reaction is, “Well, you have to be wrong. That can’t be right, because I know X.” My dad is in his 80s and flat out does not believe that smoking causes cancer because he smoked two packs a day since he was 14. And he doesn’t have cancer, therefore smoking doesn’t cause cancer.
CHUCK:
Oh, that is so interesting, because it’s so dangerous.
DAVID:
Yeah.
GREG:
Absolutely.
JESSICA:
It didn’t cause cancer in him. The system worked for him. What’s the problem?
DAVID:
Right.
GREG:
Exactly.
CHUCK:
Right.
DAVID:
And Richard Stallman has RSI, therefore Emacs causes RSI because its inventor has RSI.
GREG:
Sure. And Andreas, talk about what happened when your first study saying that a randomly designed language was just as hard or easy to learn as Perl got on the [web]. [Laughter]
JESSICA:
Oh yeah, you guys have to hear this. This is good.
Go ahead. Andreas, walk them through what happened. The study and then what happened afterwards.
ANDREAS:
Sure. So, as I said oftentimes we try to analyze how programming languages impact human beings. And this actually came about from a totally different purpose actually. We were working with a bunch of blind children because in the United States there was effectively no program for blind kids to learn programming. But what we noticed was that traditional languages like C or Java or these sorts of tools were actually really hard to use when you had to listen to them through audio. So, you might have to hear, literally hear for left paren, int i equals zero semicolon, i less than ten semicolon, i plus plus, right paren, left brace, which is a pain in the ass. It’s just hard to hear.
So we thought, “Well maybe that’s true for the blind. But I’m curious whether if we just took typical people that can see just fine and we ran a test, what would be the impact on comprehension or accuracy or productivity or anything of the above?” So, we designed an accuracy study. And we thought, “Well if we’re going to do a study we should probably have some kind of a control group,” just standard science, right? Since the late 1700s typical procedure is to use a control group.
And so, what we thought is, “Well I don’t know what a control group would be for programming languages. So, let’s see if we can come up with a concept of placebo.” So, we took a programming language. We took Quorum which is the language that I’ve designed. We ripped out all the tokens, or a large majority of the tokens. And then we sat around the lab predominantly laughing our asses off rolling dice and then replacing the characters with random symbols. [Laughter]
ANDREAS:
Now the purpose was we thought, well maybe this placebo will give us some kind of a baseline. If a programming language, if a human uses that language four times better than a randomly designed language, maybe that gives us a baseline against something really bad. But then we ran the study. And it turned out…
CHUCK:
Dun-dun-dun.
ANDREAS:
Yeah [chuckles]. We ran the experiment and to our surprise, we found that people using Perl could not do so any better than a language that we sat around rolling dice.
DAVID:
Wow.
ANDREAS:
And that surprised us a lot. But what would surprise us even more is that the moment that it got on Slashdot, my students and I started getting a tremendous amount of hate mail. And this really hasn’t ebbed much since that time.
DAVID:
Wow.
ANDREAS:
For example, we put out another paper which contains evidence related to a lot of studies we were on just a few weeks ago, only to get an email not that long ago that said I was starting a language eugenics program, and I was no better than Hitler or Sauron apparently.
JESSICA:
Sauron [laughs]. [Laughter]
DAVID:
Wow.
ANDREAS:
Yeah, maybe they were…
CHUCK:
Oh, that they put them on the same level.
ANDREAS:
Maybe they were joking, but my German colleague didn’t think it was that funny given the history. But you know…
CHUCK:
Oh, wow.
DAVID:
Neither did the hobbits.
ANDREAS:
Yeah.
[Laugher]
ANDREAS:
But so you know, [inaudible]
DAVID:
I just want to know, Andreas I want to know how many Google Summer of Code projects were proposed in Quorum after you released your findings.
ANDREAS:
Oh [chuckles]. Well, we didn’t propose any.
DAVID:
No, I mean how many people picked it up that actually said, “Oh, I want to learn that language.”
CHUCK:
Nope.
JESSICA:
Ah, too easy.
ANDREAS:
Probably none I would imagine.
GREG:
Right, because, because…
DAVID:
I would disagree with you just on the grounds that languages like Brainf*ck and Befunge are deliberately written to be hard to learn and hard to compile. And…
GREG:
Brain [inaudible].
DAVID:
Brain F has been ported to every language out there. I’ve written a Brain F to Ruby… [Laughter]
DAVID:
Compiler. We exist. We are sick people.
JESSICA:
Well, exactly. We don’t pick a problem because it’s easy.
GREG:
But…
DAVID:
Right.
JESSICAS:
We pick a problem for its technical difficulty.
DAVID:
Yeah.
GREG:
But it comes back to something that I first heard from my father that he attributed to Winston Churchill but the quote goes back much longer. I’ve still not traced this one down to its roots. Most people would rather fail than change.
DAVID:
Mm.
CHUCK:
That, oh, wow.
GREG:
Given the choice between changing your beliefs about something or continuing to burn 140 calories an hour banging your head against a wall, and yes that study has been done and that is the number…
DAVID:
Wow.
GREG:
Then most people would rather fabricate an excuse to justify all the pain they went through to get this far. It’s like a hazing ritual. You have to say it builds character, because otherwise it was just pointless misery. And you can’t admit that to yourself.
DAVID:
Yeah.
GREG:
And I’ve given a talk several times now asking people who are software engineering researchers who teach software engineering classes at colleges and universities why we don’t rebuild that standard intro to software engineering class along more useful lines. Right now what happens is you take your intro to software engineering and there’s a group project in groups of typically four to six. And you have to draw up a requirements spec and then you have to do an implementation plan. And you probably have to draw some UML diagrams. And all those things that nobody ever actually does in that order in the real world.
ANDREAS:
Yeah, exactly.
GREG:
And if you’re forward-thinking, you say, “No, we’re going to be agile and we’re just going to do this in sprints.” Both are fictions, as one of my students explained to me several years ago. She always left everything until the last moment because that was the only rational choice. If you’ve got five bosses, five professors who don’t talk to each other about deadlines and due dates, then the only sensible thing is to wait until the last possible moment when the professor has run out of energy and isn’t going to fix the assignment spec again under your feet. [Laughter]
GREG:
So, they lied to us. When we say track your hours and show that you did a few hours every day because we’re supposed to be emulating extreme programming or scrum or something like that, they just make stuff up. And we know that. And we pretend…
JESSICA:
That’s kind of realistic though.
GREG:
It is. But what we could do instead is have a software engineering course where the very first assignment looks something like this: here is a Git repo for a Java project that’s had 2,000 commits and is now about 20,000 lines long. And over here is the bug tracker showing where bugs were found over the lifetime of those 2,000 commits. Hypothesis: Short methods are more likely to have bugs per line. They can have a higher bug per line count than long methods.
GREG:
Go and prove or disprove. Well, I don’t know if that’s true or not. What I want you to do is go off and look at the data we’ve just given you for this one project and tell me if it’s true. Now, think about what you have to do in order to answer that question. You have to learn some code analysis tools. You have to decide how you’re going to measure the length of a method. Is it lines or is it lines of source code? Is it number of semicolons? There’s a bunch of ways to do that.
You have to decide how you’re going to attribute bugs to particular methods. You have to do science. You have to take this fuzzy idea that sounds interesting and you have to operationalize it and come up with a particular experiment which might be different from the one that your colleagues come up with. And you might come up with different answers because you’re actually measuring different things.
But would you hire a programmer whose first instinct was to go off and say, “Hmm, I wonder if this is true. Let me pool together some data and I can answer the question”? Hell, yes. That would be better than where we are now. That’s what engineers do, is go and gather data, measure a few things and then say, “Well, maybe this doesn’t apply everywhere and at all times. But at least now we have some kind of pointer.” I’m bitterly disappointed that the people designing Go and Rust and [Julie]…
DAVID:
Dart.
GREG:
And Dart have just ignored the techniques that Andreas and his team have developed. It doesn’t matter if you believe the answers or not. He’s shown that you can actually go and measure these things and get a better language as a result.
DAVID:
So, Greg you have just touched on the one magic question that I wanted to ask you in this call. And I was going to save it ‘til the end, but I’d like to throw it out at the beginning and see if we can just weave it through the whole call, because I think it’s kind of what you’re on about here. How can I do science? I literally mean can I as a college dropout, can I do studies and get them published? Or do I have to find a grad student and a PhD candidate or whatever, I guess that’s the same thing, or some professor and convince them to spend their underfunded, over-budgeted time studying my pet project which none of them want to do?
I have a thousand questions what I would love to gather evidence on. And I’m just convinced that I’m going to write something up just completely filled with biases and incorrect assumptions and that sort of thing. Can I learn to design experiments and do science and gather this evidence?
GREG:
Absolutely. There are hundreds of thousands of amateur scientists in the United States right now doing real work, most of them with tenured faculty but not all of them. Rachel Carson changed the world. She didn’t have a degree. She just went out there and started doing the science. You can think of thousands of people who have gone out and said, “I can measure things. I can count things.” Andreas, how many times have you not run a study because you couldn’t round up enough programmers?
ANDREAS:
Well, in my case never because I work really hard to make sure I get them.
GREG:
Okay. How many of your colleagues have never run a study because they couldn’t round up enough programmers?
ANDREAS:
That would be probably all of them.
GREG:
Okay. [Chuckles]
GREG:
So, right there is something that a community like this can do. In the same way that public health research relies on having people in the community who will go out and ask questions, go door to door and, “Here’s the questionnaire. Can you fill it in?” or go and interview people. We can teach you how to do this. And you don’t have to wait for us. You can go and start asking questions about your personal history by logging through your repos on GitHub, the bug trackers you use.
Tavish Armstrong who at the time was a student at Concordia, he’s now down in California, just started asking some of these questions. When does he commit? And is he more likely to commit bugs on a Friday afternoon than on a Thursday morning? You can do that. And who cares if it gets published academically?
DAVID:
Yeah.
GREG:
You’ll be listened to.
ANDREAS:
The other thing you got to remember here too is that, Greg didn’t mention it maybe because he’s being more political than I am, but in fact the academia aspect of the science here is actually, has significant problems. So, let me give you a few examples. In 1996 there was a paper, an excellent paper by Walter Tichy where he talked about the nature of science and software engineering. And he found very strong evidence that software engineers were simply not using evidence even in academia.
Now, think about that for a second. If you as an individual want to publish a paper at an academic conference and you know with some level of certainty given the evidence from Walter Tichy and actually later work by Andy Ko, and also we’ve shown it in the language design community as well, if you know that they’re not using evidence, what purpose is there even of publishing in those venues, for an individual person? For example…
DAVID:
Right, just my ego. [Chuckles]
ANDREAS:
Let me give you one very specific example that I like. You’re all familiar with functional programming I assume, right?
CHUCK:
Mmhmm.
DAVID:
Sure.
ANDREAS:
So, there’s a conference, a very famous conference, called the International Conference on Functional Programming. We have actually read and tracked formally every single paper they have ever written. Out of that entire academic history, guess how many of those papers actually investigated the impact those features on human beings?
GREG:
Zero.
ANDREAS:
That would be two.
GREG:
Oops.
ANDREAS:
And neither had a control group. [Chuckles]
CHUCK:
Wow.
DAVID:
Wow.
ANDREAS:
I want to be clear about that. Whenever you hear a functional programming language feature’s being added to a new language, I want you to remember, every single paper in their entire history at the top academic conference.
DAVID:
That’s because functional papers are immutable. [Laughter]
GREG:
You know, I am normally a peaceful man. You are disturbing my calm. [Laughter]
GREG:
I just threw a link into the chat area of the Skype call to a thing that Tavish Armstrong threw together on a Friday night, because Fernando Perez who’s one of the creators of IPython said eight minutes from bug on my box to review to merge PR. And I’ve got Tavish, an undergrad at a school over in Montreal thinking, “How long does it usually take to do a review on the IPython project?” So, he answered the question. So, flip over to that link and look. Here’s how you do it. You import a bunch of tools. While we understand data analysis, we kind of invented that, and then you turn the Git log into some JSON and then you go and you do some stats and you’ve got the distribution. It’s like, “Oh, this is how long it takes and notice that there are some spikes.” Okay.
Anybody can do this. And I think this is what engineering looks like. I trained as an engineer originally. I switched over to programming because I’m very clumsy. There was actually an afternoon in 1982 when I picked up a soldering iron the wrong way around twice in the space of an hour. [Chuckles]
GREG:
And I still have the scar on my right hand. And the second time I had the hand under the cold water tap, the lab tech came over and said, “Greg, have you thought about the programming option?” [Laughter]
GREG:
Probably saved my life, right? Because the next term was the power transmission course with the 50,000 volt transformers and I wouldn’t be here. [Laughter]
DAVID:
Greg, can we be spiritual brothers if I tell you that I got into software after I flipped a molten solder into my eye?
GREG:
Ouch. Okay, you win. [Laughter]
DAVID:
Okay. Well, no honestly I think burning my hand on the iron was about as bad.
GREG:
Yeah. So, the whole thing that makes engineering different from the craft that came before it is that engineers go off and measure things. They build their tables. They understand strengths of materials. The Romans knew a whole lot about the strength of concrete. Some of their structures are standing 2500 years later. But they also had a lot fall down. And once engineers started adopting calculus and the experimental method in the early 1800s, first in Germany and then in France and the UK, things stopped falling down as often. And there’s a reason why we can build big chemical plants. There’s a reason why the power transmission grid mostly works. It’s applied science. It’s science with a purpose rather than science for the sake of curiosity.
And software engineering by that metric isn’t an engineering discipline but it could be. We now have the data that they didn’t have 30 years ago. SourceForge revolutionized the study of software development because for the very first time there’s a huge amount of information, admittedly from a rather odd community. But there’s a huge amount of information out there in the open. And now anybody who wants to can go and do things like look at how long it takes to close an issue compared to the size of the eventual patch and see if there’s a correlation. Are bigger fixes slower to land? Are security fixes faster or slower?
Here’s another interesting thing. So, until recently I was working at Mozilla. So, they’ve shifted to a much shorter release cycle for Firefox. They’re now on a six-week cycle. And what they have found is that it reduces the number of bugs that get shipped that cause Firefox to absolutely crash. That’s interesting. But what’s really interesting is when Firefox crashes, it does so quicker after launch than it used to, even normalizing for speed of machine. And we have no idea why.
DAVID:
So, from an engineering standpoint, agile short cycles reduce the mean time between failures. This is a very bad thing, right?
GREG:
Well, is it? I mean, we’ve put…
DAVID:
I’m yanking you chain, dude. [Laughter]
GREG:
But the thing is we’ve got data that we didn’t use to have.
DAVID:
Yeah.
GREG:
And we’re not training future programmers to think in terms of data. Here’s a scary stat. Just doing a lap around the table, well let me ask David. David, what did you do before you were a programmer?
DAVID:
I was a grade school student.
GREG:
Okay, so you went straight from that to this?
DAVID:
I’ve loved this field ever since day one. I’ve done retail jobs and that kind of stuff, sure.
GREG:
Okay. What about Charles?
CHUCK:
Before I was a programmer?
GREG:
Yeah.
CHUCK:
I did QA and before that I ran a tech support department and before that I did IT.
GREG:
Okay. And Jessica?
JESSICA:
I was a Physics major.
GREG:
Okay.
JESSICA:
Straight from undergrad into programming. And I will say that the scientific method that was a large focus of the Physics curriculum has helped me tremendously in programming. I haven’t done any aggregate level studies, but just the individual experiments, that is, any sort of debugging, the scientific method has helped tremendously with that.
GREG:
But it’s helped in another way. When you did your undergrad in Physics, how many lab experiments did you do over the course of four years?
JESSICA:
If you count the little ones, then a couple of dozen.
GREG:
At least, right? A biology program in Canada cannot stay accredited if there aren’t at least six hours of lab work per week over the course of four years. So, by the time you finish a four-year bio degree, you’ve done dozens of experiments.
The computer science student who shares an apartment with that biologist will do on average one experiment in four years. It’s probably in the operating systems course where she probably collects a bit of performance data, throws away the outliers, takes an average, and pretends that means something. If she does the…
CHUCK:
Wow.
GREG:
Human-computer interaction course she might do a second experiment on something like eyehand coordination or color perception.
DAVID:
Maybe, yeah.
GREG:
Maybe.
DAVID:
But both of those experiments will be to replicate existing experiments in the textbook, correct?
GREG:
Right, which is true of most undergrad physics as well.
DAVID:
Okay.
CHUCK:
Yeah, I saw that in my computer and electrical engineering courses as well.
GREG:
Yeah. So, people spill out of college and university trained in computer science hopefully knowing how to program without the direct experience of constructing a model, constructing an experiment to test it, gathering the data, analyzing the data. And that’s exactly what we want them to do when they’re doing performance engineering on a cluster. It’s exactly what we wish they were doing on their own development processes and on their code. Are there features in programming languages that are more likely to cause or be correlated with, I should be careful, more likely to be correlated with gnarly bugs? The answer in our gut tells us yes. But I bet we wouldn’t agree on which ones.
DAVID:
Right.
GREG:
But we can go, we’ve got that data. Does garbage collection reduce the number of bugs in programs? We all know what it’s like to manually manage memory in C and C++. Are there fewer bugs in code when you start having the machine manage it? I have never seen any data that proves one way or the other.
ANDREAS:
One thing that I think is important to say with this too is that whenever I give talks on the science of a programming language… by the way I should say before I say that, I think Greg is spot on. One of the biggest problems here is definitely academia isn’t really teaching computer scientists the scientific method which involves experimentation. But anyway, what I was going to say is that I think there’s another issue to think about here besides simply issues like bugs. One thing that is hard to learn from source code repositories is you can learn things like bug reports and figure out correlative information that way. But in some cases, there are certain types of people that you can’t gather information on that way.
For example, in the UK right now Simon Peyton Jones, one of the people at Cambridge, is starting a program because the UK is now essentially mandating that computer science be taught to children. And this, to the best of my knowledge, this is children ages 6 to 17, right? Now, on the one hand we might say well, these are just children. They’re not professionals. Maybe we don’t care about them. But I think that would be a little bit naïve because this generation that’s coming up in the UK is going to beat the pants off anybody taught in the United States because they’ll have 11 years of programming experience.
But the thing is though is if we have these students using programming languages, the question then is which ones and what is the actual impact on seven-year-olds? On 12-year-olds? What about 13-year-olds? Are there differences? What about a child that can’t read using scratch? Does that provide transfer of learning to a programming language? And if so, which ones and under which conditions does that matter? In other words, I’ve…
GREG:
Right, and…
GREG:
I was going to say, and people like Mark Guzdial…
ANDREAS:
Yeah.
GREG:
Mark Guzdial and Barbara Ericson at Georgia Tech, the Scratch Group at MIT, they have done so much high quality work on exactly these questions over the course of, what is it now, 15 or 16 years that Mark and Barbara have been at Georgia Tech. Mark Guzdial and Barbara…
ANDREAS:
Ericson.
GREG:
Ericson, they’re at Georgia Tech. And if you go to Mark’s blog, computinged.wordpress.com, the man is a blogging machine. Almost every day, he puts up something interesting…
DAVID:
Cool.
GREG:
About computing and education. He isn’t just an academic researcher. He’s been part of several consortia that have been trying to shift the needle in the southeast, Georgia, Tennessee, that region, trying to get this stuff into schools. And they’ve identified that the problem isn’t a technical one. The problem is that cycle of schools won’t offer courses unless they’re sure they get teachers.
Teachers won’t specialize in computing until they’re sure the courses are going to be offered.
ANDREAS:
Well, I know Mark pretty well and there’s other problems there as well.
GREG:
Yeah.
ANDREAS:
And part of it is this: that in the US oftentimes computing doesn’t meet the requirements under things like no child left behind. And as such basically what happens is a school district says, “Alright. Our math skills at our school are lower than we would expect and we’re going to lose funding unless we focus on that. Therefore, we better not teach computer science because that’s bad.” When the reality is what they should be saying is, “This is the 21st century and students at our school need to have technical skills because that’s the century they live in,” sort of just the opposite response of what they should have. But they don’t have a choice because it’s actually mandated. They lose actual financing if that happens. It’s a big problem.
GREG:
And there’s a bit of pushback [inaudible] though and I’ve been on the receiving end of this. Everybody can agree that there ought to be more computing in schools. I’ve never heard anybody argue against that. The problem is what do you take out to make room?
Here in Toronto for example, should the Toronto District School Board have less health and physical education to make room for computing? Well, I think that’s a bad idea. We’re facing an obesity epidemic like everybody else. Should there be less math? I’d fight that. Given the choice between math and programming, I think it’s more important the kids learn math. But should there be less on social studies? Easy for me to say, but that’s where kids learn the history of their country and how the legal system works and what the rest of this province and country look like.
Every single thing that’s in the curriculum has a defender. Somebody fought hard to get it in there and keep it in there. So, it’s not enough to say we’re right. We have to be more right than somebody else. We have to displace something, because it’s a zero-sum game. And people try to fudge this by saying, “Oh, if everybody knew how to program they could learn these other subjects faster.” Again, where’s the evidence for that?
ANDREAS:
Absolutely.
GREG:
So, I agree that there ought to be computing in the schools. But I’m damned if I know what I want my daughter to not learn so that she can learn programming.
DAVID:
And I know this will make you crazy but I would go in and argue that computer science should be taught almost as an advanced math class just on the basis that I want to do, on the computer I wanted to learn how to draw angles and circles and arcs. And I went to a 2A school in Southern Utah. So, pre-calculus and trigonometry was the advanced track senior level. There just was no more math I could learn at my school. And I got into trigonometry and I could do it all in my head. I knew what the sines approximately of an angle were because I had spent three years writing programs to draw circles and polygons and that sort of thing.
JESSICA:
Speaking of learning from math and programming, Greg talked about and posted a link which will be on the page to an IPython notebook describing the pull requests. And this is a great example of this IPython notebook tool is used by scientists to analyze data reproducibly, because the code is right there. And it’s basically like IRB with pictures. And it graphs the results of your commands for you so that you can look at them and analyze the data visually and really think about it.
This is something that I have found lacking. In Clojure there is one, but it was really hard to get running. In Scala there is one and it was even harder to get running. But this notebook, this idea of really looking at our data mathematically not just sampling it, like whenever I’m working in the REPL I just say give me the first element of this map so I can look at it. This is something that the scientists do that we could learn from and that Greg’s group teaches there. Being able to program gives us the ability to learn from this data in ways that scientists are just learning in Greg’s Software Carpentry program.
GREG:
And the irony is…
JESSICA:
There’s still room for Ruby.
GREG:
Yeah, the irony is that we as programmers are telling everybody else to use analytics. Go off an analyze traffic on your website. See who converts, See who clicks. Look at correlations in your sales patterns. There are a lot of programmers that have become data obsessed and stats obsessed. And the one thing they won’t turn that lens on is themselves.
ANDREAS:
I got to say, this is one of the things that I found so odd about some of the emails we get about many of the studies we run. So for example, that Perl study, we ran it again on a larger sample, added a bunch of languages. Ruby does quite well in those studies actually. We learned a lot on the Quorum project from Ruby, specifically the way it structures if statements were much better than we had on Quorum for example.
But anyway, the funny thing is though, when we get a letter from somebody saying, “Oh, you suck. You’re just bashing Perl,” whatever, the thing that’s so weird about it is I don’t really see why people even care. For example, suppose that the language that you use and love is no better than one designed randomly. Well, why not just fix it over time? Maybe not in Perl, maybe not in whatever language. But it seems to me that it makes more sense that the next person that’s designing language designs, that’s doing a new language, should just follow the data, because why wouldn’t you do that? It just makes no sense to not follow the evidence out of some kind of feeling of advantage or feeling of belief or something. It just doesn’t make any sense.
GREG:
Well, psychologists actually have a technical term. I had to look this up. There’s a term in psychology for somebody who will ignore evidence that could lead them to doing something better or more profitably, or more healthy, or something like that.
ANDREAS:
[Really?]
GREG:
Yeah, the term is stupid.
[Laughter]
ANDREAS:
I was totally expecting you to say something [inaudible] [Laughter]
DAVID:
I was expecting you to cite the recent study on solution avoidance bias. But stupid is much funnier. [Laughter]
GREG:
And we all do it. Back when I was faculty at University of Toronto, I had students who smoked. And I don’t know if you’ve ever had to bury a student, but you don’t get over that. And so, I would tell them, “Do you not know what this is doing to you heart, to your lungs, to your teeth?” And they all know. And they keep doing it. How many of you eat three healthy meals today? Hell, how many of you flossed this morning. [Chuckles]
GREG:
Right? So, we all do it somewhere. But the hypocrisy of telling everybody else that they should go and analyze their data, the measured self with the Fitbits and things like that, or all these personal monitors for how many steps did you take, what was your heart rate? We build this stuff. We tell everybody else how cool it is. We could turn that on our own practice.
Another example, a guy from the Firefox team, fairly senior team leader had some time on a flight, had a bunch of data on his laptop. So, he went and he built a very, very simple linear regression model, basically just draw a straight line saying, “If this is the rate at which new issues are coming in, when are we going to ship?” because we can look at that historically. Using that model, he said the next release of Firefox isn’t going to be ready until two months after it’s currently slated to go out. And everybody said, “No, no, no. We’re on track. We’ll be okay,” which means politically it’s not acceptable yet to say we’re going to miss our target date.
Well, guess what? They shipped two and a half months after they were originally slated to. Now, at that point people started paying attention. But it shouldn’t have been an astounding bit of rocket science. It should have been normal.
DAVID:
It’s in ‘Code Complete’.
GREG:
Sure.
DAVID:
This was published in 1996. [Chuckles]
GREG:
And I think the problem is the blind leading the blind. Most faculty in computer science departments have never shipped a product. They’ve certainly never shipped version two of a product. And they weren’t taught to think about software engineering as an engineering discipline. Most of what is taught in software engineering is actually project management.
ANDREAS:
That’s exactly right.
GREG:
Now you can…
ANDREAS:
In fact, even the textbooks are overwhelmingly just project management. It’s silly.
GREG:
Well, no it isn’t. There needs to be project management. And most engineering programs have a project management class. And you can be objective. You can be analytical about project management. If you’re doing lots of projects in the construction industry you can tell, how many kilometers of road are we going to get built this month? My father-in-law used to do exactly that. You can’t do it nearly as accurately in software development. But you can do it a lot better than we do. If you’re building websites for shopping malls you can predict pretty much how long it’s going to take, plus or minus a factor of two maybe. Whereas right now, we’re just guessing.
ANDREAS:
Well, what I mean is that when you read one of these major software engineering books, almost any of them really, they talk about project management but it’s very fuzzy. It’s not like, “Here’s how we use evidence to try to gauge things.” It’s more like, “Here are the different process models that you could use. You could use waterfall. You could use spiral. You could use agile.” And then that’s it.
GREG:
Yeah.
ANDREAS:
That’s the end. The end, all we need to know.
GREG:
Let me ask the panelists. How many of you have watched the show called Mad Men?
DAVID:
Oh, like one episode.
GREG:
So, ad agency in New York in the 1960s. And one of the things that struck me was that they were all blind. It was six months from where you started an ad campaign until you had even a little bit of sales data back to tell you if you were on the right track. Everybody was flying blind. Today, it is 15 minutes from the time an item leaves the shelves in Walmart until head office knows that item has been sold. That’s globally. That’s not just in the United States. And their…
JESSICA:
And on the website, there’s no reason for it to be that long.
GREG:
Everything gets aggregated, archived. It’s just, we have this and it makes things more efficient. But it took marketing two decades, three decades, to turn themselves into a data-driven discipline. And parts of it still aren’t. Medicine is still going through this. There are very few medical programs in North America whose curriculum is driven by empirical evidence. The one at McMaster University here in Ontario is one of the few. There’s nothing they teach, absolutely nothing they teach for which they do not have a study saying, “This is beneficial.” And the thing that makes them really special is that they have this rolling survey. What do doctors still remember and use five to ten years after graduation? Because if you’re not still using it five years after graduation, we won’t bother to teach it.
CHUCK:
Wow.
GREG:
So, they’ve got the four-year curriculum down to three years. And they can prove that their graduates are just as good at being GPs as anybody else’s.
ANDREAS:
That’s incredible.
GREG:
Right? And this is…
ANDREAS:
Especially in medicine where the history of lack of data is so astounding. The studies on mesmerism by Benjamin franklin from the late 1700s, but yet that ridiculous theory lasted 100 years. Homeopathy which came out in the 1830s was discredited by the 1900s but yet is still funded by the UK today. It’s staggering.
CHUCK:
That makes me think…
AVDI:
After watching the…
CHUCK:
Oh, I was just going to say that makes me feel better about going to the doctor. [Laughter]
AVDI:
After watching both of your talks, I kind of wanted to start calling myself a Software Witch Doctor. [Laughter]
DAVID:
I have this in my notes, yes.
AVDI:
It’s really a lot of what we do is folk medicine.
JESSICA:
But there’s so much more glory in that, in just…
AVDI:
Yeah, and…
JESSICA:
[inaudible] right compared to proving it. [Inaudible]
CHUCK:
I am so starting, I’m going to start up a dev shop now that’s called Shrunken Heads Software. [Chuckles]
AVDI:
It reminds me a lot of, I love reading authors from the 19th century. You had these guys with incredibly impressive beards who would write about things like psychology. You had William James writing about psychology and Sir James George Frasier writing about what became anthropology. And some of that writing is so great. But it really was just a guy stroking his beard and taking all these stories that people would, anecdotal stories that people would submit from the field. And then he’d stroke his beard for a while and come up with a grand theory to explain it all. And some of it turned out to be somewhat apt. And a lot of it turned out to be bunk. But boy, some of that writing sure is fun to read.
GREG:
Yeah. And the idea that if you make something quantitative, that if you measure it the fun goes out, is completely untrue. And I’ve had this argument with people in the arts. They say, “If you know the science behind the rainbow, the rainbow stops being beautiful.” And the answer is no. If you read literary criticism it doesn’t make you enjoy the novel less. If you learn music theory, it doesn’t make you enjoy the song less. If I know why the rainbow has those colors, I think I appreciate it more. Knowing about evolution lets me see more beauty in the world around me rather than less.
And I think the same is true of engineering. I think that both of my brothers can look at a house and see things that I can’t because they know how it’s put together. They see through it, not just the surface. I think that understanding more about the relationship between choice of variable names and how readable the code is makes me a better programmer and also makes me enjoy programming more, because I know I’m doing the right thing.
Pair programming, some of the studies that were done, and there were a lot of flaky studies done, but it looks like there’s a technique that actually does improve coding, but only in short bursts. If you pair program with the same person for an extended period, the benefits wear off. Now, one theory is that you start thinking alike. Another theory is it’s the Facebook effect. When you’re first sitting beside somebody, you don’t want them reading your email. If you’ve been sitting beside somebody for three weeks, you don’t care if they see you with your fuzzy slippers on. So, maybe pairing with somebody forces you to be on your best game for a while. But then you slack off, just as you do after you get to know your roommates.
We don’t know. But now we can ask a more intelligent question. Now we can go and dig deeper. And that’s fun. So, my ideal software engineering course is something where we ask the students to collect and analyze data. I don’t know that they have to do experiments any more than astronomers do experiments. You don’t go and construct a galaxy and then measure its rotation rate. You use the ones that we have, because the environmental impact it’s just a killer otherwise. [Chuckles]
GREG:
But there’s so much data out there that we could give to our students and say, “Okay, analyze this. See what you can find in this.” Are good programmers really ten times better than average programmers? Let’s get back to that. Well, there’s GitHub. Tell me how you’re going, what you’re going to measure and how you’re going to measure it. And there is no right answer. There’s just a right method. You change exactly what you’re measuring and how you’re measuring it and you can get whatever answer you want. But at least now, I know exactly what you mean.
JESSICA: And…
GREG:
Now, I can agree or disagree. Go ahead please, Jessica.
JESSICA:
And all of those studies are done in a particular context. But if we get good at gathering data and analyzing it ourselves, we could find out what works in our particular context.
GREG:
Yes, absolutely.
ANDREAS:
Also, I think that’s a good point because when you run experiments the context matters a lot. But so, when I was in London at Code Mesh and I think Jessica was there in the room, there was an individual that got up and said, “The only thing that matters for an experiment on static versus dynamic typing is if you had professional programmers that were working on projects of one million lines of code or larger.”
JESSICA:
Was that individual Joe Armstrong? I believe it was.
ANDREAS:
It may have been. [Chuckles] But the point is that oftentimes I think as developers we have these beliefs that certain contexts matter or don’t. But the reality is in most scientific fields what you look for is patterns across sets of contexts.
And so, I want to give a tangible example. In the studies that we have on static versus dynamic typing, the evidence is pretty clear that if you have static typing in method declarations, that it improves programmer productivity with a variance accounted for measure between 0.3 and 0.5. So, what that means in plain English is if you look at the wobble, the variance between programmers, about 30 to 50% of that variance accounts for differences you observe. And when you videotape you can get a pretty good estimate of why developers go in and look at other files and try to find what type to pass and all this kind of stuff.
But the thing is that if you look at novices using static typing or also if you look at generics when you’re trying to change files, you see different properties. So for example, for novices you find that these individuals actually miss the type annotations 92% of the time inside of a method declaration. Exactly the same spot that a professional or someone with more experience would garner benefits from. On the other hand, if you look on the inside of a function, like you’re writing code within a function as opposed to the declaration, what you observe is that novices miss it less. So, you can actually look across those contexts and try to find compromises.
So, that’s why in Quorum right now we allow a certain amount of type inference on the inside of functions because it helps novices a little bit. But we don’t allow inference on method declarations because the evidence points to the fact that it causes a drop in productivity for developers.
DAVID:
Wow.
ANDREAS:
So, we might as well look across these and do what helps the most people possible under as many contexts as we can reasonably study in a lifetime.
GREG:
One of my profs when I was in engineering said, “Engineers are not allowed to use the words right and wrong. They’re only allowed to use the words better and worse.” That’s exactly what Andreas and his team are helping us get towards. There are always going to be tradeoffs. Let’s face it. If you are Guy Steele Jr. you don’t need types of any kind, right? If you are Simon Peyton Jones, I don’t think he uses the backspace key when he codes. [Laughter]
GREG:
I’ve met the man. I’ve watched him type. I don’t think he uses the backspace key when he codes. But we shouldn’t be designing all of our systems around or for them, nor should we be designing them around or for the 87-year-old grandfather who has never used a computer before. If we know what the distribution is, we can do what the HCI people do which is if 90% of the people are happy 90% of the time, you call that a win and you keep moving. But we don’t even know what our distributions look like and we could.
DAVID:
Yeah.
GREG:
We could. This is what’s so frustrating, is this is purely self-inflicted damage at this point. 20 years ago, we didn’t have the data. I grant that. 10 years ago, we had it but we weren’t really sure what to do with it. Now, we don’t have all the answers. We don’t even have most of the important answers.
But we have some of them and we can act on them. And I’m just embarrassed that we don’t.
DAVID:
Here’s a question I have which is, there’s actually a philosophical field of study dedicated to the philosophy of science itself which is, how do we know science works? Like, how do you tell pseudoscience from real science? And one of the bugbears in the philosophy of science is that it’s actually impossible to discern pseudoscience from a genuine astonishing discovery in the early days without being able to read the mind of the discoverer. You have to have this longitudinal study. And I’m curious. You mentioned in your talk, show me the evidence and the plural of anecdote is not data and that sort of thing. If everyone goes off and demands evidence, who’s going to perform studies? I’m asking this for the sake of my job security as a folk medicine witch doctor here.
GREG:
Sure. So, that’s the role of people like Andreas. That’s why we fund universities. They are trained to do this. The thing that we need to tell them is what questions we have. Now, I’m going to have to jump but Andreas might be able to pull up a paper by Andres Begel, B-E-G-E-L, who’s at Microsoft Research. He went and asked several hundred developers inside Microsoft, what questions do you most want software engineer researchers to answer?
DAVID:
Awesome.
GREG:
Okay. Now we’re starting to close the loop, because going back to medicine, I think they have a much clearer idea there of what clinicians want answers for. There’s a wide gulf between what software engineering researchers study and what programmers care about. Again as an example, when I asked one of the leads for the Firefox team, what do you most want to know? His question was, how long should our release cycle be? We’ve picked six weeks but we just, we picked that. Should it be shorter or longer? Okay. I’ve never seen anybody do a study that looks at how long a release cycle should be for a product where you’ve got the possibility of near continuous deployment. I wish there was more of that.
I would love to see a hundred software engineering researchers show up to the next OSCON or the next Strange Loop. I keep trying to get them to go. I would similarly like to see more people from those sorts of conferences coming back to things like the International Conference in Software Engineering so that people could meet each other. And I keep trying to engineer that. And it keeps not happening.
DAVID:
No, I think that’s fantastic. I think that’s fantastic. Thank you.
GREG:
Sure.
DAVID:
Normally Andreas, I would say that you agreeing with Greg would make you redundant to this call. But in this case, this is a great case of scheduling failover.
[Laughter]
ANDREAS:
Well, my lab’s in the trenches for this stuff. We conduct a lot of studies. So, I think Greg has the luxury of being able to look big picture at a lot of this and ask questions about what’s needed and stuff like that. We can do that too, but we have to actually publish and try to prove a lot of this stuff. And it’s a hard thing, especially given that there is so little data out there oftentimes. Even knowing exactly what to test is not easy, even foundational questions. You would have thought that static versus dynamic typing, what the impact is on programmer productivity would have been answered 30 or 40 years ago. But in fact, it’s really only been the last three to four years that people have started investigating, at least in any serious sense, in an actual research line where they conduct multiple replicable studies over time on various kinds of groups and stuff like that. But believe it or not, that’s the truth. People didn’t even really investigate for decades, which is very strange.
CHUCK:
So, I have a question here. And it kind of stems from something Greg was talking about earlier where he mentioned, what’s our optimal release cycle? We arbitrarily picked six months. And it seems like when we’re following agile methodologies we have our retrospectives and we sit down and we say, “We have this problem. We want to overcome it.” So, we’re releasing buggy software, for example. Or, we’re really curious to see if this works better than this. But when we have the retrospective, we all site around the table and we go, so how did you feel about what we did? And then the other guy, “Well, when we were pairing it really helped me with this. And it really didn’t help me with this.” And that’s useful I think in feeling productive.
But is there a way for us as development teams to actually demonstrate or prove our theories where we actually then can say definitively, yeah, pairing made a big difference on this team, or whatever?
ANDREAS:
Well, okay. So, this is actually a hard question. And the reason it’s hard is because when you’re developing tools for real, so for example even though my lab is a research lab I also build Quorum. So, I probably spend 20 to 30 hours a week working on the compiler or whatever it is I feel like, or some libraries or whatever it is we’re doing. And the thing is that when you’re in the trenches and you’re writing code, oftentimes you don’t have time to take a step back and do a ton of scientific analysis on how you’re coding. And also as scientists, sometimes it’s hard when it comes to this sort of work, because you don’t want to spend a lot of time trying to answer scientific questions that don’t actually matter, that aren’t foundational, at least as a research lab.
But see, this is the rub, because if you’re a development company you often don’t really care whether the metrics and analysis that you have are answering foundational questions. Like, you may not care about static versus dynamic typing or syntax or something like that. But you might very much so care about how much money it costs you to have your developers spending an extra two months, stuff like that. But the problem is since we don’t track almost anything with our software development processes on an individual company level or in general, oftentimes the first thing that you can do is track anything that you can do easily and see if anything sticks. That’s the honest truth.
words:
at first if you have no idea what you’re doing then just have somebody on the team dedicated to just tracking stuff. In other words, anything you think might be relevant, because that’s better than nothing. And if you find nothing, then at the end of the day, you pick other stuff for the next round, right? See what I mean? In other words, when you know nothing you have to start with something. And that something might be bogus. It might be completely wrong, totally misguided. But nonetheless over time our discipline will get better at figuring out what things matter by people just trying all sorts of crazy ideas to figure out their productivity concerns.
JESSICA:
That’s a win. It would be lovely if we could have data in our retro for anything that can conceivably be measured and we think is important. And also talk about feelings for things that feelings matter for, like our own experiences.
ANDREAS:
Yeah, I think that’s important. Oftentimes we, at least as researchers in my lab, we focus pretty clearly on quantitative measurement. So, we boil things down to numbers and then we run tests on distributions and such. But at the end of the day too, there’s more than just numbers. Oftentimes these qualitative studies asking developers things can be very useful. Or even on a team, oftentimes you can’t boil down the release of a product necessarily to a number because no one number would necessarily give you the whole story about what you just released. So sometimes, you just need to sit down and talk about it, too. So, this is the hard part about science, is that oftentimes you have incomplete measures and nothing’s perfect. So, it kind of is what it is in that context.
JESSICA:
That’s the story of our lives, as humans and as programmers, acting on incomplete information.
ANDREAS:
Yeah. So, in our studies on syntax for example, when we run experiments, and let’s suppose for the… one of the results that we had a few years ago that I thought was funny was if you have a looping construct, just like it’s going to be a for loop or something. Well, it turns out that when we do surveys, and I mean scientific surveys where we’ve done them on large scale, we’ve done them in multiple universities. We’ve had vetting by other scientists and all sorts of stuff, peer review. What we find is that the words that are consistently rated the worst choices for understanding the concepts of loops, that concept of repetition or iteration or whatever you want to call it, are the words for, while, and for each in that order.
[Laughter]
ANDREAS:
Yeah, sort of funny. But on the other hand, the question then becomes immediately one of context, whether that matters. So for example, if you’re in a developer shop and you’re using Ruby or Java or whatever, you can’t even change that even if you wanted to. It would make no difference because the only people that can change that are language designers. But on the other hand we know that the language design community isn’t using evidence.
[Laughter]
ANDREAS:
And even if they do use evidence, that particular result is done with people that are, is generally done with novices because they have no innate biases, right? So, the question then is, what kinds of people do that impact? So, for us on projects like Quorum we have to effectively make a choice. Do we want to use that kind of evidence to try to influence the design of the language? Or does it have a negative impact for somebody else? And that’s a really hard question to answer sometimes. Granted, I think that most professional developers that are totally, what was the word Greg used, stupid?
[Laughter]
ANDREAS:
If you use a phrase like repeat ten times in a programming language, I think most professionals can figure out what the hell that means. So, I don’t know how much of a big deal that particular one really is. But the thing is, what’s interesting in that context is a professional can probably figure out what that means. A novice is more statistically likely to. So, is that the right decision? Well, it’s still hard to know. And that’s just language design. Individual companies, any bets are off.
DAVID:
Yeah.
AVDI:
And that sort of leads into something that I want to ask about. And this is kind of a big ball of a question. I’m not sure where to start into it, so bear with me. And first of all I just want to say, I think you’re doing God’s work. [Laughter]
AVDI:
But…
ANDREAS:
You’re the one.
AVDI:
So, I want to do a little bit of being just, I’ll take the role of the voice of resistance. We’ve talked a little bit about how there is a lot of resistance to these ideas. And I will be the voice of resistance a little bit, because I can sort of recognize in myself some of those resistances when I hear this stuff. I’ll start here. I’m not sure of the best place to start, but I’ll start here.
I spent, I don’t know, roughly nine years in a giant aerospace corporation. And it wasn’t a particularly happy place to work. And they were definitely big on engineering, or at least what they called engineering. When academia impinged upon my work at all, it was usually in the form of management fads like CMI and Six Sigma and EFQM and stuff like that, which came along with a lot of at least supposed metrics and data and stuff like that. And they usually, they seem to be focused on how to squeeze another 2% of productivity out of a giant mass of developers who are basically miserable cogs in wheels. So, starting from the assumption of a bunch of miserable cogs in wheels that are replaceable, how do we squeeze another 2% of productivity out of them?
And I guess something that I wonder about is, in an organization like that you could do a study. A lot of the data that we see, a lot of times it’s only the big organizations that collect any data, like Microsoft. So, a lot the data we see comes out of places like Microsoft. And you can see that, say I
don’t know, something like static typing or something might make a percentage increase in that organization. But what I worry about a little is whether there’s a possibility of optimizing for local maxima there, because…
So, I think everyone in this panel probably has anecdotal evidence or experience of a team that had picked a set of tools and practices, a smaller team that picked a set of tools and practices and just completely gelled on them. And this team just delivered and delivered and delivered. They just, they were on target. And it often seems when you look at stuff like that, often seems like it doesn’t even really matter what tools and practices they chose. More it seemed like it mattered, either it was the organization itself or it was the fact that the set of tools and practices that they chose worked well together. Which is sort of the idea that you had in the original XP idea, was that these 12 practices or however many there were, they all work well together but they fall apart on their own.
JESSICA:
Or that the people worked well together.
AVDI:
Right. I guess some of the resistance that I feel in myself when I look at studies like this and I wonder if others share it with me is that sense that, what if this is just going to make us start optimizing for local maxima and ignoring either the organizational effects or the synergistic effects that would render those local maxima meaningless noise?
ANDREAS:
Yeah, so you know actually I appreciate this question a lot because it’s the type of thing that comes up in this kind of research. Or actually, it’s also the type of things that came up in medicine when they were transitioning from a non-evidence based discipline to an evidence based discipline. So, I’ll give you a classic example of the medicine in just a second.
But let me tackle this local maxima issue first, because I think it’s really important. And that is this. If I optimize or some set of research people optimize programming languages as best as possible, let’s for the sake of argument say that there’s a best possible language. I think that’s BS. I don’t think there is such a thing or probably ever will be. But suppose for the sake of argument that it is. There’s no guarantee that even if you put that best possible language, whatever that is, at an organization that it will improve productivity by more than X percent for some value of X.
So, think about it. Even if you have the best possible language product at an organization, that doesn’t mean that the people don’t suck, right? That doesn’t mean that the manager isn’t a dick, or that the colleagues that you have, you hired people with the wrong expertise. It doesn’t mean that you lost funding at the last second and had nothing to do with the language technology. The problem is that these are human organizations. For example, a medical doctor can conduct a heart transplant and a certain percentage of people that do that are probably going to die. But really good procedures might lower that percentage.
So, I think what’s important here when we talk about this concept of local maxima or context is to realize that even though we’ve done studies on things like static typing and syntax and stuff like that, at the end of the day we’ve conducted maybe six to ten, something like that, studies so far. But what we need is six to ten thousand studies.
In other words, when medicine, I’ll use as example in the 1830s or 40s or 50s or whenever it was, there’s a really famous quotation that I recall reading from this fella named [inaudible]. It’s a paper on the history of randomized control trials in that discipline. And he has this wonderful quotation that I’m going say slightly wrong, but I’ll do the best I can. And a doctor said something around the range of, I’m not going to test homeopathy because it’s abundantly clear that bloodletting is the appropriate procedure for medical doctors to use.
[Laughter]
JESSICA:
Nice.
ANDREAS:
But think about that for just a second. At the end of the day, what the doctors at the time were missing wasn’t that one particular procedure like static typing or something was better. What the doctors at the time were missing was that they were a non-evidence based discipline. And that’s the core problem that we see today. So, we don’t even really know exactly whether we’re hitting local maxima or not. And in our case of static typing studies, I think we’re pretty confident that static typing varies across different kinds of people.
It appears from the 2013 paper by Hanenberg that if you use generic types that it can, actually you may not know this result so let me say it. There’s a paper out of [Hoops] or SPLASH by Stefan Hanenberg. They had people use generics for static typing. And what’s interesting is that the static typing results actually varied when you add in generics compared to the types of API studies that Stefan and I did together earlier. This one was all on just him. Anyway, so if you use a generic like you have list string, something like that in a Java-like language, it turns out that does bump your productivity under the conditions of the test just a little bit. The effect size is pretty small but it does exist. However, if you change a generic’s class, change, that’s the key thing, it decreases your productivity by tenfold, tenfold.
DAVID:
Wow. So, if you…
ANDREAS:
Now, think about that.
DAVID:
Even if you change the interface from list string to list integer or something like that?
ANDREAS:
No, no, no, no, no. I mean the list class itself.
DAVID:
Oh, okay.
JESSICA:
That we’re doing it right.
ANDREAS:
Well, in other words…
JESSICA:
By accident.
ANDREAS:
Yes. That’s the key, by accident, by accident.
DAVID:
[Chuckles]
AVDI:
Well, it’s sort of the natural process. We’re not accelerating ourselves yet. We’re still in the process of evolution where nature senses that some things work out better than others. Or maybe, the animal that learns that when it goes over here it gets an electric shock, it doesn’t really understand electricity yet. [Chuckles] But it…
JESSICA:
And yet the whole interesting part of thinking is, can we do better than natural selection?
AVDI:
Right, right.
[Laughter]
DAVID:
Because if we don’t, we die.
ANDREAS:
[Chuckles] Well, true. There are all sorts of wonderful studies on artificial selection too, that are very fascinating. [I never] read them. Anyway…
DAVID:
In an interesting way it sounds like what you’re kind of praising is that we’re doing, Greg talked about in is talk about folk medicine. They would go to the jungle and they would find these indigenous people that were putting tree sap on a wound and it would get better very quickly. And they would then go, “Okay. We can see that that works. Let’s sit down and figure out why it works.” And I feel like that’s kind of, Avdi said witch doctor. That made me laugh because I have in my notes, I am a computer science witch doctor, because all of my stuff is not based on evidence. It’s just based on my own experience of which tree saps make my programs go faster and which tree saps make it go slower. And it’s terrifying. [Laughs]
ANDREAS:
Well you know, I have another story here just because I think it’s interesting. When we ran our last study which came out in December which was, essentially we gathered just tons and tons of data on syntax. We had some people with experience look and do surveys and stuff like that. We tested across lots of languages like Go and Smalltalk and had people answer questions about different variations in syntax, basically lots of things that did semantically the same thing but where the representation of the language changed. And for my perspective, the most interesting result for me as a language designer on the Quorum project was that it turned out that that study disproved a lot of beliefs that I held in regard to language design. So, I want to give you an example.
We thought in Quorum 1 that natural language would be very effective for certain kinds of constructs, especially if statements, right? So, we have if and then we had some parens, and then we had A equals B or whatever. And then we had the word then, so like the if-then. And then, but we wanted a terminator like the left brace and the right brace. And so, we had the word end. So, if you had an if statement, it’d say something like if A equals B then end else then, that sort of idea.
DAVID:
[Boo].
ANDREAS:
But it turns, yeah well, you say that. But then we tested it and we tested it against a whole bunch of other languages, one of which was Ruby. And what we found is we used this technique which comes from DNA processing. We call it token accuracy map. It’s sort of a way to predict highly accurately which tokens cause problems for people under a particular context. It works very well and is highly [replicable].
DAVID:
Huh. Cool.
ANDREAS:
Yeah, it’s sort of funny. That’s why I can say something like 92% of novices miss the type annotation on a method declaration, because this DNA processing technique helps us sort this out, basically. Anyway, so what we found though is that Ruby just demolished Quorum in its design of if statements. The thens at the top of the statement, only 67% of people novices got it right. The ends on the inside of statements, only 8% of people got it right. It just totally destroyed us. And so, we could have done two things. We could have said, “Ah, Ruby’s wrong. Screw Ruby.” But what we really did…
JESSICA:
[Laughs]
ANDREAS:
Is we stole everything that worked. So now…
DAVID:
This is how I know you’re not a PHP programmer.
[Laughter]
ANDREAS:
Yeah [chuckles]. Actually, we were fighting with PHP yesterday, but that’s another story.
DAVID:
[Chuckles]
ANDREAS:
So, the only exception that Ruby did not get right that Quorum did get right in terms of the if statement design was the equals sign. So, Ruby uses the double equals. But it turns out in the four languages we tested, any language that used the double equals sign, between 0 and 8% of novices, getting the exact number is slightly hard for complicated reasons, between 0 and 8% of people correctly use it inside of an if statement. That’s fascinating. But what’s even more fascinating is that you realize that we always test again placebo, right? So, we also had a group that just by chance used the right slash to indicate the equals equals. And it turns out they did just as well as the Ruby group for that particular token.
DAVID:
[Chuckles]
ANDREAS:
And that’s interesting because if you use a single equals sign it raises a novice’s performance to about 67% accuracy for that token, so about two-thirds of people give or take. So, what that means is we can take Ruby, optimize out all the pieces that they did right that we did wrong, but keep any of the pieces that we did right, over time making languages that only use the things that win, that use the best features that we can according to a set of standardized metrics and controlled studies, essentially.
AVDI:
Okay, I’m going to be devil’s advocate here real quick, and I actually am disagreeing with myself as I say this.
ANDREAS:
[Chuckles]
AVDI:
But how much does it matter, or do we know how much it matters to make languages that are easy
for novices? In other words, is there a difference that disappears as people get more experienced with programming in general? Obviously, the double equals comes from C. It’s there to make things comfortable for C programmers, not for novices.
ANDREAS:
Yeah, we don’t know. The number of studies is astonishingly small, astonishingly small. However, we do know that compiler errors are actually an issue for professionals from a Google study that was done at ICSE 2014, a big software engineering conference. But what we do know from their study is that the types of errors that professionals at Google, and I want to say that very clearly because the context is important, the type of issues that they have are different than some of the issues that novices have.
So for example, the Google study if I recall correctly show that the biggest predictor of developers losing time to, productivity time, to compiler errors, again not syntax but compiler errors, was anything related to these module dependencies. You know what I’m talking about, how you don’t just have a compiler error. You have a compiler error because you need to load a library on your path or whatever that is. Apparently, those errors are a major problem for developers at Google. However, when you look at errors for novices things change a little bit.
So, there’s a researcher named Paul Denny and he showed that type system errors account for something like, I think it was 33% of the time for people in a first class. So, not just the beginning which is what we test in our lab, but over the course of a semester. But we also know by the time that you’re in your third year of a program that these type system details improve your productivity. But no one’s every tested to my knowledge children. So for example, if you can’t read then syntax doesn’t make any difference at all. But you might want to use a tool like Scratch.
So, in other words, we don’t know very much about the various kinds of contexts that these things matter in order to optimize across these groups. But nonetheless we’re starting to get reliable and replicable data at least under some conditions, which I think is better than nothing. So, or at least a start.
DAVID:
I’m seeing lots of studies in computer science lately and I’m seeing people gripe about this on Twitter like Gary Bernhardt, other people talking about a lot of studies in computer science that have fewer than 60 participants. And the most common cohort is college students who desperately want free pizza. And so, they tend to be novices or first or second-year programmers. They tend to fall into a specific cohort. I’ve heard an argument that big successful businesses often, well no, big businesses want to be studied for their CMM level and their whatever compliance. But startups often don’t want people doing studies while they’re busy trying to make money and that sort of thing. So, that was a long rambling question. Studies with too few participants and with too isolated of a cohort, how do we deal with that?
ANDREAS:
Actually, so there’s a standard. And that’s called statistical significance. So basically, how this works in modern science, and I want to be very clear, in every single other discipline except computer science, every single other discipline, is we do the following: we conduct an experiment and we gather two metrics. One is called p-value. P-value gives a very, well people disagree about exactly what it means but…
DAVID:
[Chuckles]
ANDREAS:
Some people interpret it as the probably that a particular study will replicate. That’s not strictly true. But nonetheless, it’s a rough guesstimate for that type of idea.
DAVID:
Okay.
ANDREAS:
However there’s a second thing that we have to do. And that is we have to gather what’s called the variance accounted for, or the effect size. Now what this means is if I run a study and I get a pvalue of a particular standard, so in different disciplines it works differently. And psychology or computer science is typically a 5% chance that you’re wrong, so 1 in 20 which is not very good. But in physics, it’s [Six Sigma] often, so I think it’s 1 in half a billion, or I forget the exact numbers. But it’s really, really rigorous because you can conduct very large scale things.
But here’s the catch. When you conduct scientific studies on small samples, that’s okay so long as you conduct replication studies and get the same variance accounted fors and the others. So for example, if we conducted many, many studies on static typing, which we’re doing, and we observed different effect size values for particular studies, we would know that the original sample was probably wrong, the original data. However, if we conduct scientific replications and we get approximately the same answer on a different sample, then we know that at least that measure’s reliable. So, let me be very specific. In the original study on Perl we had six people in that group which is nothing.
DAVID:
[Laughs]
ANDREAS:
That’s a tiny amount. We ran a replication. The replication was accurate to within 0.3% accuracy on a new sample.
DAVID:
Oh, wow.
ANDREAS:
So, should we run a study that has 100,000 people in it just to make sure that we get even more accurate? Hell no, because that’s a waste of money. So, if you want to pitch that to the National Science Foundation, what [inaudible] instead is start running studies under different contexts. So, children, professionals, people in various years, third year engineers at Google, fifth year, people close to requirement, people’s [inaudible] experience that they’re close to death, whatever you want to test. In other words, we use probabilities and we use effect sizes in order to detect replication. And that’s the keystone to science, not necessarily the sample size itself. That’s actually, it’s such a common misunderstanding it has a name. It’s called “The Fallacy of Large Numbers”.
DAVID:
Jessica asked a great question in the backchannel about every single other discipline. And she says maybe that implies that our discipline isn’t. I love the fact that when I first got into software engineering I was a student member of the ACM and the IEEE. And this was in the early 90s. And there was a furious set of arguments going around back then that software engineers should not be allowed to call themselves engineers because we were the only branch of engineering that did not have licensing exams, board certifications, bonding, standard practices for safe building of what we do. It’s like a janitor calling themselves a sanitation engineer. It’s a mockery. And the fact that a sanitation engineer, okay that’s a funny joke. Software engineer, that’s an insult to mechanical engineers and to civil engineers and the like. Yeah, it’s crazy.
I want to make sure I understand what you said about the Fallacy of Large Numbers. So, you did a study with six participants and you got this, I’m assuming…
ANDREAS:
Well, that group had six participants in it, which is tiny.
DAVID:
Okay.
ANDREAS:
By any standards.
DAVID:
Right, so you replicated it with six more? Or did you go get 60 the second time?
ANDREAS:
The second time we had 12 and we added new contexts.
DAVID:
Okay.
ANDREAS:
So, in that case we went from three languages to six that we tested, because we wanted to see if different languages held different properties. So for example, it turns out Java has the same property as Perl in that it doesn’t beat a randomly designed language. And in fact…
DAVID:
[Laughs]
ANDREAS:
It does about 3% worse. And the reason is that…
DAVID:
[Laughs]
ANDREAS:
Because of the type annotations. So, Perl has the weird funny dollars. And in Java, it uses names like int or float or whatever. And it turns out that that makes a difference. It’s not very much. It makes a small difference. But it makes some kind of a difference. So in other words, usually what we do in my lab is we follow the doubling rule. It’s not a very common thing, but it’s something we do.
So it’s like, we run an experiment and when we start we use one person. And we don’t publish that because obviously that’s worthless. However, then we run two, then we run four or eight. So, in the first study that we did that, we ran six. And then we ran 12 for each statistic, for each group. But the thing is we replicated so tightly that it wasn’t really worth it to go to 24. So, that’s why you follow a rule like that because then you can decide, should we run a different study or should we do the same one again? That’s why you do that.
JESSICA:
That’s really interesting, because that’s a technique that we could use at work. When we’re picking, I don’t know what to measure so let’s measure something. We can measure it on ourselves.
ANDREAS:
Yeah.
JESSICA:
And then on our pair. And then on our team.
ANDREAS:
Yeah.
JESSICA:
And if we have a really good idea that this could help, maybe even across teams.
ANDREAS:
That’s exactly right. And that’s why my lab right now is running, I think we’re running 15 studies in parallel because we want to know all sorts of questions. We don’t know which of them we can find reliable data about. Or even whether the techniques we’re using to study it are any good. So, you start with one. And then when you’re doing one you’re generally not expecting to find anything interesting. You don’t even have a control.
But you can test things like, are our measurement techniques accurate? Is the tool that we’re having people use working? There are all sorts of things you can do when you’re running these small scale pilots and building them up over time. And I really appreciate that Jessica, because that’s exactly what you can do in industry. Start really small. Do it with one person and then scale as you learn.
DAVID:
Yeah. So, Gary said right before he signed off that, when I asked him…
ANDREAS:
You mean Greg?
DAVID:
Oh, I’m sorry, Greg. I asked, how can I do science? Can I do studies and get published and that sort of thing. And he said, sure we’d be happy to teach you. Is there a short answer to where I can go to learn this doubling rule and p-values and variance?
ANDREAS:
Well, it depends on what you want to run. So, if you want to run really controlled experiments, it does take some practice. But the nice thing is that many kinds of randomized control trials are actually really astonishingly easy. So for example, here’s the simplest kind of control trial. You have group A and it gets one thing. And then you get group B, it gets another thing. And then you randomly assign the participants to both groups. Ta-da!
JESSICA:
Oh, so one pair drinks coffee and one pair drinks beer. And then you [inaudible] the results.
ANDREAS:
And that pair has more fun.
JESSICA:
[Laughs] Yes, which pair has more fun? Exactly. [Chuckles]
DAVID:
My pair had to drink bleach.
ANDREAS:
Yeah, that’s exactly right. That’s called a between subjects randomized control trial. That’s that that’s called. And actually when you think about it, that’s pretty easy. That’s like Coke or Pepsi. It’s not that hard. Now, when you need to analyze it, the nice thing is that since there’s a growing group of people in the actual academia that know how to do this stuff, if you ran a control trial like that and you sent us the assumptions and you sent us the data, we could run the little tests needed for the stats for you. It would take ten minutes. It’s not that hard.
DAVID:
That’s cool.
ANDREAS:
So actually, there’s a lot that people can do even if they don’t have all the training to do really complex designs. And in fact most of the time you don’t even need complex designs to answer some of these basic questions, because actually they’re easy. Like, should I have lambdas? Well, have one group use them and have one group use something else. Should I have enums? Same bloody question. It’s really easy most of the time.
JESSICA:
That could make a really interesting collaboration, because we have information that you in academia can use and access to people.
ANDREAS:
Hell, yeah.
JESSICA:
Who can answer questions for you. And if we make that collaboration, I see a really good conference talk could come out of that.
ANDREAS:
Yeah, me too. And you know I should say just as [inaudible] because [inaudible] I guess, [inaudible] something. On the Quorum project we have this rule and that is this: if somebody anywhere sends us a randomized control trial and it shows that something we believe is wrong, we will change the language, because we’re scientists. We’re not believers.
But the thing is, that means if you people in industry say, “Hey, I think the Quorum guys are full of crap on this one particular issue,” you have a great opportunity because you could just send us the data and show us your methodology and all that kind of stuff. We’ll have social scientists or psychologists review it and we’ll send it to experts in the area that are real scientists that can do all sorts of stuff. And if it turns out you’re right, we have no qualms. It’s no big deal at all. In fact, we like that because that means it’ll increase the productivity of our users over time, so that’s perfect.
It’s a win-win.
DAVID:
Yeah. But it has to be evidence, not an essay on how you’re Hitler and Sauron. [Laughter]
ANDREAS:
Yeah. No, yeah, exactly. No, if it’s not evidence-based, we throw it away.
DAVID:
Okay.
JESSICA:
[Inaudible] evil.
ANDREAS:
Yeah. Well, just because we don’t have time to just read everybody’s random rantings.
DAVID:
Yeah.
AVDI:
So, speaking of random ranting. Again, I want to play the voice of resistance.
DAVID:
[Chuckles]
AVDI:
A little bit here.
ANDREAS:
Go ahead.
AVDI:
So, to bring this back to Ruby a little bit. Ruby is famously optimized for developer happiness. And that is one of those wonderful statements without evidence. What I do have is anecdotally, it seems like more than half of the Ruby developers I have met have a story of basically self-selecting themselves into the language because they hated the languages they were working with. They hated Java or PHP or .NET or what have you and were so much happier once they were using Ruby.
So, let’s say I am a Ruby developer and I self-selected into the language because it made me so much happier than what I used to use. And I see a result that says let’s say, lambdas make you less productive. We have some good evidence that lambdas make you less productive. But I have resistance to that because I hated using anonymous inner classes in Java.
ANDREAS:
Huh, yeah you and me both.
AVDI:
And the idea of giving up lambdas makes my fingers curl into claws and it makes me feel all defensive. Now, my actual question here is what is the next question that I should ask? When I’m confronted with that dichotomy between evidence that says this thing reduces productivity and internal resistance that says, but not doing it makes me miserable. What is the next question I should be asking to move forward?
ANDREAS:
Okay, that’s a good question. And I should say too that on the lambda issue, this is something I’m really interested in but it’s also exceptionally hard to study. I should say without telling the results, we’ve actually run three randomized control trials in two different countries and they’ve all failed because we just screwed up the design in many ways. But that’s normal. This is why we do these doubling and all this kind of stuff. But anyway, so…
JESSICA:
And by fail, you don’t mean they didn’t give you the results you wanted. You mean the methods were wrong.
ANDREAS:
Yeah. It means we screwed it up. That’s what it means. Like, one of them we took time data but we didn’t do the timing properly. And so, we weren’t sure if we were just seeing an artifact or whether it was real. And it’s hard to know who to test on for these studies, because if you test with novices, novices can’t understand lambdas. They’re too complicated. But on the other hand, if you test with professionals that have been using them for a decade, well what are you really going to learn? So, the question is which group should you start? And since there’s, so far as we know, no other studies that really have tested this rigorously compared to alternatives, the question then is well, who do you start with? Because you can only run one study at a time.
But anyway, the issues of lambdas aside, I think the heart of your question is if I find out that a particular thing doesn’t support the language that I’m using in some way, what do I do? And unfortunately the answer right now is that there’s probably not much that you can do. Because if you find out that feature X in the language you’re using is invalidated, that doesn’t necessarily mean you can even switch.
So for example, even though we’ve never tested C# in our syntax studies, since C# basically has similar syntax to Java for the most part, not in all cases, but it’s pretty similar, it would be unsurprising if C# held the same property that Randomo and Perl and Java all have, that the syntax doesn’t make much sense to human beings. But if you’re working at Microsoft, there’s probably not much you can do if your team is using C#. And that’s the way it is.
But in 15 years or 20 years people are going to invent more programming languages over time. And I think what this research has is it may not benefit you and I in our generation but I think that in 15 to 20 years as people are inventing newer products and making newer versions of these products, it might help them over time.
So for example, on JDK 8 they added lambdas to the language. Now, I think many of us including myself tend to think the lambdas are probably an improvement over anonymous inner classes, in part because anonymous inner classes have all sorts of complexity and problems with them. But that’s just my opinion. That’s not a scientific statement. That’s just a guess.
But on the other hand, why is it that in the JDK process and the process for making these new versions of the language, why is it that even a company as big as Oracle doesn’t bother to put 20 people in a room and do a comparative study? It would be dirt easy for them to do that. And if that data was public over time, it would help a lot of people. I mean, think about it. Oracle always argues that Java’s on three billion devices. If you’re going to distribute a product to over three billion devices, it makes sense that we’ve actually vetted that using scientific methods instead of just letting their engineers do whatever the hell they feel like doing, right?
DAVID:
That’s why calling us engineers is an insult to engineers.
[Chuckles]
ANDREAS:
To a certain extent, yeah.
DAVID:
Right? Isn’t that, that’s hugely irresponsible in a way.
ANDREAS:
Well you know, it’s not a term that I’m using. It’s just the name of the discipline. It’s called software engineering.
DAVID:
Yeah.
ANDREAS:
That’s not to say that you’re not wrong, because you’re kind of right. But at the same time, it is what it is. It’s called what it is.
DAVID:
Yeah.
ANDREAS:
So anyway, so the point is sometimes as an individual there’s nothing you could do. But that doesn’t mean that over time if you have influence on the language or influence on the tools or influence on other areas that you can’t say to the people that do have control over those products, “Hey, when you make the new version, let’s use some evidence. Let’s look at the issues. Let’s run a study. Let’s do some analysis. Let’s think about it scientifically and try to answer the questions as opposed to just believing that we’re right or wrong.”
AVDI:
I just wonder if there’s extra data that’s being missed there. When there’s a feature which, let’s say there’s evidence that says it makes you less productive but you’re attached to it because it really seems like it makes you happier as a developer, it seems like that isn’t, that fact there isn’t completely data-free. It seems like that’s saying something. And I don’t know…
ANDREAS:
Oh actually, I forgot. I was going to mention this. So, we ran one of our surveys on people between 0 and 10 to 15 years of experience or so, in one of our surveys on the intuitiveness of syntax. In other words, what do people believe invariant of productivity. What do people believe about their languages? And we found an interesting result. So, we tested people. We had them rate how intuitive something was on scale from 0 to 10. Like I said, it’s just a survey. There’s nothing special about it. And then we had them do this across sets of languages. And the people that we happen to working with mostly knew C++. So, they might have had one year of experience in C++. They might have had 10. They might have had 20, somewhere around there.
And what we found was that regardless of the C++ syntax we gave them, for every year of experience that they had in that language, they rated that syntax as half a point higher on this [inaudible] scale. So, what that means is that developers as they garner experience increasingly tend to think that what they’re looking at makes sense. But the problem is that belief may not be actually true. It’s just a belief. That’s survey data. So, it doesn’t mean that they’ll be more effective.
It just means that Bjarne Stroustrup probably thinks that C++ is really intuitive, right?
AVDI:
[Chuckles]
JESSICA:
Every study, every result, has a context around it. And when you’re talking about yourself you are your own context. If this works for you, you know what? It works for you. And it doesn’t matter whether it works for someone else until you start working with another developer, getting a new person on your team. And then you start inflicting your context on them. So, when a new person comes in and something that seems really intuitive to us is not obvious to them, that’s a good time to go to the studies and say, “Oh wait. I’m the weird one here. They’re right. This is not actually intuitive to human beings in general.”
ANDREAS:
Yeah, exactly. And these studies always too, we take averages and then we discuss distributions. But just because you have an average and you know that it helps people across on average, it doesn’t mean that any one individual might not be benefitted. So for example, if I had a particular language feature X, it could be that on average that feature is terrible for most people. But it could be that for one person somewhere, that that’s just the feature they need for their competence.
That’s real.
JESSICA:
That one person who really understands generics should totally write a bunch of Scala libraries.
ANDREAS:
Yeah.
AVDI:
[Chuckles]
ANDREAS:
And actually, the generics is a good example, because even if generics when you’re writing generics classes, even if you do get a tenfold (let’s just say that that’s true for the sake of argument) decrease in productivity, well we also know that you get a slight bump for using those libraries. So, maybe that’s okay. Maybe it’s one of those features where, yeah somebody’s going to have to take a hit and write those libraries. But it’s going to benefit the rest of the team. So, maybe that’s okay.
AVDI:
Yeah. I just wonder if sometimes it means or could mean that feature X, that there’s a third way. That maybe feature X isn’t a completely bad idea but the way that it’s implemented syntax wise or the way it’s presented or commonly used or something, is not so great.
JESSICA:
Good point. Maybe [inaudible]
ANDREAS:
Yeah, I agree completely.
AVDI:
Should we throw it away or should we try to find a third way that says, “Okay how can we incorporate feature X in such a way that it improves other people’s productivity the same way it improves my productivity?”
ANDREAS:
Actually, you’re totally right. This is actually one reason why we struggled so much with doing a lambda study because it’s hard to know exactly what to test. For example, some of the syntax that you see in various kinds of lambdas in various languages is pretty weird and obtuse, right?
AVDI:
Yes.
ANDREAS:
At least I think so. But on the other hand people love that feature. People really like lambdas under a lot of conditions, or so it seems. And it seems like there’s other advantages. Like in JDK 8 they have all these cool things you can do with parallelism with lambdas and that’s really cool. So, do you test under that condition or do you test a debugging study? But what about syntax? The syntax, if it impacts anybody, it probably only influences novices. So, maybe that doesn’t matter as much. I’m not sure. This is the hard part.
Since we’re such a non-evidence based discipline what that means is that we can’t even really tell the context under particular features, what matter yet. I think with lambdas it wouldn’t surprise me if we needed 200 studies before we got a good sense of who it impacted, when. That wouldn’t surprise me at all. But we have to run…
JESSICA:
It is very much a matter of how you think.
DAVID:
When Josh was on the show, he loved to point out that there were studies about the intuitiveness of various types of code and font size and color scheme and Dvorak versus Colemak versus QWERTY. And at the top…
ANDREAS:
Oh yeah, yeah.
DAVID:
And at the top of the call you and Greg both lamented that nobody’s paying attention to the data.
Where is this data? I’ll be that guy. I don’t even know where this data is.
ANDREAS:
Greg means something very specific when he talks about it. And in his context, he’s right under what he means. So, in his particular case he’s talking about data from source repositories. And he’s saying that those exist now. But he doesn’t mean data in the [broad]. If you want to talk about data from randomized control trials, you are totally right. It does not exist, and especially on languages.
And actually, you don’t have to take my word for this. We’ve actually run what’s called a metaanalysis. And this is this thing where we’re actually going back and reading all of the papers in academia. So far, we’ve read about 2200 papers from academia to try to find the evidence on different kinds of language features. And the short and unfortunate answer is that it largely doesn’t exist. A lot of academics don’t believe this. They think that we just have to look nostalgically back at the 70s and 80s and we’ll find all these wonderful studies. And it’s just not true. When you actually read the papers, there’s either no data or very little.
So for example, you may have heard the claim before that the Smalltalk teams did a substantial amount of empirical data analysis on their tools. I don’t know if you’ve heard that claim.
DAVID:
I’ve heard that.
ANDREAS:
Now if you look at the original studies, it’s actually really unclear what they actually did. It’s not clear whether they ran really controlled experiments. It’s not clear what the evidence was. I’ve even asked for the data and I can’t seem to find it.
DAVID:
Huh.
ANDREAS:
So, as is, from one scientist to another I would say, that’s weird. But at the time there was nothing going on in that area, so it’s like a diamond in the rough to a certain degree. But on the other hand there’s almost no evidence. And then there are other conferences where there’s a little bit.
So, there’s a very not well-known one at all called Empirical Studies on Programmers, easily the best resource ever on empirical data for how programmers work. But no one has heard of this because it’s not very well-known in programming language design. But it’s a treasure trove. It includes some of the only studies ever on inherited systems. It includes a direct comparison between object orientation and imperative styles, which we actually use on the Quorum project. That’s why languages like Ruby or Quorum or some others, you can say something output “Hello world” as opposed to needing to wrap it in public class void main, all this kind of crap. So, these sorts of studies can give us real insight but there’s so few of them, so few.
I’ll give you one last example on this foundation of evidence issue. And that is that if you look at even conferences dedicated to these issues supposedly, one of them is called the Psychology of Programming Interest Group. It’s a UK conference. It’s a workshop but it’s academic. It’s run by people that have influence at Cambridge and other places in the UK. Anyway, we did a systematic analysis of all their papers that are online. They don’t have all of them online but they have a good chunk of them. And we found that less than 1% of the papers were actually related to programming languages and actually used reliable data collection techniques.
DAVID:
Wow.
ANDREAS:
Less than 1% at a conference dedicated to this issue. Take the major programming language conferences, the answer is that it’s effectively zero except for the papers that we’ve written.
DAVID:
[Chuckles]
ANDREAS:
So, to answer your question, there is no evidence when it comes to randomized control trials. But what Greg meant was that there are source code repositories and we can probably learn a lot from them potentially. I don’t know if that answers your question or not.
DAVID:
Yes, yes it does. It does. And it actually leads into one of my picks. Should we wrap up and do picks?
JESSICA:
Yes, picks. It’s time for picks.
GREG:
So, pick number one: ‘Trick or Treatment’. This is a great book, Simon Singh and Edzard Ernst. Now, if you don’t know Edzard Ernst’s name, he was the first tenured professor of alternative medicine in the United Kingdom. He trained as a homeopath, got tenured, and ten years later said, “Okay, this stuff doesn’t actually work.” The courage it takes to have put 20, 25 years of your life into something and then turn around and then say, “Okay on balance looking at the evidence, I no longer believe this works,” is tremendous. And the flak that he got… Andreas gets hate mail. You should see the hate mail that people like Edzard get.
But he and Simon Singh wrote a book where they go through chapter by chapter different alternative medicines like acupuncture, like homeopathy, like chiropractic. What is the evidence? Does it actually work? And it turns out that for some kinds of ailments, acupuncture can be marginally effective. There are some herbal remedies that are effective for some minor disorders. But what they’re really doing is showing you how you tell the good from the bad, the true from the bogus. It’s a really good book. And if one day in my lifetime somebody is able to write a book like that for software engineering, I’ll think we’ve made real progress.
DAVID:
[Chuckles]
GREG:
I’m not expecting it. But man, it would be wonderful if we could.
The second of my picks is a book called ‘Seeing like a State’. It’s nothing to do with programming except it is.
DAVID:
[Chuckles]
GREG:
It could be half the length it is, but what is James Scott is pointing out is that over and over again large organizations of all kinds value uniformity over productivity. They want everybody playing by the same rules so that they can manage from the center. And that is more important than allowing people the freedom at the grassroots level to improvise and adapt, because then yes, they’re being more productive but you no longer have that central control.
Now, the reason for this isn’t just authoritarianism. It’s also that the more local adaptation custom you allow the more mistakes people can make. Let’s face it. We’re all glad that civil rights legislation was introduced in the United States in the 1960s, that local customs were banned in certain parts. But every software development organization and every political party that I’ve ever been part of suffers from this effect, that we lower everybody’s productivity to the point where they’re interchangeable so that they can be managed. And if I had a decade and a large budget, I would go off and try to figure out how this affects software organizations. Is it necessarily the case that every large software development team has to be mediocre?
And then the third one is ‘Code Complete’ which should need no introduction. And if you can find me a university software engineering course that uses that as their textbook, I’ll buy you a beer.
ANDREAS:
I might actually use it just to get a beer from you.
[Laughter]
JESSICA:
There it is. Beer-driven evidence again. [Laughter]
GREG:
But seriously, Steve McConnell knows the literature backwards and forwards. He’s got it all in his head. If there’s a significant paper in the last 30 years that he hasn’t read and filed away in his head, I’ve never bumped into it. And he’s just organized it all and put it down there. And now book is now 16 years since its first edition. It changed my life when I discovered we actually knew stuff about things. It should be the core text in every undergrad software engineering program and isn’t. So, if anybody listening to this podcast hasn’t read it, switch off the podcast. There’s nothing more important to you as a professional programmer right now than going and getting that book and finding out what we actually know.
DAVID:
Jessica, do you want to do picks?
JESSICA:
Alright. So, I am going to echo one of Greg’s picks because it was on my list but for a different reason. ‘Seeing like a State’ is an amazing book. And I think it’s drastically changed the way I look at software, not for the same reason as Greg talked about but because it shows why what we do is hard. ‘Seeing like a State’ talks about all the subtleties of human systems and human interactions at the local context level. It talks about all the improvisation that everyone does on a day-to-day basis and how in real human communities, we’re constantly changing the system to adjust to a slightly different reality, to corner cases we hadn’t seen before but now we have. It’s shifting and it’s not well-defined. And suddenly it makes complete sense that the hardest part of software is figuring out what we want to do. That’s it. It’s a great book.
DAVID:
That’s awesome. Avdi?
AVDI:
I am pick-less this week.
JESSICA:
Pick-less.
AVDI:
Yeah, I slacked off.
DAVID:
I was going to do a hot sauce pick today. And we’re so overtime and the last pick I did for hot sauce went 14 minutes.
[Laughter]
DAVID:
So, I’m going to pick really quick. The first one is that I get to be on the show today. I’ve been freelancing for, oh gosh, five years pretty solidly. And I went and interviewed with CoverMyMeds. And they’re just a fantastic team and they’re hiring. And so, I’m plugging them as my first pick because they’re just absolutely amazing. They’re doing really good high tech work. They’re doing work that saves people’s lives. And they have a no assholes rule for people that hire on there. So, if you’re interested you should give them a shout at their website.
My second picks actually are related to the show today. The first one is a really interesting bias. If you want to measure something, like how do you determine what your placebo or your control is, and that is the fact that if you track your weight, you will lose weight. That’s your weight loss plan. If you just keep track of it. Basically, it takes off of Peter Drucker’s old myth or old standby that what gets measured gets managed. So, you can’t use them as your control group because they’re going to lose weight too. So, it gets kind of interesting. How do you measure somebody’s weight without them knowing? Sneak in at night and weigh the bed, I guess.
Andreas, do you have picks for us today?
ANDREAS:
Sure. I came up with two. And of course, they’re both related to my area. The first one is an old paper that I think more software engineers or whatever word you want to say should know, and that is by Walter Tichy. And it’s called ‘Should Computer Scientists Experiment More?’ It was published in the late 1990s. And basically what Tichy did is he talks about this idea of what the evidence is. And this is the late 90s. And I want to just repeat one little quotation from him from 1998. He says, “There are plenty of computer science theories that haven’t been tested. For instance, functional programming, object-oriented programming, and formal methods…”
DAVID:
Wow.
ANDREAS:
“Are all thought to improve programmer productivity, programming quality, or both. It is surprising that none of these obviously important claims have ever been tested systematically even though they’re all 30 years old and a lot of effort has gone into developing programming languages and formal techniques.”
DAVID:
Wow.
ANDREAS:
This was 1998 and this wasn’t just his belief. He actually did a very similar type of meta-analysis to what this fellow name Andy Ko at University of Washington did pretty recently, finding a similar result, and what we’ve done on the language design communities in our own work. So, that’s my first pick. I think people should go read Walter Tichy’s fantastic paper describing this problem and also giving them some context for how long computer scientists have ignored evidence, just totally ignored it.
Okay, second pick. This is on the opposite end of the spectrum. And this is a paper by Leo Meyerovich and Ariel Rabkin that came out of OOPSLA 2013 called ‘Empirical analysis of programming language adoption’. And the reason I’m pointing this particular paper out is not because many developers might be that interested in adoption, but because it’s an excellent example of what you can learn from a very, very rigorous evidence-based investigation into a topic. Specifically Leo Meyerovich and Rabkin are kind of trying to start a sociology in programming languages, or at least that’s one way that I’ve heard Leo describe it when I talked to him.
And what’s interesting about this is that we learn all sorts of stuff. And I won’t spoil it all because some of the results are surprising related to how people adopt or stop adopting functional languages and what kind of features cause languages to be adopted, stuff like that. It’s a fascinating read. And if the people on your list haven’t read it, it’s just an amazing tour de force paper that is worth reading.
Oh, you know I didn’t even plug it at all. See, I’m a scientist so I tend not to plug my own stuff. But we actually just released a new online version of Quorum. It’s still a little bit in beta. But number one, it lets you run Quorum right on the web as a JavaScript alternative, at least kind of, sort of. It will get that way at least eventually. But also more than that, we also put up a bunch of video tutorials because we were accepted to the Hour of Code.
DAVID:
Oh, cool.
ANDREAS:
That Code.org puts up. So, we’ve actually got all these cute little online tutorials with this high school student that is learning programming. And they’re designed to be kind of fun and silly and stuff. So, if people that are listening want to give it a try, that’d be awesome. Because when people use it, we get some data. And then if it turns out we’re wrong, we change the language.
DAVID:
That is awesome.
ANDREAS:
They’re a lot of fun, these little video tutorials and stuff, I think. But hopefully people on ye olde internet will like it too. We’ll see.
DAVID:
Yeah, awesome.
AVDI:
Cool.
DAVID:
Thank you for being on the show, Andreas. This was awesome.
[This episode is sponsored by MadGlory. You’ve been building software for a long time and sometimes it’s get a little overwhelming. Work piles up, hiring sucks, and it’s hard to get projects out the door. Check out MadGlory. They’re a small shop with experience shipping big products. They’re smart, dedicated, will augment your team and work as hard as you do. Find them online at MadGlory.com or on Twitter at MadGlory.]
[This episode is sponsored by Ninefold. Ninefold provides solid infrastructure and easy setup and deployment for your Ruby on Rails applications. They make it easy to scale and provide guided help in migrating your application. Go sign up at Ninefold.com.]
[Hosting and bandwidth provided by the Blue Box Group. Check them out at Blubox.net.]
[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.]
[Would you like to join a conversation with the Rogues and their guests? Want to support the show? We have a
forum that allows you to join the conversation and support the show at the same time. You can sign up at RubyRogues.com/Parley.]
184 RR What We Actually Know About Software Development and Why We Believe It's True with Greg Wilson and Andreas Stefik
0:00
Playback Speed: