JSJ 476: Understanding Search Engines and SEO (for devs) - Part 1 - JavaScript Jabber -

JSJ 476: Understanding Search Engines and SEO (for devs) - Part 1

If you're building a website or web-app, there's a good chance that you want people to find it so that they will access it. These days this mostly means that you want it to appear in the relevant search engine results pages (SERP). In this episode we are joined by Martin Splitt, DevRel at Google for the Search & Web ecosystem, who explains in detail how search engines work, and what developers and SEOs need to know and do in order to be on their good side.

Hosted by:

Special Guests:

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Panel

Aimee Knight
AJ O'Neal
Dan Shappir
Steve Edwards

Guest

Martin Splitt

Transcript

DAN_SHAPPIR: Hello everybody and welcome to another episode of JavaScript Jabber. I'm Dan Shappir coming to you from Tel Aviv and today on our panel, we have Steve Edwards.

STEVE_EDWARDS: Hello from Portland.

DAN_SHAPPIR: Amy Knight.

AIMEE_KNIGHT: Hello from Nashville.

DAN_SHAPPIR: AJ O'Neil.

AJ_O’NEAL: Yo, yo, yo. Coming at you live from they changed trash day to Tuesday.

DAN_SHAPPIR: And our special guest for today is Martin Splitt from Google. Hi, Martin.

MARTIN_SPLITT: Hi there, and hello from Zurich, Switzerland.

DAN_SHAPPIR: Oh, it must be a very nice weather over there right now.

MARTIN_SPLITT: Yeah, it's actually sunny, warm, and blue sky, and not too bad. Yeah, surprisingly.

DAN_SHAPPIR: And how's the temperature like?

MARTIN_SPLITT: I think today we had 14 degrees centigrade. I think we'll need to maybe try to convert it for American. Nice and toasty.

MARTIN_SPLITT: Yeah, I think in American units, it's 25.3 caterpillars in a nutshell or something.

AJ_O’NEAL: Yeah, well, what you do is you divide by two and then multiply by five-ninths.

MARTIN_SPLITT: Right. Sounds legitimate.

AJ_O’NEAL: Plus 32. Don't forget the plus 32. You got to carry the 32.

MARTIN_SPLITT: I think it's in the upper 50s for you guys.

STEVE_EDWARDS: So what are the caterpillars and the nutshell fit in there? I missed that in the calculation.

MARTIN_SPLITT: Yeah, I don't know. I I have no idea how the exact like there are coefficients there and like you have to figure out the units I guess but I'm not a I'm not really good at this stuff so I don't know.

AJ_O’NEAL: It's either 39 or 57 depending on whether it was supposed to be five nines or nine-fifths.

DAN_SHAPPIR: I'll tell you one thing, whenever I need to convert Celsius to Fahrenheit or vice versa, I just use this thing called Google Search. Martin, have you heard about it?

MARTIN_SPLITT: I heard about it. It's apparently like a really hot startup from the Mountain View area right now.

DAN_SHAPPIR: Yeah, for those of you who don't know, Martin is actually involved with Google Search. Can you tell us your role there?

MARTIN_SPLITT: Yes, so I'm a developer advocate at the Google Search Relations team, so my job is to both help everyone build websites that can be discovered through search engines, more specifically through Google Search, and also to basically bring back developer and SEO feedback to the relevant product teams at Google, in Google Search, more specifically, to make sure that Google Search works the way it's supposed to do.

DAN_SHAPPIR: And we brought Martin on our show to explain to us exactly how the Google Search algorithm works on the inside. Right, Martin?

MARTIN_SPLITT: Yes, correct. That's exactly what I'm here for.

This episode is brought to you by Dexacure a company that helps developers make websites load faster automatically. With Dexecure, you no longer need to constantly chase new compression techniques. Let them do the work for you and focus on what you love doing, building products and features. Not only is Dexecure easy to integrate, it makes your website 40% faster, increases website traffic, and better yet, your website running faster than your competitors. Visit dexecure.com slash JSJabber to learn more about how their products work.

AJ_O’NEAL: What's the latest name like caffeine or cocoa bear or juju

MARTIN_SPLITT: Oh, no, no, you're mixing up all the things I think like the latest name is Bert that has been or is it now passage indexing that people are like freaking out about I don't know like there's there's something new every now and then popping up in the community and then it's like, oh my god, the big update and we're like, yeah, it's not really that big of an update. But yeah. Okay, I guess.

AJ_O’NEAL: I mean, other than turning search results completely upside down, sure, not that big of an update. It's like butterfly effect. One small tweak to the algorithm and then all of a sudden your site can't be found anymore.

MARTIN_SPLITT: You say that, but then like if you look at it, the thing is there's like lots of websites and search results, right? And for many, an update doesn't change anything. And then for some it changes quite a lot. And then for others, it changes positively. For others, it changes negatively it's really hard to say what's the impact in terms of like, you can't summarize that as like, oh, yeah, this had lots of impact, or, oh, yeah, this had none of the impact whatsoever. And you see that quite interestingly when we talk about different things that happen in search. So for instance, a couple of years back, we felt that we should promote use of HTTPS, so SSDL or TLS in your website serving infrastructure. And that is a ranking factor. It's just one out of lots of them. So people are like, oh my god, we're going to see a big change if you're not using HTTPS. And then that wasn't the case. Except that for some websites, it did make like if you are having competitors that are on HTTPS and are pretty much as good as you, and now you are the one who's the only one in that niche that is not actually supporting HTTPS, then you would see like, oh my god, we are losing traffic and ranking here from search, whereas in other areas, people are like, yeah, no, we don't really get like that much of a ranking boost. And there might even be non HTTPS websites outranking you. Like you told us that HTTPS is a ranking factor. Yes, it's one ranking factor. But if I have a really bad page that is on HTTPS, why would I outrank a really good page that just happens to not have HTTPS? So yeah, it's always wild and complicated.

DAN_SHAPPIR: I forget who said it, but it was a really nice explanation by some Googler who told me, maybe it was even you, who told me that Google tries to find for you the pages that you want to find. And usually, that's based on what the page contains, the content of the page, and the authority of the page and whatnot. But literally, if you know, you would have gone through the list of all the pages, you know, obviously, that's not possible. Google tries to put at the top the ones that you would have put at the top.

MARTIN_SPLITT: Yeah. And that that can make a big difference between different users as well. So if I am searching for restaurant open right now, I'll very likely get very different results from you all because none of you is in Zurich. And I probably want a restaurant that is open right now in Zurich, not in Nashville or Portland or wherever else in the world, right? So ranking is a really complex monster of a machine. It is amazingly fine-tuned. We are doing lots of experiments at any given time. And we are looking at if the outcome looks like what we want the outcome to look like in the sense of we have what's called quality raters. That's actually human beings who are presented with a search query and with a search query with a set of search results. And then they are also presented with an alternative set of search results. And then they get to say, I think this is the better search results set, or I think this other thing is the better search results set. And that way we are also fine-tuning our ranking infrastructure.

AJ_O’NEAL: Well, that sounds, I guess the whole thing is just kind of scary actually. Well,

MARTIN_SPLITT: it's not as scary as you think.

AJ_O’NEAL: Okay, let me bring up one specific scenario, okay? It seems to me that, bots are taking over in that fake auto-generated content is getting higher priority than a real human content. And part of that is probably because there's like 10 to 20 times as much bot-generated content out there than human-generated content. But specifically, if you search best XYZ, you are not going to get a page where a human has had any input. It's like 99.999% of the time, the first 20 results on the Google page are all bot sites that are just like using GPT three on Amazon reviews and generating really crappy Amazon link farms. That's like my biggest problem with search right now is I can never find out any real information about a product. Cause it's all just parroting what's in the description and I don't know what other stuff they use, but it's just like, I always get garbage results when I'm searching for products information and I have no idea how to fix it.

MARTIN_SPLITT: I don't know. That's an interesting example, but I would disagree that that's usually the case. So for instance, if I'm searching, I use search as well, right? I'm a searcher as everyone else is. And for many things like, I don't know, best vegan restaurant in Boston, or I don't know, best price for this and that, or best. I don't know. Used car dealer Zurich. I actually get pretty good useful results. I think it is a question of how intense the competition is. And I think especially when product search, that is a bit of a tricky situation because sometimes some products are just so specific that if you are searching for a very specific product name, then you are in a very small part of the internet to begin with. There is just not that much information. And then it is hard for automated systems, including our bot, to understand what is good content here. And if it's like everything looks the same, then that's really hard to say, like, which of the 10,000 pages that are pretty much the exact same, should we rank higher or lower? If someone actually pays some attention to what people want and what people need and what query intention they're coming with then there is potential to optimize for that and actually give human-generated content that ranks better. It's just that for very specific products, oftentimes no one does. And it doesn't,

AJ_O’NEAL: if I were to search best USB C hub or best iPhone glass protector, or, I mean, pretty much like anything the average consumer would, would be buying, like I always get to these junk sites and, and what I end up having to do is find some type of like person on YouTube to explain it, which is a little less than ideal because I don't want to watch it.

MARTIN_SPLITT: But that's that was surprising because if I search literally for best USB-C hub, I get CNET, Laptop, Mac, TechRadar, Business Insider, New York Times, which I would suggest are not spam

generated bad pages. Or so I hope. I do hope that the New York Times does not have like a terrible, terrible, terrible page on the best USB-C hubs or Business Insider doesn't. So yes, there are bad results in some cases, in which case that's something that we definitely are trying to improve constantly. Product search, that's a little tricky because there is only so much information. And then in that case, you can at least try to sidestep that by providing a little different in terms of search intent. So for instance, you could ask for comparison of where that might filter out a bunch of pages that are just like general product information rather than a specific review and comparison. So of course, everyone is trying to game our algorithms all the time. And that's one of the challenges that we are constantly dealing with. And that's one specific reason for not giving out too much information on the ranking algorithm, actually.

DAN_SHAPPIR: A question about that. Because at the beginning of our conversation, I kind of jokingly said that you would be explaining the algorithms, which of course you won't and you can't. But my question is, theoretically, if you wanted to, could you, I mean, is it actually something that human beings can still understand? Or is it just like some huge AI, machine learning, monstrosity that nobody really has, no person has a total grasp of how it actually works?

MARTIN_SPLITT: I would say it's not a non-understandable or black boxy kind of AI system. It is actually a collection of smaller relatively understandable bits and pieces that are being mixed together and weighted to form a final ranking in the end. And even though there is not one person that understands it all because just no one needs to do that you can, if you really sit down and want to understand how everything works, you can totally do that. It's just, it's a little pointless because you might be more interested in your specific area of expertise. So for instance, if I'm a natural language processing or programming expert, then I might be looking more into that subsystem that tries to understand the relevance for a given topic and a given page. But you can totally look further and basically go my system that I am controlling here gets this weight attached to it and then this other thing comes in as well with this other weight attached to it and you can comprehend what's going on. It's not all machine learning, it's not all a black box.

DAN_SHAPPIR: Oh interesting. When I prepared for our conversation, one of the things that I thought might be worthwhile for our listeners, especially given that our listeners are generally JavaScript developers and are not necessarily familiar with this particular field is maybe explaining some of the terminology that's common and like even the basic things like what is SEO? What is a serp what is a slug or maybe some other terms?

MARTIN_SPLITT: Yeah,

DAN_SHAPPIR: I'm detecting maybe you can do that.

MARTIN_SPLITT: Sure actually we do have an internal class that new Googlers are encouraged to sign up for and take, where we discuss how search works. We also have a public page that explains how search works on very high level. I do hope that we will eventually actually make a video that explains it in a little more detail. But basically, I could give us a quick run through the infrastructure. Basically, you make a website. That's fantastic eventually, this website somehow either gets linked to or you submit it to Google, which then starts the process where the crawler, which is basically an automated bot program called Google Bot, makes an HTTP request to that URL that we found somewhere and then downloads the response. Then we analyze what's on the page. That's the indexing step where we basically try to figure out, OK, so this website is about cats, but it's not just about cats. It's specifically

medical diets for cats with specific medical needs. And then that puts like, basically files it in a database with lots of signals attached to it. Like how old is the page and how recent has it been updated? How much information about each specific diet is there? Cats, blah, blah, blah. How many images might there be? How many pages link to it? How many pages does it link to? All this kind of stuff gets collected and stored in the index, which is basically a database. And then you go to Google.com or you use the Google app on your phone or you use Google Assistant or you use Google search through assistant through your Google home. Who cares? Basically, what you do is you phrase a query. So you are searching for something. So you ask us something. You might ask us for the best USB-C hub in 2021. You may ask us for the best vegan restaurant or you may say my cat needs a low sodium diet. What's what can I do to make that happen? Like low sodium recipes for cat food or cat food brands. And then what happens is that this request gets passed into what we call the fulfilling layer. The fulfilling layer is rewriting the query a little bit. So it tries to figure out variations of the query. You don't necessarily have to use a full sentence. If you do, there might be stop words that don't mean anything. There might be typos. So we filter those out. A bunch of stuff happens there and then basically it asks the index for potential candidate documents and says like, okay, so which pages on the low-sodium cat food brand do we have? And then a bunch of pages show up, probably millions or billions of pages or even more. And then each of these documents is looked at. We have, as I said, we have collected lots of signals. One of them, for instance, being if it's HTTPS or not, the other being how recently it has been updated and so on and so forth. So it's like, oh, there's this very recent page that has lots of links coming to it and actually links to a bunch of relevant information as well. And it is an HTTPS search page. That's also great. It's in the right language because I don't think you want to read a German page to that specific query. You probably want an English page for this. And all these kind of factors come in together, are being weighted. And that way, we get a sorted list of candidates. And then this sorted list of candidates is being served back to you as a search result page in short, SRP or SERP. So sometimes you will hear the term SERP, so search engine result page. That's the page of search results that you'll see. Each of these results can look very, very different. So the simplest one that you probably know is what we call the blue links. So you have a title which is in blue, below it in green is usually the URL or the breadcrumbs that hen you have a little text that describes a little bit what the page is about. That's what we call the snippet. You can control this to a certain degree by providing a meta description in your website. Now we can then use as a snippet. So that's the simplest search result we have, but we might also have, so if it's like a medical query, then maybe we have some medical information somewhere from Wikipedia or elsewhere that we might highlight on the right hand side. So it's like, oh yeah, low sodium cat food is being served to cats with suffering these medical conditions, blah, blah, blah. You might find images, you might find products if it's a shopping result or relevant query. There are also these so-called rich results. So if you look for like a low sodium cat diet, for instance, you might actually find recipes that explain how to make your own cat food to serve cats with that specific need. In that case, you might, yeah, you might actually have like a oh yeah, this recipe takes 20 minutes to prepare and has this many calories and takes these ingredients. So there's different kinds of search results. And yeah, that's, I think, SERP, so the search result page, is an important term that you'll often hear. Snippet, or meta description, is a term you'll often hear. Blue link, or simple results, rich results, things that are usually being thrown around. Rich results are powered by structured data. So for enabling a rich result, you have to add a specific piece of JSON markup to your page to tell us what this page is about specifically. In this case, that's a recipe, for instance, and what the steps are and what the calories are and so on and so forth. Yeah, we had covered indexing, ranking, and crawling. Anything else? Anything you think is missing?

DAN_SHAPPIR: I mentioned two. First of all, SEO itself.

MARTIN_SPLITT: All right, yeah.

DAN_SHAPPIR: and also the term slug gets thrown around.

MARTIN_SPLITT: Oh, yeah. Oh, god. SEO itself is search engine optimization.

DAN_SHAPPIR: I mean, we've gotten to the point where you're invoking deities really early in our conversation.

MARTIN_SPLITT: Yeah, no, the slugs one is an interesting one, but I'll come to that in a moment. So SEO itself is the work of optimizing pages for search engines. That can be done well and not so well, let's put it that way.

DAN_SHAPPIR: So I think of before you delve into that maybe I'll lead into that with a really basic question. Okay, that's my basic question is like why I mean we have html which is a semantic representation of our content It contains all the content that I want to provide. It also describes certain sections of the content in in like I said, the semantic way, this is the title, this is the header, this is the list, this is the table, links, and so on. You know, given that and given how well the browsers themselves, when I visit a web page are able to present this data and now, you know, the data is also available in an accessible way, hopefully so that I can, you know, whatever input, output device best works for me. Given all of that why do I actually even need to do SEO? Why isn't the HTML enough?

MARTIN_SPLITT: That's a really good question. So search engine optimization should be about helping you optimize the site, the moment when search engines and users interact with, so when users interact with search engines and search engines interact with your site, certain things happen. To give you an example, or make an analogy maybe, maybe I'll try that instead. So if you are thinking about building a website and you make all the HTML really fantastic, but it's a website about photography and all your pictures are really, really shoddy, like really low resolution, like 100 by 100 pixels, is that a good photography website? Well, from the technical perspective, yeah, it's valid HTML. All the images have out text. The images serve fantastically fast. They don't take that much bandwidth. They load really quickly. Awesome. But for the purpose that you're trying to make the website for, which is showing off your photography skills, it's terrible. Because if I want to see your pictures, I think I need more than a 100 by 100 pixel resolution to actually see lots of the detail and how well you captured the light and how great the composition is. All these thumbnails, if that's all what you provide me, are not really going to do a good job in showing how great of a photographer you are. And it's similar with any website you make. You are trying to serve a certain purpose and you have to ask yourself what is the purpose that I'm trying to save and as to serve. And one beautiful example of that is recipes. If you want to share recipes for whatever purpose that might be, it might be just like the popularity, it might be people coming to your site and seeing ads or because you are selling the ingredients or you're selling a recipe book or whatever, I don't know. But you want people to come to your website and read these recipes and be delighted by the recipe and try the recipe out and be happy with it and tell others about your website and maybe even like post on all sorts of social media about it so that even more people come and cook your recipe, see your ads, buy your book, whatever it is. And that's often where people fall short. And it's not only a technical concern, because obviously, you can make technical mistakes and then have problems. You're like, oh yeah, this recipe website actually doesn't show me anything on my mobile phone because it's broken in some very interesting specific way. Or oh, this website is super slow. I don't want to actually wait for this recipe, so I'll go somewhere else. So that's the technical considerations there. But then also, if I come to the website and everything works on the technical side, but then the recipe goes on for five screens on the live backstory or like the live background. When I grew up in the south of France, my mother used to say, no, no one cares. Like, just tell us what is this? Why would I care for this recipe? And how do I make this actual thing? And maybe illustrate it with a video or pictures or something. And that's where SEO comes in. That's why SEO starts and says like, OK, so what's the intention with which people are probably going to end up here? Like, what are you? What's the interest that you're serving with this page? And maybe this recipe is a real quick recipe for vegans. So but if your entire recipe site never mentions that this is a quick recipe and this is a vegan recipe, then how would Google know? So if it's like, oh yeah, there's this like cupcake recipe that's really great and it's like rainbow cupcakes like grandmother did them or something, then this would not be something that would show up for quick vegan cupcakes, even though it is a quick and vegan cupcake recipe. And then people might be like, oh, why are my customers not showing up on my site? Why is it that I have to like pay a lot for ads or even if I don't pay for ads like no one shows up, this is not, this is frustrating. And then SEOs usually help you by saying like, okay, so what are you trying to convey? Well, this is a really great vegan recipe blog for quick vegan recipes. Well, then say so, right? And then search engines being bots, not humans that actually like try out all the recipes, they'll be like, okay, so this says a quick vegan recipe. It has structured data that says it is a recipe and it's vegan. And it also says that it only takes five minutes to prepare. So that's great, too. And then like these things would benefit in highlighting and showcasing your work and your content and search engines. And semantic HTML is not enough because that's just the foundation. You have to be in HTML to begin with, to make it a website. If it's semantic, it's even better because then we understand the semantics of it. But that does still not tell us if a visitor and far less so a bot, an automated piece of software, can actually understand what the website is about or if that aligns with what you want it to be about. That's the other thing. Okay, so now we understand this is a recipe for cupcakes. But how do we want to present this? Just as a recipe for cupcakes or a quick recipe for vegan cupcakes? That's a very specific thing. So that's why search engine optimization is necessary. And it's not just that. It's not just the content part. It's also the technical part because you have to also understand Google and other search engines have to make trade-offs. If you are building a program, you know that you pretty much make trade-offs all the time, any decision you make, and even which framework you're going with implies trade-offs. Yes, of course you can, I don't know, use, let's say like TypeScript with React. That works. That is totally possible. But you could also just use Angular where TypeScript is like a given. It's like a default configuration. You don't have to, but you can. You can also decide differently and then figure that one out yourself or work with other people's work. But Google Search has to, or other search engines as well, have to make these trade-offs of like, on one hand, we want to download a bunch of stuff from your website as quickly as possible. So if you are a huge retailer, let's say, or like a newspaper, with a bazillion articles. Let's say you started a newspaper like 10 years ago and now you have a website for your newspaper. You might easily have 10 million articles about all sorts of stuff and current affairs. So we want to get all of your articles as quickly as possible because new articles will probably pop up relatively frequently and quickly and they are current affairs. So we want to see them quickly. At the same time, we don't want to make so many requests to your website that your server starts to break down. So we need to slow down and pace ourselves a little bit. But now this is a trade-off. How much do we have to slow down? How quickly can we still go? Is there something you can do? Yeah, there's something you can do. You can, for instance, tell us if there's different variations of your website. Let's say newspaper.example, HTTPS www.newspaper.example, and HTTPS newspaper.example, and maybe HTTPS mobile.newspaper.example. And all of these are different sides, but all of them show the same content. And you can tell us, oh, by the way, I would be interested in you just going to, I don't know, HTTPS www.newspaper.example. All the others are just alternative versions of this URL but it's the same content. So then you're telling us, oh, we don't actually need to download all these three pages for mobile, dub, dub, dub and non-dub, dub, dub. You're telling us, okay, there's one that we actually have to look at and all the others are just duplicates. We figured that out ourselves, but why make us waste work when we could use just this one base URL and then basically just triple our throughput with URLs that actually make sense because they're unique content. These kind of things are very specific requirements, very specific considerations that I would say most developers don't know about or don't care about. And that's not too bad, because as long as you have an SEO that tells you when something goes wrong, you don't need to worry about this. If you don't have an SEO that helps you, you need to figure that out yourself. And then it's just one more consideration that you need to keep in mind. And there's lots of these trade-offs and lots of these things that can potentially go wrong. Because if you ever had to interface with an API, you know that it might look easy on the surface, but there are all these edge cases and weirdnesses that any API has as an inherent characteristic of a computer system consuming something from another computer system. And that's exactly what happens in search engines. It's a computer system consuming your content. And yeah, that's why you need SEO.

Did you work your tail off to get that senior developer gig just to realize that senior dev doesn't actually mean dream job? I've been there too. My first senior developer job was at a place where all of our triumphs were the bosses and all the failures were ours. The second one was a great place to continue to learn and grow only for it to go under due to poor management. And now I get job offers from great places to work all the time. Not only that, but the last job interview I actually sat in was a discussion about how much my podcast had helped the people interviewing me. If you're looking for a way to get into your dream job, then join our Dev Heroes Accelerator. Not only will we help you get the kind of exposure that makes you attractive to your dream employer, but you'll be able to ask them for top dollar as well. Check it out at devheroesaccelerator.com.

DAN_SHAPPIR: So if I'm listening to your explanation, I have a quick comment. So first of all, it seems to me thaT part of the thing that the SEO does is actually make your website essentially better. I mean, if I'm thinking about that analogy that you gave with the recipe and, you know, it's probably beneficial that I mentioned that it's for vegans and that it's a five minute recipe for humans, not just for search mods. And SEO can just help me organize the content in a way that makes it more accessible to bots, but also to people. So that's, so that's like thought number one that occurs to me. And thought number two, it seems that effectively you've kind of extended the API of the web, of the web server to browser API to accommodate for the fact that there's a machine on the other end, and some of the context and some of the assumptions cannot be made automatically. Theoretically, this could have been a part of the standard of the web itself. You've effectively turned it into a standard by saying, well, Google, the bigger search engine, that's the way that we work, and effectively made it the standard.

MARTIN_SPLITT: Pretty much, yes. It's not just Google search. It's pretty much all the other search engines have the same design decisions to make and and it's an inherent characteristic of crawling the web, I would say. Yeah. Cool.

STEVE_EDWARDS: So I got a few questions for you, Martin. I'm going to jump in here.

MARTIN_SPLITT: Sure.

STEVE_EDWARDS: I'm always a detail oriented person. So you were just talking about how, you know, we can tell you if the, you know, the different variations of the site where there's a dub dub dub versus non dub dub dub, HTTPS versus HTTP, etc. Can you talk in some detail about how we tell you about those. What is it like robots.txt? Is it a site map? What are the best ways to communicate that sort of information through Google bot?

MARTIN_SPLITT: Absolutely. Yeah. Very happy to answer that, Stephen. I think that's a, that's a good point. We do have a bunch of documentation available for this on developers.google.com slash search that has like lots of different bits and pieces that, that might clarify things like this, but very shortly you have different ways of interacting with us. The very first and foremost crawling-related one is the robots.txt that you mentioned. So before a robot like Googlebot would make an HTTP request to your site, it would basically go to your site slash robots.txt and it looks into that file to figure out, does the owner of this website even want me to make it requests to this website. If so, fantastic. Then that happens. Then a crawling request happens and the content is then further unprocessed. You can very granularly say like, oh, I don't want this URL pattern or this folder or whatever to be crawled. And then you can exclude things from crawling in robots.txt. And you can do that for specific bots as well. So for instance, you might, Google bot from accessing certain files under or that are within the members area, for instance. And we wouldn't make HTTP requests to this. You have to be a little careful though, because when there's one step that I kind of skipped over earlier, which at least Google search does, I don't know about the other search engines, but as far as I am aware, other search engines do this as well, which is rendering, where we are basically opening the website in the browser to execute the JavaScript and get whatever content comes from the JavaScript or is generated or patched by the JavaScript. And what I have seen in the past is that people are like, oh, we'll save on requests from Googlebot by disallowing our API. But then it's a client-side rendered application. So basically the website is pretty much empty in terms of the content that comes from the server. And then the JavaScript is executed in the browser, making a bunch of requests to the API. If you do that, then Googlebot will basically go through the crawler and check the robots.txt, see, oh, API is not allowed to be fetched. So we can't make an HTTP request to the API, which means the JavaScript gets an error, which means we are not seeing your content. So robots.txt is a very powerful weapon. On the other hand, it's not the almighty gun for everything. Because if you prevent us from crawling something, that does not stop us from indexing it. That sounds weird and paradox, but if you think about it, there is a website somewhere on the internet that says, look at this high school photo from Martin Splitt and then links to a photo on my website. Now my website is blocked by robots.txt, which means Googlebot sees this other website, sees, oh, there's a link with the context, Martin Splitt's high school photo, and then tries to make a network request to that specific photo on my website, reads the robots.txt, says, okay, I can't actually, I can't go there because I'm not allowed, thanks to the robots.txt. So I don't know what is behind this link, but I know that there's probably a high school image of Martin Schmidt. So it doesn't have any information on the content there. It doesn't get many signals, but it can still put it in the index as like, I know someone says that there's a high school photo of Martin Schmidt. It won't usually rank very high. And in most cases, we'll never show up in search results. But if we don't have anything else that ranks well enough for Martin Schmidt High School photo, that thing might still come up. It might come up with just the URL or just the title Martin Schmidt's high school photo from the other URL that links to it. But it won't show a meta description or anything because we don't have that information. We haven't been able to actually crawl it on and get any information about the page. It might even be a 404. Who knows? But it can still potentially show up as a low quality search result if we don't have anything else to show in that scenario or nothing better to show in that scenario. If I want to prevent indexing, oh, I can totally do that. There are the so-called robots meta tags, which can be either a meta tag in your HTML or it can be an HTTP header. And in the robots meta tag, basically, when we make the HTTP request, and this is important because we have to make an HTTP request to read the header or the HTML that you have there, which means you don't block this page in robots.txt. You allow us to crawl it. We crawl it. We download it. We look at the meta robots tag. And then it says, do not index this. No index is the value that you put in. No index. And we'll say, oh, OK, so we can fetch this from your server. But then you tell us, please don't put this in the index. Then we will not put it in the index. That's what we'll do. We'll do. We stop right there. We stop processing right at that moment when you say no index, and it will not show up in search results. That's one thing that you can do as a developer, manage your site both through the lens of robots.txt and also of the robots.meta tag. Then I said there's the different variations, like multiple URLs pointing at the same content. That would be another HTML tag that you can use. The link relation canonical. So you know you link your CSS style sheets, the link rel style sheet. In this case, you would have a link rel canonical where you tell us this URL for this page is what we think is the URL that we care about. If you come here from another URL, you can disregard that URL. We think it should be this one. Oftentimes people use this incorrectly, which either leads to weird issues with some pages not showing up in search results afterwards or us ignoring that link rail canonical. So we take that as a hint, as a signal, but we might choose a different URL as a canonical. That sometimes leads to weird situations. For instance, if you have, let's say, so there's this fantastic phenomenon that in Germany, Switzerland, and Austria, German is spoken. So you sometimes see the situation that a website in Germany, for instance, says, oh, we have this product here for 20 euros and then Switzerland has the same product for 35 francs. So the only difference in these two pages is the price. But to the company in Germany it's probably very important that we are showing the German and the Swiss version, which we will do. But the way that this works is that we say, oh these are the same page. But then to users in Switzerland we'll version of the same page. And users in Germany will be shown the German version from DE, from Germany. Even though it's the same page, we kind of make the connection that there's like different language versions. If you have such a situation that you have different language versions of the same content, you can even tell us with a link relation href lang, you can tell us, oh yeah, so there's like the Spanish version of this page is here, the English version of the page is here, the Argentinian version enough. Maybe the Arabic version, that's what I wanted to say, Arabic version of the pages here, the Hebrew version of the pages here, the German version of the pages there. And that way we can also figure out which language versions we would show something in. So if someone from Israel would request something, we might show the Hebrew version if they have their device and their browser and their Google search settings pointing to Hebrew instead of English, even though the query might be Yeah, so canonical tags, title tags was already mentioned by Dan. Have a good title that actually describes what's on the page. So not if you have like an e-commerce store, it should just always say like Martin's e-commerce store on every page. You should have a title that describes each product and each category and each description, whatever article, whatever you're putting on your website. A meta description, meta name description with like a small snippet that can potentially show up in search results is useful.Yeah.

DAN_SHAPPIR: Another one that I think has become very much a best practice is the nofollow attribute, no?

MARTIN_SPLITT: Oh yeah. I know exactly what you mean. Yeah, aha, best practice. So I hinted at this earlier, and I think it's probably one of the most popular patterns when it comes to Google is probably the in 1999 or 1997, I think, when it started, was that we didn't just have a catalog of keywords, but we also evaluated pages based on how well linked they were. As in, like, if a page is good, the hypothesis is if a page is good, lots of other pages will link to it as a reference material, kind of. That came out to be the PageRank system that has since been improved. We're not really using page rank. We're using something similar in ranking as well. It's not the number one ranking thing, but it's a ranking factor. And the problem there is that obviously when that came out, people were like, aha, so I'll just make a hundred spam bullshit websites and have all of them linked to this website that I want to rank. And then Google will see like, oh, a hundred websites link to this. So it must be good. So we are taking site quality into account when that comes. So it's like, okay, so now you have created 100 bullshit pages, but they are not linked to from anywhere. And, or maybe there's like a cluster of 100 bullshit pages that are linking to each other, but nothing else links to them. So we are basically building what's called a link graph over the web so that we understand certain clusters of topics and certain like relations between different pages. And that's not just between different sites that's also within your own site. So what's the information architecture looking like on your site? Which pages are linking? Where do these things mean that if I'm searching for this, not only this page is relevant, but also this other page might be relevant. And a bunch of spammers are still trying to play that system by offering basically paid links. So we are hunting down link abuse basically, or link spam, by looking at the links if the link looks like it's an organic link. So someone said, oh, yeah, I think this makes sense in this context. I'll link to that other page. Or if it's spam. And oftentimes, spam happens when you have, I don't know, a blog with comments. So you have a comment form, and people will post links to their websites there in the hopes that we think that these organic links, we are pretty good at filtering these out. But we can potentially get the feeling that your website is being primarily used as a link dump for other websites. So we might not feel so happy about your website being involved in this. To avoid that, you can say on every link that you have on your page, you have the opportunity to set a relations attribute, a rel attribute. And if you know that the link comes from a user comment, you can always add the link relation nofollow, which means you're saying, I don't really trust this link. I don't know what is behind this link. I haven't specifically curated this. This is not an organic link. This is something else. We have recently introduced other attributes there as well, or attribute values there as well. So you can give us a little more information on why you don't trust the link. So for instance, you can say UGC for user generated content. So the comments example would be a user on your website, so not you, the owner of the website, but a user on your website posted that link. You don't know if it's good, you don't vouch for its quality. So you can say this is user generated link. Don't consider this please, because it's a user generated link. You shouldn't do that on all the links, because that way you also kind of undermine the idea of having a healthy link graph. And I think distrusting all the links that you put out there by default is not a good practice either. I think it makes sense for all the links that you don't really have control over, a way you don't know if they are any good. But for anything that you, if you write a blog post about something and you really like that something that you blog about, link to them, there's no harm in linking to them. That's an organic link. Liked it, you put it in organic, all fine. If however, you are like posting a hundred posts a day that are mostly just linking out to one page, then that's maybe questionable. But yeah, so, Yeah, nofollow is an important tool, but don't overuse it.

STEVE_EDWARDS: So I want to come back to the canonical tag and just to give you a little background, I had a few months ago, spun up a little blog site and I'm using statically generated sites. So, you know, from SEO standpoint, that should be really good because you're not having to query an API. It's all, you know, right there, obviously. But one of the issues that I keep running into on my Google search console is under coverage errors, you get alternate page with proper canonical tag, even though I have a canonical tag in every post, it's a blog site post. So is a canonical link something you want in every post that would like point to itself, or should that only be put when your source of truth is actually at another URL?

MARTIN_SPLITT: So with Google search console, that's the tricky one because the way that Google search console works is that. It warns you or informs you about things that are potentially not what you want, but they don't necessarily mean that they are a problem. So in the end, if you... A brilliant example for static sites, depending on how your static site generator works. For instance, with Hugo, which I use for some of mine, oftentimes I have a folder and then an index HTML file. So I can get to the same post with a trailing slash and without a trailing slash. Now Google has the problem that it might see both URLs somehow. Someone might link to it with a training slash. Someone might remove the training slash. And then it sees both versions. And then it determines, OK, so one of them is a duplicate of the other. And now it has to pick. And what can happen is that it sometimes picks one, sometimes picks the other, which looks really weird in reporting because that basically just means that certain pages come in and out of search results, which is not really what happens. But that's how it looks like in the reporting, at least. Yeah, and that's not generally a problem, especially if it's a small blog. I would say everyone with less than a million pages on their site is not really concerned with crawl budget or anything. If you're a really large site, that is actually a real problem. That's something that you definitely need to look into. So you can specify a canonical if you want on every blog post that points to the trading slash version, the non-trading slash version, the HTTPS version, the non-dub-dub-dub version, whatever it is, just like make it consistent. Or alternatively, you just leave it out and say like, that's not a problem. And that's also okay.

This episode is sponsored by Sentry. Sentry is the thing that I put into all of my apps first thing. I figure out how to deploy them. I get them up on the web and then I run Sentry on them. And the reason why is because I need to know what's going on in my app all the time. The other thing is, is that sometimes I miss stuff. I'll run things in development, works on my machine. We've all been there, right? And then it gets up into the cloud or up on a server and stuff happens, stuff breaks. I didn't configure it right. AWS credentials, something like that, right? And so I need to get the error reporting back. But the other thing is, and this is something that my users typically don't give me information on is I need to know if it's performing well, right? I need to know if it's slowing down because I don't want them getting lost into the Twitterverse because my app isn't fast enough. So I put Sentry in, I get all of the information about what's going right and what's going wrong, and then I can go in and I can fix the issues right away. So if you have an app that's running slow, you have an app that's having errors, you have an app that you're just getting started with, go check it out, sentry.io slash four, that's F-O-R slash JavaScript and use the code JSJabber for three months of their base team plan.

DAN_SHAPPIR: Okay, so that'll push us to pick. Steve, let's start with you then. Did you come up with a pick for today?

STEVE_EDWARDS: Yeah, I think I did. So I'm going to go TV show again and it's one that I watched quite a bit when it was on and went away for a while and now it's come back on one of the various cable channels that I have and I don't think I picked it before if I have, I'm sorry. But it was called, it was on initially a cable only show when those first started coming around as compared to everything being initially on the networks, but it was called in plain sight to show that takes place in Albuquerque, New Mexico. And it's about a couple, uh, us federal marshals that are in charge, uh, or a part of the witness protection program, which is where the government puts witnesses that have testified for them against really dangerous people and dangerous people want to kill them. So they're trying to protect them. And it's, uh, it was written produced and written mostly by the lead actress named Mary McCormick. But just one of those shows that I really got into. I think it has seven or eight seasons and ended six or seven years ago. Again, I don't know how the dates, but in plain sight, really, really good show, really fun to watch.

DAN_SHAPPIR: Cool. AJ, how about you? You usually have lots of excellent picks.

AJ_O’NEAL: All right, well, today is no exception. I'm gonna pick the Randall Munroe book trilogy that's not really a trilogy because they're not related, but what if, how to and thing explainer. Randall Monroe is the XKCD guy and you know, he's just funny. He's, he's quippy and witty and the, he's got these books, which you should get in hardcover versions so that you can keep them forever and ever in your collection and keep them prominently on display for your guests or, or put them in your office waiting room. And it's just like absurd absurd explanations of things like an overly detailed explanation of buried pirate treasure and the economics of it. And if you can jump out of a plane with a helium tank and realistically be able to fill enough balloons with helium before you hit the ground to slow your descent, which apparently you can. And the thing the thing explainer is

DAN_SHAPPIR: don't forget to drop the tank. Well, once you fill all the balloons.

AJ_O’NEAL: I, yeah, I don't know exactly how it works, but I think you've got to have those big, big balloons and, you know, yeah, yeah, yeah. And you got to, yeah, of course. I'm sure. I'm sure that's really important. Yeah. And, and then thing explainer is using only 10 hundred words. Uh, he explains the world's most complex topics such as nuclear fission. So it's basically using words that a second or third grader would know. So the, the, uh, international space boat, for example, stuff like that. So picking that I'm also going to pick. We were talking about the NSA and there's this nice little route between the NSA building and the Microsoft building. It's not that much different between the Oracle building or the whatever big company you want to name Adobe building, et cetera, because all of those buildings are right there together. But yeah. So, I mean, I anyway, just for fun, there's the, the Google maps link to see that. I am a little bit worried about big tech invading Utah though. I mean, it's been doing it for years and years. It's, you know, Utah has always been a tech hub since the eighties or seventies or whatever it was. But I'm, I'm a little bit concerned with how things are changing around here. And it's starting to feel a little too much like San Francisco, a lot more. Hmm. Uh, the things I don't like anyway, but interesting just to kind of see how close those, those things are together. The tech hub and the NSA data center. And then, uh, I'm also, uh, see, no, not going to pick that. Okay. Last thing I'm going to pick is there's this nice video. That's a parody of a user focus group, which isn't like that far off because you have to do focus groups, right? Otherwise you get bad answers from your respondents. But I just thought of it as like the perfect explanation of traditional 1970s run of the mill, public key cryptography and blockchain technology. That's, that's what I came to in my mind, but it's, it's a user study of cavemen. It's like a one minute clip of cavemen discovering the wheel and and their feedback on what they should do with it. And I just thought it was hilarious. And it's like, oh, that's like, that's like what they did to cryptography to come up with the idea of the blockchain. Excellent. So those are the things that I am gonna leave you with some links to, to check out. That is all.

DAN_SHAPPIR: Okay, so I only have one pick for today. It's a pick about our very own Amy Knight, who unfortunately had to drop off before this section because of work stuff, who would have thought. So Yeah, one thing that I really like that Wix engineering, the company that I work at does is that we organize a lot of technical meetups and share a lot of useful content and then usually the videos for these meetups go online and you can find them on, on YouTube in the Wix engineering, TACTOX channel. Uh, and recently we had a meetup where our very own Amy Knight spoke and she spoke about the technicalities of CSS, you know, how CSS actually works inside the browser, how it combines with the HTML, the DOM to actually form the visual representation of the page, how to debug CSS and so on and so forth. A really excellent talk, obviously, because it's Amy. So I'm going to link to that. And that will be, that will be my, my pick for today. So with that, we'll go over to you, Martin. Do you have any picks for us?

MARTIN_SPLITT: Yes, I actually do. I usually fall into all sorts of weird rabbit holes on the internet. And I come up with random.

DAN_SHAPPIR: And why do you want to find them?

MARTIN_SPLITT: Yeah, it's wild. It's actually not Google, interestingly enough. So I'll pick an interesting article about the curious tale of Tegel, as in Berlin, Tegel's old airport. Tegel's Boeing 707, where they recapitulate the history of a Boeing 707 that was standing around in one of the more remote areas of Tegel Airport in Berlin and how it got there and what happened there and like a bunch of backstory about like plane hijacking and stuff. So yeah, it's quite a ride, but it's quite an interesting one. I think then I'm also picking a fantastic article that covers a thing that happened in Belgium. So apparently, a bunch of mutant crayfish, as in like genetically manipulated female crayfish, escaped a facility in Belgium and are now like proliferating themselves and like living a happy life on a Belgian symmetry, which I think is such a bizarre story. But I love that it's true. And it's actually happening. And I'm like, Oh, my God, this is fantastic crayfish over. I for one welcome our crayfish overlords. And last but not least, because I feel compelled about like sharing something useful as well. I think that technical writing is tricky and important to get right and it's a useful skill. And I haven't really seen like much in terms of how to self educate yourself about it. I work with fantastic tech writers like Lizzie Harvey on our team was an amazing tech writer. But I wanted to brush up my skills a little bit and I found, or actually was pointed to this by Marie Schweitzer, I think, pointed me to an article from the Duke Graduate School or Duke University Graduate School on scientific writing. And it explains like why it's important, how you can get better at it, what's making a communication effective and like basically walks you through a bunch of lessons. I think it's three lessons on how to write better.

DAN_SHAPPIR: Excellent. So Martin, I want to thank you very, very much for coming on our show. It's a lot of amazing content that you've provided us with. I think this is really, wasn't really excellent. So if I may say so myself, and with that, we conclude another episode of JavaScript Jabber. So thank you to our listeners. Bye-bye.

STEVE_EDWARDS: Adios.

MARTIN_SPLITT: Thanks. Thanks for having me. Bye-bye.

AJ_O’NEAL: Adios.

Bandwidth for this segment is provided by Cashfly, the world's fastest CDN. Deliver your content fast with Cashfly c-a-c-h-e-f-l-y dot com to learn more.

JSJ 476: Understanding Search Engines and SEO (for devs) - Part 1

0:00

59:24

Playback Speed: