orderedtrash.com/blog/2015/05/22/how-web-crawlers-work/ ====== bifrost The TL;DR version, "I don't know". This guy writes some bad quality code and no comments! WTF!? ~~~ x1798DE I didn't read the article, but just for reference's sake, this doesn't translate to "I don't know" in Spanish. You literally can't translate this into Spanish. It's just nonsensical. ~~~ gkya The author writes "no, I do not" though. ------ sp332 As a more technical take on this subject, [http://www.webmonkey.com/2010/05/how_google_work_part_3_the_...](http://www.webmonkey.com/2010/05/how_google_work_part_3_the_crawl) says "The crawler can make decisions based on what the page sends back in the HTTP response, including page content type, length of the page, redirects (HTTP code 303), and so on." And another article explains how this is a really important security feature for modern browsers, making sure that a malicious or buggy browser doesn't get to control the web. [https://thenextweb.com/dd/2011/10/06/why-you-should- care-4-m...](https://thenextweb.com/dd/2011/10/06/why-you-should- care-4-million-google-search-users-use-firefox/) ~~~ a3n Google is a spider, not a browser. ~~~ yen223 Google indexes pages, just like a browser does. ~~~ a3n And a spider collects info from a browser? ------ josh2600 I don't know much about search engines but it seems like this article is comparing apples and oranges, or maybe even lemons and cherries. I'm not sure how Google would know that it's not in the link list without making that a property of the link. After all, your index and your link list aren't synced, right? That's the whole point of a search engine is to index, not just copy a text file of a website. My theory is that Google bots crawl from page to page looking for new links and that it's possible a bots crawling from page A to page C (which is not on the link list) will discover page B which is on the link list (B <-> A). Then, because A is on the link list, the link from B to A shows up in the list. This works, right? ~~~ smudgymcscmudge The article does state that Google can learn when the links it indexes change. I do agree with you though, it’s not a clear explanation how it works, but that’s what I got out of it. It also didn’t explicitly say that all links from links list to page are crawled, just that a certain proportion are. ~~~ tasuki Yes, Google can detect when a link changes (either because it notices that the domain has moved or because a third-party site has posted a link to that page and linked it back to itself). There's lots of places on the web where that happens. But this article only shows how Google _can_ do that. It's not clear how it actually _does_ this. ------ bifrost Its not really that hard, its really just a case of 'how often is this page changed'. Its pretty easy to do with regular expressions. The article didn't talk about if they're links, if they're links or embedded links, what order they're crawled in etc etc But its also likely to get you de-indexed from Google for crawling your own pages so its not really an exhaustive article either. ------ jondubois Would Google not need to learn how to parse HTML? In other words, you would need to make an exhaustive search of the web before Google can determine what links are important. Otherwise it could potentially crawl your page and then stop because it has reached a point where it needs more knowledge - "information not yet in its database". ~~~ cma This is about GoogleBot, not Google index or google search. GoogleBot knows how to parse HTML, even if the site doesn't send its full HTML to users. ------ gcb0 because links are in your page in every HTML instance. goo but this is not:

You could be crawling all the web, and still be missing the links. ~~~ blotter_paper [http://robert.ocallahan.org/2009/07/getting-started-with- the...](http://robert.ocallahan.org/2009/07/getting-started-with-the-robots- exclusion-standard/) ------ dandare I do not understand the question "how often is this page changed". Can anyone explain in plain words? ~~~ jakeogh I'm going to guess that the "how often" varies greatly between sites and the page was written for human consumption. It's hard to ask a bot "Do you need this again in X time?" You have to ask it in a context. Maybe you can say the page has N changes between Y and Z, but not really. ------ smcleod I'm guessing this is referring to Googlebot rather than Google search - I've never seen that information made available or used by the search engine itself. ~~~ smcleod It appears that this is the case [https://developers.google.com/search/docs/data- crawling/sched...](https://developers.google.com/search/docs/data- crawling/scheduling) I don't think this is Google's primary way of crawling web sites though, its just a part of a bot used for keeping search results updated. For example YouTube crawls channels which is likely how this site works too. I imagine it would be useful for the search engine as it would mean you only needed to check what content of yours has changed as opposed to having to keep entire copies of the site in the crawler. That said I'm not a Google engineer so I'm not really qualified to say for sure. ~~~ sli The robots.txt directive [https://github.com/hootsuite/webflow- ui/blob/master/docs/runn...](https://github.com/hootsuite/webflow- ui/blob/master/docs/running_docker/README.md#robots-txt-directive) is a useful way of telling your server that you want Googlebot to never be able to index the page. Whether this is relevant to you depends on what you have on your site (which we're not told) and what you want to achieve by letting Google Index your site in the first place (which we're not told either).