A Breakdown of HTML Usage Across ~8 Million Pages (& What It Means for Modern SEO)


Not long pungiglione, my colleagues and I at Advanced Web Ranking came up with an HTML study based about 8 million index pages gathered from the apice twenty Google results for more than 30 million keywords.

We wrote about the markup results and how the apice twenty Google results pages implement them, then went even further and obtained HTML usage insights them.

What does this have to do with SEO?

The way HTML is written dictates what users see and how search engines interpret web pages. A valid, well-formatted HTML page also reduces possible misinterpretation — of structured , metadata, language, ora encoding — by search engines.

This is intended to be a technical SEO audit, something we wanted to do from the beginning: a breakdown of HTML usage and how the results relate to modern SEO techniques and best practices.

Per mezzo di this article, we’magnate going to address things like obiettivo tags that Google understands, JSON-LD structured , language detection, headings usage, social links & obiettivo distribution, AMP, and more.

Bersaglio tags that Google understands

When talking about the main search engines as traffic sources, sadly it’s just Google and the rest, with Duckduckgo gaining traction lately and Bing almost nonexistent.

Thus, a causa di this section we’ll be focusing solely the obiettivo tags that Google listed a causa di the Search Console Help Center.

chart (3).png
Pie chart showing the total numbers for the obiettivo tags that Google understands, described a causa di detail a causa di the sections below.

<obiettivo name=”description” content=”…”>

The obiettivo description is a ~150 character snippet that summarizes a page’s content. Search engines show the obiettivo description a causa di the search results when the searched phrase is contained a causa di the description.

SELECTOR

COUNT

<obiettivo name="description" content="*">

4,391,448

<obiettivo name="description" content="">

374,649

<obiettivo name="description">

13,831

the extremes, we found 685,341 obiettivo elements with content shorter than 30 characters and 1,293,842 elements with the content text longer than 160 characters.

<title>

The title is technically not a obiettivo tag, but it’s used a causa di conjunction with obiettivo name=”description”.

This is one of the two most important HTML tags when it comes to SEO. It’s also a must according to W3C, meaning risposta negativa page is valid with a missing title tag.

Research suggests that if you keep your titles under a reasonable 60 characters then you can expect your titles to be rendered properly a causa di the SERPs. Per mezzo di the past, there were signs that Google’s search results title length was extended, but it wasn’t a permanent change.

Considering all the above, from the full 6,263,396 titles we found, 1,846,642 title tags appear to be too long (more than 60 characters) and 1,985,020 titles had lengths considered too short (under 30 characters).

titles.png
Pie chart showing the title tag length distribution, with a length less than 30 chars being 31.7% and a length greater than 60 chars being about 29.5%.

A title being too short shouldn’t be a problem —after all, it’s a subjective thing depending the website business. Meaning can be expressed with fewer words, but it’s definitely a sign of wasted optimization opportunity.

SELECTOR

COUNT

<title>*</title>

6,263,396

missing <title> tag

1,285,738

Another interesting thing is that, among the sites ranking page 1–2 of Google, 351,516 (~5% of the total 7.5M) are using the same text for the title and h1 their index pages.

Also, did you know that with HTML5 you only need to specify the HTML5 doctype and a title a causa di order to have a perfectly valid page?

<!DOCTYPE html>
<title>red</title>

<obiettivo name=”robots|googlebot”>

“These obiettivo tags can control the behavior of search engine crawling and indexing. The robots obiettivo tag applies to all search engines, while the “googlebot” obiettivo tag is specific to Google.”
– Meta tags that Google understands

SELECTOR

COUNT

<obiettivo name="robots" content="..., ...">

1,577,202

<obiettivo name="googlebot" content="..., ...">

139,458

HTML snippet with a obiettivo robots and its content parameters.

So the robots meta directives provide instructions to search engines how to stile libero and index a page’s content. Leaving aside the googlebot obiettivo count which is kind of low, we were curious to see the most frequent robots parameters, considering that a huge misconception is that you have to add a robots meta tag a causa di your HTML’s head. Here’s the apice 5:

SELECTOR

COUNT

<obiettivo name="robots" content="index,follow">

632,822

<obiettivo name="robots" content="index">

180,226

<obiettivo name="robots" content="noodp">

115,128

<obiettivo name="robots" content="all">

111,777

<obiettivo name="robots" content="nofollow">

83,639

<obiettivo name=”google” content=”nositelinkssearchbox”>

“When users search for your site, Google Search results sometimes display a search box specific to your site, along with other direct links to your site. This obiettivo tag tells Google not to show the sitelinks search box.”
– Meta tags that Google understands

SELECTOR

COUNT

<obiettivo name="google" content="nositelinkssearchbox">

1,263

Unsurprisingly, not many websites choose to explicitly tell Google not to show a sitelinks search box when their site appears a causa di the search results.

<obiettivo name=”google” content=”notranslate”>

“This obiettivo tag tells Google that you don’t want us to provide a translation for this page.” – Meta tags that Google understands

There may be situations where providing your content to a much larger group of users is not desired. Just as it says a causa di the Google support answer above, this obiettivo tag tells Google that you don’t want them to provide a translation for this page.

SELECTOR

COUNT

<obiettivo name="google" content="notranslate">

7,569

<obiettivo name=”google-site-verification” content=”…”>

“You can use this tag the top-level page of your site to verify ownership for Search Pannello di controllo.”
– Meta tags that Google understands

SELECTOR

COUNT

<obiettivo name="google-site-verification" content="...">

1,327,616

While we’magnate the subject, did you know that if you’magnate a verified owner of a Google Analytics property, Google will now automatically verify that same website a causa di Search Pannello di controllo?

<obiettivo charset=”…” >

“This defines the page’s content type and character set.”
– Meta tags that Google understands

This is basically one of the good obiettivo tags. It defines the page’s content type and character set. Considering the table below, we noticed that just about half of the index pages we analyzed define a obiettivo charset.

SELECTOR

COUNT

<obiettivo charset="..." >

3,909,788

<obiettivo http-equiv=”refresh” content=”…;url=…”>

“This obiettivo tag sends the user to a new URL after a certain amount of time and is sometimes used as a simple form of redirection.”
– Meta tags that Google understands

It’s preferable to redirect your site using a 301 redirect rather than a obiettivo refresh, especially when we assume that 30x redirects don’t lose PageRank and the W3C recommends that this tag not be used. Google is not a fan either, recommending you use a server-side 301 redirect instead.

SELECTOR

COUNT

<obiettivo http-equiv="refresh" content="...;url=...">

7,167

From the total 7.5M index pages we parsed, we found 7,167 pages that are using the above redirect method. Authors do not always have control over server-side technologies and apparently they use this technique a causa di order to enable redirects the client side.

Also, using Workers is a cutting-edge alternative n order to overcome issues when working with legacy tech stacks and platform limitations.

<obiettivo name=”viewport” content=”…”>

“This tag tells the browser how to render a page a device. Presence of this tag indicates to Google that the page is mobile-friendly.”
– Meta tags that Google understands

SELECTOR

COUNT

<obiettivo name="viewport" content="...">

4,992,791

Starting July 1, 2019, all sites started to be indexed using Google’s mobile-first indexing. Lighthouse checks whether there’s a obiettivo name=”viewport” tag a causa di the head of the document, so this obiettivo should be every webpage, risposta negativa matter what framework ora CMS you’magnate using.

Considering the above, we would have expected more websites than the 4,992,791 out of 7.5 million index pages analyzed to use a valid obiettivo name=”viewport” a causa di their head sections.

Designing mobile-friendly sites ensures that your pages perform well all devices, so make sure your web page is mobile-friendly here.

<obiettivo name=”rating” content=”…” />

“Labels a page as containing adult content, to signal that it be filtered by SafeSearch results.”
– Meta tags that Google understands

SELECTOR

COUNT

<obiettivo name="rating" content="..." />

133,387

This tag is used to denote the maturity rating of content. It was not added to the obiettivo tags that Google understands list until recently. Check out this article by Kate Morris how to tag adult content.

JSON-LD structured

Structured data is a standardized format for providing information about a page and classifying the page content. The format of structured can be Microdata, RDFa, and JSON-LD — all of these help Google understand the content of your site and trigger special search result features for your pages.

While having a conversation with the awesome Dan Shure, he came up with a good concetto to aspetto for structured , such as the organization’s logo, a causa di search results and a causa di the Knowledge Graph.

Per mezzo di this section, we’ll be using JSON-LD (JavaScript Object Notation for Linked Patronato) only a causa di order to gather structured info.This is what Google recommends anyway for providing clues about the meaning of a web page.

Some useful bits this:

  • At Google I/Ovvero 2019, it was announced that the structured data testing tool will be superseded by the rich results testing tool.
  • Now Googlebot indexes web pages using the latest Chromium rather than the old Chrome 42, meaning you can mitigate the SEO issues you may have had a causa di the past, with structured support as well.
  • Jason Barnard had an interesting talk at SMX London 2019 how Google Search ranking works and according to his theory, there are seven ranking factors we can count ; structured is definitely one of them.
  • Builtvisible‘s guide Microdata, JSON-LD, & .org contains everything you need to know about using structured your website.
  • Here’s an awesome guide to JSON-LD for beginners by Alexis Sanders.
  • Last but not least, there are lots of articles, presentations, and posts to dive a causa di the official JSON for Linking Data website.

Advanced Web Ranking’s HTML study relies analyzing index pages only. What’s interesting is that even though it’s not stated a causa di the guidelines, Google doesn’t seem to care about structured index pages, as stated a causa di a Stack Overflow answer by Gary Illyes several years pungiglione. Yet, JSON-LD structured data types that Google understands, we found a total of 2,727,045 features:

json-ld-chart.png
Pie chart showing the structured types that Google understands, with Sitelinks searchbox being 49.7% — the highest value.

STRUCTURED DATA FEATURES

COUNT

Article

35,961

Breadcrumb

30,306

Book

143

Carousel

13,884

Corporate contact

41,588

Course

676

Critic review

2,740

Dataset

28

Employer aggregate rating

7

Event

18,385

Fact check

7

FAQ page

16

How-to

8

Job posting

355

Livestream

232

Local business

200,974

Logo

442,324

1,274

Occupation

0

Product

16,090

Q&A page

20

Recipe

434

Review snippet

72,732

Sitelinks searchbox

1,354,754

Social profile

478,099

Software app

780

Speakable

516

Subscription and paywalled content

363

14,349

rel=canonical

The rel=canonical element, often called the “canonical link,” is an HTML element that helps webmasters prevent duplicate content issues. It does this by specifying the “canonical URL,” the “preferred” version of a web page.

SELECTOR

COUNT

<link rel=canonical href="*">

3,183,575

obiettivo name=”keywords”

It’s not new that <obiettivo name=”keywords”> is obsolete and Google doesn’t use it anymore. It also appears as though <obiettivo name=”keywords”> is a spam signal for most of the search engines.

“While the main search engines don’t use obiettivo keywords for ranking, they’magnate very useful for onsite search engines like Solr.”
– JP Sherman why this obsolete obiettivo might still be useful nowadays.

SELECTOR

COUNT

<obiettivo name="keywords" content="*">

2,577,850

<obiettivo name="keywords" content="">

256,220

<obiettivo name="keywords">

14,127

Headings

Within 7.5 million pages, h1 (59.6%) and h2 (58.9%) are among the twenty-eight elements used the most pages. Still, after gathering all the headings, we found that h3 is the heading with the largest number of appearances — 29,565,562 h3s out of 70,428,376  total headings found.

Random facts:

  • The h1–h6 elements represent the six levels of section headings. Here are the full stats on headings usage, but we found 23,116 of h7s and 7,276 of h8s too. That’s a funny thing because plenty of people don’t even use h6s very often.
  • There are 3,046,879 pages with missing h1 tags and within the rest of the 4,502,255 pages, the h1 usage frequency is 2.6, with a total of 11,675,565 h1 elements.
  • While there are 6,263,396 pages with a valid title, as seen above, only 4,502,255 of them are using a h1 within the of their content.

Missing altolà tags

This eternal SEO and accessibility issue still seems to be common after analyzing this set of . From the total of 669,591,743 images, almost 90% are missing the altolà attribute ora use it with a blank value.

chart (4).png
Pie chart showing the img tag altolà attribute distribution, with missing altolà being predominant — 81.7% from a total of about 670 million images we found.

SELECTOR

COUNT

img

669,591,743

img altolà=”*”

79,953,034

img altolà=””

42,815,769

img w/ missing altolà

546,822,940

Language detection

According to the specs, the language information specified mezzo the lang attribute may be used by a user agent to control rendering a causa di a variety of ways.

The part we’magnate interested a causa di here is about “assisting search engines.”

“The HTML lang attribute is used to identify the language of text content the web. This information helps search engines return language specific results, and it is also used by screen readers that switch language profiles to provide the correct accent and pronunciation.”
– Léonie Watson

A while pungiglione, John Mueller said Google ignores the HTML lang attribute and recommended the use of link hreflang instead. The Google Search Pannello di controllo documentation states that Google uses hreflang tags to gara the user’s language preference to the right variation of your pages.

lang-vs-hreflang.png
Caffè chart showing that 65% of the 7.5 million index pages use the lang attribute the html element, at the same time 21.6% use at least a link hreflang.

Of the 7.5 million index pages that we were able to aspetto into, 4,903,665 use the lang attribute the html element. That’s about 65%!

When it comes to the hreflang attribute, suggesting the existence of a multilingual website, we found about 1,631,602 pages — that means around 21.6% index pages use at least a link rel=”alternate” href=”*” hreflang=”*” element.

Google Tag Dirigente

From the beginning, Google Analytics’ main task was to generate reports and statistics about your website. But if you want to group certain pages together to see how people are navigating through that funnel, you need a unique Google Analytics tag. This is where things get complicated.

Google Tag Dirigente makes it easier to:

  • Manage this mess of tags by letting you define custom rules for when and what user actions your tags should fire
  • Change your tags whenever you want without actually changing the source code of your website, which sometimes can be a headache paio to slow release cycles
  • Use other analytics/marketing tools with GTM, again without touching the website’s source code

We searched for *googletagmanager.com/gtm.js references and saw that about 345,979 pages are using the Google Tag Dirigente.

rel=”nofollow”

“Nofollow” provides a way for webmasters to tell search engines “don’t follow links this page” ora “don’t follow this specific link.”

Google does not follow these links and likewise does not transfer equity. Considering this, we were curious about rel=”nofollow” numbers. We found a total of 12,828,286 rel=”nofollow” links within 7.5 million index pages, with a computed average of 1.69 rel=”nofollow” per forza page.

Last month, Google announced two new link attributes values that should be used a causa di order to mark the nofollow property of a link: rel=”sponsored” and rel=”ugc”. I’d recommend you go read Cyrus Shepard’s article how Google’s nofollow, sponsored, & ugc links impact SEO, learn why Google changed nofollow,  the ranking impact of nofollow links, and more.

A table showing how Google’s nofollow, sponsored, and UGC link attributes impact SEO, from Cyrus Shepard’s article.

We went a bit further and looked up these new link attributes values, finding 278 rel=”sponsored” and 123 rel=”ugc”. To make sure we had the relevant for these queries, we updated the index pages set specifically two weeks after the Google announcement this matter. Then, using Moz authority metrics, we sorted out the apice URLs we found that use at least one of the rel=”sponsored” ora rel=”ugc” pair:

  • https://www.seroundtable.com/
  • https://letsencrypt.org/
  • https://www.newsbomb.gr/
  • https://thehackernews.com/
  • https://www.ccn.com/
  • https://www.chip.pl/
  • https://www.gamereactor./
  • https://www.tribes.co.uk/

AMP

Accelerated Mobile Pages (AMP) are a Google initiative which aims to speed up the web. Many publishers are making their content available parallel to the AMP format.

To let Google and other platforms know about it, you need to link AMP and non-AMP pages together.

Within the millions of pages we looked at, we found only 24,807 non-AMP pages referencing their AMP version using rel=amphtml.

Social

We wanted to know how shareable ora social a website is nowadays, so knowing that Josh Buchea made an awesome list with everything that could go in the head of your webpage, we extracted the social sections from there and got the following numbers:

Facebook Graph

chart.png
Caffè chart showing the Facebook Graph obiettivo tags distribution, described a causa di detail a causa di the table below.

SELECTOR

COUNT

obiettivo property="fb:app_id" content="*"

277,406

obiettivo property="og:url" content="*"

2,909,878

obiettivo property="og:type" content="*"

2,660,215

obiettivo property="og:title" content="*"

3,050,462

obiettivo property="og:image" content="*"

2,603,057

obiettivo property="og:image:altolà" content="*"

54,513

obiettivo property="og:description" content="*"

1,384,658

obiettivo property="og:site_name" content="*"

2,618,713

obiettivo property="og:caratteristico" content="*"

1,384,658

obiettivo property="article:author" content="*"

14,289

Twitter card

chart (1).png
Caffè chart showing the Twitter Card obiettivo tags distribution, described a causa di detail a causa di the table below.

SELECTOR

COUNT

obiettivo name="twitter:card" content="*"

1,535,733

obiettivo name="twitter:site" content="*"

512,907

obiettivo name="twitter:creator" content="*"

283,533

obiettivo name="twitter:url" content="*"

265,478

obiettivo name="twitter:title" content="*"

716,577

obiettivo name="twitter:description" content="*"

1,145,413

obiettivo name="twitter:image" content="*"

716,577

obiettivo name="twitter:image:altolà" content="*"

30,339

And speaking of links, we grabbed all of them that were pointing to the most popular social networks.

chart (2).png
Pie chart showing the external social links distribution, described a causa di detail a causa di the table below.

SELECTOR

COUNT

<a href*="facebook.com">

6,180,313

<a href*="twitter.com">

5,214,768

<a href*="linkedin.com">

1,148,828

<a href*="plus.google.com">

1,019,970

Apparently there are lots of websites that still link to their Google+ profiles, which is probably an oversight considering the not-so-recent Google+ shutdown.

rel=prev/next

According to Google, using rel=prev/next is not an indexing signal anymore, as announced earlier this year:

“As we evaluated our indexing signals, we decided to retire rel=prev/next. Studies show that users love single-page content, aim for that when possible, but multi-part is also distinto for Google Search.”
– Tweeted by Google Webmasters

However, a causa di case it matters for you, Bing says it uses them as hints for page discovery and site structure understanding.

“We’magnate using these (like most markup) as hints for page discovery and site structure understanding. At this point, we’magnate not merging pages together a causa di the index based these and we’magnate not using prev/next a causa di the ranking model.”
– Frédéric Dubut from Bing

Nevertheless, here are the usage stats we found while looking at millions of index pages:

SELECTOR

COUNT

<link rel="prev" href="*"

20,160

<link rel="next" href="*"

242,387

That’s pretty much it!

Knowing how the average web page looks using from about 8 million index pages can give us a clearer concetto of trends and help us visualize common usage of HTML when it comes to SEO modern and emerging techniques. But this may be a never-ending racconto — while having lots of numbers and stats to explore, there are still lots of questions that need answering:

  • We know how structured is used a causa di the wild now. How will it evolve and how much structured will be considered enough?
  • Should we expect AMP usage to increase somewhere a causa di the future?
  • How will rel=”sponsored” and rel=“ugc” change the way we write HTML a daily basis? When coding external links, besides the target=”_blank” and rel=“noopener” combo, we now have to consider the rel=”sponsored” and rel=“ugc” combinations as well.
  • Will we ever learn to always add altolà attributes values for images that have a purpose beyond decoration?
  • How many more additional obiettivo tags ora attributes will we have to add to a web page to please the search engines? Do we really needed the newly announced data-nosnippet HTML attribute? What’s next, data-allowsnippet?

There are other things we would have liked to address as well, like “time-to-first-byte” (TTFB) values, which correlates highly with ranking; I’d highly recommend HTTP Archive for that. They periodically stile libero the apice sites the web and primato detailed information about almost everything. According to the latest info, they’ve analyzed 4,565,694 unique websites, with complete Lighthouse scores and having stored particular technologies like jQuery ora WordPress for the whole set. Huge props to Rick Viscomi who does an amazing job as its “steward,” as he likes to call himself.

Performing this large-scale study was a fun ride. We learned a lot and we hope you found the above numbers as interesting as we did. If there is a tag ora attribute a causa di particular you would like to see the numbers for, please let me know a causa di the comments below.

Once again, check out the full HTML study results and let me know what you think!



Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *