SEO How-to, Part 9: Diagnosing Crawler Issues

Image result for SEO How-to, Part 9: Diagnosing Crawler Issues

Editor’s note: This post continues our weekly primer in SEO, touching on all the foundational aspects. In the end, you’ll be able to practice SEO more confidently and converse about its challenges and opportunities.

In order to rank in natural search, your site has to first be crawled and indexed. Sites that can’t be accessed by search engine bots will drive neither the traffic nor the sales needed for true natural search performance.

This is the ninth installment in my “SEO How-to” series. Previous installments are:

  • “Part 1: Why Do You Need It?”;
  • “Part 2: Understanding Search Engines”;
  • “Part 3: Staffing and Planning for SEO”;
  • “Part 4: Keyword Research Concepts”;
  • “Part 5: Keyword Research in Action”;
  • “Part 6: Optimizing On-page Elements;”
  • “Part 7: Mapping Keywords to Content;”
  • “Part 8: Architecture and Internal Linking.”

In “Part 2: Understanding Search Engines,” I discussed how search engines crawl and index content for near-instant retrieval when needed for search results. But what happens when they can’t access the content on your site?

Accidentally Limiting the Crawl

It’s one of the worst-case scenarios in search engine optimization. Your company has redesigned its site and suddenly performance tanks. You check your analytics and notice that home page traffic is relatively stable, product traffic is quite a bit lower, and your new category pages are nowhere to be found.

What happened? It could be that, for search engine bots, your category pages are literally nowhere to be found.

Bots have come a long way, and the major engines have declared that their bots can crawl JavaScript. That’s true to an extent. The way that developers choose to create each piece of JavaScript code determines how search engines access or understand the content within that code.

It’s one of the worst-case scenarios in search engine optimization. Your company has redesigned its site and suddenly performance tanks.

It’s possible for content that renders on the screen perfectly for users to be not crawlable for bots. It’s also possible for content that renders for users and bots to be essentially orphaned without links to it because the navigation has been coded using noncrawlable technology.

In some cases, the content itself may not render correctly or at all for most bots. The most advanced bots can render the page as we humans see it in our latest-version browsers, take snapshots of that page in its various states, and compare the different states to extract meaning.

But that’s relying on several things happening: (a) the most advanced bot gets around to crawling your pages; (b) the bot being able to identify and trigger the necessary elements – such as navigation and video elements – to control the experience; and (c) the bot correctly assessing meaning and relevance for the different states based on its comparison.

Compare this scenario to the traditional scenario of the more common bots that crawl accessible content via links to assess relevance and authority. That doesn’t mean that we must stick with old-school HTML links and text, but we need to work with developers to be sure that content will be crawlable for more than the most advanced bots.

Testing Crawlability

Unfortunately, the tools publicly available to most SEO practitioners aren’t capable of determining with certainty whether something will be crawlable before launch. Some of the major agencies skilled in technical SEO and development can assist with this issue, but make sure to screen them carefully and ask for references and case studies.

The tools that are available publicly to diagnose crawlability issues are not foolproof. They can determine if content is definitely crawlable, but because they use lesser technology than modern search bots, they can show negative results when a search bot actually might be able to access the content.

First, check Google’s cache. This is quick and easy, but only works on a site that is already live and indexed. In Google’s search bar, type in “cache:” before any URL you want to check. For example, you might type in cache:www.mysite.com/this page/. This would check the rendered page that Google has stored for www.mysite.com/this page/.

Now click on “Text-only cache” at the top of the page. This shows you the code that Google has accessed and cached for the page without any of the fancy bells and whistles of the rendered page that trick you into thinking a page is functional for SEO. Look for elements that are missing. Content delivered by vendors and injected into a page is a common culprit, as is navigational links and cross-linking elements. Only blue underlined words are links — check to ensure that everything that should be a link is blue and underlined.

Using the cache method, if everything looks like it should in the text-only cache, congratulations, your page is crawlable and indexable. If pieces are missing or links aren’t registering as links, you need to dig deeper to determine if there’s really an issue. It could be a problem, or it could be a false negative.

Also try the Fetch as Googlebot tool in Google Search Console. It allows you to fetch any publicly accessible page and shows both the code of the page that Google sees as well as the rendered page. If you like the result, you can also request that Google index the page. Be careful not to squander these, each account has a limited number available. As with the cache method, if both the rendered version and the text version look OK, then your page is fine. If not, you may have problems or it may be a false negative to keep investigating.

Using the cache method, if everything looks like it should in the text-only cache, congratulations, your page is crawlable and indexable.

Next, crawl the site using your favorite crawler. This can be done in preproduction environments also. Try to do this step before you launch a site or major site change, if possible, so you can resolve major issues or at least have an idea of what you’re dealing with when it goes live. I recommend ScreamingFrog SEO Spider or DeepCrawl, both of which have instructions for crawling JavaScript.

Let the crawler run against your site, and again look for areas that are missing. Are all pages of a certain type — usually category, subcategory or filtered navigation pages — missing from the crawl log? What about products? If the category pages aren’t crawlable, then there’s no path to products.

Using the crawler method, if you don’t see holes in the crawl then congratulations, your site is crawlable. Search bots are more capable than the crawlers available to us, so if our crawlers can get through a site’s content, so can the actual search bots. If you do see problems with the crawl, it could be a problem or a false negative.

If all publicly available tests have returned negative results — they’re showing gaps in content and pages crawled — and if your analytics show performance issues consistent with the timing with which a new site or site feature went live, it’s time to get help. Go to your developers and ask them to investigate. Call an agency or consultant you trust who has experience in this area..

[“Source-practicalecommerce.”]