Protecting staging environments: the ultimate guide
Prevent embarrassing situations by making sure you properly protect your staging environments.
The best way to go about this is using HTTP authentication.
Is your staging environment already indexed by search engines? No sweat. With this guide, you'll learn how to quickly and effectively reverse that.
We see it happen all the time: staging or development environments containing websites that are works in progress, left available for all the world to see. And often they're getting indexed by search engines too!
Don't believe me?
Check out this query – inspired by Peter Nikolow's tweet (follow him on Twitter, he's both funny and smart!).
Why having accessible staging environments is a bad thing
Or rather, a doubly-bad thing: bad from the viewpoints of both business and SEO.
The business viewpoint
Do you want others to see your "lorem ipsum" content and laugh, or even—god forbid—read about a huge announcement such as an acquisition or rebranding that should have been kept secret until the new site was launched?
It's unprofessional, and above all it's not very smart. It's even a sales tactic for certain agencies: they look for other agencies that are making these mistakes and then pitch to their clients, leveraging the embarrassing situation.
The SEO viewpoint
Besides the embarrassment, having a staging environment that's indexed by search engines can lead to duplicate content issues when the staging environment and the production environment are highly similar.
Having accessible and indexed staging environments is totally unnecessary, because it's easy to prevent. In this article we'll show you how to do it, what methods you can use, and what to do if your staging environment is already indexed.
From here forward, when we refer to staging environments, we may be referring to both development and staging environments.
What are development and staging environments?
When you're working on a completely new website or new functionality, you don't just do that on your live website (often called your "production environment"), because websites are easy to break. It's a best practice to work with different, separate environments where you develop and test new functionality.
So what different environments are there besides a production environment?
- Development environment: this is where developers initially test their code. Often they do this on their local machines, so if that's the case, there isn't any danger at all of this environment being accessible to others and getting indexed by search engines. If it's not kept locally, but instead for instance on a
dev.example.com
subdomain, there is a risk of it being accessible to others and being indexed. - Staging environment (often also called the "test environment"): this is where releases are staged and new functionality is tested before a release. New content is published here so it can be checked to ensure it looks as intended. Staging environments often aren't run locally: different team members need to be able to easily access it, and so it usually runs on a subdomain or a separate domain.
Since you're reading up on protecting staging environments, it's likely you're going to be doing a website migration soon. To migrate flawlessly, check our website migration guide to make sure you aren't forgetting any crucial steps in the migration process!
Security through obscurity is not a feasible strategy
Not telling anyone about your "secret" staging environment is a case of "security through obscurity ". It's not a feasible strategy to use. Especially not as the only layer of protection.
What if someone accidentally publishes a link to the staging environment? Or pushes some code to production that accidentally includes canonical or hreflang references to the staging environment?
Not only does this create issues in your production environment, it also leads to search engines picking up your staging environment's scent. And they will queue it up for crawling unless you make it impossible for them to access the staging environment, or give them rules of engagement to follow.
How to protect your development and staging environments
Now it's clear why you need to protect your development and staging environments. But how do you do it? There are multiple ways to go about this, but which one is best?
We'll discuss the pros and cons of every method, taking into account:
- User-friendliness: the degree to which the method doesn't add extra inconvenience.
- Third-party access: the degree to which the method prevents third parties from accessing an environment.
- SEO-friendliness: the degree to which the method keeps search engines from indexing an environment.
- Monitoring-friendliness: the degree to which the method lets you monitor the protected environments for SEO purposes.
- Low risk of human error: whether this method has a low risk of human errors, impacting SEO.
Method | User-friendliness | Third-party access | SEO-friendliness | Monitoring-friendliness | Low risk of human error |
---|---|---|---|---|---|
HTTP auth | |||||
VPN | |||||
Robots directives | |||||
Robots.txt | |||||
Canonical links | |||||
Whitelisting specific user-agents |
Method 1: HTTP authentication - your best choice 🏆
The best way to prevent both users and search engines from gaining access to your development and staging environments is to use HTTP Authentication. Be sure, meanwhile, to implement it using HTTPS, because you don't want usernames and passwords to travel in plaintext over the wire.
We recommend whitelisting the IP addresses at your office, and providing external parties and remote team members access via a username/password combination.
This way search engines can't access anything, and you have total control over who can see what. You can prepare your staging environment with the same robots.txt that you'll be using on the production environment, as well as the correct robots directives and canonicals. This lets you gain a representative picture of your staging environment when you're monitoring it for issues and changes prior to launching.
Another benefit of this is that it's not prone to developers forgetting to publish the right robots.txt, robots directives, and canonicals on the production environment.
This is a much better approach than using robots.txt and/or robots noindex directives and canonical links, because those don't prevent other people from accessing them, and search engines will not always honor such directives.
What's more, when using HTTP authentication it's still possible to use Google's testing tools such as AMP, Mobile-friendliness, and Structured DataStructured Data
Structured data is the term used to describe schema markup on websites. With the help of this code, search engines can understand the content of URLs more easily, resulting in enhanced results in the search engine results page known as rich results. Typical examples of this are ratings, events and much more. The Searchmetrics glossary below contains everything you need to know about structured data.
Learn more Testing Tool. Just set up a tunnel .
How do I set up HTTP authentication?
Below you'll find some resources on how to set up HTTP Authentication on Apache, nginx, and IIS:
- Setting up HTTP Authentication with Apache and a handy HTPasswd Generator .
- Setting up HTTP Authentication with nginx
- Setting up HTTP Authentication with IIS: basic authentication and IP restrictions
Method 2: VPN access
VPN stands for "virtual private network." You basically connect your local machine so as to become part of the company network. And now that you're part of the company network, you can access the staging environment. Anyone who's not part of the network cannot access it. This means that neither third parties nor search engines can access the staging environment.
Having access through a VPN offers most of the benefits of HTTP authentication. However, there's one big drawback: SEO monitoring solutions that aren't running locally may not work out of the box, or at all. Not being able to track your development team's progress is troublesome, and it becomes truly problematic when you're dealing with truly big websites.
Method 3: Robots directives
Robots directives are used to communicate preferences surrounding crawling and indexing. You can for instance ask search engines not to index certain pages, and not to follow (certain) links.
You can define robots directives in a page's HTTP header (X-Robots-Tag header
), or via the meta robots directive in its <head>
section. Because you'll have other content types besides just pages on your staging environment, it's recommended that you use the X-Robots-Tag header
to make sure that PDF files, for instance, don't get indexed.
Robots directives, like the name implies, are meant for robots ("crawlersCrawlers
A crawler is a program used by search engines to collect data from the internet.
Learn more"). They don't prevent 3rd party access. They do send a moderately strong signal to search engines not to index pages. I say "moderately" because search engines can still decide to ignore the robots directives and index your pages. It's also not a monitoring-friendly solution, as—similar to robots.txt—it may lead to false positives being reported by SEO tools.
On top of that, there's a huge risk of human error, as staging robots directives are often accidentally carried over into the production environment.
Method 4: Robots.txt
The robots.txt file states the rules of engagement for crawlers, so by using robots.txt, you can ask search engines to keep out of your staging environment. The most common way to do this is by including the following contents in robots.txt:
User-agent: * Disallow: /
This prevents search engineSearch Engine
A search engine is a website through which users can search internet content.
Learn more crawlers from crawling the site, but they may still index it if they find links to it, leading to listings like these:
Some people include the unofficial Noindex
directive in their robots.txt. We don't recommend doing this, as it's a worse way to prevent your staging environment from being accessible than using a Disallow
directive, since it's really an unofficial directive.
Your robots.txt doesn't offer any actual protection against third-party access to the site, and it throws off SEO monitoring tools as well, potentially leading to false positives. Plus, you're creating a huge risk of human error: here once again, the robots.txt from the staging environment is often accidentally carried over into the production environment.
Method 5: Canonical links
The canonical link informs search engines of the canonical version of a page. If the staging environment is referencing the production environment, all signals should be consolidated with the production environment.
Otherwise, canonical links resemble robots directives, especially in their downsides:
- They still let third parties access the staging environment.
- They're not a monitoring-friendly solution, as they may lead to false positives being reported by SEO tools.
- There's a risk of human error, as canonical directives from staging are sometimes accidentally carried over into the production environment.
Method 6: Whitelisting specific user agents
The whitelisting of specific user agents for access to a staging environment can be used to allow SEO specialists to monitor a staging environment, as long as their SEO tooling supports setting custom user agents . They could create a made-up user agent and use that, while blocking all other user agents (including browsers).
But this isn't a very user-friendly approach, because manual verification through your browser is made harder. It's not a very secure approach either: When third parties know you're working for or at company X, and they're aware of your user agent (perhaps because they're a disgruntled customer)—they may be able to gain access to the staging environment.
How can you find out if your staging environment is being indexed?
Here are a few ways to find out if your staging environment is being indexed. Here are the two most common ones:
Option 1: site query
If you know that your staging environment is running on a subdomain, you can try a site query such as: site:example.com -inurl:www
This query returns all the Google-indexed pages for the domain example.com except the ones containing "www
".
Here's a link to an example query
Option 2: Google Analytics
If you don't know the URL of your staging environment, you can try checking in Google Analytics:
- Navigate to
Audience
>Technology
and chooseNetwork
. - Select
Hostname
as the Primary Dimension. - Look for hostnames that have a different domain, or contain subdomains such as
staging
,test
ordev
.
Option 3: Google Search Console
With the consolidation of properties in Google Search ConsoleGoogle Search Console
The Google Search Console is a free web analysis tool offered by Google.
Learn more, it's now much easier to spot pages that shouldn't be indexed.
Whether your staging environment is set up a separate domain, a subdomain or subfolder: if you've verified the domain you'll be able to see all pages that are indexed, and all queries that your domain is ranking for. Right in the overviews you're used to look at;
Performance
>Queries
Performance
>Pages
Index
>Coverage
Our special thanks goes out to Rhea Drysdale and Martijn Oud for mentioning this to us!
Getting your already indexed staging environment removed from the index
Uh-oh. Your staging environment has already been indexed by search engines, and you're the one who has to fix it. Well, the good news is: if you follow the steps below, you're good. And they're easy.
Step 1: hide search results
Verify the staging environment in webmaster tools such as Google Search Console and Bing Webmaster Tools and URL Removal (see Google's documentation and Bing's documentation on this). For Google, this request is often granted within hours (Bing takes a little longer), and then your staging environment won't show up in any search results. But here's the catch: it's still in Google's and Bing's indexes; it's just not shown. In Google's case, the staging environment is only hidden for 90 days. So within this timeframeFrame
Frames can be laid down in HTML code to create clear structures for a website’s content.
Learn more, you need to make sure to request removal of your pages from search engines' indexes in the right way: via the robots noindex directive.
Step 2: applying the noindex directive and getting pages recrawled
Make sure you apply the robots noindex directive on every page in your staging environment. To speed up the process of search engines recrawling these pages, submit an XML sitemap. Now watch your server logs for search engine crawlers' requests for your previously indexed (and now "noindexed") pages, to make sure they've "gotten the message."
In most cases, these 90 days are enough time to signal to search engines that they should remove the staging-environment pages from their index. But if they aren't, just rinse and repeat.
Once it's all done, protect the staging environment using HTTP authentication to make sure this doesn't happen again and remove the XML sitemap from Google Search Console and Bing Webmaster Tools.
Here are some other useful resources on protecting staging environments:
- How to keep your staging or development site out of the index by Patrick Stox
- Protect your Staging Environments by Barry Adams
- Devs. by Dean Cruddace
Now keep on learning!
Now that you've learned about the best way to protect your staging environment, keep on learning with these articles: