How to remove a URL(s) from Google

-Removing a URL(s) from a search engine index.
-How to
prevent incorrect URL crawls.
This is a topic that is discussed a lot throughout many forums on the WWW but
rarely is the advice given very accurate. Today we will clear a lot of the
confusion about the best way to go about removing your URL(s) from Google or
other search engines, specifically Google. We are going to go over the most
solid ways to do it, and how to prevent incorrect URL crawls.
Why would you want to remove a URL from Google? Let's give you a couple
practical examples of why:
1) Say your online forum has been spammed by Ukrainian porn bots and it
shows pornographic advertisements as far as the eye can see. This happens "all of the time" and will never go away
therefore it is a good example. Considering that
you really do not know WHEN Google is going to crawl your directory's again
it's a good idea to "re-direct" the spiders until the problem can be
fixed.
2) (Preventative Measure) You are developing a new website and it
is not quite ready to be published to the WWW yet. This is a good reason to
make sure that you set preventative measures that will disallow Google to crawl
your website until you a ready. It is not good enough to just stick everything
in a sub directory and expect Google to not find it, because it will.
3) By mistake you have allowed various URL'S from your website to be
indexed when they should have been a rel="no follow"
Here is an example of a URL that leads to a page to be printed. http://www.involutionmedia.com/kbase/index.php/article/printer/involution
Go ahead and click on it, you will see the problem.
BEST PREVENTATIVE MEASURES
First, let’s list some preventative measures to ensure that you will never have to actually go through the process of removing a URL.
1).htaccess (password protect)
This method is probably the most robust method used if you want to be 100% sure that ANY AND ALL search engine robots do not visit a specified directory. Clearly, if a robot cannot get to your files then it will not cache anything. If you do not have an .htaccess file on your web server then all you have to do is make one from a text file and save it as .htaccess.
In your .htaccess file you can password protect a specified directory. No robot is going to try and guess a password. Rather, it will leave. When you get your problem fixed then simply remove the rule from the .htaccess file.
The actual .htacces file might look something like this:
|
|
And the .htpassword file might look like this:
|
|
2) robots.txt
Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
For quite some time this has been an industry standard in guiding search engine robots to a specified location in either allowing or disallowing access. Before a robot does anything it always finds the robots.txt file first.
User-agent: *
Disallow: /bad_directory/
The "User-agent: *" means this section applies to all robots. The " Disallow: /bad_directory/
" tells the robot that it should not visit the directory bad_directory.
If you want to disallow your entire website then just use a single / like this:
Disallow: /
There are two important considerations when using /robots.txt:
- robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
- the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
This method is not full proof but normally works great with most robots.
TWO METHODS TO REMOVE A URL(s) from a search engine’s index.
1) <meta type="robots" content="noindex,nofollow" />
By adding this
between the <head></head> tags, it will tell robots to go away
making it impossible for any data to be indexed. Some say that this is not totally reliable but
my personal experience as been very good. I have always had 100% success in
telling a robot to go away with this method.
2) Google Web Master Tools (URL REMOVAL TOOL)
With the Google Web Master Tools you can use the URL removal tool to remove a single URL. This method has been proven to work very well.
3) rel=”nofollow”
"Nofollow" provides a way for webmasters to tell search engines "Don't follow links on this page" or "Don't follow this specific link."
Originally,
the nofollow attribute appeared
in the page-level meta tag, and instructed search engines not to follow (i.e.,
crawl) any outgoing links on the page. For example:
<meta name="robots" content="nofollow" />
Before nofollow was used on
individual links, preventing robots from following individual links on a page
required a great deal of effort (for example, redirecting the link to a URL
blocked in robots.txt). That's why the nofollow attribute value of
the rel attribute was
created. This gives webmasters more granular control: instead of telling search
engines and bots not to follow any links on the page, it lets you easily
instruct robots not to crawl a specific link. For example:
<a href="signin.php" rel="nofollow">sign in</a> The use of this is clear, you can specify individual links instead of an entire page. NOTE: YOU CAN ALSO USE THE ROBOTS.TXT FILE TO DISALLOW INDIVIDUAL LINKS, NOT JUST DIRECTORYS. Below is a link to remove a URL via YAHOO REMOVAL TOOLLINK





