Unicode , UTF-8 and Google searching

unicaode and google searches

My memory is sometimes giving up on me.I could recently not remember what the ⎋ symbol stands for.
(Glyph symbol for escape key on Mac Intosh keyboards)
It finally turns out that google also will attempt to interpret it as ⎋ , according to this page.

A search with the query string what is ⎋ used for will return 4,350,000,000 million results, and that only for greek , german and english pages.

Take a look at the relevant result page, to see what I mean. That basically does mean, is that means, that the whole application framework does not have a real unicode support.

If you just insert the non ASCII Glyph as is as a search term, google will return this0 results.

We are getting a little better in increasing results, if we mix together the following ASCII and non ASCII characters as our query string.
Example: Bill Μπίλ Γκάτες Gates

RESULTS: 1 according to this page.

What could possibly be wrong with Googles search results, you ask?

Googles preference pane recomends to select the option“Search for pages written in any language”, so I tried to switch back to that default option instead.
The result remains the same though.
How about Googles specialized University search ?

The query remains the same for our experiment of course, which is: .

Things are changing a little, results returned are many, all from different Universities, surprisingly though completely irrelevant with our search term used.
It even goes so far, to place itself again among the results, with a good page rank, demonstrating good SEO practices.
What if I wanted to look up a symbol which is non ASCII from my fully UTF-8 capable and enabled platform through Google to have it searched as is as a search term?Escape it? Enclose it is CDATA blocks ?

How do search Engines deal with mixed data that contain non-ASCII strings?This looks like a complex matter in a similar way than form submissions with non ASCII data and I18N.

For a more in-depth lecture about those matters and how the data is processed, the following document might provide us with some insight.
(I haven’t red at all though, but it can give an idea.)

In UTF-8, according to the MAC OS Character palette, it belongs to Miscellaneous Technical symbol range of the Basic Multilingual plane

Hopefully we can make our life a lot easier using the right Tool for the right job.
This will quickly reveal the desired answer to our simple question.
is BROKEN CIRCLE WITH NORTHWEST ARROW.
Selecting the misc option returns the following details:

•ISO 10646 Comment: escape
•Extended Properties:
Pattern_Syntax
•Derived Core Properties:
Grapheme_Base
•Script: Common
•Block: Miscellaneous Technical
•Arabic/Syriac Shaping:
Schematic Name:
Joining Type: U (Non_Joining)
Joining Group:
•Designated in Unicode 3.0

And according to UNICODE Checker it has the HEX code point value of U+238B.

Now we know.

Update.: Lately discovered, happy search then.

Technorati Tags: ,


Posted ·2006-04-29


© 2006-2008 marios buttner

send this article to a friend

send article

Commenting is closed for this article.