Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Emoji already existed in documented (not to mention frequent) use when they were added to Unicode. As other commenters noted, the emoji originated in Japan and were encoded in the character sets used by mobile phones. Not adding them was not an option, because it would mean Unicode wouldn't be able to encode Japanese emails and text messages.

What happened in the past few years is that the emoji set was updated, because the Japanese core set missed symbols when considered from a global perspective. In some cases emoji have also been added pro-actively in cooperation with smartphone OS vendors; this is a point of contention, but I think it makes sense in this case.

> We end up with standard that is full of short lived symbols from 2000-2025 […]

That's fine. There is plenty of space in the Unicode standard. It also contains whole swaths of glyphs that were only used in some small country for a few decades, or that are obscure to the point of being almost irrelevant. But that is what Unicode is for. It is a means to encode the texts we produce as humans¹, not a proscriptive table of characters we ought to use.

1: Aliens too I guess, as long as they can provide proof of prior and significant enough use.



> That's fine. There is plenty of space in the Unicode standard.

But there is not infinite time from a font creator. Already it is impossible to know which of the newer characters will be supported by which OS/fonts/etc. Unicode characters are all but useless if they can't be displayed.


That's fine too. There is no need for every font to have support for Byzantine Musical Symbols or Coptic Epact Numbers. Operating Systems tend to have a collection of fonts installed that can handle every common use case. If you depend on glyphs introduced in special use areas you are more than likely aware of the need to install a special font.

Sure, you may need to be aware as a developer that a new character introduced in last month's Unicode update isn't likely to be supported yet, but that's no different from any other technology like CSS.


It would be nice if there was a reference set of glyphs. Like if in order to add a character to Unicode you had to also add a default glyph. Then at least if a font doesn't have that character the default could be displayed instead of a black box.


When I'm running Linux/X, if the current font doesn't know a character it renders the character using another font on the system that does. So then it's just a matter of including a set of fonts that covers every character, and I think a reasonable attempt has been made to do that - although maybe that's based on the fonts I've chosen to install, since that's something that's moderately important to me.

When I'm running Windows (for work) it only renders characters for the current font, and I get box instead of a character. (Traditionally at least; maybe this has changed? - it's not important to my work so I haven't really investigated, but as a feature it's so useful it's hard to believe it hasn't happened yet.)

But as you can see this isn't a problem with Unicode but the configuration of your system. If it chooses to show you a character from another font, then that's good and convenient.

Note that Windows support for emoji is significantly better. I think Gtk+3/Gnome apps and Firefox have improved significantly this year (wow, lots of color!), but they're still lagging.


> So then it's just a matter of including a set of fonts that covers every character

Have you ever tried to literally do this? You won't succeed.

The Google Noto font family is getting there, but I think they're only caught up with like Unicode 6, and we're on Unicode 11 now. There are some recently-added scripts that you won't find in any font. The Unicode tables render them in proprietary fonts that are obfuscated in the PDF, usually without even any information about how to buy them.


The list of font contributors can be found here: http://www.unicode.org/charts/fonts.html. Choice quote:

> The Unicode Consortium currently uses over 390 different fonts to publish the code charts and figures for The Unicode Standard. The overwhelming majority of these fonts are specially tailored for this purpose and have been donated to the Consortium with a restricted license for use in documenting the standard.

You might be able to reach out to the vendors listed and try to buy the font, but it seems the majority were produced only for Unicode's use.


That doesn't make sense for some complex (shaped) scripts where you have things like zero width joiners.


Are there characters that really can't be drawn individually? Surely there's something better than a black box for pretty much any character.


Yes, I disagree that it doesn't make sense. You're not supposed to see a ZWJ, so the correct render of it is not visible. So even if you include it in your font - well, the correct thing for your font engine to do is to apply the semantics of it correctly and change the render of the surrounding characters.


Yes, but there are glyphs, especially in Indic languages, that have no representation in a presentation form in Unicode. If you are familiar with Arabic, each glyph has forms - isolated, beginning, medial, and final. Your "typical" Arabic string is composed of isolated code points, they go through a text shaper, and out come glyph indices that also have a Unicode code point associated with them. You do the same thing with an Indic language, except imagine that the glyph indices that come out have no corresponding Unicode code points. It's very surprising at first and some software can't handle these unassigned or "hidden" glyphs.

So my point is: how much value would there be in requiring a representation of a small fraction of the glyphs (only those with Unicode code points, many of which are ZWJs) in a script in a standards document when potentially hundreds of forms are necessary after shaping to represent the language?


That’s a big problem with the standard in my opinion: standardization of new characters require submit a font for them but this font do not need to be open. So you have standardized characters that cannot be displayed without someone else implementing another font. That’s a lot of duplicate effort.


I'm not sure have over 100k glyphs in random fonts solves anything. I'm also not convinced fonts samples are something a character encoding should care about or try to standardize.

For practical use Googel's Noto font is under an open font license and covers so many glyphs it's collection is split into an multiple OTF files because of the 65k glyph limit per font. The goal of Noto is much the same as the one you propose - to have an open representation of every character (and in a consistent font).


> covers so many glyphs it's collection is split into an multiple OTF files because of the 65k glyph limit per font.

Not because of that. Out of the hundred Noto files, several are the same CJK characters with different country and Sans/Serif/Mono styles, and everything else combined would fit into a single file.


Even removing CJK, there are still more than 65,535 glyphs necessary to represent everything the the SMP and BMP less CJK. If you look in BMP without CJK, surrogates, and private use areas, you are looking at around 27,000 code points. If you look at the SMP (supplementary multilingual plane), there are around 90 blocks of 4096 code points assigned. That total is well over 65,535. And keep in mind many scripts also require unassigned glyphs which are not Unicode code points themselves. These unassigned glyphs count against the 65,535 TTF limit, though.

https://en.wikipedia.org/wiki/Plane_%28Unicode%29


Sure, it would take two files to do everything outside CJK. What I said is still true, that everything covered by the non-CJK Noto fonts would fit in a single file (50k glyphs total).

My point is that you only need 3-4 files to cover Unicode. Noto is not split into a hundred different files for that reason, but for other reasons.

(Also the used space on the SMP is roughly 90 * 256.)


People are going to use popular icons whether they're in unicode or not. The burden of making it work is going to exist either way. And something like an emoji doesn't need to be in a general-purpose font.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: