Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Python has str.casefold() for caseless comparisons that handles the example in the OP[1]:

> str.casefold()

> Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

> Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".

> The casefolding algorithm is described in section 3.13 of the Unicode Standard.

[1] https://docs.python.org/3/library/stdtypes.html#str.casefold



I believe this does work for German (or at least I can't think of an example where it doesn't). But a case I can think of where it doesn't work is with standard modern Greek. In all-caps words, accents are omitted, while at least Python's implementation of casefold produces non-equal strings for the all-caps and lowercase versions:

    >>> str.casefold("παράδειγμα")
    'παράδειγμα'

    >>> str.casefold("ΠΑΡΑΔΕΙΓΜΑ")
    'παραδειγμα'
It's at least consistent with this, though:

    >>> str.upper("παράδειγμα")
    'ΠΑΡΆΔΕΙΓΜΑ'
But I'd consider that incorrect, or at least nonstandard. The Greek alphabet does have accented versions of capital letters, but they can only be used as the first letter of a word in mixed case (e.g. if a sentence starts with έλα, you write it Έλα), never in the middle of a capitalized word. However maybe this slides too far to the "language" rather than "encoding" side of the space that Unicode considers outside of its purview.


Yes, the Python standard library (like many other implementations) chose to do the "easy" way instead of the "correct" one.

Correct case-folding in Greek is complicated, since it might introduce diaeresis in the next vowel, if the accents are removed: Μάιος - ΜΑΪΟΣ.

All this means that correct handling is not reversible, which introduces a slew of other problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: