latin1 is the default for text, including HTML, if you don't specify in protocols such as HTTP (modulo some stupidity from the WHATWG where it might be Win-1252 instead) and Windows-1252 is the default encoding in Windows in the USA (at least, prior to the Unicode APIs being added. The old APIs probably still exist though…). So these codecs pop up a lot in places where people who don't know what they're doing end up touching text.
The WHATWG HTML spec requires UTF-8 for conforming documents and scripts [WHATWG 4.2.5.4]. In both HTML specs, charset declarations, if provided, must be UTF-8 [4.2.5].
If the transport, content-type, lack of charset declaration, and sniffing fail to determine an encoding, both specs use defaults based on the configured locale, for English that's windows-1252 [WHATWG: 12.2.3.2 W2C: 8.2.2.2]. latin1/ISO-8859-1 is prohibited. [WHATWG: 12.2.3.3 W3C: 8.2.2.3].
I ran across some code once for descrambling data that had been incorrectly processed like that, which I found common in legal documents. It's an interesting problem, because strictly speaking, it's lossy, but you can use probabilities to figure out something plausible. You can decode/encode one thing as another, or you can decode/encode multiple times...
Any chance you have a link? I’ve had implement solutions to this myself and it’s very tedious. If someone has built a more complete solution I would love to just use that instead
That might be what I'm remembering; then again, I don't really do Python, so maybe it was something else. I doubt it was anything better than the link above, regardless.
As latin1 (ISO-8859-1) or Win-1252; ASCII doesn't have either à or ©.
latin1 is the default for text, including HTML, if you don't specify in protocols such as HTTP (modulo some stupidity from the WHATWG where it might be Win-1252 instead) and Windows-1252 is the default encoding in Windows in the USA (at least, prior to the Unicode APIs being added. The old APIs probably still exist though…). So these codecs pop up a lot in places where people who don't know what they're doing end up touching text.