Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That could become an automated improvement as well... although it was probably published in some paper 40 years ago and I just don't know the name of it.

Basically a second-pass where you look for outliers in each character-group, reassign them to another group, and keep the change if it improves similarity-scores the character groups involved. Then iterate a bit until nothing seems to be improving.

For example, one might find "the 3 that is least-like all the other 3s", temporarily reassign it to "8", and then keep that change if it means an improvement in the the scores for "how closely all sub-pictures of 3s resemble each other" and "how closely all sub-pictures 8s resemble each other".

That might backfire if a document has different typefaces in it though, where it makes the mistake of putting all "3"s from different typefaces together, ruining the group-similarity scores.



For typefaces you check for the distribution of similarities in each group. If it has large clusters then group by 3, 3', etc then run your outlier check on each of those groups

Still would risk some weirdness, but would help a bit I'd hope.

I wonder if it would be worth running some kind of language analysis or spelling/Grammer check to verify the scan too. At least for text, you'd need another solution for number tables.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: