Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The best read for something like this is The Practice of Programming which is a great little book to have in general. This link to a sample chapter covers the entire analysis and creation of a markov chain program, none of which require many lines of code.

https://ptgmedia.pearsoncmg.com/images/9780201615869/samplep...

One thing I've noticed from playing with these types of programs is the number of words to use as a hash. Two to start with of course, will quickly reproduce the sample text once you get to only five or six prefix words. Where as two prefix words usually generate nonsense, the sweet spot to a believable quality is only three or four words as a prefix with five or more reproducing the original text. The larger the varied sample text, the much better the results. Furthermore, only breaking words on whitespace creates even better quality output than assuming you need to tinker with the punctuation.



The question of how many words are used is discussed here: https://en.wikipedia.org/wiki/N-gram#n-gram_models

See also https://en.wikipedia.org/wiki/Google_Ngram_Viewer which recently popularized the term n-gram.


> which recently popularized the term n-gram.

Eh? It's well known in Info Retrieval; it goes back at least to Salton's IR group at Cornell in the 1970's.


I know "popularized" is often used colloquially as a synonym for "introduced", but I think I'm well within the accepted meaning of the word if I use it to mean "made popular with or accessible to a larger audience". In fact that meaning is closer to the dictionary definition (and the etymology) than the colloquial sense of "introduced".


I also recommend everyone read PoP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: