Hacker Newsnew | past | comments | ask | show | jobs | submit | goose0004's commentslogin

This is great! Looks significantly more verbose, though I admit I haven't looked through all of your documentation. I'm very interested in knowing how it's performing!


Claude is middling, codex is great with it - I think codex has significantly better math reasoning and it's all very mathy compared to e.g. typescript. Everything in the repo is LLM generated at the moment - I expect in the near future to have to start manually doing some things, but I'm not sure where the sticking point will be.

It's already good enough that I'm thinking self-hosting will be relatively quick, which is a huge deal at least in my opinion. Having proper self-hosting locque-in-locque and tools in locque in the first ~6 months would be superlative.


Looks like my tokenization review method was incorrect - honestly a little embarrassing on my part. I think it would have been a lot longer before I discovered it, so thanks for the comment!

I did just go through and ran equivalent code samples in the GlyphLang repo (vs the sample code I posted that I'm assuming you ran) through tiktoken and found slightly lower percentages, but still not insignificant: on average 35% fewer than Python and 56% fewer than Java. I've updated the README with the corrected figures and methodology if you want to check: https://github.com/GlyphLang/GlyphLang/blob/main/README.md#a...


Yeah Java IS verbose. Thanks!


GLyphLang is intended to be a whole standalone language. It's implemented in Go, but it doesn't transpile to or from it. It has its own lexer, parser, type checker, bytecode compiler, and stack-based VM. If it helps, the compilation pipeline currently looks like this:

source (.glyph) -> AST -> bytecode (.glyphc) -> VM.

While the original intent was to have something tailored to AI that a human could manage, I'm realizing (to your point) that will absolutely not be necessary sometime in the likely near future. I've started working on making GlyphLang itself significantly more token-friendly and am adding a top layer that will essentially do what I think you've suggested. I'm adding expand and compact commands for bidirectional conversion between symbols and keywords that will allow engineers to continue developing with more familiar syntaxes on a top layer (.glyphx), while LLMs will generate actual .glyph code. Once completed, the pipeline will look like this:

.glyphx (optional) -> .glyph -> AST -> bytecode -> VM

Regarding #2, that's a great point and actually something I considered, though admittedly maybe not long enough. Regardless, I've tried to develop this with a value proposition that isn't purely about cost (though that does drive a lot of this). I'm also working on these 3 points: 1. Reduced hallucinations: symbols are unambiguous - there shouldn't be confusion between def/fn/func/function across languages (no formal benchmarks yet, but they're planned) 2. Context window efficiency: fitting more code in context allows for better reasoning about larger codebases, regardless of cost 3. Language-neutrality (someone else brought this up): symbols work the same whether the model was trained on English, Spanish, or code

I think even if tokens become free tomorrow, fitting 2x more code in a context window will still significantly improve output quality. Hopefully it will be necessary or at the very least helpful in the next 12-18 months, but who knows. I really appreciate the questions, comments, and callout!


The collision point is interesting, but I'd argue context disambiguates. If I'm understanding you correctly, I don't think the models are confused about whether or not it's looking at an email when `@` appears before a route pattern. These symbols are heavily represented in programming contexts (e.g. Python decorators, shell scripts, etc.), so LLMs have seen them plenty of times in code. I'd be interested if you shared your findings though! Definitely an issue I would like to see if I could avoid or at least mitigate somewhat.

That's an absolutely fair point that vocabularies differ regarding the tokenizer variance, but the symbols GlyphLang uses are ASCII characters that tokenize as single tokens across GPT4, Claude, and Gemini tokenizers. THe optimization isn't model-specific, but rather it's targeting the common case of "ASCII char = 1 token". I could definitely reword my post though - looking at it more closely, it does read more as "fix-all" rather than "fix-most".

Regardless, I'd genuinely be interested in seeing failure cases. It would be incredibly useful data to see if there are specific patterns where symbol density hurts comprehension.


Collision is perhaps the wrong word. But llms definitely have trouble disambiguating different symbols of a language that map to similar tokens.

Way back in the gpt3.5 days I could never get the model to do a parse of even the simplest grammar until I replaced the one letter production rules with one word production rules, e.g. S vs Start. A bit like how they couldn't figure out the number of rs in strawberry.


If context disambiguates, then you have to use attention which is even more resource intensive.

You want to be as state free as possible. Your tokenizer should match your vocab and be unambiguous. I think your goal is sound, but golfing for the wrong metric.


That's an awesome tool! I think textclip.sh solves a different problem though (correct me if I'm wrong - this is the first I've been exposed to it). Compression at the URL/transport layer helps with sharing prompts, but the token count still hits you once the text is decompressed and fed into the model. The LLM sees the full uncompressed text.

The approach with GlyphLang is to make the source code itself token-efficient. When an LLM reads something like `@ GET /users/:id { $ user = query(...) > user }`, that's what gets tokenized (not a decompressed version). The reduced tokenization persists throughout the context window for the entire session.

That said, I don't think they're mutually exclusive. You could use textclip.sh to share GlyphLang snippets and get both benefits.


Yes, the tool here is just to share the prompt, sorry the first one I had handy is the one describing the service itself.

Here's it in plain text to be more visible:

``` textclip.sh→URL gen: #t=<txt>→copy page | ?ask=<preset>#t=→svc redirect | ?redirect=<url>#t=→custom(use __TEXT__ placeholder). presets∈{claude,chatgpt,perplexity,gemini,google,bing,kagi,duckduckgo,brave,ecosia,wolfram}. len>500→auto deflate-raw #c= base64url encoded, efficient≤16k tokens. custom redirect→local LLM|any ?param svc. view mode: txt display+copy btn+new clip btn; copy→clipboard API→"Copied!" feedback 2s. create mode: textarea+live counters{chars,~tokens(len/4),url len}; color warn: tokens≥8k→yellow,≥16k→red; url≥7k→yellow,≥10k→red. badge gen: shields.io md [!alt](target_url); ```

It uses math notation to heavily compress the representation while keeping information content relatively preserved (similarly to GlyphLang. Later, LLM can comfortably use it to describe service in detail and answer user's questions about it. Same is applicable to arbitrary information, including source code/logic.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: