same, without context it is speculation. even with context, if the person creating the story controls the window, a window can be found which supports the story.
Does running the command to remove the anti-goblin ask from local prompt increase performance slightly because there is less "cognitive load" to have to hold its tongue?
If you work at open ai or another llm company, I have a clear message I want you to hear:
I don't give a shit if my agents say goblins or not.
They are coding monkeys to me, researchers, etc.
I only care about their performance. perf per token / cost.
If you load their context with a bunch of style rules or safety theater shit, really - please don't - the context is for me.
Do you de-goblin before you run all the benchmarks, because that is what i am paying for, the performance as benchmarked - please don't benchmark then ship a bunch of one shot context mods to my install by default.
The article is cute and interesting but doesn't rise to the level of a thing I give a shit about for my use.
I think that what is missing in this discussion is the latent background that one needs to actually create "ai art" that has merit.
It is a strange combination of skills that some have developed to create styles, pipelines, techniques to create bespoke, recognizable, artistic output from these tools.
Anyone who is typing a prompt into a publicly available generative model is likely not getting anything of high value from it.
I will commend parascene project for trying to court this special sort of advanced users who can create custom pipelines and then connect it to the parascene system for anyone to use.
That process, the implied skills to create a custom generator and host it, can all be broken down so that more people can do it, but I don't think enough people realize it is even something they can do. We are so trained to be consumers of ai services.
It's unfortunate that we have so few words for where we are in history right now. AI is AI is AI to some. A lot gets lost in that collapse.
Thanks for seeing it, though. It's very difficult to wave a flag to call people over when the only flag you have also looks a lot like the flag for danger, trouble, or nefarious deeds. At least if some people get it, there is hope that others will, too.
I am not an expert but this seems like model distillation could work to get the behavior you need to run on a cheap end-user processor (Raspberry Pi 4/5 class). I chatted with claude opus about your project and had the following advice:
For the compute problem, you don't need a Jetson. The approach you want is knowledge distillation: train a large, expensive teacher model offline on a beefy GPU (cloud instance, your laptop's GPU, whatever), then distill it down into a tiny student network like a MobileNetV3-Small or EfficientNet-Lite. Quantize that student to int8 and export it to TFLite. The resulting model is 2-3 MB and runs at 10-20 FPS on a Raspberry Pi 4/5 with just the CPU - no ML accelerator needed. For even cheaper, an ESP32-S3 with a camera module can run sub-500KB models for simpler tasks. The preprocessing is trivial: resize the camera frame to 224x224, normalize pixel values, feed the tensor to the TFLite interpreter. The CNN learns its own feature extraction internally, so you don't need any classical CV preprocessing.
Looking at your observations, I think the deeper issue is what you identified: there's not enough signal in single frames. Your validation loss not converging even after augmentation and ImageNet pretraining confirms this. The fix is exactly what you listed in your future work - feed stacked temporal frames instead of single images. A simple approach is to concatenate 3-4 consecutive grayscale frames into a multi-channel input (e.g., 224x224x4). This gives the network implicit motion, velocity, and approach-rate information without needing to compute optical flow explicitly. It's the same trick DeepMind used in the original Atari DQN paper - a single frame of Pong doesn't tell you which direction the ball is moving either.
On the action space: your intuition about STOP being problematic is right. It creates a degenerate attractor - once the model predicts STOP, there's no recovery mechanism. The paper you referenced that only uses STOP at goal-reached is the better design. Also consider that TURN_CW and TURN_CCW have no obvious visual signal in a single frame (which way to turn is a function of where you've been and where you're going, not just what you see right now), which is another reason temporal stacking or adding a small recurrent/memory component would help. Even a simple LSTM or state tuple fed alongside the image could encode "I've been turning left for 3 steps, maybe try something else."
For the longer term, consider a hybrid architecture: use the distilled neural net for obstacle detection and free-space classification, but pair it with classical SLAM or even simple odometry-based mapping for path planning and coverage. Pure end-to-end behavior cloning for the full navigation stack is a hard problem - even the commercial robots use learned perception with algorithmic planning. And your data collection would get easier too, because you'd only need to label "what's in front of me" rather than "what should I do," which decouples perception from decision-making and makes each piece easier to train and debug independently.
Can you please design a version for kids to ride on?
With a seat and handle similar to "wooden bee ride on" by b. toys?
I want a vacuum that kids can actually drive, ride on, do real vacuuming and has minimal levels so safety, like turning it over halts vacuums, stairs/ledges are avoided, and lack of rollers or items that could snare a kids hair, etc.
There may be benefits of fusion of child input signals with supervisory vacuums route goals. Would be age dependent, older kids would want full manual I think.
Kids like to do real jobs, and as a parent I prefer purchasing real items for my kids rather than toy versions if practical.
Real vacuums are _so_ difficult for kids though, they're the wrong size and way to heavy. A zamboni-vacuum-for-kids is definitely not a general purpose thing, but does hit a nice balance between functional and kid-friendly.
You may find bits of Shiro's code useful as it has massive shimming work to get Claude Code, npm/node, git, various grep tools and isomorphic git and git diffs to work, and some weird features like virtual servers that create virtual ports to communicate to frames.
All the unix tools that Claude code are supported as well. It is also a typescript project and has similar architecture, and MIT license so there may be parts you can just straight import without much hassle.
Probably the hardest part to keep architecturally clean is the shimming required in js eval environment to make Claude Code and non-browser-native packages to run. But it is very nice once you have an agent able to work inside your browser os.
Great job and thanks for sharing Lifo. I am certain this will catch on once the implementations become more solid.
Wes Anderson is positioning cinematographers on American soil with extremely powerful telephoto lenses, filming actors performing in meticulously designed miniature sets across the Canadian border. The film will be titled "The Asymmetrical Tax Avoidance" and will star Bill Murray as a customs agent with daddy issues.