T5 Encoder is ~5B parameters so back of the envelope would be ~10GB of VRAM (it's in bfloat16). So, for 360p should take ~15 GB RAM (+/- a few GB based on the duration of video generated).
We can update the code over the next day or two to provide the option for delete VAE after the text encoding is computed (to save on RAM). And then report back the GB consumed for 360p, 720p 2-5 seconds on GitHub so there are more accurate numbers.
Beyond the 10 GB from the T5, there's just a lot of VRAM taken up by the context window of 720p video (even though the model itself is 2B parameters).
The 5B text encoder feels disproportionate for a 2B video model. If the text portion is dominating your VRAM usage it really hurts the inference economics.
Have you tried quantizing the T5? In my experience you can usually run these encoders in 8-bit or even 4-bit with negligible quality loss. Dropping that memory footprint would make this much more viable for consumer hardware.
Nice! Built a similar system in the past using a servo-controlled traxxas buggy with an LTE hat, which let us do open-space driving. Latency (over internet) was still a challenge, and finding cameras and lenses that performed well across varying lighting conditions turned out to be a bit of a pain but pretty fun stuff.
Been using raylib for years to power generative digital paintings on embedded systems (RPI and the like). I have been really impressed with its performance and accessible API. Plus it's a very active and welcoming open source project, kudos to the maintainer.
I've been using raylib for years now to implement digital signage art and it's been a pleasure to work with, especially thanks to its excellent multi platform support (used many Raspberry PIs). Really well thought, intuitive API, kudos to the author.
Amazing to think they were actually spawning 150k headless browsers to simulate the traffic. That sounds like throwing money at the problem and it probably worked (for a while anyway).
Having built a load-test tool as well, I can say making it realistic enough and keeping it that way is possibly the hardest challenge. Maintenance cost is high, especially in a features focused environment.
The new tool seems like an early version as well, with pretty basic functionality.
In the example where it is supposed to "viewing a message, marking the message as read, and finally calling reactions.add"...it doesn't really do those things in a real chain. They just have a 5 second delay after "view a message", then run the "mark message as read", then a 60 second delay, then calling reactions.add. I'm not sure that mimics real end user behavior terribly well.
It seems like they could have used jMeter rather than making a home-grown web sockets test client. Perhaps there's some requirement where existing tools don't work well.
For whom yet to read the article, this story is about stopping the money-throwing and switching to more scalable (cheaper) solution.
It's kind of interesting to see them choosing rather "declarative" (which is, json-centric) approach instead of adopting small languages like Lua for scenario-based scripting.
Maybe the declarative approach is suitable for auto-generation from the user stats data as they described? After all, there are often fewer number of people who like to write stress tests than writing a feature that should be stress-tested.
This is great and overdue. Hopefully all major browsers will add some support for open source/royalty free codecs.
Emscripten/WebAssembly actually worked rather well with audio (OPUS is just awesome) but when it comes to video it's just unfeasible, especially if you are looking at doing low latency streaming. That said, I cannot fail to mention the incredible effort done by ogv.js [1] to make a/v decoding possible almost anywhere.
At Mattermost we went for the do-it-yourself option and wrote a custom tool for the job [1]. After a lot of research on all the existing open-source frameworks we couldn't really find anything that would fit our use-case. We are quite happy with the result although, as the OP mentioned, there's a significant maintenance cost attached. As new features gets implemented and more API calls added you need to go back and make sure your user behaviour defining logic stays in sync with the real world.
If I were to do it all over again, I'd probably give k6 [2] a chance but I am still convinced a tailored solution was the best choice.
Yes they are. Technically though it's not mixed. All arriving audio data is dumped into a single buffer. Mixing would now be a great way to get rid of the stream's choppyness.
now that so many users are playing simultaneously, I'm a bit annoyed that some leave the website open and stream silence.
Wondering how I could get a user a fair slot to perform now...
Anyways, the stream has become a great source of entropy now :D
Lack of ambient light and atmospheric attenuation. Significantly more direct light vs indirect light.
If you fly at 35,000 ft the horizon is at 221.3 miles and most of it is dense air. If you look directly downwards from ISS there is less than ten miles of thick atmosphere between the camera and the target.
If you do ray tracing from single light source with few objects and without effects that simulate atmosphere you simulate how the scene looks in vacuum.
I suspect also contributing is that the setting here is more like what you usually see with computer graphics than in real life. Very few moving parts.
In real life there are insects, and birds moving around. Wind blowing all sorts of things (leaves, blades of grass, trash, etc), etc. Individual strands of hair. Etc. All things we can't really reproduce with graphics.
Here there is just a sphere with a surface texture and some volumetric effects.