Hacker Newsnew | past | comments | ask | show | jobs | submit | subzel0's commentslogin

I’m working on uvmap.ai A browser tool for editing 3D model textures directly from the model view instead of bouncing between a UV map, an image editor, and an AI tool. You load a glTF, click the part you want to change, it uses SAM3 to mask that region, then sends it to Nano Banana and puts the result back onto the texture. Still early, but the goal is to make texture iteration much less tedious.

Soon here: https://github.com/RefactorHQ/UVMapAI


One thing this article proved is that the Dead Internet Theory is real. Look at all these Claudy comments!


“Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right is a dog, on the left is a cat”

https://pbs.twimg.com/media/GG8mm5va4AA_5PJ?format=jpg&name=...


That's _amazing_.

I imagine this doesn't look impressive to anyone unfamiliar with the scene, but this was absolutely impossible with any of the older models. Though, I still want to know if it reliabily does this--so many other things are left to chance, if I need to also hit a one-in-ten chance of the composition being right, it still might not be very useful.


It’s the transformer making the difference. Original stable diffusion uses convolutions, which are bad at capturing long range spatial dependencies. The diffusion transformer chops the image into patches, mixes them with a positional embedding, and then just passes that through multiple transformer layers as in an LLM. At the end, the model unpatchify’s (yes, that term is in the source code) the patched tokens to generate output as a 2D image again.

The transformer layers perform self-attention between all pairs of patches, allowing the model to build a rich understanding of the relationships between areas of an image. These relationships extend into the dimensions of the conditioning prompts, which is why you can say “put a red cube over there” and it actually is able to do that.

I suspect that the smaller model versions will do a great job of generating imagery, but may not follow the prompt as closely, but that’s just a hunch.


Convolutions are bad at long range spatial dependencies? What makes you say that - any chance you have a reference?


Convolution filters attend to a region around each pixel; not to every other pixel (or patch in the case of DiT). In that way, they are not good at establishing long range dependencies. The U-Net in Stable Diffusion does add self-attention layers but these operate only in the lower resolution parts of the model. The DiT model does away with convolutions altogether, going instead with a linear sequence of blocks containing self-attention layers. The dimensionality is constant throughout this sequence of blocks (i.e. there is no downscaling), so each block gets a chance to attend to all of the patch tokens in the image.

One of the neat things they do with the diffusion transformer is to enable creating smaller or larger models simply by changing the patch size. Smaller patches require more Gflops, but the attention is finer grained, so you would expect better output.

Another neat thing is how they apply conditioning and the time step embedding. Instead of adding these in a special way, they simply inject them as tokens, no different from the image patch tokens. The transformer model builds its own notion of what these things mean.

This implies that you could inject tokens representing anything you want. With the U-Net architecture in stable diffusion, for instance, we have to hook onto the side of the model to control it in various sort of hacky ways. With DiT, you would just add your control tokens and fine tune the model. That’s extremely powerful and flexible and I look forward to a whole lot more innovation happening simply because training in new concepts will be so straightforward.


My understanding of this tech is pretty minimal, so please bear with me, but is the basic idea is something like this?

Before: Evaluate the image in a little region around each pixel against the prompt as a whole -- e.g. how well does a little 10x10 chunk of pixels map to a prompt about a "red sphere and blue cube". This is problematic because maybe all the pixels are red but you can't "see" whether it's the sphere or the cube.

After: Evaluate the image as a whole against chunks of the prompt. So now we're looking at a room, and then we patch in (layer?) a "red sphere" and then do it again with a "blue cube".

Is that roughly the idea?


It kinda makes sense, doesn't it? What are the largest convolutions you've heard of -- 11 x 11 pixels? Not much more than that, surely? So how much can one part of the image influence another part 1000 pixels away? But I am not an expert in any of this, so an expert's opinion would be welcome.


Yes it makes sense a bit. Many popular convents operate on 3x3 kernels. But the number of channel increases per layer. This, coupled with the fact that the receptive field increases per layer and allows convnets to essentially see the whole image relatively early in model's depth (esp. coupled with pooling operations which increase the receptive field rapidly), makes this intuition questionable. Transformers on the other hand, operate on attention which allows them to weight each patch dynamically, but it's clear to me that this allows them to attend to all parts of the image in a way different from convnets.


I put the prompt into ChatGPT and it seemed to work just fine: https://imgur.com/LsRM7G4


You got lucky! Here's a thread where I attempted the same just now: https://imgur.com/a/xiaiKXp

It has a lot of difficulty with the orientation of the cat and dog, and by the time it gets them in the right positions, the triangle is lost.


I dislike the look of chatGPT images so much. The photo-realism of stable diffusion impresses me a lot more for some reason.


This is just stylistic, and I think it’s because chatgpt knows a bit “better” that there aren’t very many literal photos of abstract floating shapes. Adding “studio photography, award winner” produced results quite similar to SD imo, but this does negatively impact the accuracy. On the other side of the coin, “minimalist textbook illustration” definitely seems to help the accuracy, which I think is soft confirmation of the thought above.

https://imgur.com/a/9fO2gxN

EDIT: I think the best approach is simply to separate out the terms in separate phrases, as that gets more-or-less 100% accuracy https://imgur.com/a/JGjkicQ

That said, we should acknowledge the point of all this: SD3 is just incredibly incredibly impressive.


This is adjustable via the API, but not in ChatGPT. The API offers styles of "vivid" and "natural", but ChatGPT only uses "vivid".


It looks terrible to me though, very basic rendering and as if it’s lower resolution then scaled up.


What was difficult about it?


From my experience, the thing that makes using AI image gen hard to use is nailing specificity. I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious (I often equate it to putting coins in a slot machine, hoping it 'hits').

Generating good images is easy but generating good images with very specific instructions is not. For example, try getting midjourney to generate a shot of a road from the side (ie standing on the shoulder of a road taking a photo of the shoulder on the other side with the road crossing frame from left to right)...you'll find midjourney only wants to generate images of roads coming at the "camera" from the vanishing point. I even tried feeding an example image with the correct framing for midjourney to analyze to help inform what prompts to use, but this still did not result in the expected output. This is obviously not the only framing + subject combination that model(s) struggle with.

For people who use image generation as a tool within a larger project's workflow, this hurdle makes the tool swing back and forth from "game changing technology" to "major time sink".

If this example prompt/output is an honest demonstration of SD3's attention to specificity, especially as it pertains to framing and composition of objects + subjects, then I think its definitely impressive.

For context, I've used SD (via comfyUI), midjourney, and Dalle. All of these models + UIs have shared this issue in varying degrees.


It's very difficult to improve text-to-image generation to do better than this because you need extremely detailed text training data, but I think a better approach would be to give up on it.

> I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious

The models should be developed to accelerate this then.

ie you should be able to say layer one is this text prompt plus this camera angle, layer two is some mountains you cheaply modeled in Blender, layer three is a sketch you drew of today's anime girl.


Totally agree. I am blown away by that image. Midjourney is so bad at anything specific.

On the other hand, SD has just not been on the level of the quality of images I get from Midjourney. The people who counter this I don't think know what they are talking about.

Can't wait to try this.


previous systems could not compose objects within the scene correctly, not to this degree. what changed to allow for this? could this be a heavily cherrypicked example? guess we will have to wait for the paper and model to find out


From the original paper with this technique:

  We introduce Diffusion Transformers (DiTs), a simple transformer-based backbone for diffusion models that outperforms prior U-Net models and inherits the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL E 2 and Stable Diffusion.
Afaict the answer is that combining transformers with diffusers in this way means that the models can (feasibly) operate in a much larger, more linguistically-complex space. So it’s better at spatial relationships simply because it has more computational “time” or “energy” or “attention” to focus on them.

Any actual experts want to tell me if I’m close?


would be nice if it were just more attention. there could be something else though


One thing that jumps out to me is that the white fur on the animals has a strong green tint due to the reflected light from the green surfaces. I wonder if the model learned this effect from behind the scenes photos of green screen film sets.


The models do a pretty good job at rendering plausible global illumination, radiosity, reflections, caustics, etc. in a whole bunch of scenarios. It's not necessarily physically accurate (usually not in fact), but usually good enough to trick the human brain unless you start paying very close attention to details, angles, etc.

This fascinated me when SD was first released, so I tested a whole bunch of scenarios. While it's quite easy to find situations that don't provide accurate results and produce all manner of glitches (some of which you can use to detect some SD-produced images), the results are nearly always convincing at a quick glance.


One thing they don't so far do is have consistent perspective and vanishing points.

https://arxiv.org/abs/2311.17138


As well as light and shadows, yes. It can be fixed explicitly during training like the paper you linked suggests by offering a classifier, but it will probably also keep getting better in new models on its own, just as a result of better training sets, lower compression ratios, and better understanding of the real world by models.


I think you have to conceptualize how diffusion models work, which is that once the green triangle has been put into the image in the early steps, the later generations will be influenced by the presence of it, and fill in fine details like reflection as it goes along.

The reason it knows this is that this is how any light in a real photograph works, not just CGI.

Or if your prompt was “A green triangle looking at itself in the mirror” then early generation steps would have two green triangle like shapes. It doesn’t need to know about the concept of light reflection. It does know about composition of an image based on the word mirror though.


It's just diffuse irradiance, visible in most real (and CGI) pictures although not as obvious as that example. Seems like a typical demo scene for a 3D renderer, so I bet that's why it's so prominent.


It does make sense though. Accurate global illumination is very strongly represented in nearly all training data (except illustrations) so it makes sense that the model learned an approximation of it.


Wow - is it doing pre-render-ray-tracing?


EDIT: Wrong window folks....

What if you can | a scene to a model and just have it calc all the ray-paths and then | any color/image... if you pre-calc various ray angles, you can then just map your POV and allow for the volume as it pertains to your POV be mapped with whatever overlay you want.

Here is the crazy cyberpunk part:

IT (whatever 'IT' is) keeps a lidar of everything EVERYONE senses in that space and can overlap/time/sequence anything about each experience and layer (baromoter/news/blah tied to that temporal marker)

Micro resolution of advanced lidar is used in signature creation to ensure/verify/detect fake places vs IRL.

Secret nodes are used to anti-lidar the sensors... so a place can be hidden from drones attempting to map it.

These anonolies are detectable thou, and GIS experts with terra forming skills are the new secOPs.

Fn dorks.

-- so, you already have an asset, lets say its a CUBOID room - with walls and such of wood texture_05.png


I think you've read too far into this. Ray tracing is not a useful real-world primitive for extracting information from most scenes. Sure, "everything is shiny", but most surfaces are diffuse and don't contain useful visual information besides the object they illuminate. Many supposedly "pure" reflections like mirrors and glass are actually subtle caustics that introduce too much nuance to account for.

Also, "pipe" isn't considered harmful terminology (yet) just FYI. I was confused seeing the "|" mononym in it's place.


Thanks for that - I like | .

I was being lazy....

But I realize you are correctin the mirroring - I immediately thought it was ray tracing the green hue from the reflection onto a surface that could see it...

Inference is far more efficient - however - it would be really interesting to know HOW an AI 'thinks' about such reflections?

Whats the current status of AIs documenting themselves?


Not bad, I'm curious of the output if you ask for a mirrored sphere instead.


This is actually the approach of one paper to estimate lighting conditions. Their strategy is to paint a mirrored sphere onto an existing image: https://diffusionlight.github.io/


That's very impressive!


It is! This isn't something orevious models could do.


Interesting is that Left and right taken from viewer’s perspective instead of red sphere’s perspective


How do you know which way the red sphere is facing? A fun experiment would be to write two prompts for "a person in the middle, a dog to their left, and a cat to their right", and have the person either facing towards or away from the viewer.


Now try “a highway being held up by an airplane”

Tried all morning and ChatGPT could not do it.


That's hard for me to parse as a human. Do you mean the plane is on the highway and causing a traffic jam?

Or is the highway literally being held by a humanoid plane?


Would it be a highway resting on top of a plane, like the plane is a pillar ?


We're getting to strong holodeck vibes here


"When in doubt, scale it up." - openai.com/careers


I don't see why people go crazy over this, the meissner effect in superconductors look different.


I don't understand what you're not seeing. Nobody is saying that the whole LK-99 sample is superconducting. Why would you expect the whole sample to float, instead of "partial floating" as we're seeing, if you have a material that's only partially superconducting?


I was wondering this, why do all the samples not completely float. If the sample is only partially superconducting, could you just smash it into smaller pieces and find a few floating ones? Or would the different domains be so small that you have to turn the sample into dust before you get a completely superconducting piece? Probably can not be answered in general as it heavily depends on the synthesis process?


Someone asked:

> If you were to cut off the part that isn't lifting, do you reckon the entirety of the floaty half would float? AKA is this semi-float thing a purity issue?

And Andrew responded [0]:

> We wanted to do rock surgery, but I was scared we would shatter our biggest piece. It definitely has a piece that isn't contributing lift.

I'm guessing once they've prepared more samples and had existing samples tested they'll break some into smaller parts to try and get a piece that floats independently.

[0] https://twitter.com/andrewmccalip/status/1687408423179370497



That batch seems to be a dud:

https://twitter.com/andrewmccalip/status/1687425301184393216

"Reaction looks like garbage. Very incomplete"


洗芝溪 answered this on Zhihu. All his answers are extremely good. This user knows something.

这应该还是材料的各向异性, 也就是它的准一维结构导致. 因为材料当中有一个固定晶相, 那个方向的抗磁性最强, 所以磁感线总会沿着这个方向走, 导致所有样品都是竖起来的.

This should be due to the anisotropy of the material, that is, its quasi-one-dimensional structure. Because there is a fixed crystal phase in the material, that direction has the strongest diamagnetism, so the magnetic field lines will always go along this direction, causing all The samples are all upright.

https://www.zhihu.com/question/615044128/answer/3145337312


If the material is a one-dimensional superconductor where current can only flow with zero resistance along one direction, aligning that direction with the magnetic field lines should produce the weakest diagmanetism, because the magnetic force is proportional to the cross product of current and magnetic field. If the two are parallel, the force is zero.

Of course this could explain why the sample doesn't lift off: it starts out with the superconducting direction at an angle to the magnetic field and relatively strong diamagnetism leading to a righting force until the two directions are nearly aligned and the diamagnetism is too weak to overcome gravity.


That's what is currently being done! A few videos have surfaced of smaller grains of LK-99 fully levitating. We're waiting for additional confirmation.


I thought it was clear that the process for creating the material is low yield, and the reaction produces other materials, so most of the material we're seeing is not LK-99. See the wiki section on synthesis: https://en.wikipedia.org/wiki/LK-99 . Right now we don't have a good way to produce the material with better yield.


Imagine you take red paint and blue paint and mix it together. It is certainly true that there is red paint and blue paint still in there, but however small samples you take it’s still going to be purple. You might get some samples that are reddish purple and some that are bluish purple however.


The Russian molecular biologist “Iris” had a tiny speck that floats, and there was another video as well (I don’t remember the source, but I saw it on Twitter). However, the credentials of the people or the labs they work for are unknown, so take it with a grain of salt.


So if we increased the magnitude of the magnetic field (e.g. by using an electric magnet), why couldn't we still have the sample float? Or does the Meissner effect not relate to the magnitude of the magnetic flux?

Unless, of course, the other end is attracting the magnet.


Some speculate its a monocrystsline "1D" superconductor, which could explain its different behavior.

But that is out of my depth. I am still trying to wrap my head around a directional superconductor/resistor...


I'm guessing this is the crappy v1 that barely does what it's supposed to. Still represents a giant leap for mankind


How so?


The rock should float for a start.


Meissner Effect does not mean it has to float. It only floats if thats the easiest way to expulse the magnet field. Otherwise it aligns normally with the magnetic field, which is supposedly happening here.


As it’s been theorized elsewhere, it may be that the production processes produce a material with only bits of the theorized superconductor in it, leading to the wiggling effect we see here as well as the Chinese replication video.


It is not a purity issue. On the other hand, 1D superconductors are expected to behave this way. See my other comments.


How do we know it isn't a purity issue?


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: