Hacker Newsnew | past | comments | ask | show | jobs | submit | domh's commentslogin

I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.


Indeed. Qwen doesn’t just second guess itself, it third and fourth guesses itself.

Solid Terry Pratchett reference right there.

OK thanks! That's helpful. I ignorantly assumed simpler prompt == faster first response.

I did not know, that NVFP4 was handled at the silicon level... until I dug deeper here - https://vectree.io/c/llm-quantization-from-weights-to-bits-g...

I still don't think I understand it. I saw those nvfp4 models up by chance yesterday and tried them on my Linux PC with a 5060TI 16gb. Ollama refused to pull them saying they were macOS only.

I assumed it was a meta-data bug and posted an issue, but apparently nvfp4 doesn't necessarily mean nvidia-fp4.

https://github.com/ollama/ollama/issues/15149


They are nvidia-fp4 weights, but CUDA support isn't _quite_ ready yet, but we've got that cooking.

I made my M2 Max generate a biryani recipe for me last night with 64gb ram and the baseline qwen3.5:35b model. I used the newest ollama with MLX.

https://gist.github.com/kylehotchkiss/8f28e6c75f22a56e8d2d31...

Under 3 minutes to get all that. The thinking is amusing, my laptop got quite warm, but for a 35b model on nearly 4 year old hardware, I see the light. This is the future.


The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).


Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.


Thanks! I assumed simpler == faster, but my ignorance is showing itself.

I am using the model they recommended in the blog post - which I assumed was using MLX?


Avoid reasoning models in any situation where you have low tokens/second

When MLX comes out you will see a huge difference. I currently moved to LMStudio as it currently supports MLX.

I keep emailing my (Labour) MP about this, I suggest you do the same! I get the standard "protecting the children" response. I am not voting Labour again if this madness is still in place (or worse!) at the next GE.


MPs are pretty bad at dealing with anything that doesn't come from the party or the newspapers. I'm donating to the Open Rights Group to care about this on my behalf.

(my MP is SNP, so I benefit from not being in the two party trap)


Hey! The demo didn't work in Firefox. It said something about setting up the database then it crashed the tab.


It doesn't work in Incognito mode. Did you try it without incognito?


Why?



Oh actually, sorry I lied. I recently switched to Vanadium as my default browser which is the modified Chromium instance that ships with GrapheneOS. Apologies


You can use tailscale services to do this now:

https://tailscale.com/docs/features/tailscale-services

Then you can access stuff on your tailnet by going to http://service instead of http://ip:port

It works well! Only thing missing now is TLS


This would be perfect with TLS. The docs don't make this clear...

> tailscale serve --service=svc:web-server --https=443 127.0.0.1:8080

> http://web-server.<tailnet-name>.ts.net:443/ > |-- proxy http://127.0.0.1:8080

> When you use the tailscale serve command with the HTTPS protocol, Tailscale automatically provisions a TLS certificate for your unique tailnet DNS name.

So is the certificate not valid? The 'Limitations' section doesn't mention anything about TLS either:

https://tailscale.com/docs/features/tailscale-services#limit...


I think maybe TLS would work if you were to go to https://service.yourts.net domain, but I've not tried that.


It works, I’m using tailscale services with https


Thanks for clarifying :) I'll try it out this weekend.


NatWest and Monzo work fine on my Pixel 9a running GrapheneOS. Community maintained list of supported banking apps here:

https://privsec.dev/posts/android/banking-applications-compa...

Google Wallet is not supported at all.


Curve works and you can set that up as a replacement for Google Pay.


with avbroot ?


I didn't have to do any resigning or repacking apks. It just worked installed from the play store.


The UK is no longer in the EU; The UK is still in Europe and is very much European.


Here's a community maintained list of apps and whether or not they work:

https://privsec.dev/posts/android/banking-applications-compa...

This is linked to from the Banking Apps section on GrapheneOS docs: https://grapheneos.org/usage#banking-apps

Sample size of 1: my UK banking apps all work fine.


Yeah spot on. I think this is the only thing that's been announced so far: https://www.androidauthority.com/graphene-os-major-android-o...


This is similar to Deno Sandbox[1] which was announced a couple of weeks back. Apparently also something similar is done with fly.io's tokenizer[2][3]

[1]: https://deno.com/blog/introducing-deno-sandbox

[2]: https://news.ycombinator.com/item?id=46874959

[3]: https://github.com/superfly/tokenizer


My friend made this site to try and surface the best place to buy music: https://streamtoshelf.com/

He also made a section of the site that allowed you to login via Spotify and it would aggregate your listening history and tell you how much it would cost to buy all of your most listened to albums. Annoyingly Spotify seems to restrict the oauth app creation process, so users have to be invited by email to access that.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: