Hacker Newsnew | past | comments | ask | show | jobs | submit | anishathalye's commentslogin

MIT already has an excellent class on Ethics for Engineers: https://e4e.mit.edu/


Clearly failing the students seeing how they still work for evil companies.


We (@anishathalye, @jjgo, @jonhoo) returned to MIT during IAP (January term) 2026 to teach a new iteration of Missing Semester, a class covering topics that are missing from the standard computer science curriculum.

Over the years, the three of us helped teach several classes at MIT, and over and over again we saw that students had limited knowledge of tools available to them. Computers were built to automate manual tasks, yet students often perform repetitive tasks by hand or fail to take full advantage of powerful tools such as version control and IDEs. Common examples include manually renaming a symbol across many source code files, or using the nuclear approach to fix a Git repository (https://xkcd.com/1597/).

At least at MIT, these topics are not taught as part of the university curriculum: students are never shown how to use these tools, or at least not how to use them efficiently, and thus waste time and effort on tasks that should be simple. The standard CS curriculum is missing critical topics about the computing ecosystem that could make students’ lives significantly easier both during school and after graduation (most jobs do not formally teach these topics either).

To help mitigate this, the three of us developed a class, originally called Hacker Tools in 2019 and then renamed to Missing Semester in 2020 (some great past discussion here: https://news.ycombinator.com/item?id=22226380, https://news.ycombinator.com/item?id=19078281). Over the past several years, we’ve seen the course translated into over a dozen languages, inspire similar courses at other universities, and be adopted by several companies as part of their standard onboarding materials.

Based on feedback and discussions here and elsewhere, along with our updated perspective from working in industry for several years, we have developed a new iteration of the course. The 2026 edition covers several new topics such as packaging/shipping code, code quality, agentic coding, and soft skills. Some things never change, though; we’re still using this hacky Python DSL for editing our multi-camera-angle lecture videos: https://github.com/missing-semester/videos.

As always, we’d love to hear any feedback from the community to help us improve the course content!

—Anish, Jon, and Jose


Yeah, that was just a design choice that I made: I wanted a library that worked with `Iterator`s, felt more lightweight to me / fit my immediate needs better. I'm personally not a huge fan of Pandas DataFrames for certain applications.

LOTUS (by Liana Patel et al., folks from Stanford and Berkeley; https://arxiv.org/abs/2407.11418) extends Pandas DataFrames with semantic operators, you could check out their open-source library: https://github.com/lotus-data/lotus

Semlib does batch requests, that was one of the primary motivations (I wanted to solve some concrete data processing tasks, started using the OpenAI API directly, then started calling LLMs in a for-loop, then wanted concurrency...). Semlib lets you set `max_concurrency` when you construct a session, and then many of the algorithms like `map` and `sort` take advantage of I/O concurrency (e.g., see the heart of the implementation of Quicksort with I/O concurrency: https://github.com/anishathalye/semlib/blob/5fa5c4534b91aa0e...). I wrote a bit more about the origins of this library on my blog, if you are interested: https://anishathalye.com/semlib/

ETA: I interpreted “batching” as I/O concurrency. If you were referring to the batch APIs that some providers offer: Semlib does not use those. They are too slow for the kind of data processing I wanted to do / not great when you have a lot of data dependencies. For example, a semantic Quicksort would take forever if each batch is processed in 24 hours (the upper bound when using Anthropic’s batch APIs, for example).


That was a small self-contained example that fit above the fold in the README (and fwiw even last year’s models like GPT-4o give the right output there). That `sort` is based on pairwise comparisons, which is one of the best ways you can do it in terms of accuracy (Qin et al., 2023: https://arxiv.org/abs/2306.17563).

I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.

These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/

There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).


As you compose fuzzy operations your errors multiply! Nobody is asking for perfection, but this tool seems to me a straightforward way to launder bad data. If you want to do a quick check of an idea then it's probably great, but if you're going to be rigorous and use hard data and reproducible, understandable methods then I don't think it offers anything. The plea for citations at the end of the readme also rubs me the wrong way.


I think semantic data processing in this style has a nonempty set of use cases (e.g., I find the fuzzy sorting of arXiv papers to be useful, I find the examples in the docs representative of some real-world tasks where this style of data processing makes sense, and I find many of the motivating examples and use cases in the academic work compelling). At the same time, I think there are many tasks for which this approach is not the right one to use.

Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.


FWIW it doesn't serve as a great example because the ordering is not obvious. I think that is what GP was reacting to. When I say "sort a list of presidents by how right-leaning they are" in any other context people would probably assume the MOST right-leaning president to be listed first. It took me a moment to remember that Pythons 'sort' would be in ascending order by default.


Good point, I see how the example can be confusing. Updated the example to have `reverse=True` and a comment, hopefully that clarifies things.


Thank you for engaging with me so politely and constructively. I care probably more than I should about good science and honesty in academia, and so I feel compelled to push back in cases where I see things like: blatant overstating of capabilities, artificially farming citations.

This case seems to have been a false positive. Surely people will misuse your tool,but that's not your responsibility, as long as you haven't mislead them to begin with. Good luck with the project, I hope to someday need to cite the software myself.


For sure! I share your feelings about good science and honesty in academia :)


Hi HN!

I've been thinking a lot about semantic data processing recently. A lot of the attention in AI has been on agents and chatbots (e.g., Claude Code or Claude Desktop), and I think semantic data processing is not well-served by such tools (or frameworks designed for implementing such tools, like LangChain).

As I was working on some concrete semantic data processing problems and writing a lot of Python code (to call LLMs in a for loop, for example, and then adding more and more code to do things like I/O concurrency and caching), I wanted to figure out how to disentangle data processing pipeline logic from LLM orchestration. Functional programming primitives (map, reduce, etc.), common in data processing systems like MapReduce/Flume/Spark, seemed like a natural fit, so I implemented semantic versions of these operators. It's been pretty effective for the data processing tasks I've been trying to do.

This blog post shares some more details on the story here and elaborates what I like about this approach to semantic data processing. It also covers some of the related work in this area (like DocETL from Berkeley's EPIC Data Lab, LOTUS from Stanford and Berkeley, and Palimpzest from MIT's Data Systems Group).

Like a lot of my past work, the software itself isn't all that fancy; but it might change the way you think!

The software is open-source at https://github.com/anishathalye/semlib. I'm very curious to hear the Hacker News community's thoughts!


Implementing an ACME client is part of the final lab assignment for MIT’s security class: https://css.csail.mit.edu/6.858/2023/labs/lab5.html


Nice thanks! I’ve been wanted to learn it as dealing with cert expirations every year is a pain. My guess is that we will have 24 hour certs at some point.


I don’t know about 24 hours, but it will be 47 days in 2029.


Looks like a good class; is it only available to enrolled students? videos seem to be behind a log-in wall.


Looks like the 2023 lectures weren't uploaded to YouTube, but the lectures from earlier iterations of the class, including 2022, are available publicly. For example, see the YouTube links on https://css.csail.mit.edu/6.858/2022/

(6.858 is the old name of the class, it was renamed to 6.5660 recently.)


I see someone posted this before I was able to do it :)

Hi HN! For the last six years, I've been working on techniques to build high-assurance systems using formal verification, with a focus on eliminating side-channel leakage. I'm defending my PhD thesis next week, where I'll talk about our approach to verifying hardware security modules with proofs covering the entire hardware and software system down to the wire-I/O-level. In terms of the artifacts we verify: the biggest example is an ECDSA signature HSM, implemented in 2,300 lines of C code and 13,500 lines of Verilog, and we verify its behavior (capturing correctness, security, and non-leakage) against a succinct 50-line specification.

One of the components that I'm most excited about is how we formally define security for a system at the wire-I/O-level --- we do this with a new security definition called "information-preserving refinement," inspired by the real/ideal paradigm from theoretical cryptography.

HN has been a huge part of my life since I started undergrad about 10 years ago (I post occasionally but mostly read). I would love to see some of the HN community there, whether in-person or over Zoom --- PhD thesis defense talks are open to the public, and my talk is aimed at a general CS/systems audience!


This is really neat!

We've been working on some research to formally verify the hardware/software of such devices [1, 2]. Neat how there are so many shared ideas: we also use a PicoRV32, run on an iCE40 FPGA, use UART for communication to/from the PicoRV32 to keep the security-critical part of the hardware simple, and use a separate MCU to convert between USB and UART.

Interesting decision to make the device stateless. Given that the application keys are generated by combining the UDS, USS, and the hash of the application [3], it seems this rules out software updates? Was this an intentional tradeoff, to have a sort of "forward security"?

In an earlier project I worked on [4], we had run into a similar issue (no space for this in the write-up though); there, we ended up using the following approach: applications are _signed_ by the developer (who can use any keypair they generate), the signature is checked at application load time, and the application-specific key is derived using the hash of the developer's public key instead of the hash of the application. This does have the downside that if the developer is compromised, an adversary can use this to sign a malicious application that can leak the key.

[1]: https://github.com/anishathalye/knox-hsm [2]: https://pdos.csail.mit.edu/papers/knox:osdi22.pdf [3]: https://tillitis.se/blog/2023/03/31/on-tkey-key-generation/ [4]: https://pdos.csail.mit.edu/papers/notary:sosp19.pdf


Thank you. Interesting paper!

As you've already noted the TKey's KDF is Hash(UDS, Hash(TKey device app), USS), which means every device+application combination gets its own unique key material. As you conclude this means an update to the loaded application changes the key material, which changes any public key the application might derive. This is a hassle and not very user friendly.

However, nothing prevents the loaded application (A1) to load another application (A2) in turn. This is a key feature, as it allows A1 to define a verified boot policy of your choice. The immutable firmware would do the KDF using A1's machine code. A1 running accepts a public key, a digital signature and A2 as arguments. A1 measures the public key as context, verifies the digital signature, and then hands off its own contextualized key material to A2. In this example A1 is doing verified boot using some policy, and A2 is the application the end user uses for authentication: FIDO2, TOTP, GPG, etc.

Regarding key compromise of the developer's key you might want to look into transparency logs. Another project I'm a codesigner or is Sigsum - a transparency log with distributes trust assumptions. We recently toggled it v1, and it should be small enough to fit into a TKey application. We haven't done it yet though. Too many other things to do. :)


Very cool! That's a nice design that gives the developer the choice on the trade-off between being upgradeable and being future-proof against developer key compromise.

Transparency logs indeed are a neat ingredient to use here. I've heard of other software distributors (e.g., Firefox) using binary transparency logs but hadn't heard of anyone use them in the context of HSMs/security tokens/cryptocurrency wallets yet.


Thank you! We think so too. It is inspired by TCG DICE, which came out of Microsoft Research if I recall correctly. This approach has several other benefits as well (ownership transfer etc) which I've outlined in another comment in this thread.

Here's a cool application we've yet to make: Instead of only using the transparency log verification for the verified boot stage, use it in the signing stage as well - imagine a USB authenticator that only signs your software release if the hash to be signed is already discoverable in a transparency log. You could also rely on cosigning witnesses for secure time with distributed trust assumptions, and create policies like "only sign stuff if the current time is Monday-Friday between 09-17". That would require a challenge-response with the log though.

Regarding binary transparency I think Mozilla only considered doing it, but never actually did it. In part this was probably because CAs and CT log operators didn't want CT to be used for BT as well. Speaking of transparency, you might be interested in another project I'm involved with - System Transparency - which aims to make the reachable state space of a remote running system discoverable.


Sharing some context here: in grad school, I spent months writing custom data analysis code and training ML models to find errors in large-scale datasets like ImageNet, work that eventually resulted in this paper (https://arxiv.org/abs/2103.14749) and demo (https://labelerrors.com/).

Since then, I’ve been interested in building tools to automate this sort of analysis. We’ve finally gotten to the point where a web app can do automatically in a couple of hours what I spent months doing in Jupyter notebooks back in 2019—2020. It was really neat to see the software we built automatically produce the same figures and tables that are in our papers.

The blog post shared here is results-focused, talking about some of the data and dataset-level issues that a tool using data-centric AI algorithms can automatically find in ImageNet, which we used as a case study. Happy to answer any questions about the post or data-centric AI in general here!

P.S. all of our core algorithms are open-source, in case any of you are interested in checking out the code: https://github.com/cleanlab/cleanlab


The class website has a list of resources that includes open-source DCAI tools: https://dcai.csail.mit.edu/resources/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: