anu.sh

#projects

Tuesday, May 6, 2025

The First Two Laps

A couple of weeks ago, I wrote up a scrappy MCP (Model Context Protocol) Server and Client. Beyond the fact that everyone is creating MCP servers, there were two reasons I wanted to write these:

I wasn’t able to find a way to test the newly introduced Streamable HTTP transport in the Model Context Protocol, either on the server or client. At the time of writing, the MCP Inspector didn’t seem to support this transport; it supported SSE (Server-side Events) and Stdio transport mechanisms. I was also not aware of a server that supported Streamable HTTP, though this was relatively easy to write. The hard part seemed to be the client app, though it’s entirely possible that I missed something obvious.
There were even fewer pointers to MCP clients that required authorization support from servers, which had been running mostly on local desktops. Now with Streamable HTTP transport, remote servers were beginning to emerge.

So I prototyped an MCP client that used device flow based OAuth and used Streamable HTTP transport to communicate with servers. As a command line client, it required a custom MCP server with OAuth that was slightly different than anything I could find publicly, either via the MCP Inspector or CloudFlare Remote MCP Servers.

Last week, I prototyped an MCP client that mimicked a browser-based experience similar to the MCP Inspector, but one that supported Streamable HTTP transport. This allowed me to drop down one level to work with the end-to-end sequence of messages exchanged between the client and server, from authorization to tool execution and follow up notifications.

This has been a lot of fun and after the first two laps of this marathon, I’ve been reflecting on what I learned.

Takeaways

The most interesting takeaway was that vibe coding is awesome to learn a new language or programming paradigm. It may also be good enough to build a prototype. However, in this journey, I quickly hit a wall when I could no longer fully follow what was happening and couldn’t get myself unstuck. For example, one day I couldn’t debug why the authorization flow was going into a loop. That day I ran out of my model call limit just trying different guesses. Then I had to fall back on my own understanding and it frustrated me that I hadn’t come to understand what I was doing.

The next day, I built the whole app from scratch myself, one step at a time. I used the code from the prior iteration to fill in the gaps as I went along. When I had it running in a day, I realized this alternative journey had been far more satisfying than the imaginary one, where I’d have been happy to simply have something working one time.

My sense is that most programmers building something meaningful that continously evolves will run into this performance wall sooner or later. They will come to value learning for its own sake as learning can help us get a lot more stuff done.

My growing conviction is that we have a “fast learning” button at our fingertips today not a “generate code” button. Software by definition is meant meant to evolve… and writing software that’s not been written before will likely remain in our own hands.

Next Steps

Could I create a shopping agent that plugs into my Amazon account? Or a Midjourney agent that might help me reimagine my living room? Or a personal agent that could do my taxes? How might multi-modal data travel on the MCP wires?

#projects

Vibe Coding

Thursday, March 27, 2025

Everyone’s calling it “vibe coding”, and I’m quite enjoying it. It’s fast and fun, like a game where you create your own levels, and keep going till you exhaust the model. I go from “working” to “working” and unlock new challenges everyday. For example, I’ve never coded in Typescript, but I’ve now built two apps, one for personal finance using bank transactions and a second one to estimate income taxes. Next, I want to fine-tune a model and evaluate it using model-generated datasets, things I’d only dreamt of doing last year…

Surprisingly, it’s fun because it’s not always smooth sailing. People are comparing generative AI coding skills with human coding skills, which I think misses the point. When I understand what the model is doing and why it’s doing it, it feels like moving at the speed of logic rather than being constrained by language and syntax.

I must admit that part of the fun is discovering the model’s limits. Maybe these are Cursor’s limits rather than the model’s limits? For the record, I’ve been using Cursor with the default model, Claude 3.5 Sonnet. I’ve also used Claude Code, but that was a different experience. I might write about it at a later point. Yesterday, someone asked me what was it like to use Cursor. So here’s a short roundup of what I discovered.

Always be vigilant

Confabulation: Even when I provided reference documentation from Stripe, the model (Cursor?) confidently responded with something along the lines of “Looking at the documentation, this is correct.” It only relented when I asked it to provide a reference. If I hadn’t read the documentation myself, it may have taken me much longer to debug situations where it generated handlers for completely made up wehbook events.

Blindness: The model uses its previously generated code as ground truth. When I manually fixed code, it didn’t seem to effect the model’s subsequent generations, though asking it to fix the mistakes helped it recognize the “fix” before I could continue. For example, when I renamed a field in a Typescript interface (e.g. short_term_capital_gains instead of just capital_gains), it would not pick this up for the code it generated after this. At least a couple of times, it created new files without taking the repo structure into account. For example, it created a new scripts folder for database migration at the root level when it had previously created similar scripts in the src/app/db level.

Specificity is no antidote to eagerness

Overreaching: When I asked to add a new button and selector to a web interface, it added half dozen new packages and four new files. We already had Tailwind CSS and basic components to add these features, and I was not expecting that it would make the task to be 10x more complex than I wanted it to be. Fortunately, reverting was simple.

Overassuming: On another occasion, I prompted it to add specific new fields to one Typescript interface, but it also added to another interface for no reason. This worried me about what else it had assumed… maybe most of its assumptions were good?

Don’t bank on its stamina

Anti-endurance: Ironically, sessions that don’t go well last longer. The session where it confidently misinterpreted Stripe’s documentation was also the session that drew on for a couple of hours. At some point, it froze in the middle of responses, leaving the repo code in intermediate state. Rerunning the prompt sometimes resolved the issue. After a point, it stopped responding altogether. It would respond to short and simple questions after that.

Anti-recall: On a few occasions, it added dependencies even after we explicitly decided against these. For example, I began the project by asking for options and trade offs, and directed to proceed without taking a dependency on the Prisma ORM. When I tried to add new features, it would forget this instruction, and add code requiring Prisma. This occurred enough that I asked it to find a way to remember it. It proposed to create a project plan without realizing that we had created one at the beginning of the project. I wasn’t sure if updating it would help it remember better, but the issue hasn’t reappeared since.

Intelligence != Ownership

No Reflections: On many occasions, it would leave behind spurious code or files, making no effort to clean up after I had redirected its suggested solution. Sometimes this clean up became difficult as I lost track of all the changes that had to reversed. Maybe this is a good thing, because I wouldn’t want it getting too eager overthinking its past results!

#projects

Inventing Factories

Wednesday, March 19, 2025

If I had asked people what they wanted, they would have said faster horses, Henry Ford is famously quoted. Yet, Ford didn’t invent cars to replace horses. He invented factories to make cars cheaper and more accessible. Factories that would make the car, according to Ford, ‘so low in price that no man making a good salary will be unable to own one.’

When Ford developed the Model T, cars had been around for decades, made ‘artisinally’, if you wish. They were expensive and unreliable. Ford’s goals were affordability and reliability. His first idea was to introduce consistently interchangeable auto-parts. His second idea was the moving assembly line, which reduced the time workers spent walking around the shop floor to procure and fit components into a car. ‘Progress through cautious, well-founded experiments,’ is a real quote from Ford. The 1908 Model T was 20th in the line of models that began with the Model A in 1903.

Borrowing a leaf from the top of the last century, more than inventing new foundation models (FMs), we need to invent the factories that make such models trivially affordable and reliable.

Billions of Models

If FMs are the new ‘central processing units’, technically there’s no limit to the number and variety of ‘programmable’ CPUs that we can now produce. A typical CPU chip requires $100-500M in research and design, $10-20B in building out fabrication capacity, and 3-5 years from concept to market. Today, FMs can be pretrained within months (if not weeks) for much less. On this path, more than OpenAI, Google and DeepSeek have accelerated affordability.

Model Provider	Small Model	Mid-Size Model	Reasoning Model
Anthropic	Haiku 3.5: $0.8/MIT, $4/MOT	Sonnet 3.7: $3/MIT, $15/MOT	N/A
OpenAI	GPT 4o-mini: $0.15/MIT, $0.60/MOT	GPT 4o: $2.5/MIT, $10/MOT	o3-mini: $1.1/MIT, $4.4/MOT
DeepSeek	N/A	V3: $0.27/MIT,$1.10/MOT	R1: $0.55/MIT, $2.19/MOT
Google	Gemini 2.0 Flash: $0.10/MIT, $0.40/MOT	Gemini 1.5 Pro: $1.25/MIT, $5/MOT	Gemini 2.0 Flash Thinking: Pricing N/A

Note: MIT: Million Input Tokes, MOT: Million Output Tokens

Affordability

Affordability is admitedly relative. Software developers may be willing to pay more than marketing content creators. The value of research for knowledge workers depends on the decisions it enables them to make. Customer support may be valuable, but no more than the total cost of employing and managing human support representatives. When the returns (or savings) are clear, there is quantifiable demand.

OpenAI loses more than twice that it makes, and it needs to cut costs by a roughly an order of magnitude to be sustainably profitable. If Nvidia’s Blackwell chips deliver the promised 40x price to performance improvement, this will be the year of non-absurd business models. More power to them.

It’s possible that DeepSeek is already there. More importantly, DeepSeek might represent the price level of an API provider who doesn’t have a application business to cannibalize. Is it ironic that OpenAI is facing an innovator’s dilemma of their own?

Meanwhile, Anthropic charges a premium over OpenAI’s application-level rates. They also need an order of magnitude reduction. They might already be there with TPUs, or with Trainium 2’s 75% price drop they’re likely getting there. It’s unclear if they have a cannibalization issue yet, though their CPO definitely wants their product teams to iterate faster.

Training and adapting the model to meet specific and evolving customer expectations is the business need. On this point, popular applications such as Perplexity and Cursor/Windsurf are arguably underrated. Just as Midjourney provides a delightful experience by getting the combination of the model and user experience just right, these applications taking their shot. After all, the model is a software component, and application developers want to shape it endlessly for their end users. The faster these developers iterate with their models based on feedback from their applications, the faster they’ll see product-market fit. They can then figure out how to grow more efficient. Finding product-market fit is the only path to affordability.

People mistake such applications to be ‘wrappers’ around the model or ‘just’ interface engineering. That’s a bit like saying Google is just interface engineering over Page Rank.

Reliability

For a given use case, reliability is a function of: How often does the model ‘break’? How easy and/or expensive is it to detect and fix?

In creative work, there’s often no wrong answer. And checking the result is generally easier than generating the result.

For automation, it’s more of a spectrum ranging from infeasible to non-compliant, to varying degrees of risky, to safe & expensive, to safe & cost-effective.

What makes the application risky vs. safe? And who underwrites the risk?

One answer is tool use. Multiple tool use protocols such as Model Context Protocol want to make FMs more aware of available tools in addition to making tool use more effective and efficient. However, there’s no significant reason (yet) for any major model provider to use another’s protocol. I expect protocols to emerge from most if not all model providers, and feel that standardization is at least a year or two away. Even then, new standards may usurp older ones, and different economic and geopolitical agendas could shape these in weird ways.

However, a sophisticated ‘tool’ or service really wants to be an agent. When multiple agents need to work together, we need distributed ownership, separation of concerns, authentication, authorization, auditability, interoperability, control, non-repudiation, and a lot more. Much of this plumbing already exists with OAuth2.0 and can be repurposed for service agents, but a lot still needs to be built. Whoever builds the most reliable multi-agent collaboration systems will likely grow to become the most trusted.

The Industrial Revolution

Unlike fantasy AI factories that spew pure intellgence as tokens, these factories will ship affordable and reliable engines that can safely power software applications. While we urgently kick off the next manufacturing industrial build out in the U.S., my guess is that these software factories will take years to build. We need to have started yesterday…

#projects

Static Site Creator

Sunday, March 2, 2025

The First Project: Setting up a Static Site

This past week, I worked on three projects. The smallest, shortest one was simply setting up this blog using my husband’s very brief, but very effective instructions. It took me a couple of hours to get the site running.

The site itself is uses Zola, a Rust based static site engine that renders the whole site as static files, eliminating the need to manage a server or a database. The site’s code is trivial to generate. Alongside all publishable content, this code is maintained on GitHub. We use GitHub Actions to build and deploy the site on AWS. All this is relatively simple, and takes no more than a few minutes.

Figuring out how to stitch together AWS resources is a whole different ballgame. While I’m usually pretty slow finding my way around the AWS Management Console, this is what took a majority of the “couple of hours”. It required:

Creating an S3 bucket.
Creating a Hosted Zone in Route53. Adding the name servers from the Hosted Zone to the domain registrar.
Requesting a public certificate from the AWS Certificate Manager. Creating domain verification records in the Hosted Zone.
Creating a Viewer Request CloudFront function to redirect to index.html.
Creating a CloudFront distribution using the ACM certificate and the CloudFront function.
Updating the S3 bucket policy to allow the CloudFront distribution to access the bucket.
Adding A and AAAA records in the Hosted Zone using the CloudFront distribution as the alias.
Adding an GitHub Open ID Connect identity provider in AWS to enable GitHub Actions to update the S3 bucket and create invalidations with the CloudFront distribution.
Creating an IAM policy to allow these actions.
Creating an IAM role with this associated policy so that GitHub Actions can assume this role to perform the permitted actions, with GitHub as the principal.
Updating the GitHub workflow file in our repository to use the appropriate S3 bucket and CloudFront distribution when deploying the site.

This was a win, but it wasn’t a real win unless it kicked off another worthy project, right? What if we could automate this couple of hours of effort?

Coincidentally, Anthropic rewarded me with an invite to Claude Code the same day. Before I build generative AI stuff, it’d be nice to use it to build something plain ’n boring… was my reasoning.

The Second Project: Automating Static Site Creation

I started with Claude Code, which was now flush with $25 credits in addition to my regular Claude subscription. It was immensely fun, apart from being incredibly rewarding.

I started by cleaning up and fortifying the documentation that I used to create my blog (above), and sharing this with Claude using the following prompt.

I want to create an agent that follows the instructions in the README.md.
We need to figure out what inputs are needed from the user up front, and what authentication is required from the user as the agent progresses.
Then let’s design the application. Ideally, I’d be able to invoke the agent using the terminal.

The result was a reasonably well designed “application” that included:

utils for logging, configuration, and prompting for credentials
commands for initializing the project as well as setting up GitHub and AWS
services with API calls perform various tasks using Zola, Git, GitHub, and AWS

Later I learned that this code structure aligns with Anthropic’s Model Context Protocol recommendations on separating “client” and “server” implementations, where the client is embedded in the application (e.g. commands exposed to the user) while the tools (aka services) are implemented on the server side. In situations where the client and server are combined, this unit may notionally represent a complete “agent”.

In this case, the agent is simply automating a series of steps and isn’t using an large language model (LLM). However, at a later stage, we could decide that it needs one to, say, generate a new theme.

Authentication & Authorization

One reason I wanted to avoid any LLM-driven execution is that I wanted to learn about authentication and authorization flows without an additional layer of abstraction. When setting up my blog, I used a personal access token from GitHub to push the code to the remote repo. When implementing the automation, I wanted to use GitHub’s Device Flow for authorizing OAuth apps. While it seems more secure, I’m not sure if it improves the user experience much…

Unfortunately, authentication options for AWS are limited. It doesn’t support OAuth in general (except for Cognito, Quicksight, etc.). For this scenario, I decided to create an IAM user, assign this user any relevant permissions, and use the credentials for this user to perform the required actions¹. We might return to replace this with an IAM role in the future².

Questions More Than Thoughts

First, while this automation might shrink a couple of hours of work to a few minutes, I suspect it will still require the user to be somewhat technically aware. For example, they must know about name servers to update their domain registrar. What if all domain registrar’s also provided APIs?.

Second, while my “new” site currently doesn’t offer much beyond my projects notes, lessons, and reflections, what if I offered a new tool or service? Perhaps I should set up a Model Content Protocol (MCP) server? And then perhaps register it with Anthropic’s (WIP) MCP Registry to become a “well known” provider of the service?

Third, while I could use Google/GitHub as an identity provider for my service, what if I wanted to charge for my services? Would that be separate “token” to be negotiated with the user? Would the user be required to register and enter their card information on my site again?

Fourth, if I were able to negotiate a payment token with the user, how would I trust each request as being valid? Could the user request a refund if they didn’t request it or were not satisfied with the service? Would this be negotiated through the application owner invoking the service, or directly with the service provider?

Does it feel like we’re reinventing the web and digital commerce?

Amazon does provide authorization grants to read a customer’s profile, but it’s unclear if this is useful to perform any actions on Amazon.com or its associated properties.

Use of IAM users is recommended only for specific use cases such as emergency access, applications that don’t use IAM roles, third party clients, or when IAM Identity Center isn’t available. An IAM role is intended to be assumable by anyone who needs it and is not associated with long-term credentials such as a password or access keys.

#projects

↑ newer

↓ older