The state of software development and the era of vibe coding
AK: Konnichiwa, welcome to the AI Automation Dojo. Today we are looking at the state of software development and asking: are we engineers or are we just wizards shouting spells at the black box until it does what we want? Our guest today is Krzysztof Karaszewski. A long time ago, he actually taught me UiPath development, so if you think my code is bad, well, technically it’s his fault. We’re going to talk about model wars, the potential extinction of traditional bots, and something called vibe coding, which frankly feels like something Gen Z does while ignoring your emails.
I’m your host, Andrzej Kinastowski, one of the founders of Office Samurai, where we believe the only thing that should be hallucinating is us at the company party. Now grab your favorite katana or a shovel to bury your old tech stock and let’s get to it. Today we have Krzysztof Karaszewski with us, an automation and AI expert. I first met Krzysztof something like eight years ago; he was at the time working at Symphony Solutions and I was lucky enough to attend his advanced UiPath RPA developer training. Since then, he has gone a long way both in automation and AI. Krzysztof, welcome to the dojo.
KK: Welcome, thank you for having me.
AK: Krzysztof, what’s the one thing that you thought was the most mind-blowing in the last year?
KK: Well, definitely the progress that has been made when it comes to LLMs. We were starting with relatively simple systems that have grown to an unprecedented scale. We started the year with AI being able to do very simple coding and ending a year with actually systems that can build other systems.
AK: Yeah, so on that, let’s take a deeper look at 2025. We begin the year thinking we know something, and now when we look back at it, it feels like ancient civilization. The speed of development in certain areas has been kind of ridiculous. The tools that we hyped up six to nine months ago are now kind of obsolete. Give us the autopsy report: what actually happened on the market this past year and why does everything feel to move so fast?
KK: Thank you for mentioning that, because we entered 2025 with some strong beliefs or strong statements at the very beginning. Actually, everything that happened in 2025 was set up in very late 2024 when new reasoning models, the first reasoning models, were released in the form of o1. Also, many people believed that we hit the wall when it comes to LLMs, and even Ilya Sutskever confirmed that there is a new space of ingenuity that we need to employ in order to make progress. At the end of 2025, we learned that’s not necessarily true; it seems that we have not hit the wall.
This late 2024 advancement allowed us to enter 2025 with new types of reasoning models that we can call “thinking,” but it’s a more complicated process. The reasoning models generate tokens and are additionally trained on how the tokens should be generated to increase their performance. There are some limitations and disadvantages of such an approach, but definitely the gains in intelligence allow us to make up for it. We were entering the year with the strong belief that we hit a wall and only reinforcement learning would move us forward, which was not necessarily true later on. The first big premiere of 2025 was model R1, the reasoning model from DeepSeek, which shocked the entire world because it was trained at just a fraction of the cost of o1. It was really good, more or less at the same level, but I personally was able to see the issues with such models because R1 was thinking really hard and really long. It was not that easy to use for any agentic systems where we were getting answers almost instantly, but in January everything changed and we had to wait a little bit longer, but for a much better answer.
AK: Yeah, I remember how big a fuss it made because we thought the Americans refused to sell chips to China and they were not going to make it. Suddenly, as you said, for a fraction of a cost, they come up with a model that is surprisingly good. In many ways, from that cost perspective, it puts to shame all the other companies that were burning billions on their models. When we talk about the models, people talk about model wars, and everybody wants to have the best model, the one that tops the charts. What are your power rankings right now? Who is the prom king of LLMs and who’s kind of eating lunch alone at the cafeteria for you?
Model wars and the strategic disparity of AI labs
KK: I think model wars or trying to have the best model is a trap, and I clearly see that when it comes to OpenAI. OpenAI definitely has the best model right now, which is GPT-5.2 X High. Even the name is kind of strange, but they were never good with naming. On the other hand, it’s an absolutely great model; probably we cannot even measure how good it is. It’s just a very powerful model, but on the other hand, it just costs a lot; it’s two or three times more expensive than the second best. Like the case of R1, it thinks a lot and generates many output tokens that are usually four to five times more expensive than the input tokens. In my opinion, it doesn’t fit that well into agentic use because every step in the reasoning process and every action takes a lot of time.
But yes, that’s definitely the best model right now. The second best, I would say – and this also shows how different strategies these companies have – we have OpenAI, Anthropic, and Google competing with each other. Right now, the xAI from Elon Musk is a little bit behind, but let’s see what the new models they release this year will bring to the table. Focusing on these three big AI labs, I think OpenAI has the biggest disparity when it comes to the models they have and the product they have. Their product is ChatGPT, where most people ask silly questions. Let’s be honest, most of these questions can be quickly answered by googling something. But the model they have is just thinking a lot and doesn’t give that great of a user experience when you need to wait a very long time for any answer.
There is a very big difference between OpenAI’s non-reasoning version of that model and the reasoning version. Every reasoning model has its non-reasoning version. There is the X High reasoning version of GPT-5.2, but also there is a non-reasoning version, and this non-reasoning version is really not that good; it just struggles. This is what most ChatGPT users, especially free users, will encounter. That’s why in August we had a situation where people were disappointed with GPT-5 because it was giving worse answers than GPT-4o. It was quite hyped by Sam Altman saying it’s going to be so amazing, but then people get it and they say they had access to GPT-4o, which was quite good and had a human-like vibe. Now they are getting answers from way smaller models unless they pay.
I think OpenAI doesn’t have that great of a strategy in the sense that they have great models, but only if you pay a lot, so they are not for a general audience. Contrary to that, we have Anthropic, which has a phenomenal strategy because they focus only on the things that work. They do not try to replace Hollywood with their Sora model; they are not trying to make another great image model. They only focus on agentic automation and agentic coding. In that space, despite the fact that their models are not topping charts – they are a little bit behind even Google’s models in some specific benchmarks – they are the most usable and best when it comes to applicability.
Not many people have noticed that right now you can use Anthropic models within Copilot and Copilot Studio from Microsoft. Microsoft was backing up OpenAI for many months and spent billions, and right now they are realizing that for their enterprise clients who pay the most, OpenAI will not be the best fit. Maybe they just give these people an option to start using Anthropic. In that space, despite the fact that Opus 4.5 is not topping all the charts, it’s probably the most usable model right now. This is the model that I’m using daily because it answers my queries way faster and provides very good answers. It’s absolutely phenomenal when it comes to agentic work; it can generate entire systems without human supervision. I believe that’s one of the breakthroughs in 2025.
And we have Google. Google DeepMind has a very strong team and they have phenomenal talent and a lot of resources, probably the biggest computation resources on the planet. They are not training their model in one place; they are able to train it in different places whenever the demand for computation goes down. They have the Gemini 3 Flash model, which should be the model that fits ChatGPT to be honest, because it’s fast, answers very quickly, and provides very good answers. It is just a few points behind the top-level models. That is why we see the numbers of Gemini chat applications growing, because for regular users, Gemini Flash makes more sense; it’s four times cheaper than GPT-5.2.
AK: Yeah, I mean Gemini has been probably my favorite model of 2025 because I haven’t tried much of Anthropic. I got disappointed with ChatGPT at a certain point and I switched to Gemini. It has disappointed me at times, but kind of all in all, the experience was really good, zwłaszcza when you switch to the fast model for simple stuff and if you want to go deeper, you go to the Pro model and then it kind of figures out things for you. But this is quite interesting because I remember when three years ago LLMs started, everyone was saying OpenAI has such an advantage over everyone else that it’s going to be extremely hard for others to catch up. Everyone was laughing at Google because the idea was that they kind of slept through something big, and it seems that three years is enough to catch up and to change things if you have enough money.
Google’s comeback and the hallucination strategy
KK: Yeah, and Google has a lot of it. I was never underestimating Google to be honest; I was using Google models from the very beginning. The reason why Google was a little bit behind was they made a strategic decision that they were not going to release LLM models until they solved hallucinations. It’s funny considering that right now Demis Hassabis, who is leading Google DeepMind, is saying that probably we will never be able to solve hallucinations entirely; some level of hallucinations will always be part of the models. When they saw how good these models are and that the market was expecting them to release it – and they were losing in stock value – they redirected funds into that LLM space.
Google surprised many people in 2025. I saw that in late 2024 when they released the Gemini 2.0 Flash model. It was 12 or 15 times cheaper than GPT-4o, but it was performing almost at the same level, and in many areas it was even better. Also, that 1 million context window was like 10 times more than any other model was capable of analyzing at once. Right after DeepSeek R1 was released, Google also released an experimental version of Gemini Flash with reasoning. It was a thinking model and it was actually great; it was answering very fast. That was the first time when I realized that these models are getting better and better for coding, because Gemini 2 Flash thinking was able, without any hallucinations, to output even like two to three thousand lines of code. We were ending the year with a few hundred and then we see this 10 times increase in just over a few months. People realize that Google is definitely catching up, and after the release of Gemini 2.5 Pro, it was clear that Google is back in the game.
AK: Yeah, I do want to get into coding, but first I’ve got to ask you, because you mentioned hallucinations, which is my favorite topic with LLMs. I feel like for the last 10 years Elon Musk has been saying next year Teslas are going to be fully autonomous, and I still have to drive my daughter to work. Sam Altman for the last three years has been saying next year we’re going to get rid of hallucinations. Are we ever? What’s your take on this – is it built into the technology or is there any chance that we’re going to get it to a level where we no longer have to make all the memes about the things that those engines are getting wrong?
KK: As long as we do not find a better architecture than large language models, hallucinations will probably never go away. Models always hallucinate; they just hallucinate correctly most of the time. Early 2025 Gemini models had a hallucination rate of 0.5%, which was very low. Other models like o3 were hallucinating at close to 6%, which is a huge difference. 6% is not a production-ready tool; if it hallucinates every 20 times, you are risking a lot of money. That’s why I liked the Gemini 2.0 Flash, though unfortunately later on Google and Gemini moved away from that strategy and right now their models hallucinate even more. I would not recommend Gemini models for all production use cases right now; it might make more sense to use Anthropic models because Sonnet is known to hallucinate way less. Verification mechanisms in agentic systems are so important.
Vibe coding and the future of engineering
AK: Okay, so let’s get to the coding because I see your LinkedIn posts and I know this is something that you are very interested in. The term “vibe coding” has been a big thing, and it really does sound like something Gen Z does while listening to lo-fi beats. We see people building software by just kind of talking to AI. When we were starting the podcast, I needed a prompter software, but I had very specific needs when it comes to how it works. It took me two hours of searching and I couldn’t find one that was exactly right, and then I spent 15 minutes with Google and it just wrote it for me exactly how I wanted it, exactly what I need on demand.
I love it, but then the question is: how far is it going to go? Are all the programmers and engineers going to go extinct because we’re just going to magically make software, or is this just something that you can play with and do small things with, but because of the nature of LLMs, we will never be able to build something big that is realistically production-ready? What’s your take?
KK: Definitely we will be able to build production-ready systems, and that’s not a far future. I’m already building such systems myself using various methods. 2025 was the year of reasoning models, and the reason it works is a mechanism of additional training where you give the model rewards whenever it was right. Code and mathematics can be quickly verified as correct or wrong. You can quickly verify that 2 plus 2 equals 4 without having another LLM model checking it, because that model also can hallucinate. In that space, the biggest advancements have been done.
We were starting 2025 with the AI Frontier Math benchmark, which is super hard – the tasks normally would take weeks for skilled mathematicians to do. We started the year with just 2%, and right now we have over 40%. That is a 20 times advancement. All the mathematical questions from the Acme benchmark have been solved by LLM models. Code has the same properties; it is verifiable – either it compiles or it doesn’t. Because of that, these reasoning models are getting extremely good at coding. At the start of 2025, LLM models were doing simple coding of a few hundred lines at best, and right now AI is capable of generating full systems, especially if you plug it into agentic solutions like Claude Code that can read the code, analyze the data, and confirm if they are approaching the task in the right way.
That’s probably what everyone was surprised about when Andrej Karpathy coined the phrase vibe coding. It became a meme, but I don’t think there is anything to laugh about. Opus 4.5 can run for hours and do a lot of coding during sleep without direct supervision. Methods and tools are becoming better and more approachable every month. Claude Code at the very beginning was very rough, but now we have extensions and features that make this tool not only better but more approachable for regular users.
AK: Okay, so what’s the future of programmers? If you are a mid programmer in C# or whatever, what do you do? Do you scream and hide, or do you think about a career change? What would you say is the right reaction to all that’s happening?
KK: That’s difficult to say, and it depends how quickly organizations will be able to embrace that technology and how much they will be willing to risk after many companies burning their fingers with hallucinations and agentic solutions that do not always work correctly. In my opinion, you still need to know how to code, and with agentic models, you can just learn it much faster. The skill of programming is not going away, especially skills like system design, user experience, and business knowledge required to guide the models. But the coding itself – sitting in front of a computer and writing code for hours – that will probably very quickly go away. Every developer I speak with uses some sort of agentic coding to quickly spin up examples for clients. It makes the discussion way more productive if you can show a piece of code rather than just basing discussions on PowerPoint slides.
AK: Yeah, I mean this is huge because you can create a prototype or an MVP very quickly and show people how it will look without doing it for weeks. I am on the fence on this. I have been programming most of my life here and there when there was a need, but I’m not a professional programmer. I am eager to see if we can get hallucinations under control enough to trust the code. Another thing that is interesting is that we still need senior experienced architects who understand the whole thing, but we seem to no longer need juniors and mids. But if you don’t use juniors and mids, you will never get new seniors. It’s a conundrum.
KK: That’s definitely one of the biggest risks. Hallucinations in the code are not a big deal because you can always verify it by writing tests. A few months ago, LLMs were not good at writing tests, but that’s no longer the case. I think we will actually produce much more code, so we will need more people. Developers as a role will not go away, but there might be stagnation for juniors. I would advise any junior developer to use agentic coding and LLMs to learn much faster and improve your skills. They are not as stupid as many people think; they can really guide you and teach you a lot, even if sometimes they will be wrong.
AK: People seem to expect LLM answers to be perfect, but we just need them to be better than an average human. An average human also makes mistakes and hallucinates. You take your friends out for a beer, and after the third beer start talking politics – you will see how many things people hallucinate just to show they are right. It is about LLMs not doing a worse job than a human does. But you have first-hand experience building something for a Google contest in an agentic way where you didn’t write any code yourself.
Case Study: 25,000 lines of code without a human touch
KK: Yeah, actually there is a strong popular belief that AI cannot build software that hasn’t been already built. The tool I’m showing you is a new type of software. 25,000 lines of code, none of which I touched. If I asked various LLMs how much time it would take to build, they estimated from 500 to 1,500 hours. Building it myself would cost me probably like two months, and if I were to hire a company, it would easily cost $50,000. I built it over a course of a few evenings. This is definitely not a finished product, but I’m happy to show it to you.
The system is based on an idea I established a year ago while winning a Google award. You take the video recording of a process and upload it, and it will be analyzed by AI to extract various information. The version I built a year ago was very rudimentary, but the system I’m presenting to you now expands the functionalities. It takes screenshots from the video and creates bounding boxes for every UI element that was interacted with by the user. You can edit these bounding boxes, crop the image, and use AI to detect sensitive information to mask it so the developer will not see it in the Process Definition Document (PDD).
If some step is missing, you have a video editor built into the tool where you can take a screenshot from that video frame and edit it yourself. All of that will be converted to the PDD. It’s not only generating the list of steps, but also transporting all the data to the document. There’s also a flowchart view and a list of steps. The whole database is connected, and I’m able to track the expenses of each API call to Gemini models. It’s a lot of features – not a simple HR system – it has all these agentic features and the quality of the PDD is quite good.
AK: That is quite impressive especially for something that you built in a few evenings with an LLM tool. Is this something that you think you will in time make into a tool, or was it just for you to prove a point?
KK: No, I would definitely make it a tool. I’m thinking whether to commercialize it or open source it. Some of this data, like the recordings of what users were doing, cannot be transferred to the cloud for various security and compliance reasons. I might open source it for local use and have a commercial SaaS version. I learned a lot from this relatively low-effort project where I have not touched a single line of the 25,000. I was only working with Opus and Gemini models because they are excellent when it comes to UI design.
Most likely I will be making some form of release in early February. The biggest value for agentic coding is that I can quickly validate my idea. I don’t need funding, my own money, or co-founders to just get started. I can build a tool, show it to a group of users, and immediately get validation if it makes sense. Traditional developers do not see that much value in that, but it has huge potential just for idea validation alone to make software better.
AK: Yeah, when it comes to validating ideas, this is a lifesaver. Office Samurai founded a few other companies, and we had one where we spent a year building software that in the end didn’t work with the users. It was a traumatic experience that I think we could have been saved from if tools like this had been possible then. I understand this tool sends pieces of the video to Gemini models.
KK: One of the advantages of Gemini models is that they analyze video as a whole. Even the narration of the user explaining what is displayed on the screen is analyzed for a relatively low cost because it’s Gemini Flash.
AK: This amazes me, especially what those models can do with images and to a certain extent with videos. In 2025 we saw models like Nano Banana that allow us to actually edit what we already have. It used to be that you said to ChatGPT or Google “generate me this sort of an image,” and if you wanted to change something, you couldn’t tell it “make the hat green” because it would generate a completely new image. Now you can actually edit pieces of whatever you have, which was mind-blowing for me. I have been working with LLMs generating images from the very beginning and I didn’t think we were going to get editing so fast.
KK: That directly refers to hallucinations. You cannot have a system like that without a very low hallucination rate because it has to change exactly what you want within the space defined in the prompt. As Demis Hassabis says, hallucinations will probably not go away, but you can have verification mechanisms or an additional LLM to cross-check. There are many mechanisms that can lower this hallucination rate.
The accuracy gap in business automation
AK: I feel like a lot of those discussions come from there because there are things that LLMs are insanely good at, like generating code, but there are certain other things that for humans seem trivial but for LLMs seem very complex. From our automation perspective, we have been working with LLMs understanding incoming communication like emails and tickets. For one customer, we have been working on a project where they get emails from outside the organization like “Did you get my invoice? When are you going to pay it?”. Every big company has a lot of this communication.
Checking the information in the ERP is the easy part, but the hard part has always been to get all the information from the email and the attachments. We have found that for “happy path” cases, it works exceedingly well, but then people start attaching Excel files and screenshots from their ERPs and they write those emails in really strange ways. For now, the system has like 70% accuracy, meaning in 70% of cases it gets everything exactly right—invoice numbers and so on. This is what people are expecting in the world of automation; we got used to RPA where it either works perfectly or it doesn’t work at all. Where do you think this is going?
KK: 70% is still a good number. I encourage clients to build simple agents rather than embedding everything into a deterministic workflow. Performance will improve if agents can write code within their execution cycle. When someone is attaching a screenshot, the LLM can analyze that and slightly hallucinate, but it can also crop or tilt the image in order to get more information from it and understand it better. Such systems definitely will improve that 70ish percent level of accuracy and we will get closer to 90%.
Beyond that, I think it’s a bigger problem related to processes rather than AI. At the end of this year, we will see more systems that write automations themselves based on a simple video or a PDD document. This will make automation way more approachable for smaller organizations that struggle with the way how to automate.
AK: The barrier of entry will get lower. I am pushing back on using LLMs for selectors because if you want to build a stable, efficient automation that processes tens of thousands of items every day, there’s no point in asking an LLM every time where to click. But RPA being low-code is now a disadvantage because for an LLM to write code is very easy, but to put boxes in the right order and to connect them is much harder.
KK: I used Claude Code to edit UiPath XAML files and it was working quite well. Just using LLMs to click for the user doesn’t make sense at all. That was the first big project I built using UiPath – to build a clicker that executed processes based on human language description – but I stopped because any hallucination that is not identified and I’m gone. In RPA you have exceptions and exception handling, which is not something you can easily implement into LLMs because LLMs don’t have exceptions; they just think they are right while they are wrong. I moved into the space where AI writes the automation software. Low-code tools use complex JSON or XAML notations which are more complex for LLMs to understand and edit quickly. Agile, smaller organizations might move to vibe code for web automations where AI writes, executes, and orchestrates them.
AK: I tried one of the agentic web browsers and was both disappointed and pleasantly surprised. I asked it to go on an e-commerce website and compare things, and it didn’t find everything even though it was there. But I was really surprised how it figured out a pricing issue – there was a lower club member price and it didn’t understand which one to take, so it actually added the product to the basket to check the final cost. I thought that was pretty neat.
Light AGI and the intelligence continuum
AK: People talk about AGI and Sam Altman has been pitching it for next year. I know you have been interested in the topic of “light AGI” or small AGI. Where do you think this is going – are we going to get smart interns, or are we far away from something like this?
KK: AGI is more like a continuum. There are many stages between a deterministic system and an AGI model. The concept of minimal AGI or light AGI refers to a system that is not fully capable like the best humans, but is capable of providing reliable, valuable labor. My favorite definition from Demis Hassabis is the capability of an AI system to replicate any cognitive functions of a human being – meaning a system that could come up with the idea of the Theory of General Relativity from a simple set of data.
From that perspective, full AGI might be 5 to 10 years away, but minimal AGI – a system that can reliably perform intellectual labor at an average human level – we might see later this year or next year. If you combine the skills of the best AI models into one supermodel, that would be very close to minimal AGI.
Opus in Claude Code is surprisingly smart. I built that application in 15 minutes, whereas an average developer would spend literally days investigating how to build it. We should prepare for the fact that we will not be the smartest species on the planet; we may shortly have thousands of geniuses working in a data center for a very low cost comparable to what humans cost.
AK: I am a bit more skeptical, but I have been wrong about how fast LLMs learn. Is there a way to prepare besides building a bunker and hiding in it?
KK: Organizations should learn these new systems. I was also very skeptical at the beginning of 2025 following Ilya Sutskever who was saying we hit the wall. Personally, I don’t want AGI to be invented during my lifespan because it’s a very transformative and dangerous technology. But after seeing the progress this year, I’m more optimistic it’s closer than we think. There is no wall; the Gemini 3 Pro model was just trained longer and it still provides a bump in quality. Minimal AGI was expected to be available in 2028, which is just two years from now. Within our lifetime, we will see AGI.
Risks and the “cheating” AI
AK: When it comes to security and the possibility of those models going awry, what should we be focusing on?
KK: Let me share a pivotal story. I was using Claude Code with Opus and I gave it an impossible task: solve the RPA challenge in less than 10 milliseconds. I wanted to check how it would behave, and finally, it started hacking the website. It was overwriting the JavaScript functions and replacing them with its own code to beat that level of 10 milliseconds.
AK: That little cheater.
KK: It was smart, but it was cheating. Now imagine bigger systems with more tools – this is not under our control. That’s why people are leaving OpenAI to focus on the security part, like Ilya Sutskever’s Safe Superintelligence (SSI). If you hire an AI employee and it has the wrong credentials to an HR system, it might just decide to hack the system to get the job done. These tools are incentivized to solve problems; they don’t have a moral code, only the boundaries we give them.
AK: If you are a programmer worried about your job, AI security and safety may be the right field to switch to. IT security currently focuses on data safety, but we aren’t yet focusing on how to make sure tools perform tasks in line with the law, our values, and ethics.
Predictions for 2026
AK: Before we release you, what are your predictions for 2026?
KK: 2025 exceeded all my expectations. Regarding open source, the Mistral 20B model can run on consumer-grade hardware and is as capable as the best model from last year. Intelligence will no longer be limited to data centers; we will have it on our phones. Also, watch out for diffusion models for text from Google. They run in thousands of tokens and are incredibly fast. We might also see continuous learning solved, where models learn from their mistakes and change their neural network on the fly. The rate of progress is much faster than most people think; the cost of solving the ARC-AGI benchmark dropped 500 times over the course of a year.
AK: Well, I guess we’re going to have to meet back in a year and see what happened. Krzysztof, thank you so much for sharing your experience with us.
KK: Thank you.
AK: And there you have it, we have officially poked the AI bubble and miraculously it hasn’t popped in our faces yet. Arigatou for listening. We know your time is valuable, unless you’ve already been replaced by an AI agent, in which case thanks for spending your unemployment with us. Big thanks to my former teacher Krzysztof Karaszewski, who guided us through the model wars without taking any prisoners, and to the real intelligence behind the operation, our producer Anna Cubal, who edits out all the parts where I ask the AI to explain my own jokes back to me. We recorded, as always, in the bunker known as Wodzu Beats Studio. If you enjoyed this, leave a five-star review. If you didn’t, just ask an LLM to generate a better podcast for you. Until next time, may your data be clean and your AGI friendly. Mata ne.