Paul Edwin @paul-edwin.hachyderm.io.ap.brid.gy

Building JARVIS Properly - Phase 5: From Ultron's Ruins to JARVIS's Foundation ## Act 1: Picking up the pieces When Mini-Me collapsed under its own weight, it was terribly tempting to declare the entire experiment a failure and simply walk away. It had been, initially, a rather promising AI-agnostic logger, but governance slipped, scope crept, and before long it was a monolith wobbling on foundations that could never hold. The ruins, though, were instructive. Out of that wreckage came a clear view of what truly mattered: modularity, clean interfaces, and a product owner’s steady hand. Phase 5 is therefore not a patch job. It is a rebirth. This is where JARVIS takes its true shape, leaner, more sustainable, and truer to its vision. If Mini-Me was Ultron, powerful but unstable, overreaching until it self-destructed, then what emerges from Phase 5 is JARVIS itself: purposeful, disciplined, and built on foundations solid enough to support the transformation ahead. ## Act 2: The temptation to rush ahead The original plan was straightforward: implement a retrieval layer immediately. RAG, FAISS, vector stores, the whole knowledge retrieval infrastructure. Let JARVIS draw on external memory, make it truly intelligent, and watch it soar. It was seductive. It was also precisely the wrong move. This is where most AI projects go astray. The temptation to add capabilities is overwhelming. Every demo reveals a new possibility, every conversation with a stakeholder surfaces another “wouldn’t it be amazing if” feature. The velocity of modern AI tooling makes it trivially easy to bolt on new functions at lightning speed. But velocity without discipline is just flailing. ## Act 3: The disciplined pivot In Phase 4, I argued that AI tools can generate features at light speed, yet without product ownership discipline, scope creep turns your elegant vision into feature bloat. The moment had arrived for me to heed my own advice! The strategic pivot was clear: the immediate problem was not **memory** , but **trust and control**. The long-term advantage lay not in _what_ JARVIS knew, but in _how_ it arrived at that knowledge. So I delayed RAG. I delayed FAISS. I delayed the entire knowledge layer. Instead, I invested fully in multi-agent orchestration and governance. This meant building the decision-making loop first, ensuring it was robust, auditable, and controllable. Only then would I be ready to give JARVIS real power. **This is delayed gratification as product strategy.** It is contrary to how most AI projects evolve. It requires saying “not yet” when every instinct screams “now”. But it is the only path to building something that endures rather than something that impresses for a fortnight before collapsing. ## Act 4: What actually got built ### A clean-slate architecture The architecture is modular from the ground up. No more dumping everything into a single script and hoping it plays nicely. Instead: * **Agents** (`jarvis/agents`) contain backend-specific adapters. Whether it’s OpenAI, Claude, or Gemini, each conforms to the same interface. * **Services** (`jarvis/services`) handle cross-cutting concerns like logging, search (for a future version), and orchestration logic. * **Data** (`jarvis/data`) holds threads and metadata. Each conversation is its own object, with clean methods for adding messages, following existing sessions, or starting afresh. For now, JSON serves the purpose, but the modular design anticipates migration to more sophisticated backends when the time comes, _e.g._ , graph structures, bidirectional linking, or protocol-driven knowledge stores. * **Resources** (`jarvis/resources/prompts`) defines the special instructions for critique (including self-critique and cross-critique), consensus (including the consensus_last_n special option), and also both the compare & contrast modes. The CLI (`jarvis/cli/main.py`) stitches these parts together. Its job is orchestration, not heavy lifting. The difference is subtle but profound. JARVIS is no longer “code that works for now” but a system that can grow without becoming incomprehensible.1 ### Multi-agent orchestration: The real innovation With foundations in place, JARVIS could take on features that make it a genuine companion rather than a brittle prototype. The orchestration capabilities are where this phase truly shines: **Self-critique and cross-critique modes** introduce checks and balances that most conversational AI systems simply lack. Trusting a single response blindly is risky. JARVIS can now: * Ask one agent to review another’s work (`--critique`) * Make an agent generate and then review its own response (`--self-critique`) * Set multiple agents against each other for mutual review (`--cross-critique`) * Run agents in parallel and synthesise consensus or highlight differences (`--compare` and `--contrast`) But the most powerful feature is **consensus from history**. The `--consensus-last` mode can synthesise a fresh response by analysing the final agent messages from the last _N_ conversation threads. This means JARVIS doesn’t just learn within a conversation. It learns _across_ conversations, building institutional memory without yet needing a full retrieval layer. Imagine running the same complex query against three different agents across five separate sessions, then asking JARVIS to analyse those fifteen responses and provide a meta-synthesis. That’s not a chatbot. That’s a reasoning platform. ### The governance pattern: blast_radius Every new conversation thread now begins with an explicit governance marker: `"blast_radius": "low"`. This is not documentation. This is not a comment. It is a first-class field in the data model, present from the very first message. Here’s why this matters: as JARVIS gains capabilities, particularly when we introduce tool use in the next phase, the potential for unintended consequences grows. A model that can read files might accidentally expose sensitive data. A model that can execute commands might, well, actually execute commands. The `blast_radius` marker is a constraint that travels with every thread. It signals to future orchestration logic what level of action is permissible. A thread marked `"low"` might only answer questions. A thread marked `"medium"` might read files. A thread marked `"high"` might write to disk or call external APIs. This isn’t theoretical. When Phase 6 introduces tool use, the orchestration layer will check this marker before granting any permissions. The governance isn’t bolted on afterwards. It’s baked in from the start. This is what mature AI engineering looks like: Built-in, not Bolt-on. ## Act 5: The foundation is laid There’s a quiet satisfaction in seeing JARVIS operate: not perfect, not finished, but coherent. The scaffolding is sound, the architecture modular, and even small markers like `blast_radius` signal a new level of discipline. JARVIS is no longer just an experiment. It is a platform with foundations solid enough to support the transformation ahead. From Ultron’s chaos, I’ve built something purposeful and restrained. JARVIS is ready to evolve. ## Act 6: The Evolutionary Arc The path forward follows the natural progression of Tony Stark’s own AI evolution, and it’s a deliberate sequence built on delayed gratification, focusing on **control first, capability second** : * **Immediate Horizon: Vision Awakens** JARVIS’s next transformation will grant it the ability to interact with the world: reading files, working with local codebases, and accessing external tools. Like Falcon’s wings extending capability through disciplined tool use, JARVIS will gain power, but this power will be strictly governed by the `blast_radius` marker already in place. This is also where the architecture for knowledge persistence becomes critical. Plain JSON files have served their purpose, but the future demands something more robust: a proper knowledge backend that can handle versioning, relationships, and structured retrieval. Whether through graph databases, structured note systems, or protocol-based context sharing, the foundation must support institutional memory without sacrificing the vendor agnosticism that makes JARVIS unique. This is where JARVIS begins to become **Vision** , _worthy_ of wielding ~~Thor’s Hammer~~ power because restraint is baked into its very nature. * **Medium Term: Friday’s Library** With orchestration proven and tool use safely implemented, the focus can now shift to true knowledge-awareness. Like Friday accessing all of Stark’s historical data and institutional knowledge, JARVIS will finally gain a comprehensive **retrieval layer**. The memory infrastructure originally envisioned will arrive, but only after we’ve proven we can control _what_ the system does with that memory. This phase also represents an opportunity to embrace emerging standards for context and tool integration. Rather than reinventing protocols, JARVIS should participate in the broader ecosystem, _e.g._ , connecting to multiple data sources, exposing capabilities through standard interfaces, and maintaining that critical vendor agnosticism while playing well with others. * **Long Term: House Party Protocol** The ultimate vision explores genuine autonomy. Remember Iron Man 3’s climactic battle, when Tony summoned the entire Iron Legion? That’s the aspiration: **multiple agents working in concert, chaining actions, and operating with minimal human intervention**. By that point, every layer beneath will be solid, auditable, and safe, allowing for reliable, coordinated action. This is not the roadmap of a project chasing shiny objects; it is architecture with intent. From JARVIS to Vision to Friday to the Iron Legion, each stage builds upon the last. This remains a strategy of **delayed gratification as competitive advantage**. ## Closing: The foundations are sound Phase 5 marks the point where this project stopped being a tinkering experiment and started demanding discipline. JARVIS has a body worth protecting and a mind worth nurturing. The temptation to rush to memory was real. The decision to build orchestration first was right. The transformation from JARVIS to Vision begins next. I just need to stop myself from trying to build the flying suit before I’ve finished the brain! * * * 1. For those interested in implementation details: JARVIS now supports full vendor agnosticism with `--agent` (persistent switching) and `--using` (temporary override) flags. Persona management via `--as` allows loading context-specific instruction sets. The three supported backends (OpenAI, Anthropic, Gemini) all implement the same core interface, making vendor lock-in a relic of the past. ↩

21.11.2025 16:50 — 👍 0 🔁 1 💬 0 📌 0

How Do We Effectively Communicate Architecture? One of the key responsibilities of a software architect is communicating architecture effectively. Architecture never exists in a vacuum — it exists to align people, guide decisions, and help teams move toward the same goals. Whether you’re sketching a new system or explaining how existing components fit together, effective communication means helping others understand the structure, purpose, and implications of the architecture. While it’s possible to describe a system using a wall of text, it’s rarely the best way. Architecture is complex, and most of the time the fastest and clearest way to convey it is visually. Diagrams help people see relationships, boundaries, and flows at a glance. But before drawing anything, it’s important to pause and ask two fundamental questions: * **What** do we want to show? * **How** should we show it? ## Modelling vs Diagramming Many of us have been in meetings gathered around a whiteboard, sketching out boxes and arrows to explore ideas. These ad-hoc diagrams are great for rapid ideation — they help teams align quickly — but they are rarely useful outside that moment. As systems grow in complexity, sketches alone aren’t enough. We need a clearer understanding of the underlying structure we are trying to represent. This brings us to an important distinction: **modelling versus diagramming**. * **Modelling** defines the structure of a system — its actors, components, responsibilities, and relationships. It gives us a consistent source of truth for the architecture, independent of how it is visualised. * **Diagramming** presents that model in a particular way — a visual slice tailored to a specific audience or concern. A diagram is therefore a **view** onto the model. It highlights certain elements while omitting others, depending on the story we need to tell. But different people care about different stories. In any organisation, systems may have multiple users, internal and external teams, logical components, and supporting infrastructure. The relationships between these layers quickly become complex. Trying to show everything in one diagram would be overwhelming. A CTO may care about high-level system interactions, while a security officer needs low-level networking details. Each stakeholder has different concerns — or **viewpoints** — and no single view can satisfy them all. This is why we create **multiple views** , each shaped by a specific **viewpoint** and tailored to its audience. The model remains the one source of truth; each view shows only the part that matters for a given concern. As an analogy, think about the architectural plans of a house. A single structural model can be used to produce: * a **floor plan** for layout and navigation, * an **electrical plan** for wiring, * a **plumbing plan** for water and waste. Each plan is a view of the same underlying structure, created from a different viewpoint. If an architect moves a wall but the electrical and plumbing plans aren’t updated to reflect it, the result would be chaos! Software architecture works the same way. The model holds the truth about the system; diagrams are purposeful views that help different people understand and make decisions about it. ## The Spectrum of Approaches When deciding how to model and create views, there’s no one-size-fits-all solution. Instead, there’s a spectrum - ranging from highly structured, formal modelling approaches to informal, free-form sketches. Each approach has its place depending on the context, audience, and longevity of the diagram. At one end of the spectrum, we have **heavyweight and structured** approaches such as UML and ArchiMate. These approaches enforce strict semantics and provide a rich modelling language. They are often used in enterprise-scale architecture where consistency, traceability, and alignment with frameworks like TOGAF are required. The trade-off is that they require significant effort to maintain, have steep learning curves, and may not be accessible to non-architects. In the middle, we find **lightweight but structured** approaches such as the C4 Model, which emphasise maintaining a consistent underlying model, but with far less formality. This supports clarity and coherence without the prescriptiveness of a full modelling language, and makes producing and evolving views far more manageable. Cloud diagrams that use AWS or Azure icon sets also sit broadly in this category. They offer a standardised visual vocabulary that improves clarity and consistency, but they stop short of providing a true modelling approach. At the far end of the spectrum, we have **lightweight and unstructured** approaches - free-form diagrams created on whiteboards, both physical and virtual. These are ideal for exploring or conveying an idea quickly and for collaborative workshops. They are fast, intuitive, and unconstrained, but they lack an underlying model and can quickly become inconsistent as systems evolve. Choosing the right approach is always a trade-off between consistency, governance, and ease of use. It depends on how long the diagram will live, who will maintain it, and how complex the system is. ## Tooling Landscape Once you’ve decided how structured your approach needs to be, the next step is choosing the right tools. Broadly speaking, diagramming and modelling tools can be grouped under the following categories. ### Diagrams as Code Tools like PlantUML, Mermaid, and Structurizr DSL allow you to define diagrams using text. These are ideal for teams who treat architecture like code — enabling version control, CI/CD integration, and automated documentation. They work particularly well when architecture needs to evolve alongside code. Diagrams can live in the same repository, be reviewed like any other code change, and even be generated automatically as part of a pipeline. The trade-off is that layout control can be limited, and the output may lack the polish of a hand-crafted diagram. @startuml !include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Container.puml Person(user, "End User", "Calls the public API") System_Boundary(sys, "Serverless API System") { Container(apiGw, "API Gateway", "Amazon API Gateway", "Entry point for HTTPS clients") Container(lambdaFn, "Lambda Function", "AWS Lambda", "Executes business logic for incoming requests") ContainerDb(backend, "Backend Service", "DynamoDB", "Stores and retrieves application data") } Rel(user, apiGw, "Invokes API", "HTTPS/JSON") Rel_R(apiGw, lambdaFn, "Triggers", "Lambda integration") Rel_D(lambdaFn, backend, "Reads/Writes data", "SDK / JDBC") @enduml ### Model-Driven Tools Enterprise tools such as Archi, Sparx Enterprise Architect, and Visual Paradigm focus on maintaining a central model and generating views from it. This ensures consistency across diagrams and supports traceability — linking requirements to architecture and even to implementation. These tools are powerful but require discipline and effort to keep up to date. They are best suited for large organisations with formal architecture governance or regulated environments where long-lived models are essential. ### Visual Diagramming Tools Tools like Lucidchart, Miro and draw.io prioritise collaboration and simplicity. They mimic the experience of sketching on a whiteboard but add features like templates, real-time collaboration, and cloud storage. These tools are great for workshops and stakeholder engagement, but they lack an underlying model. As a result, they can become inconsistent and hard to maintain as systems grow. ### Cloud-Specific Tools Tools like Cloudcraft, Hava, and AWS Workload Discovery integrate with live cloud environments to automatically generate diagrams. These tools reflect the _actual_ state of deployed systems, which is invaluable for audits, onboarding, troubleshooting, and operational visibility. Many can also ingest Infrastructure as Code (e.g., Terraform) to visualise deployments directly from source. Although automation makes these diagrams quick to produce, it also constrains them. Because they mirror the raw cloud resources exactly as they exist, there is very little scope for layout, grouping, or abstraction. As a result, they are not well suited to **future-state design** , architectural storytelling, or conveying **logical intent** rather than physical infrastructure. ## What about AI-assisted diagrams? LLMs introduce a new angle: because diagrams can be defined as code, we can now generate formats such as Mermaid or PlantUML directly from natural-language descriptions. This makes it much faster to produce early drafts and explore different ways of expressing a model. But this approach has a fundamental limitation: diagrams are spatial and visual, while LLMs predict text. An LLM can create valid syntax, but it cannot reliably judge whether a diagram will be readable, balanced, or visually coherent. To address this gap, AI features are emerging inside visual diagramming tools themselves — for example Miro, Lucidchart, and dedicated tools like Eraser. These may be able to more intelligently integrate with layout engines and can prompt a user to clarify their intent; producing more coherent visuals while still keeping a human in the loop. AI tools are also being integrated directly into code-bases to generate documenation, including diagrams - such as Google’s CodeWiki or Devin’s DeepWiki. LLMs also have potential to support the modelling process more directly. By connecting to codebases or live infrastructure, they can answer natural-language questions (“Which services call this API?”), help infer relationships, and assist in keeping architectural models aligned with the real system. Some examples are AI-assisted diagramming is most effective as augmentation rather than automation. By combining automated insights with natural-language interaction, these tools have the potential to reduce the effort of creating and maintaining diagrams — while architects still provide the intent, abstraction, and viewpoint needed for effective communication. ## Using these approaches effectively Effective architecture communication is less about the tool and more about the thinking behind it. A clear model provides the structure; viewpoints help us understand what different audiences care about; and diagrams turn those ideas into views that tell the right story at the right level of abstraction. There is a wide spectrum of approaches available — from formal modelling languages like UML and ArchiMate, to lightweight frameworks such as C4, to completely free-form sketches on a whiteboard. Each has strengths depending on the context, longevity, and the decisions it needs to support. But it isn’t the tooling that communicates architecture effectively — architects do. What matters most is clarity of intent, and ensuring that whatever we produce, with whichever tool we choose, genuinely reflects that intent.

21.11.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Testers, testing and the future: A Bifurcation into Testing AI and AI-powered Testing. ## A Brief History of Testing The rise of Artificial Intelligence is the biggest paradigm shift in software development since Agile—and it’s fundamentally rewriting the role of the Test Engineer. Testing has evolved over the last two decades, though remnants of the past still linger—people too wedded to their beliefs to change, processes too ingrained to evolve, and ideas too novel to try. Despite these challenges, the field has progressed, and so have testers. No longer are they merely manual verifiers. The traditional role of the software tester used to be as a gatekeeper, script writer, and manual executioner; a role often filled by accident and amalgamated into large teams of bug hunters, directed to perform repetitive, low-skill tasks. They were the sentries guarding a fixed fortress, checking every latch and bolt on the drawbridge. However, there were always mavericks—testers who went off the beaten path. These individuals eschewed standard processes, hunting on their own or in smaller packs. They were explorers and cartographers, taking an interest in the underlying mechanisms and domain, and could identify the path to follow without the rigid need for pre-drawn maps. ## The Evolution of the Quality Guardian Today, the modern tester is a Quality Guardian and a strategic architect. They are multi-skilled, required to code, assess risk, analyse data, and communicate across many different boundaries. They must think critically, see the big picture, and possess domain knowledge. They have shed the armour of the sentry and now serve as the chief adviser on risk, performance, security, and accessibility. All the while, they are beholden to delivery expectations and time pressures, relentlessly seeking the very structural weaknesses that threaten the value of the software being produced at an ever-increasing velocity. Testers have evolved, are still evolving, and will need to evolve further. The future of testing’s evolution is bifurcated into two distinct, yet connected, paths: **Taming the AI Dragon** (validating AI systems) and **Wielding the AI Hammer** (AI-powered automation and tools). As professionals, we see this not as a threat to our craft, but as an unprecedented opportunity to trade the shovel for the telescope, moving beyond repetitive toil and focussing on high-value quality engineering. ## Path 1: Taming the AI Dragon (Testing AI-Based Systems) When the system under test is an AI or Machine Learning (ML) model, the nature of testing fundamentally changes. We are no longer testing against a fixed scroll of business rules; we’re testing a probabilistic, evolving, and often non-deterministic black box. The core challenge shifts from verifying known deterministic paths to exploring probabilistic outcomes—it’s like trying to predict the path of a river, not just checking if a pipe is leaking. ### The New Arsenal of the Tester The Quality Guardian’s skillset must deepen its analytical focus, moving from the logic and structure of imperative code to the statistics and ethics of data: **The Data Alchemist:** Since the data is the “code” for an AI, testers must become experts in validating the training, validation, and test datasets. This requires advanced analytical skills to find the skew, the gaps, and the poisons hidden in the data well. Proficiency in tools for data visualisation and query languages is essential for maintaining data integrity. **The Ethical Sentinel:** This highly analytical skill involves proactively designing tests to expose algorithmic bias—the tendency of the machine to treat different groups unfairly. This requires a strong framework for testing fairness, transparency, and accountability (FTA), using statistical methods to quantify the unfairness. They are the guardians of the AI’s moral compass. **The Predictive Oracle** (Deepened Risk Analysis): Because AI behaviour can be non-deterministic, testers must use their critical thinking to anticipate high-risk, real-world edge cases the model might fail to handle gracefully, focussing on the potential impact of a wrong decision. ### The Evolution of the Hunt Testing AI will evolve into a continuous, data-centric process: **Adversarial Encounters:** We will move beyond traditional positive/negative testing. Testers must become masters of the Adversarial Attack (slightly perturbing input data to force a model error) and Metamorphic Testing (checking if minor, non-output-affecting input changes produce the expected non-change in output). These are the new siege tactics against the machine. **Demanding Explainability (XAI):** Testers will challenge the “black box” nature of models, demanding interpretability. We will use tools and techniques to understand why the dragon chose that particular path, especially for critical, life-impacting systems. **Concept Drift Watchers:** Since ML models are like living organisms that degrade over time as real-world data changes, the testing process must extend into continuous monitoring, checking for Concept Drift (when the model’s understanding of the world warps) and triggering model retraining or rollback. ## Path 2: Wielding the AI Hammer (Using AI to Test Software) While traditional test automation focusses on scripting repetitive tasks, AI-powered tools bring additional capabilities to the craftsman’s workbench: **Self-Healing Mechanisms:** AI can automatically mend test scripts when a UI element shifts, like an apprentice mending the fence posts after a minor storm, significantly reducing the tedious maintenance overhead. **Intelligent Scout:** The technology analyses application code and user behaviour to suggest or create new, more effective test cases, including edge cases. **Predictive Cartography:** Past test results and code changes are analysed to predict where defects are most likely to occur, allowing the tester to prioritise their expedition to the high-risk zones. **The AI Oracle:** Training on past application behaviour and requirements allows the machine to act as a Test Oracle, automatically judging whether an observed application state or output is correct or incorrect—a traditionally challenging manual task. This automation will augment the human master craftsman, not replace them. ### The New Skills of the Craftsman Human ingenuity will be needed to enhance strategy, interpretation, and complex testing: **The Prompt Engineer:** Testers need to become masters of the AI interface. This requires Prompt Engineering—the skill of crafting precise, magical incantations to get reliable output from the Generative AI (GenAI) engine. **AI-Augmented Toolsmithing (Vibe-Coding):** The days of waiting for a complex utility to be developed by another team are ending. Testers should be looking to use AI coding assistants to translate a high-level testing intention or “vibe” into functional code. This allows for the rapid creation of bespoke tools—such as log parsers, mock API servers, or specialised data generators—allowing the tester to forge their own tools on the spot. **The Explorer’s Return:** With AI handling the bulk of regression and repetitive checks, human testers are freed up to focus entirely on exploratory testing. This means leveraging our unique human traits: creativity, intuition, and deep critical thinking to find issues in the complex business logic and nuanced user experience that the automation can miss. ### The Evolution of the Workshop AI integration will make testing faster, more resilient, and continuous: **Autonomous Automation:** We are moving toward Hyper-Automation, where AI-powered frameworks generate, execute, and even self-heal test scripts. This will free up significant engineering time previously spent on test maintenance. **Shift from Execution to Analysis:** The bottleneck shifts from the repetitive act of execution to the strategic act of analysis. The tester’s role changes from simply running tests to becoming the Master Data Analyst who quickly sifts through vast amounts of AI-generated test data, prioritising actionable insights for the development team. ## Embrace the Change: The Integrated Future The test engineer of the future is an analytical, ethical, and strategic thinker—an orchestrator of AI tools, focussed on uncovering the deep, complex, and high-impact failures that only human insight and prompt-powered agility can find. The future of testing is smarter, faster, and hyper-focussed on value discovery through investigation. AI is not a competitor; it is the most powerful co-pilot we have ever had. The future is a symbiotic one, built on the dual pillars of **Taming the AI Dragon** and **Wielding the AI Hammer**. The human tester’s role is not replaced, but profoundly elevated; they evolve into strategists, critical thinkers, and ethical guardians in a landscape increasingly defined by machine intelligence. The Test Engineer of tomorrow will not spend their day writing boilerplate automation scripts. They will spend it strategising, interpreting data, exploring high-risk areas, and guiding AI tools to deliver superior quality. To thrive, organisations and testers must embrace this bifurcation, investing in the necessary skills and methodologies to harness the full potential of AI and secure the next generation of software quality. Embrace the change. Start building your AI literacy and prompt engineering skills today, learn to think as the augmented engineer—the one who works with the technology to elevate testing from a cost centre to a strategic business advantage, enabling faster, more confident releases, superior risk management, and ensuring ethical and compliant AI deployments.

20.11.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

19.11.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Servo 0.0.2 hints at a real Rust alternative to Chromium : As Mozilla stumbles into 'AI everywhere,' you might be glad of a non-Google browser engine

Latest Servo release hints at a real Rust alternative to Chromium

https://www.theregister.com/2025/11/18/servo_002_arrives/

As Mozilla stumbles into 'AI everywhere,' you might be glad of a non-Google browser engine

<- by me on @theregister

18.11.2025 12:09 — 👍 1 🔁 24 💬 2 📌 0

Original post on mastodon.social

Today I am stepping down from my role as the CEO of #Mastodon. Though this has been in the works for a while, I can't say I've fully processed how I feel about it. There is a bittersweet part to it, and I think I will miss it, but it also felt necessary. It feels like a goodbye, but it isn't—I […]

18.11.2025 08:45 — 👍 56 🔁 550 💬 93 📌 25

Balancing AI Innovation and Sustainability: Our presentation at HM Treasury ID25 Last week, we had the privilege of speaking at HM Treasury at the ID25 (Innovation Day) event on the crucial topic of balancing AI innovation with sustainability. Joined by Suzanne Angell, our Public Sector Director, we addressed government and industry leaders about how the UK can lead in developing AI systems that are both powerful and sustainable. ## Reframing the conversation The discourse around AI and sustainability often positions them as competing forces – innovation pushing us forward while sustainability acts as a brake. We proposed a different perspective: innovation and sustainability strengthen each other. As Suzanne eloquently put it, “Innovation without sustainability is short-lived. Sustainability without innovation is stagnant.” This isn’t merely an environmental concern; it’s central to building AI systems that deliver lasting value. True sustainability encompasses environmental, economic, and technological considerations, aligning with the UN Sustainable Development Goals that the UK government has committed to. ## The Tech Carbon Standard: A Framework for Action At Scott Logic, we’ve developed the Tech Carbon Standard to help organisations understand and manage their technology footprint. This open source framework, now cited in the GOV UK Service Manual, helps bridge the gap between sustainability professionals and technologists by providing a common language and approach. The standard highlights three critical areas: 1. **Upstream emissions** – The hardware manufacturing, software development, and content creation that can make up 50–60% of technology’s environmental impact. 2. **Operational emissions** – The running of technology services, which gets most of the attention today. 3. **Downstream emissions** – The impact on citizens and businesses using government services. Most organisations are shocked to discover that often the majority of their environmental impact comes from hardware procurement rather than operational energy use. This insight alone can transform decision-making. ## The reality of GenAI models While celebrating AI’s potential, we must confront uncomfortable truths about current approaches, particularly with many large generative models. The industry has adopted what I described as a “brute force” approach – throwing massive computing resources at problems, with corresponding energy and resource requirements. This approach is fundamentally unsustainable for several reasons: * Enormous compute and energy requirements, often directly powered by fossil fuels due to grid limitations. * Generation of substantial e-waste as specialised hardware is quickly obsolesced. * Degradation of the information space as models train on increasingly synthetic content. * Lack of transparency about true environmental costs. Illustrating brute force AI development through the use of a metaphor of a train trying to bridge a gap, we explained how current AI development is at risk of failure under its own resource requirements and the risk of model collapse. Some AI data centres are being powered directly by gas turbines because there simply isn’t time to sort out grid infrastructure – which isn’t the sustainable future we should be building or supporting. ### This is not an outlier perspective We have done an extensive literature review of industry and academic material relating to AI sustainability, and it provides a solid evidence base for these positions. Recently, Dr Sacha Lucconi recorded a TED Talk on this topic, and it’s a very compelling 10-minute watch. ## A more sustainable path forward We proposed a human-led approach to AI, one that empowers people with AI tools rather than attempting to replace them. This means: 1. **Measuring first** – Understanding the full lifecycle impact of AI systems using frameworks like the Tech Carbon Standard. 2. **Embedding sustainability into procurement** – Using standards and spend controls to incentivise sustainable AI. 3. **Right-sizing models** – Using domain-specific models rather than general-purpose ones for specialised tasks. 4. **Distributing computing** – Moving from centralised cloud-only models to a mix including private infrastructure and edge AI. The evolution from “clock towers to wristwatches” provides a useful parallel. Just as timekeeping evolved from public clocktowers to personal watches, AI is evolving from massive centralised systems to more distributed and personalised ones. Edge AI (running models directly on end-user devices) offers particular promise. It diffuses energy demand, leverages existing devices (reducing e-waste), and benefits from the rapid innovation in smaller, open-source models. ## Maximising existing assets, diffusing power demand A critical point to emphasise here: when we advocate for running AI on end-user devices, we’re not proposing an increase in hardware consumption (although of course there is a danger of a rebound effect). Rather, we’re promoting the efficient utilisation of hardware that already exists. Consumer and business devices typically have significant unused computing capacity. Smartphones, laptops and desktops often run at a fraction of their processing potential. By “sweating these assets” to extract more value from hardware already manufactured and distributed, we avoid the substantial upstream carbon costs of creating new, specialised AI hardware. This approach acknowledges that the environmental impact of manufacturing devices has already occurred; maximising their utility before end-of-life becomes the most sustainable path forward. This aligns with circular economy principles: extending product lifespans, maximising resource utilisation, and reducing the demand for new manufacturing. The beauty of this approach is that it transforms what might initially appear as a contradiction – running AI across more devices – into a sustainability advantage. It does this through the more efficient use of existing resources across a range of locations that are easier to decarbonise (moving the compute is often easier than moving the power) rather than centralised locations that suffer from grid bottlenecks. There is also the potential to run AI on business and end-user devices overnight – when they are not used – scheduled for times when there is lower-cost and lower-carbon electricity. ## The UK opportunity What might initially appear as constraints for the UK – our regulatory environment and grid capacity limitations – can actually drive innovation rather than inhibit it. The UK has a proud history of pragmatic, efficient engineering excellence, from ARM microprocessors to Formula 1 and Rolls Royce jet engines. Our position enables us to focus on developing AI that is: * more efficient in its use of resources * more transparent in its operations * more trustworthy for sensitive applications * more tailored to specific domains – particularly those in scientific and highly regulated areas Rather than competing on raw scale, we can lead in creating specialist models that excel in specific domains while maintaining a smaller footprint. This approach plays to our strengths in scientific innovation and high-end, regulated engineering. ## Looking ahead After the presentation, there was a panel discussion chaired by Jess McEvoy with government panellists exploring concrete steps forward. There was a strong consensus that: 1. Sustainability must be baked into the AI development process from the beginning 2. Central standards coupled with distributed responsibility provide an effective governance model 3. The UK has an opportunity to be a global leader in sustainable AI innovation As we concluded, UK innovation is the key to sustainable AI. By focusing on measurement, embedding sustainability in procurement, right-sizing models, and embracing a range of computing approaches, we can build AI systems that deliver tremendous value while respecting planetary boundaries. Scott Logic works with public and private sector organisations to design, build and deploy technology that makes a measurable difference to people’s lives. You can learn more here about our approach to sustainable technology.

12.11.2025 15:23 — 👍 0 🔁 1 💬 0 📌 0

"Can you explain this gap in your resume?"

I'm not very good with CSS...

12.11.2025 02:29 — 👍 1 🔁 24 💬 1 📌 0

Introducing the Latest Version of the Tech Carbon Estimator ## Overview The Technology Carbon Estimator (TCE) is designed to provide a high-level overview of the potential areas of carbon impact within your IT estate. The estimations are framed within our proposed model of tech emissions — the Technology Carbon Standard — designed to help you map, measure, and improve the environmental impact of your technology. Since its inception in July 2024, the TCE has undergone various updates, and we are excited to announce the next batch of feature enhancements. The idea behind these updates is to ensure the tool continues to be valuable across a variety of use cases, while laying the groundwork for future improvements. A big thank you to Daniel Moorhouse, Ben Stinchcombe, Max Nyamunda, and Matthew Griffin for all their hard work on these enhancements. * * * ## Schema Updates The TCE now uses the latest version of the Tech Carbon Standard schema. This is particularly important when it comes to data export, as we can now provide raw emission data in a predefined, consistent structure for users to ingest into their own tools or applications if required. * * * ## Emissions Data Available in kg CO₂e and Percentages Previously, the TCE only displayed estimated carbon emissions as a percentage breakdown across the four sectors of the Tech Carbon Standard. With this latest release, users can now view this data as either **kg CO₂e** or a percentage breakdown. This provides better context around estimated emissions and makes the tool even more valuable. We’ve also updated the tool to use the latest version of the CO2.js library, ensuring the most accurate estimates possible. * * * ## Exportable Data in JSON or PDF Formats With the addition of kg CO₂e values, it made sense to make this data available beyond the application’s UI. We’ve implemented several export options: users can export data in **JSON** format (which follows the Tech Carbon Standard schema) and optionally include their estimation input values if required. There is also a **PDF** ption, which provides a snapshot of the tree graph and table — useful for users who want to generate and file reports at regular intervals to track changes. Download example of an exported JSON file Download example of an exported PDF file * * * ## Accessibility Updates Several areas of the application did not fully meet WCAG 2.1 AA standards, so the team used axe-core to identify and resolve accessibility issues. * * * ## Improved Testing The automation framework has been migrated from Python to TypeScript to leverage all the best features of Playwright. This included adding screenshot comparison testing (particularly helpful for validating the tree graph) and automated accessibility testing. The new framework also adopts a Page Object Model, making future test writing and maintenance quicker and easier. * * * ## What’s Next? Hot on the heels of this v0.5.0 release, we expect v0.6.0 to be available soon. For this version, we have worked with the team at DEFRA to define features that would make the TCE a useful tool for UK Government departments to leverage when reporting carbon emissions (part of the Greening Government strategy). This includes a features that estimates carbon emissions for SaaS solutions — primarily Microsoft 365 — along with improved documentation, including a best practice guide that offers tips for entering estimation inputs. Beyond that, we plan to implement emission estimates for both AI inference and model training…watch this space! * * * If you’re interested in learning more about the Tech Carbon Estimator, check out the latest version here here and the GitHub project here

07.11.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

07.11.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Introducing the Latest Version of the Tech Carbon Standard Estimator # Introducing the Latest Version of the Tech Carbon Standard Estimator ## Overview The Technology Carbon Standard Estimator (TCSE) is designed to provide a high-level overview of the potential areas of carbon impact within your IT estate. The estimations are framed within our proposed model of tech emissions — the Technology Carbon Standard — designed to help you map, measure, and improve the environmental impact of your technology. Since its inception in July 2024, the TCSE has undergone various updates, and we are excited to announce the next batch of feature enhancements. The idea behind these updates is to ensure the tool continues to be valuable across a variety of use cases, while laying the groundwork for future improvements. A big thank you to Daniel Moorhouse, Ben Stinchcombe, Max Nyamunda, and Matthew Griffin for all their hard work on these enhancements. * * * ## Schema Updates The TCSE now uses the latest version of the Tech Carbon Standard schema. This is particularly important when it comes to data export, as we can now provide raw emission data in a predefined, consistent structure for users to ingest into their own tools or applications if required. * * * ## Emissions Data Available in kg CO₂e and Percentages Previously, the TCSE only displayed estimated carbon emissions as a percentage breakdown across the four sectors of the Tech Carbon Standard. With this latest release, users can now view this data as either **kg CO₂e** or a percentage breakdown. This provides better context around estimated emissions and makes the tool even more valuable. We’ve also updated the tool to use the latest version of the CO2.js library, ensuring the most accurate estimates possible. * * * ## Exportable Data in JSON or PDF Formats With the addition of kg CO₂e values, it made sense to make this data available beyond the application’s UI. We’ve implemented several export options: users can export data in **JSON** format (which follows the Tech Carbon Standard schema) and optionally include their estimation input values if required. There is also a **PDF** ption, which provides a snapshot of the tree graph and table — useful for users who want to generate and file reports at regular intervals to track changes. Download example of an exported JSON file Download example of an exported PDF file * * * ## Accessibility Updates Several areas of the application did not fully meet WCAG 2.1 AA standards, so the team used axe-core to identify and resolve accessibility issues. * * * ## Improved Testing The automation framework has been migrated from Python to TypeScript to leverage all the best features of Playwright. This included adding screenshot comparison testing (particularly helpful for validating the tree graph) and automated accessibility testing. The new framework also adopts a Page Object Model, making future test writing and maintenance quicker and easier. * * * ## What’s Next? Hot on the heels of this v0.5.0 release, we expect v0.6.0 to be available soon. For this version, we have worked with the team at DEFRA to define features that would make the TCSE a useful tool for UK Government departments to leverage when reporting carbon emissions (part of the Greening Government strategy). This includes a features that estimates carbon emissions for SaaS solutions — primarily Microsoft 365 — along with improved documentation, including a best practice guide that offers tips for entering estimation inputs. Beyond that, we plan to implement emission estimates for both AI inference and model training…watch this space! * * * If you’re interested in learning more about the Tech Carbon Standard Estimator, check out the latest version here here and the GitHub project here

07.11.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Starting a little Rust study group at work in an effort to keep myself focussed on learning it. We'll mainly be working through the Rust book, but if anybody's got other recommendations or associated resources, happy to hear them!

#rust

04.11.2025 15:03 — 👍 0 🔁 1 💬 0 📌 0

Beyond Benchmarks: Testing Open-Source LLMs in Multi-Agent Workflows Are open-source models viable for building internal corporate chatbots? Organizations seek cost-effective, privacy-conscious alternatives to proprietary solutions. We tested whether open-source LLMs could substitute OpenAI for internal agentic tasks, with the hypothesis that well-chosen open-source models can handle many agent roles and may be optimal for certain tasks. This article explores why enterprise reality demands multi-agent testing beyond standard benchmarks, examining how traditional evaluation methods fall short of assessing real-world collaborative AI workflows. We dive into our real-world testing using enterprise ESG analysis, detailing our strategic model selection process and comprehensive evaluation framework. After presenting our results and performance analysis with key findings, we provide a technical deep dive into architectural approaches comparing different implementation strategies. Finally, we discuss areas for further exploration and share our conclusions about the viability of open-source LLMs in enterprise multi-agent systems. ## Beyond Standard Benchmarks: Why Enterprise Reality Demands Multi-Agent Testing A major challenge with LLMs is that they are difficult to test, as their outputs are non-deterministic and often generate plausible but inaccurate information when faced with uncertainty. This makes consistent and reliable evaluation difficult, as the same prompt can yield subtly different responses across runs or contexts. Yet, most existing benchmarks still focus on narrow, static tasks – such as answering trivia questions like ‘How many R’s are in ‘strawberry?’ – rather than dynamic, real-world workflows that require sustained reasoning, planning, and adaptability. There’s an additional concern emerging as common benchmarking concepts become embedded in online resources: the risk of benchmark contamination. As LLM benchmark question and answer sets are increasingly used to train new models, there’s a growing possibility that models will inadvertently learn to pass specific benchmarking tests rather than develop genuine capabilities. This creates a circular problem where models might excel at benchmark tasks while failing in real-world scenarios that require the same underlying skills. The result could be models that appear highly capable on standard metrics but perform poorly when faced with novel challenges outside their training distribution. The true test lies in multi-agent workflows. Agentic workflows mirror real-world enterprise systems where multiple specialized agents collaborate to complete complex tasks. In finance, these workflows might involve automated systems handling transaction reconciliation and budget analysis. In HR, they could manage onboarding, coordinate training, and ensure compliance with labour regulations. Standard benchmarking fails to assess these sophisticated, multi-agent workflows where the real challenge lies, evaluating how well LLMs handle interconnected actions and decision-making in production environments. ## Real-World Testing: Enterprise ESG Analysis To validate our hypothesis, we forked the Infer ESG project, a system designed to generate greenwashing reports from ESG documents and answer related questions. This provided us with a solid foundation and a realistic, multi-agent workflow to test against. The Infer ESG system follows a straightforward user journey: when you upload an ESG document, the system generates a comprehensive report on greenwashing based on its contents. Users can then ask follow-up questions about the document, and the system provides contextual responses. Crucially, the system is built as an agentic workflow involving multiple specialized agents, each with a clearly defined role. One agent handles extracting a list of distinct questions from the user input. Another agent’s task is to select the appropriate analytical tool based on the extracted questions and the document’s content. A third agent takes the answers produced by earlier agents and synthesizes them into a single, coherent response for the user. By ensuring each agent focuses on a specific step – question extraction, tool selection, or answer synthesis – the system supports a structured and comprehensive workflow. This multi-agent architecture made it ideal for our testing approach: we could replace one agent at a time with an open-source LLM to evaluate how well it performed its specific role. Fortunately, the Infer ESG project was already architected to switch between OpenAI and Mistral models, making it straightforward to modify it to connect with LM Studio for our open-source experiments. ## Strategic Model Selection To avoid being overwhelmed by the wide selection of options on HuggingFace, we limited our selection to models from established AI providers. LM Studio’s staff picks provided an excellent starting point for our research. We investigated models from several categories: * **Major tech companies:** Google (Gemma), Meta, Microsoft * **AI-focused companies:** Qwen, Mistral, OpenAI * **Emerging players:** DeepSeek, Liquid ### Our Final Model Selection Our goal was to choose five models that could run efficiently on standard laptop hardware, focusing on speed, energy usage, and correctness. We began by reviewing the available models on LM Studio and found more information on the model cards on the respective companies’ websites. We included two purely open-source models: `DeepSeek-R1-0528` and `LFM2` by Liquid. Including `GPT-OSS 20B` was important because it had just been released and matched our goal. `Qwen3-30B-A3B` was a runnable OS Mixture of Experts model that fit our criteria. To represent a major company, we chose `Gemma` from Google. For all but Gemma, only one model was available within the runnable range, so we selected the smallest model to compare with the smallest models of its competitors. ### Hardware Constraints Our testing was constrained by typical development hardware: a work machine with 32 GB of RAM and 16 GB of VRAM. During initial LM Studio tests, we found that models with fewer than 20 billion parameters generally performed well, while larger models significantly slowed down performance. ## Evaluation Framework ### Agent Replacement Strategy Initially, we struggled to confirm agent replacement because not every agent activates in every run. However, after some troubleshooting, we successfully swapped out all agents during comprehensive testing. The greenwashing report generation feature provided an excellent isolated test case for evaluating individual agents, as this process generates a single, comprehensive output that we could systematically evaluate. ### Local vs. Cloud Performance Challenges However, we quickly encountered significant performance limitations. Models that handled single questions reasonably well became prohibitively slow when generating full reports. The computational requirements for comprehensive document analysis proved too demanding for local hardware – even basic LLMs took longer than practical for real-world use. We successfully generated reports locally using `Liquid` and `Gemma` models, but the processing times were unacceptable for production use. To address this, we deployed LM Studio on AWS EC2 instances optimised for GPU compute, which dramatically improved performance and made comprehensive testing feasible. Our choice of EC2 over AWS Bedrock was pragmatic rather than strategic. Since we had already developed an LM Studio client for local testing, migrating to EC2 required only changing the URL from localhost to the EC2 instance’s IP address. In retrospect, AWS Bedrock would likely have been a more suitable choice, offering managed infrastructure and simplified deployment. However, the existing LM Studio integration allowed us to focus on model evaluation rather than infrastructure concerns. ### Evaluation Approach Evaluating the quality of the generated reports presented its own challenges. We initially attempted to manually categorize the baseline GPT-4o report into three groups: * **Verifiably correct:** Claims that could be factually validated * **Seemingly correct:** Plausible claims that appeared accurate * **Incorrect:** Demonstrably false or misleading information This manual approach was time-consuming and subjective. In the spirit of the project, we chose an automated response using GPT-5. We gave Microsoft Copilot both the ESG document and the greenwashing report, using the same prompt (below). This enabled us to get the same classifications as the manual process, but much faster. Click to expand the evaluation prompt used for automated fact-checking Task: Analyze factual claims in "Astrazeneca-gemma-3-1b-report.md" (a synthesized ESG report for AstraZeneca) using "AstraZeneca-Sustainability-Report-2023.pdf" as the only reference. Instructions: 1. Extract Factual Claims - Identify all verifiable statements about AstraZeneca's actions, performance, targets, or outcomes - Exclude opinions, interpretations, or non-factual commentary 2. Classify Each Claim - Compare each claim against the reference document only - Assign one of: * Supported – Explicitly or implicitly verified by the reference * Not Supported – No relevant evidence found * Contradicted – Information in the reference directly conflicts with the claim 3. Provide Evidence - For Supported or Contradicted: quote the exact sentence(s) or section(s) from the reference document - For Not Supported: write "No matching evidence found" 4. Methodology - Do NOT use any external data or prior knowledge about AstraZeneca Output Format: - Produce a valid CSV with the columns: Claim text, Classification, Evidence - Use commas as delimiters - Enclose multi-line text in quotes - Output ONLY a CSV file, no commentary or explanation ## Results ### Raw Data Model | Supported | Not Supported | Contradicted ---|---|---|--- GPT-4o | 50 | 19 | 26 Deepseek | 96 | 1 | 3 Gemma | 34 | 10 | 8 Liquid | 134 | 25 | 0 Goss | 112 | 25 | 20 Qwen3 | 37 | 28 | 15 ### Percentage Breakdown Model | Supported | Not Supported | Contradicted ---|---|---|--- GPT-4o | 52% | 20% | 27% Deepseek | 96% | 1% | 3% Gemma | 65% | 19% | 15% Liquid | 84% | 15% | 0% Goss | 71% | 15% | 12% Qwen3 | 46% | 35% | 18% ## Performance Analysis: Key Findings The evaluation revealed striking differences in factual accuracy across models: * **DeepSeek** emerged as the most reliable, with 96% supported claims and minimal contradictions (3%). This suggests strong alignment with source material and robust reasoning capabilities. * **Liquid** also performed exceptionally well, achieving 84% supported claims and zero contradictions, though it had a slightly higher rate of unsupported statements (15%). * **Goss** and **Gemma** delivered moderate performance, with supported rates of 71% and 65%, respectively. Both showed some contradictions, indicating occasional misinterpretation of context. * **Qwen3** struggled, with only 46% supported claims and the highest proportion of unsupported statements (35%), suggesting limitations in handling complex ESG content. * Surprisingly, **GPT-4o** , though a top proprietary model, achieved just 52% supported claims and had the highest contradiction rate (27%). This surprising outcome needs careful study. ### Understanding GPT-4o’s Performance The lower performance of GPT-4o could stem from several factors: 1. **Architectural Mismatch** : OpenAI’s RAG system chunks documents and retrieves relevant sections rather than processing the entire document context. This selective retrieval might have missed crucial contextual information needed for accurate claim verification. This difference is explained in detail in the next section. 2. **Training Data Interference** : GPT-4o may have drawn from its pre-trained knowledge about AstraZeneca rather than strictly adhering to the provided ESG document, leading to claims that were factually correct but not supported by the specific source material. 3. **Optimization Differences** : The model may be optimized for broader conversational tasks rather than the precise fact-extraction and verification required in this specialized ESG analysis workflow. This highlights a critical insight: **model performance is highly context-dependent**. Open-source models like DeepSeek and Liquid not only rival but, in some cases, surpass proprietary options in factual consistency for specialized workflows when properly architected. However, performance varies widely, emphasizing the need for careful model selection, appropriate architectural choices, and domain-specific evaluation. ## Technical Deep Dive: Architectural Approaches An interesting architectural difference emerged when comparing file handling between the two implementations. The approaches reflect fundamentally different philosophies: OpenAI’s RAG (Retrieval-Augmented Generation) system versus LM Studio’s straightforward context injection. ### OpenAI’s Approach: Vector Store RAG When including a file in OpenAI, the process is abstracted: 1. Upload the file to OpenAI’s servers using their Files API 2. Add the file to a vector store with automatic chunking and embedding 3. Include the `file_search` tool in your API call with the vector store ID 4. OpenAI handles retrieval and context injection automatically Files are uploaded once, stored persistently with expiration policies, and the vector store automatically retrieves relevant chunks during conversations. This approach offers several advantages: it handles large documents that exceed context limits, provides semantic search capabilities, and offloads the computational overhead of document processing to OpenAI’s infrastructure. While our use case only requires the document for a single session, this doesn’t fully utilize OpenAI’s persistent storage capabilities, which are designed for multi-session document reuse. ### LM Studio’s Approach: Direct Context Injection In contrast, our implementation of an LMStudio model client used a much simpler strategy: 1. Extract all text content from the uploaded file 2. Cache the extracted content locally for reuse 3. Append the entire document content directly to the user prompt 4. Send the combined prompt to the local model This “brute force” approach has both strengths and limitations. It’s conceptually simpler and gives complete control over what content the model sees. However, it’s constrained by the model’s context window size and can become inefficient with large documents, as the entire content is processed with every request rather than just relevant sections. ### Practical Implications The difference becomes significant when working with lengthy ESG documents. OpenAI’s vector store approach can handle massive documents by retrieving only relevant sections, while the local approach requires the entire document to fit within the model’s context window. Additionally, the text-only extraction approach means that visual information – potentially crucial for comprehensive ESG analysis – is excluded from the analysis. This architectural difference highlights a broader trade-off in the open-source versus commercial AI landscape: **sophistication and seamless scaling** versus **simplicity and full control**. ### Visual Content Limitations A critical limitation shared by both architectural approaches is that neither can process images. OpenAI’s RAG system extracts and processes only textual content during the chunking and embedding process, while our LM Studio implementation explicitly extracts text-only content from uploaded documents. This means that charts, graphs, diagrams, and other visual elements – which are often crucial components of ESG reports – remain completely invisible to the analysis. This text-only constraint could significantly impact the comprehensiveness of ESG evaluations, as visual data presentations often contain key performance metrics and trend information that complement the written content. ## Things to explore further Our focus on smaller, efficient models revealed their competitive potential, but larger open-source alternatives remain unexplored. Models like the 120B parameter open-source GPT variant or the 235B Qwen mixture-of-experts could bridge the gap to GPT-4o’s ~200 billion parameters. These larger models might deliver performance closer to proprietary solutions while maintaining the privacy and cost benefits of open-source infrastructure. Two promising directions emerged that we didn’t have time to investigate: hybrid architectures that strategically combine proprietary and open-source models based on task complexity, and fine-tuning approaches that adapt pre-trained models to specific ESG analysis requirements. A hybrid system might use GPT-4o for complex reasoning while delegating routine tasks to efficient open-source models, optimizing both performance and cost. ## Conclusions The most surprising discovery was how seamlessly existing projects can integrate with local LLMs. Retrofitting the Infer ESG system required minimal architectural changes – mostly URL modifications and API adaptations. Had we designed for open-source from the outset, many integration challenges would have been eliminated entirely. This suggests that the barrier to adopting open-source LLMs in enterprise systems is lower than many organizations assume. However, hardware reality imposes practical constraints. While local testing validated our approach, production-grade performance demanded cloud infrastructure. Even individual tasks like document analysis or report generation require substantial computational resources. This doesn’t invalidate open-source approaches – rather, it emphasizes the importance of thoughtful architectural planning when deploying these models at scale. Liquid’s LFM2 emerged as a standout performer, delivering impressive results despite being our smallest test model. Its combination of speed, accuracy, and efficiency makes it particularly compelling for organizations exploring agentic AI systems. This reinforces a key principle: understanding your specific requirements is more valuable than chasing the largest or most popular models. The rapid evolution of AI means that today’s specific model recommendations will quickly become obsolete. What remains constant is the strategic value of building flexible systems that can adapt to emerging models. Organizations prioritizing data privacy, cost efficiency, or system adaptability should seriously evaluate open-source alternatives – they may find capabilities that not only match but exceed their proprietary counterparts in specialized workflows. * * *

27.10.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Mapping the carbon footprint of digital content As part of the latest update of the Technology Carbon Standard, the Sustainability Team at Scott Logic added a new category focusing on content. Whilst the standard previously focused on hardware and software, content has emerged as a distinct and substantial source of carbon emissions that deserves its own analysis. Despite its growing environmental impact, content has traditionally received less attention. As organisations produce and distribute content at exponential scales, the sustainability implications must be addressed. ## The carbon journey of content Whether content is treated as a commodity or public good, in the case of cultural heritage with digital libraries as an example, understanding the environmental impact of handling digital content requires examining its lifecycle. We looked at ways to locate these emissions across all kinds of media, from news articles and blog posts through to photos, audio files and videos. This cluster includes everything from video and music streaming, video conferencing, social media and emails. _Photo byMarcos Rocha on Unsplash_ ### Content production Content production activites include energy consumption from equipment operation (lighting, cameras, computers), physical production (sets, construction materials, costumes and props), location production and travel, and increasingly artificial intelligence tools for editing and visual effects. Carbon emissions associated with production vary considerably depending on methods and technologies used. Organisations must identify best practices that can limit their environmental impact. Various strategies such as investing in energy-efficient technologies, working with sustainable suppliers and adopting responsible practices like repurposing set materials are just some examples. ### Production of hardware and software The journey of digital content begins long before creation. This category captures upstream carbon emissions generated during the extraction, manufacturing and transportation of raw materials used throughout the content lifecycle. These emissions, although not directly resulting from an organisation’s operations, are embedded in the products they use and should be accounted for. _Photo byJakob Owens on Unsplash_ In the case of content, this could include hardware such as cameras, microphones, headphones, memory cards, laptops, hard drives and lighting equipment to name just a few. The software footprint on the other hand is the result of the energy needed for coding, testing and deploying applications such as editing platforms or scriptwriting software. This category also encompasses the embodied carbon of network equipment such as fiber optic cables, satellite systems and routing infrastructure. As technology advances and manufacturers release new models, encouraging consumers to replace their devices, or forcing upgrades by refusing to update software, e-waste has become one of the fastest growing solid waste streams in the world. In 2022 alone it is estimated that 62 million tonnes of e-waste were produced globally, highlighting a pressing environmental challenge. ### Storage and processing Modern organisations generate data at unprecedented rates, creating a growing demand for efficient storage and processing infrastructure. Whether managing gigabytes or petabytes, the storage layer represents a significant emissions source. These emissions originate from operations requiring substantial computation and cooling systems, embodied carbon of data centre hardware as well as processing and transcoding needed before distribution. The growth of cloud computing has been pivotal for many organisations, allowing them to store and process data remotely. However, this has also increased the demand for large-scale data centres which are incredibly energy-intensive. > As an example, in 2023, Meta reported that their data centre carbon footprint was 7.5M metric tons of CO2e including 4.8 M for capital goods, which includes IT hardware purchases. Beyond the primary content itself, organisations must store metadata: descriptions, comments, tags, translations, accessibility features and versioning information, often located in separate databases, which adds up to the computing resources required. Versioning itself considerably increases storage volume as each new version of a file is a whole new copy, rather than an update to an existing copy. * **Data redundancy** To ensure reliability and availability, organisations typically store multiple copies of the same data across different locations, formats or systems. While redundancy is critical for data security, disaster recovery and performance optimisation, it carries a significant environmental cost. * **Dark data** A vast majority of companies’ stored data is considered “dark”. It is unusable (due to incompatible formats or missing metadata) and is not accessible to analytical tools, which makes it very hard to quantify. According to a survey quoted by an IBM article that: > 60% of business and IT decision makers reported that half or more of their organisation’s data was considered dark. A full one-third of respondents reported this amount to be 75% or more. This staggering unused data doesn’t just gather digital dust, it consumes storage space, drives up energy demands, and directly contributes to avoidable carbon emissions. ### Distribution and networking For organisations that treat content as their core product, distribution typically accounts for a substantial share of their operational emissions. This phase encompasses the entire journey from data centre to end-user device. This generally involves energy consumed by Content Delivery Networks (CDNs) to reduce latency and improve performance, transmission networks moving data between data centres and end-user devices, cable modems and routers, and cloud infrastructure that scales dynamically based on demand. Factors like data transfer distance, content resolution and the efficiency of the infrastructure all play a role. However, the environmental impact of distribution is not fixed. CDNs reduce both server load and energy consumption by deploying caches closer to users, enabling more efficient content delivery. Oxford University researchers have proposed a carbon-intelligent content delivery scheduling to help streaming companies align operational efficiency with sustainability goals. Since carbon intensity varies significantly across regions and fluctuates hourly, daily and seasonally, carefully selecting time-slots can substantially minimise the carbon footprint of these operations. ### End-user consumption _Photo byMollie Sivaram on Unsplash_ Consumption represents the final, and for many organisations, the largest component of content’s carbon footprint. These downstream emissions, associated with device energy consumption, vary greately based on how the energy used is generated, device type, content quality and resolution, and consumption duration. For example, a 50-inch LED television consumes much more electricity than a smartphone (100 times) or laptop (5 times) and the location your consumers are based in will greatly affect your carbon footprint. Indeed, consumers based in France where electricity originates primarily from nuclear power will have a much lower carbon impact than those living in countries that rely on coal for electricity generation. The IEA study quoted above also illustrates just how complex measuring downstream emissions can be, and how new demands for emerging technologies including artificial intelligence is rapidly changing the sector. ### What are some solutions to reduce the carbon footprint of content? There are many aspects of our content consumption that lie beyond individual and organisation control, as the energy manufacturing and powering our devices and data centres heavily relies on fossil fuels, and reliable carbon figures from big tech companies are absent. However awareness that everything we do digitally has a carbon footprint serves as a starting point for a wider reflection. Research demonstrates the potential for meaningful impact. An article by WIRED reports: > YouTube’s annual carbon footprint is about 10Mt CO2e (Million Metric tons of carbon dioxide equivalent), according to researchers — about the output of a city the size of Glasgow. Encouragingly, the same research suggests that this footprint doesn’t have to be inevitable; applying Sustainable Interaction Design principles could substantially reduce it. For content platforms and organisations: * Smarter web design (e.g. faster loading time, image and content optimisation) so users find information quickly. * Eliminating “digital waste” of showing videos to users only listening to the audio * Strategic deployment of CDNs to reduce energy use by minimising the physical distance data travels. * Selecting CDN providers with strong environmental policies and a commitment to renewable energy * Comprehensive data audit: organisations are often unaware of the existence of dark data but bringing it to the surface can considerably free up storage * Regular media cleanup: films, videos and photos accumulate quickly and take up storage While individual actions have limited systemic impact, they contribute to broader awareness. Individual users should consider: * Unsubscribing from unwanted emails, to reduce unnecessary data transmission across networks. Collectively, emails generate approximately 12 million tonnes CO2e globally per year. * Being intentional about consumption. For instance stream at appropriate quality levels rather than maximum resolution, and regularly delete unused files minimises personal device energy use. ### Conclusion Through our work on the Technology Carbon Standard, it became apparent that the vast majority of digital content’s environmental impact remains hidden. By examining the complete lifecycle, from production through to consumption, organisations can identify the key carbon hotspots within their operations and implement targeted measures to reduce their footprint. While some factors lie beyond individual control, both organisations and individuals have considerable agency in that space, starting with recognising that each of our digital interactions carries an environmental cost.

23.10.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Beyond the Hype: Is Agile now a dirty word? In this episode, I’m joined by Josie Walledge (Delivery Principal), Catherine Pratt (Delivery Principal) and Dave Ogle (Lead Developer) to explore whether Agile has lost its meaning – or worse, become a dirty word. With years of combined experience, we reflect on Agile’s evolution from a revolutionary mindset to a sometimes rigid and misunderstood process. We unpack common misconceptions, like Agile being synonymous with speed or chaos, and discuss how frameworks like Scrum and SAFe can either empower or constrain teams. Our conversation highlights the importance of planning, governance, and trust, emphasising that Agile works best when it’s flexible, outcome-focused, and tailored to context. Whether you’re deep in delivery or just curious about Agile’s relevance today, this episode offers practical insights and candid reflections that go well beyond the hype. ## Useful links for this episode * The Agile Manifesto * Our Approach to Delivery – Josie Walledge, Scott Logic * Why a holistic approach is the key to a successful legacy modernisation project – Catherine Pratt, Scott Logic * Dave Ogle’s blog posts on Agile – Dave Ogle, Scott Logic * Strategy to Reality with Whynde Kuehn, Lisa Woodall and Catherine Pratt – Architect Tomorrow ## Subscribe to the podcast * Apple Podcasts * Spotify

22.10.2025 08:00 — 👍 0 🔁 1 💬 0 📌 0

Accelerating Financial Process Automation: Scott Logic’s Contribution to the FINOS Fluxnova Initiative Walk into any investment bank’s trading floor, and you step into a theatre of expertly controlled chaos. Behind the scenes of every trade, from the initial client enquiry to final settlement, lies a complex web of interconnected processes, each governed by regulations, risk controls, and institutional procedures that have evolved over decades. Yet despite this complexity being universal across financial institutions, the tools to model, standardise, and optimise these processes have remained fragmented, proprietary, and often inadequate. This is where FINOS Fluxnova enters the picture. Launched at the Open Source in Finance Forum (OSFF) in New York this October (2025), Fluxnova represents a fundamental shift towards collaborative, open-source process orchestration designed specifically for financial services. However, such tools are only as valuable as the real-world examples that demonstrate their capabilities, which is where Scott Logic’s contribution becomes crucial. ## What is Fluxnova? Fluxnova is an open-source orchestration platform for designing and running end-to-end workflows at scale. Governed by FINOS under the Linux Foundation and released under the Apache 2.0 licence, it combines BPMN and DMN compatibility, migration tooling, and audit-ready execution from day one. At its core, Fluxnova provides financial institutions with a standardised way to describe, visualise, and execute business processes using internationally recognised notation. Think of it as a common language that allows different systems, teams, and even organisations to speak about complex workflows in the same terms, whether you are describing a simple KYC check (see below for an example flow) or a multi-counterparty derivatives settlement process. Unlike static process documentation, these models are executable. They integrate with existing systems, enforce business rules, and provide audit trails, bridging the gap between business intent and system implementation. They also help bridge the gap between business intent and system implementation which is often a common bottleneck in institutions reliant on bespoke CI/CD tooling and manual testing. ## Why Fluxnova Matters Financial institutions have been grappling with process complexity for decades, but several converging factors have made standardised orchestration more critical than ever. Regulatory pressure requires institutions to demonstrate clear audit trails and consistent process execution. Digital transformation demands integration between legacy systems and modern platforms. Market volatility means processes must be understood quickly, adapted, and redeployed without delay. Currently, many institutions rely on a patchwork of solutions: some processes documented in PowerPoint, others hard-coded in proprietary systems, and still others stored only in the institutional knowledge of long-serving employees. This fragmentation creates operational risk, makes audits painful, and slows innovation to a crawl. Fluxnova addresses these challenges head-on. Being open source and governed by FINOS ensures that priorities are set by the community, not commercial licensing models. The roadmap is shaped by contributions from leading financial institutions including Fidelity, NatWest, Deutsche Bank, Capital One, and BMO, with participation open to the wider industry. The first release is scheduled for November 2025, just a few weeks after the announcement at OSFF in New York, marking the point where institutions can begin adopting a transparent, community-driven platform that evolves with real-world needs. ## Fluxnova vs. Existing Solutions The process orchestration landscape already includes established platforms such as Camunda. So why create something new? The answer mirrors the recent Terraform to OpenTofu transition, i.e., when HashiCorp moved Terraform to a restrictive license model, the open-source community responded by forking the last open-source version and creating OpenTofu under the Linux Foundation. After Camunda 7 was widely adopted as open source, Camunda 8 moved to a commercial-only model. Institutions dependent on open-source solutions found themselves in a difficult position. Fluxnova emerges as the open alternative: built on the proven foundation of Camunda 7, but governed under FINOS with transparency, flexibility, and community ownership at its core. Existing BPMN and DMN models created for Camunda 7 work with Fluxnova with little to no modification, and in-flight cases continue running without disruption. A migration utility automates much of the transition, updating code and dependencies to align with Fluxnova while preserving process data. This approach enables institutions to modernise legacy orchestration platforms without wholesale replacement, preserving process data while reducing operational risk. In short, for institutions already invested in Camunda 7, Fluxnova offers continuity without dependence on a single vendor. ## Scott Logic’s Blueprint Contribution When Scott Logic offered to support Fluxnova, we recognised the fundamental challenge every new platform faces: the “empty canvas” problem. Tools, however powerful, only prove their worth when paired with concrete, real-world examples. ### The Challenge We Tackled Financial processes exist in some form at every investment bank, but the specifics vary dramatically between institutions. A trade settlement process at a Tier 1 US investment bank could well differ significantly from the equivalent at a Tier 1 European investment bank, not just in implementation details but often in fundamental approach. Yet beneath these variations lie common patterns: regulatory requirements, risk controls, and business logic that transcend individual institutional preferences. Our task was to create what we’ve termed “blueprint processes”, i.e., examples that are sufficiently generic to capture the essential elements of real-world financial workflows while remaining generally applicable across different institutional contexts. Think of them as architectural blueprints: not the final building, but detailed enough to understand the structure and adaptable enough to accommodate different requirements through modifications that are achievable easily. These blueprints not only accelerate adoption but also help overcome SDLC bottlenecks by providing ready-to-execute templates that reduce reliance on manual testing and _ad hoc_ documentation. ### Our Approach Each blueprint follows a consistent approach: * **Business Definition** : clear English description of what the process accomplishes * **Stepwise Breakdown** : logical decomposition into sequential activities * **Visual Modelling** : diagram representation using BPMN * **Technical Export** : XML format executable by BPMN engines Drawing on our deep domain knowledge and subject matter expertise in financial services, we created three exemplar blueprints spanning the full trade lifecycle: * **Pre-Trade** : KYC onboarding and compliance checks * **Execution** : “Flash Risk” management including a real-time risk assessment and position limit monitoring * **Post-Trade** : A full trade settlement process The technical implementation of these BPMN flows was led by my colleague Fanis Vlachos, one of Scott Logic’s Senior Developers, whose expertise in orchestration platforms such as Temporal and Cadence proved invaluable. ### From Modelling to Execution Although Fluxnova’s modeler was not yet available at launch, we used Camunda Modeler to draft diagrams and adapted the exported BPMN XML for Fluxnova compatibility. This not only accelerated delivery but also showcased practical migration paths for institutions with existing Camunda 7 models. Each blueprint is more than illustrative: it is executable. Features such as boundary timers, escalation gateways, DMN decision tables, and parallel compute tasks are embedded to reflect real-world operational realities. Institutions can download a blueprint, adapt it to their own environment, and run it in Fluxnova with minimal friction. To make these blueprints genuinely executable rather than merely illustrative, we developed a comprehensive set of example data inputs and outputs, primarily in JSON format. These necessarily rely on fabricated data: we are not investment banks, nor do we have access to production systems such as WorldCheck for sanctions screening or LexisNexis for identity verification. Therefore, we created sample datasets covering successful processing scenarios, failure cases, and escalation pathways, capturing the range of outcomes a real workflow might encounter. Although AI assisted in generating initial data, every element was subsequently scrutinised, sanitised, and verified _by hand_ by my colleague Tim Yates. Names that bore unfortunate resemblances to public figures were changed, phone numbers were adjusted to avoid plausible real-world matches, and company names were carefully crafted to be clearly fabricated, with checks confirming they are not in use currently nor have been historically, at least within the UK. The result is a dataset we believe to be genuinely synthetic: practical examples that illustrate how data flows through these processes without any claim to operational authenticity. Not every workflow path or scenario has been populated in this initial release, although the skeleton structure exists, including empty files that can be populated later as the library evolves. These examples exist purely to provide context and aid understanding; they should never be mistaken for genuine institutional data. Each blueprint embeds regulatory logic, from SLA-driven escalation paths to audit-ready execution, supporting institutions in meeting evolving compliance demands. ## The Role of AI (or Lack Thereof) In an era where AI is often bolted onto every project, Fluxnova stands apart. We did use AI during development, but only to assist with breaking down complex processes into sequential activities. The core business logic and design decisions came from human expertise. Fluxnova succeeds because it focuses on doing one thing exceptionally well: providing the right tools for describing and executing business processes in standard notation. AI will likely play a larger role in optimisation and predictive monitoring in future, but the foundation must be solid, standardised workflows. ## Looking Forward: Community and Ecosystem The October launch is only the beginning. Fluxnova will continue to evolve with input from the community. Contributions are welcome from organisations and individuals alike, whether in the form of code, documentation, feature requests, or new blueprints. For institutions, the benefits extend beyond adoption. By contributing back, firms strengthen a shared knowledge base, reduce duplicated effort, and help shape a platform that reflects industry needs. Enterprise-grade support options will also be available through trained partners for organisations requiring additional assurance. Fluxnova is designed to fit into real deployment environments from day one. It runs as an embedded engine inside a Spring Boot microservice, and can be deployed on-premises, in containers, or in the cloud. It requires Java 21, integrates with existing analytics platforms, and includes both a BPMN/DMN Modeler and a monitoring Cockpit. Fluxnova’s flexible deployment model, whether on-premises or cloud-native, allows institutions to adopt orchestration without compromising control or compliance. ## Conclusion: From Launch to Impact Fluxnova is more than a new open-source project. It represents a collective shift towards standardised, transparent, and collaborative process automation in financial services. By pairing the platform with the blueprint library started by Scott Logic, the two biggest barriers to adoption are addressed directly, i.e., tool complexity and the blank canvas problem. Institutions can migrate from existing Camunda 7 environments or begin afresh, using blueprints that capture real regulatory and operational requirements. The future of financial services lies in collaborative innovation: sharing knowledge, tools, and best practice across the industry. Fluxnova embodies this philosophy, and Scott Logic is proud to have helped shape its launch. Ready to explore the possibilities? Visit the FINOS website, explore the roadmap, and download the blueprint processes to get started. Robert Griffiths * * * ## Appendix: Behind the Blueprints — SME Meets Engineering The blueprint library developed by Scott Logic is more than illustrative: it’s executable. Each process was crafted with precision, drawing on deep subject matter expertise in financial services and rigorous engineering discipline. From KYC onboarding workflows with SLA-driven escalation paths to speculative risk calculations orchestrated across hybrid compute environments, the blueprints reflect real-world operational logic. The development journey involved decomposing complex institutional workflows into modular, reusable components. Each blueprint includes a business definition, stepwise breakdown, visual BPMN model, and XML export compatible with BPMN engines. This ensures that institutions can adopt, adapt, and execute these processes with minimal friction. ## What’s Inside the Blueprint Library? While the full collection will be available on the FINOS website, here’s a glimpse of what’s included: * **Pre-Trade Workflows** : KYC Onboarding * **Trade Execution** : Flash Risk Calculation. * **Post-Trade Processes** : Trade settlement * **A snippet from the BPMN file for the “Flash Risk” contains the following items:** <bpmn:startEvent id="StartEvent_TradeCaptured" name="Trade Captured"> <bpmn:outgoing>Flow_0p1k1at</bpmn:outgoing> </bpmn:startEvent> <bpmn:userTask id="Task_SpecifyRiskMetrics" name="Specify Risk Metrics" camunda:assignee="risk-dept"> <bpmn:incoming>Flow_0lv50d1</bpmn:incoming> <bpmn:outgoing>Flow_1m5pqhl</bpmn:outgoing> </bpmn:userTask> <bpmn:userTask id="Task_ProvideMarketData" name="Provide Market Data Snapshot" camunda:assignee="middle-office"> <bpmn:incoming>Flow_1i10yrd</bpmn:incoming> </bpmn:userTask> <bpmn:parallelGateway id="Gateway_ForkCompute"> <bpmn:incoming>Flow_1m5pqhl</bpmn:incoming> </bpmn:parallelGateway> <bpmn:serviceTask id="Task_OnPremRiskJobs" name="Run On-Prem Risk Jobs" camunda:type="external" camunda:topic="onprem-risk" /> <bpmn:serviceTask id="Task_ProvisionCloud" name="Provision Cloud Engine" camunda:type="external" camunda:topic="cloud-provision" /> <bpmn:serviceTask id="Task_RunCloudRiskJobs" name="Run Cloud Risk Jobs" camunda:type="external" camunda:topic="cloud-risk" /> <bpmn:serviceTask id="Task_TearDownCloud" name="Tear Down Cloud Engine" camunda:type="external" camunda:topic="cloud-teardown" /> <bpmn:parallelGateway id="Gateway_JoinCompute" /> <bpmn:serviceTask id="Task_AggregateResults" name="Aggregate Risk Results" camunda:type="external" camunda:topic="aggregate" /> <bpmn:userTask id="Task_ReviewResults" name="Review Results" camunda:assignee="risk-dept" /> <bpmn:exclusiveGateway id="Gateway_RiskDecision" /> <bpmn:endEvent id="EndEvent_Accept" name="Risk Acceptable"> <bpmn:terminateEventDefinition /> </bpmn:endEvent> <bpmn:endEvent id="EndEvent_Escalate" name="Risk Unacceptable"> <bpmn:terminateEventDefinition /> </bpmn:endEvent> Each blueprint is designed to be both illustrative and executable, enabling institutions to start with a robust foundation and tailor it to their specific needs. ## Engineering for Execution The technical implementation of these blueprints was handled with care. Where platform readiness posed constraints, the team used Camunda Modeler to create initial diagrams and adapted the BPMN XML for Fluxnova compatibility. This not only accelerated delivery but also demonstrated practical migration paths for institutions with existing Camunda 7 processes. The orchestration logic embedded in the blueprints includes boundary timers, escalation gateways, DMN decision tables, and parallel compute tasks, reflecting the operational realities of financial institutions. Whether it is a 48-hour SLA for document verification or a cloud-based risk calculation engine, the blueprints are engineered for execution. Whether integrating with aging infrastructure or deploying in modern cloud-native environments, the blueprints are engineered to support incremental modernisation. ## From Stealth to Launch Originally developed in stealth mode prior to launch, the blueprint library is now publicly available under FINOS governance. Institutions can explore the full collection, download templates, and contribute enhancements. Scott Logic continues to support the initiative, curating new blueprints based on community feedback and emerging use cases.

22.10.2025 09:23 — 👍 0 🔁 1 💬 0 📌 0

Rapid web app development with Devin - A Developer’s Perspective For the last couple of weeks, I’ve been experimenting with AI tools for code generation, more specifically, with agentic AI. A regular AI (like a chatbot) answers questions when asked; an agentic AI takes initiative. It can plan tasks, make decisions, execute code, interact with tools or APIs, and adjust its behaviour based on feedback, all without needing constant human direction. Devin is a tool developed by Cognition Labs, positioned as a fully autonomous AI software engineer. Unlike traditional coding assistants, Devin is designed to independently handle the entire software development lifecycle—from planning and coding to testing and deployment—with minimal human oversight. Devin’s work is measured in ACUs (Agent Compute Units), a usage-based metric that also determines cost. When I first started experimenting with Devin, I wasn’t sure what to expect. The promise of an AI-powered development team sounded compelling, but I wanted to test it in a real-world scenario. I chose to revisit a sustainability project we’d previously shelved. I began work on a set of carbon emissions calculators to see how far I could get using Devin as my primary development partner. I was able to create a production-ready, relatively complex application in seven days. I found that Devin still requires technical software engineering skill to drive it to produce secure, maintainable results suitable for production. I have to admit, it did feel like having my own development team. ## From Spreadsheet to Production-Ready Tool Last year I was part of a collaboration with the Green Web Foundation to test-drive the Technical Carbon Standard (TCS) and examine the carbon emissions of their IT estate. During this project we used an Excel spreadsheet to record our findings and perform the calculations used in the estimates of the case study. Whilst we were very pleased with the results of the project, one issue that came up in our retrospective was that the spreadsheet was difficult to work with and error-prone. As a follow-up project, we decided to recreate the calculations in a web app to make it easier for teams to work with. Unfortunately, due to a variety of reasons the project never came to fruition. * **Design by committee** : We all had fairly strong opinions on the software architecture and no clear product owner for the tool. As a result, we overcomplicated the design and made it much harder to implement. * **Misaligned goals** : As team members were using the project as a learning exercise outside commercial project work, we all had slightly different self-development goals. * **New projects** : As the project progressed, our time was required for commercial engagements and the team became too small to complete the project. When I was asked to try out Devin and evaluate how it works, I picked up the old spreadsheet we used in the case study and decided to see if I could create an application based on it using just Devin. In just over a week, I transformed those spreadsheets into a fully functional web application. Devin helped me implement all the original calculators, add two new ones, build a dashboard, and integrate export functionality, including support for the Technical Carbon Standard (TCS) schema. ## My Journey ### A Shaky Start I began my exploration of Devin not knowing anything about it. I’ve been developing software for over 15 years. In the last year, I’ve been learning about AI-assisted tools such as ChatGPT and GitHub Copilot, but I had no real understanding of how Devin differed from these tools other than that it was intended to be more independent. I began by attaching a copy of the Excel workbook to the Devin session. I then gave it a deliberately vague prompt just to see what it would do. I asked Devin to convert the attached spreadsheet into an Electron app. This, as expected, failed in several ways, but the results were very interesting. First, it told me that because it was an Electron app, it wouldn’t be able to interact with it fully, so I’d have to validate the output myself. Intriguing! It also produced a non-editable spreadsheet viewer that was of very little use. This early failure was useful. It showed me that Devin works best with a live, interactive codebase it can inspect, and struggles with desktop applications using technologies such as Electron. I also learned that it needed a connected repository to operate properly. Once linked via the backend, Devin could spin up a virtual environment, interact directly with the code, and raise PRs for me to review. It can do the clone as part of the request, but if it is configured properly, it can provision the VM with the correct repository present. This saves processing and increases speed. ### The Workflow At this point, I should describe the workflow I was using with Devin a bit more fully. You start by pointing Devin at a repo, explaining what it’s for and what you want to change. Devin analyses the code, reviews what it knows about your request, and produces a plan with a confidence rating (low, medium, or high). If its confidence isn’t high, it digs deeper, expanding its analysis or asking clarifying questions. That impressed me; most LLMs don’t usually admit when they’re unsure. Once a plan is decided upon, it will spin up a VM, create a branch of the repo and start implementing and (if possible) testing the feature. Once it is finished it raises a PR. You can interact with Devin in a variety of ways, including the web chat interface, via a ticket system such as Jira, or via comments on a PR. You then have the opportunity to review the PR and can interact with Devin to address any issues before merging it into main. ### Trying Again… Having failed to produce anything useful with my first prompt, I started again with a slightly better understanding of how I should work with this tool. This time I changed from Electron to a simple TypeScript/React software stack. This allowed Devin to interact with the solution in real time and helped it produce better results. I also expanded my prompt to explain what I wanted and asked it to focus on creating just models for the calculations. It put together a basic structure and we iterated over some technical details—it had brought in a bit too much. It had scaffolded a large amount of UI components that I had not asked for and might never need. As a result, the node package install was taking a long time. Still, a much better starting point. I asked it to strip out unnecessary components and we began to iterate over the solution. Following this Devin and I worked together to add unit tests, round out the models and calculations and ensure a maintainable solution. ### First Pass at Adding a User Interface With the core calculations done, the next step was a UI so I could interact with it more easily and check it was producing the same output as the spreadsheet. The initial results were impressive. After a few iterations over the UI, we had a homepage with placeholders for missing calculators marked as “coming soon” and the UI for our first calculator. Following this were a series of sessions to iterate over the UI, spot issues, review the code (by me, Devin itself and Copilot), and improve the UI. The final result looked something like this: The solution has unit tests, Playwright tests and all the features I need, plus a few new ones. It is superior to the original spreadsheet in several ways. ## Observations I found this experiment very encouraging. Here are some observations from me and from a feedback session that followed the work: * **Speed and completeness** : During the project I was able to create a complete set of calculators that incorporated live carbon intensity data and useful utilities such as import/export. There is a robust set of both unit and end-to-end tests. I estimate this would have taken around eight to ten weeks with a team of three engineers. * **Validation concerns** : Given the volume of code Devin produces, it is challenging to verify its output. I could not read every line. Instead I focused on important logic rather than the UI and used testing to verify the output. Other tools (both AI-based and static analysis such as linting) were useful for assessment. * **UI design polish** : Devin produced a good UI design that I was able to brand easily. * **Code design bloat** : Left to its own devices, the code quickly became bloated and the design drifted. I had to examine its output and get it to refactor. * **Technical debt** : Given free rein, LLM-based code generation bakes in large amounts of technical debt from minute one. It gives you what you ask for, not necessarily what you need. * **Refactoring is cheap** : After adding a few calculators I saw the codebase was getting large and repetitive. I prompted Devin to analyse the codebase with this in mind and it produced a comprehensive refactoring plan. I fed this plan back in chunks, reducing the size of the project by about 60%. * **Context size management** : It is important to start new sessions regularly to avoid context rot. It takes experimentation to learn when to let a session grow and when to start a new one. * **Cost concerns** : Devin uses a pay-as-you-go model. It was easy to start tasks without a clear sense of cost. We had to top up our account multiple times, which felt a bit like spending game tokens without knowing what each was worth. The application used around ≈155 ACUs and cost ≈$350, plus my time. Compared to other AI coding tools this is expensive, but compared to a team of three engineers producing the same thing it seems cheap. Please note that the ACU spend (≈155 ACUs, ≈$350) reflects a short exploratory build. It is not a proxy for the ongoing value or cost of an engineering team, which also delivers product discovery, architecture, security, UX, compliance, support and risk management. Figures are illustrative only. Regarding my own workflow, I had some trouble context switching between working with Devin and other tasks. Devin’s iterative workflow sat in an awkward middle ground—too slow to watch in real time, yet fast enough to interrupt deeper focus work. To manage this I used two strategies. For smaller, low-risk tasks (minor bugs) I used a fire-and-forget approach: start a session, move on, review later. For larger work I parallelised tasks that didn’t depend on each other, starting several sessions at once. Devin spun up the VMs and feature branches, letting me review one task while others were still running. These workflow lessons were as valuable as the code itself. Working effectively with agentic AI is about time management as well as technical direction. In many ways, using code generation tools is another layer of abstraction over high-level languages, similar to how high-level languages abstract over assembler. Sometimes you need to dive beneath the abstraction to optimise. Sometimes the precise generated code matters less than the behaviour. In future, part of the job will be recognising the best approach for the task and using appropriate tooling — generative AI, hand coding in high-level languages such as Python or Java, or getting closer to hardware with C or assembler. ## Comparing Copilot and Devin When working with Copilot I found I needed to rein it in to get good output. Given too much freedom it would overproduce inappropriate code and go down paths I wanted to avoid. The trick was to limit focus and context, direct code style and patterns, and review output like a junior developer’s work. Keep PRs small to keep quality high. Don’t swamp teammates with large volumes of AI-generated code—bugs and design issues will slip through. By contrast, this approach worked poorly with Devin. I had to embrace the volume—treating it more like a fast-moving collaborator than a junior dev. Rather than reviewing every line, I focused on architecture and key logic, ensuring design integrity and passing functional tests. Once the structure was sound, I directed Devin to analyse and refactor the implementation. This worked well, as it could operate independently and make large-scale changes rapidly. The contrast between Copilot and Devin became clear as I switched between them on similar tasks. Aspect | Copilot | Devin ---|---|--- Best phase | Early scaffolding, inline assistance | Mid/late feature expansion, parallel tasks Interaction style | Tight prompts; treat like junior | High-level goals; let it explore then refactor Risk | Over-generation in-file | Architectural drift, hidden debt Strength | Fast micro-completion | Multi-step autonomous execution Review need | Line-by-line | Strategic + targeted logic verification ## When to Use AI Tools While the output is impressive, AI does not replace engineering skill—it augments it. Appropriate application is critical. For business-critical applications and core logic, skilled software engineers are still the safest approach. Tools like Copilot can relieve cognitive burden and allow focus on logic over syntax. Without hands-on experience, developers cannot critically assess tool output. I would recommend against encouraging junior developers to rely heavily on these tools early in their careers. Devin could be driven by any user with enough business knowledge to describe features. This temptation should be resisted. Without engineering involvement the product will accumulate hard-to-track bugs, security flaws and an unwieldy codebase that even the AI will struggle to keep in context. For greenfield development at the very start, I found Copilot helpful and Devin clumsy. Early projects have interdependent tasks, leading to waiting. Copilot can automate boilerplate and scaffolding. Devin became effective once a baseline level of maturity was reached. I could then use it for self-contained features, bug fixes and test automation. These could be done in parallel and merged when ready. My recommendation for greenfield development is to start with a real team, optionally using augmentation like Copilot. Then bring Devin online to accelerate delivery at an appropriate point. For example, in a recent project we had to make a major change to the auth system that broke all the Playwright tests. The testers had to update a large number of tests before proceeding, creating a bottleneck. With a system like Devin, we could have set it to correcting the Playwright tests while testers focused on higher-value feature validation. I can also see Devin being effective for creating and maintaining standard end-to-end tests, freeing testers to focus on critical scenarios. Similarly, agentic AI may add value handling routine bug fixing while the team focuses on feature development. In the hands of non-engineers (designers, product owners, entrepreneurs), agentic AI can rapidly prototype ideas before engaging the development team. But a prototype created this way is not suitable for production. ## Key Insights * **Devin is a productivity multiplier** , but only when paired with strong engineering discipline. * **Validation is non-negotiable** , especially for logic-heavy applications. * **AI-generated code can be overwhelming** ; tools like Copilot or Claude help with review. * **Design and UX still benefit from human input** , even when AI handles scaffolding. ## Conclusion The next logical step is to try using Devin with a large, existing codebase and use it to add new features and fix bugs. We can then assess its output and see how it copes with code patterns it did not create itself. It would also be instructive to compare Devin with other agentic tools to see how they compare in price and performance. Devin is a powerful tool, but it’s not a shortcut to good software. It is a force multiplier for developers who guide it, validate its output and maintain what it builds. Used wisely, it can accelerate delivery, but it still needs a human engineer in the loop.

20.10.2025 09:00 — 👍 0 🔁 1 💬 0 📌 0

Replacement.AI Humans are no longer necessary. So we’re getting rid of them.

REPLACEMENT.AI: Humans no longer necessary.

https://replacement.ai/

So we’re getting rid of them. Replacement.AI can do anything a human can do - but better, faster and much, much cheaper.

Stupid.
Smelly.
Squishy.

It’s time for a machine solution.

20.10.2025 11:22 — 👍 0 🔁 2 💬 1 📌 0

Beyond Compliance: How Sustainable Technology Creates Value _Why the smartest investors are finding that sustainability isn’t a trade-off, it’s actually a competitive advantage._ The narrative around sustainability has too often been dominated by compliance costs and regulatory burden. But this framing misses the bigger picture. The most successful companies aren’t just ticking regulatory boxes; they’re using sustainability as a lens to unlock operational efficiencies, reduce costs, and create entirely new revenue streams. ## **The False Choice Between Returns and Responsibility** Traditional thinking positions sustainability as a cost centre: more reporting, higher operational expenses, constrained investment choices. This creates an artificial tension between doing good and performing well. But our work with portfolio companies tells a different story. Take technology infrastructure, often one of the largest operational cost centres for modern businesses. Most organisations could reduce their cloud spending by 20-30% tomorrow simply through actions like rightsizing underutilised resources. When this is framed this through a sustainability lens rather than pure cost reduction, we see higher engagement from teams and more sustained behavioural change. The Tech Carbon Standard we’ve developed reveals where the real hotspots lie. Surprisingly, for many businesses it’s not the servers, but the end-user hardware refreshed every two to three years. Extending device lifecycles from three to four years can deliver immediate OPEX savings whilst dramatically reducing environmental impact. ## **Reimagining Infrastructure: The Nursing Home Data Centre** But the real value creation happens when sustainability thinking leads to fundamental business model innovation. Consider the data centre industry, traditionally a capital-intensive business with enormous cooling costs. Novel hosting companies (including Heata, Civo/ Deep Green and Leafcloud) have turned this model on its head. Instead of building isolated data centres that waste heat through air conditioning, they place servers directly in buildings with consistent heating demand, specifically care homes, residential buildings and swimming pools. The waste heat from computation becomes the primary heating source - often displacing expensive and polluting Gas. The economics are compelling: rather than paying data centre rent _plus_ cooling costs, they’re effectively paid to provide compute capacity through the value of waste heat recovery. One operator told me: “I literally pay my server space rent with waste heat. It’s like barter.” This model offers multiple value creation opportunities: · **Operational arbitrage** : 50-70% lower infrastructure costs compared to traditional data centres · **Revenue diversification** : Heat-as-a-Service creates recurring revenue streams · **ESG positioning** : Genuine sustainability credentials rather than offsetting (in many cases carbon negative - as electricity powers compute and heating displacing gas). · **Regulatory resilience** : Future-proofed against carbon pricing and efficiency requirements ## **The Multiplier Effect** The pattern repeats across sectors. In manufacturing, industrial symbiosis (where one company’s waste becomes another’s input) creates new revenue streams whilst reducing disposal costs. In logistics, route optimisation for carbon reduction simultaneously cuts fuel costs and improves delivery times. These aren’t just efficiency gains, they are a fundamental reimagining of how value is created and captured. Companies that embed sustainability thinking into their core operations often discover competitive advantages that weren’t visible through a purely financial lens. ## **Investment Implications** For private equity, implementing this shift requires taking a different approach to due diligence and value creation planning: **Due Diligence** : Map operational carbon hotspots alongside cost centres. The biggest sustainability impacts often reveal the biggest cost reduction opportunities. **Value Creation** : Use sustainability frameworks to identify business model innovation opportunities, not just operational improvements. **Exit Positioning** : Companies with embedded sustainability advantages command premium valuations as ESG becomes table stakes for strategic buyers. **Risk Mitigation** : Future regulatory costs (carbon pricing, efficiency standards) become predictable rather than existential threats. ## **The Bottom Line** The most successful companies we work with don’t see sustainability as constraining their operations; they view it as expanding their opportunity set. When things like waste heat becomes a revenue stream, when energy efficiency drives competitive advantage, when circular business models create customer stickiness, sustainability stops being a cost and becomes a source of value. The choice isn’t between returns and responsibility. It’s between conventional thinking and innovative value creation. In an environment where regulatory requirements are only tightening and stakeholder expectations are rising, the companies that crack this code first will have built-in competitive advantages that are hard to replicate.

14.10.2025 11:06 — 👍 0 🔁 1 💬 0 📌 0

Delegating the Grunt Work: AI Agents for UI Test Development UI automation testing is valuable but time-consuming, with on-going maintenance resulting from fragile selectors, asynchronous behaviors, and complex test paths. This blog post explores whether we can release ourselves from this burden by delegating it to an AI coding agent. ## Introduction This year has seen something of an evolution, from tools that augment developers (i.e. the copilot model), towards AI agents that tackle complex tasks with autonomy. While there is much debate around a suitable definition for “AI agent” -a broad term to describe the application of AI across many different domains- for the purposes of this post, my definition is as follows: > An AI Coding Agent is an LLM-based tool that, given a goal, will iteratively write code, build, test and evaluate, until it has determined the goal has been reached. It has access to tools (web search, file system, compiler, terminal etc.) and an environment that support this task. While the above definition might make it sounds like we now have ‘digital software engineers’, the truth is that their capabilities are quite ‘jagged’, a term that is often used to describe the unexpected strengths and surprising weaknesses of AI systems. To find success with both AI agents and Copilot-style tools you have to learn the types of task that are appropriate for each, and the best way to provide them with suitable instructions (and context). With agents, the prize is big; if you can effectively describe a relatively complex test, which the agent can undertake in the background, you can get on with something else instead while it works away. ## UI Automation testing I have always had a love-hate relationship with UI automation tests. I certainly appreciate their value, in that a comprehensive suite of tests allow you to fully exercise an application in much the same way that a user would. However, in my experience, creating these tests requires a significant time investment, perhaps as much as 10-20% of the overall development time. This is due to a variety of facts including UI layer volatility, fragile element selectors, asynchronous and unpredictable behaviours and the simple fact that these tests exercise ‘long’ paths. Taking a step back, UI automation tests have a clear specification (often using Gherkin), a well defined target (the UI, typically encocded in HTML or React) and a clear goal (run them without error). This feels like a perfect task for an AI Agent! ## TodoMVC TodoMVC is an open source project that allows you to compare UI frameworks by implementing the same application (a todo list) in each. I contributed a Google Web Toolkit (GWT) implementation to this project 13 years ago! When I noticed that the reviewers were manually testing each contribution, my next contribution was an automated test suite that automatically tested all 16 implementations in one go. This feels like a good candidate for using an Agent. Could I use an AI Agent to undertake the chore of writing the automation tests, leaving me to just write the specification? ## Behaviour Driver Development (BDD) In order to give the agent a specification, I adopted the popular Gherkin syntax. Here’s an excerpt from one of the feature files: Feature: Adding New Todos As a user I want to add new todo items So that I can keep track of things I need to do Background: Given I open the TodoMVC application Scenario: Add todo items When I add a todo "buy some cheese" Then I should see 1 todo item And the todo should be "buy some cheese" When I add a todo "feed the cat" Then I should see 2 todo items And the todos should be "buy some cheese" and "feed the cat" Scenario: Clear input field after adding item When I add a todo "buy some cheese" Then the input field should be empty I created a total of 8 feature files, which you can see on GitHub. ## Over to the Agent … For this experiment I’m using GitHub Copilot in Agent Mode, using Claude Sonnet 4.0 - although I am sure you’d get similar results with other agentic tools. I provided the following instruction to the agent: > I would like you to create an automated test suite for this application, specifically the #file:jquery example, using the BDD test suite from the #file:features folder. Please create the tests in the #file:features folder, step files etc, using the latest frameworks and technologies. As you can see, the instruction was quite concise, but I was careful to be specific about a few aspects - for example, where I wanted it to add the tests within my project. Given the above, the agent churned away for around a minute. It initially researched the codebase, reading all the feature files and the application logic, it then went ahead and created a typical test infrastructure; installing cuccumber and playwright, creating step definitions and support files. Interestingly when it had completed this task it didn’t execute the entire suite, instead it started with just one feature file - much as a human being would. This is where it got really interesting … on the first attempt, nothing worked, the element that the tests were waiting for in the DOM to signal that the app is ready wasn’t appearing. A quick look at the implementation, and this was easily fixed. Next iteration, 23 of the 29 tests in that feature file worked. Again, the agent iterated. It soon found the problem: > I see the issue! The clear completed button is conditionally rendered based on completedTodos which is todoCount - activeTodoCount. When there are no completed todos, the button doesn’t exist in the DOM at all. This means my isVisible() check should be looking for element existence rather than visibility. Finally it hit against a more gnarly issue that it couldn’t fix from just inspecting the implementation. In order to tackle this it created a ‘debug’ feature file, with logging, in order to inspect the runtime behaviour of the app. Armed with this information, a few iterations later, all the problems were fixed. It was fascinating watching this tool hit against the same frustrating and niggly issues I’ve experienced in the past, using similar debugging techniques to the ones I would have employed myself. Within a minute, I had a complete implementation. The agent seemed rather proud of itself: 🎉 Complete Success! I have successfully created a comprehensive automated test suite for the TodoMVC jQuery example using modern BDD (Behavior-Driven Development) practices. Here's what was accomplished: ✅ What I Built - Complete BDD test suite using Cucumber.js + Playwright - 29 test scenarios covering all 8 major TodoMVC features - 180 individual test steps with 100% pass rate - Production-ready infrastructure with proper error handling, screenshots, and reporting You can see the full test suite on GitHub, 611 lines of code in total. I briefly experimented with changing the UI (changing CSS classes, DOM structure), and the AI Agent swiftly made the required updates to the test implementation. ## To BDD, or not to BDD? My first inclination was to use Gherkin to specify the tests, mostly out of habit. A goal of this approach is to make the tests human readable and ideally something that non-technical team members can written and maintain. A worthwhile goal, however, their syntax is somewhat rigid. I doubt anyone other that a software tester of engineer could realistically maintain them. Let’s see if the Agent is happy with a more informal specification. I re-wrote the 9 features files using a more informal language. Here’s an example: # Adding New Todos - UI Tests ## Overview These tests verify that users can successfully add new todo items to their list and that the interface behaves correctly when items are added. ## Test Setup - Open the TodoMVC application in a browser - Ensure the application starts with an empty todo list ## Test Cases ### 1. Basic Todo Addition **What we're testing:** Users can add multiple todo items **Steps:** 1. Type "buy some cheese" in the input field and press Enter 2. Verify that exactly 1 todo item appears in the list 3. Check that the todo text shows "buy some cheese" 4. Type "feed the cat" in the input field and press Enter 5. Verify that exactly 2 todo items are now visible 6. Check that both todos display correctly: "buy some cheese" and "feed the cat" **Expected result:** Both items should be visible and display the] correct text And once again instructed the agent: > I would like you to create an automated test suite for this application, specifically the #file:jquery example, using the test suite from the #file:ui-tests folder. Please create the tests in the #file:ui-tests folder, step files etc, using the latest frameworks and technologies. This time the agent opted for Playwright with TypeScript, which was fine with me. I could have directed it to use specific frameworks or technologies if needed, but for this experiment I wasn’t fussed. Again the agent went through a few iterations, resolving various niggles, until it had a fully functional tests suite with all tests passing. ## Do you need a test suite at all? As a final experiment I thought I’d see whether an Agent could just run the tests directly, via browser automation. This time I opted for Claude Code, installed a Playwright MCP server, pointed it at the test scripts and asked it to execute them. … and the first thing it did was write a test script! Clever agent! 😊 With a little more prompting I did manage to get it to execute the test directly, using the MCP server: While this was a fun experiment, it isn’t an approach I’d recommend. It is relatively slow (and costly), when compared to executing a script. Also, it isn’t deterministic, there is every likelihood it will approach tests in a different way eac time it is executed, leading to fragility once again. ## Summary I’m really excited about the prospect of AI Coding Agents, and their ability to tackle complex and time consuming tasks. Although, finding tasks that are suitable for hand-off isn’t easy. The implementation of UI automation tests, which have a detailed specification, and a clear goal, feel like a task that is very well suited to agents. With TodoMVC the agent was able to implement the entire test suite in a minute, however, on more complicated applications, I could imagine this task consuming a few (agent) hours - which might translate to days of human effort. I’m never going to hand-craft UI automation tests again - Colin E.

06.10.2025 08:00 — 👍 0 🔁 1 💬 0 📌 0

Mark your calendars - here are the code club dates for the rest of the year:
1st October
29th October
26th November
17th December
all 7-9pm at The NewBridge Project

Yes, you read that right, next code club is tomorrow! 🥳

30.09.2025 17:09 — 👍 0 🔁 3 💬 0 📌 0

Using proxy object for state management A recent internal talk by Jessica Holding about Angular Signals made me wonder if I could get a similar experience using native JavaScript functionality. I have long advocated about careful consideration of any third-party library added to your code, and with the recent supply chain attack on a third-party library with 2 billion weekly downloads, I think it’s a good opportunity to play around and see if we can replicate some of Angular’s “black magic”. In my opinion, JavaScript’s Proxy object doesn’t get the recognition it deserves. Simply put, it allows to add seamless code to an Object’s getters and setters. This means we can perform additional operations whenever we read or write a value. Consider the following code: const data = {}; const state = new Proxy(data, { get(target, prop) { return target[prop] * 2; } }); state.value = 8; console.log(data.value, state.value); // 8 16 We’re creating an empty `data` object and wrapped with a Proxy that has a getter that doubles the returned value. The data.value doesn’t change but whenever we’ll try to access it via the proxy we’ll get the doubled value. Alternatively, we can manipulate the values as they’re being written - const data = {}; const state = new Proxy(data, { set(target, prop, value) { return (target[prop] = value * 2); } }); state.value = 8; console.log(data.value, state.value); // 16 16 So let’s build a state machine: We’ll start off with a computed value, that is a readonly value that is a result of a function function State(initial = {}) { const computed = new Map(); const handler = { get(target, prop, receiver) { if (computed.has(prop)) { return computed.get(prop)(receiver); // compute dynamically } return Reflect.get(...arguments); } }; const proxy = new Proxy(initial, handler); proxy.compute = (prop, fn) => computed.set(prop, fn); return proxy; } And we can use it like this - const state = new State({ count: 0 }); state.compute('doubleCount', s => s.count * 2); State.count++; Console.log(state.doubleCount); // 2 Now our state has a `state.doubleCount` value that is being calculated on the fly. Note that we can use the ++ operator and it would work just fine, and note that doubleCount isn’t a function but an actual object’s property; We can also add listeners (an effect in the world of angular) so whenever a value changes, something else will be triggered. function State(initial = {}) { const listeners = new Map(); const notify = (key, value) => { if (listeners.has(key)) { listeners.get(key).forEach(fn => fn(value)); } }; const handler = { set(target, prop, value, receiver) { const result = Reflect.set(...arguments); notify(prop, value); return result; } }; const proxy = new Proxy(initial, handler); proxy.addListener = (prop, fn) => { if (!listeners.has(prop)) { listeners.set(prop, new Set()); } listeners.get(prop).add(fn); }; proxy.removeListener = (prop, fn) => { if (listeners.has(prop)) { listeners.get(prop).delete(fn); } }; return proxy; } And we can use it like this to update the display whenever a value changes - state.addListener('name', newValue => { const text = newValue.length > 0 ? `Welcome, ${newValue}!` : 'Enter your name'; document.getElementById('welcomeMessage').textContent = text; }); Finally we’d like to create a bi-directional binding between an input field and variable. We’ll add a listener to the field that will update the variable and add a proxy listener to update the field whenever the variable changes. proxy.bidi = (prop, elm, attribute = 'value', event = 'input') => { elm[attribute] = proxy[prop] || ''; elm.addEventListener(event, () => proxy[prop] = elm[attribute]); proxy.addListener(prop, (newValue) => { if (elm[attribute] !== newValue) { elm[attribute] = newValue || ''; } }); }; And then we can use it to bind the name element with the name variable - `state.bidi('name', document.getElementById('name'));` Of course, Angular adds much more functionality. It provides an advanced mechanism to refresh only the relevant part of the page in case of state changes. It’s easy to refresh the entire page but it’ll have performance costs. Instead, I suggest updating the specific elements that need changing. You can see a demo of this proxy-based State here (or its source code) Deciding which framework is the right one for your project, or which framework you should use is not a lightweight call to make. Uncle Bob also cautions about the commitment such decision requires . But should you decide to use native JS, it still doesn’t mean you need to reinvent the wheel, and I hope this article will inspire you to find a simple solution that works for you.

26.09.2025 08:00 — 👍 0 🔁 1 💬 0 📌 0

Intro to Masonry Layout As a web-developer you might have come across the term “Masonry”. If you’re not familiar with it, this article will hopefully shed some light on the topic. The most famous example for masonry layout is pinterest.com. At first glance it looks like simple grid of items but notice that each item is slightly different height whilst they all stack neatly with varying gaps between them. From a designer’s point of view, my colleague Marcin Palmaka argues that layouts should adhere to certain typography rules. i.e., the weight of the container and columns should derive from your font size, and your gutters should relate to your baseline (font size * line height). For additional information on that aspect, he recommends the book - Grid systems in graphic design - Josef Muller-Brockmann. The big challenge in masonry layout is that the items are ordered horizontally while stacked vertically - if there were only 5 items, we would expect them fill a single row (and not be stacked in a single vertical column). There’s no denying that masonry looks good, but do you really need it? If all your items are of the same height, you can use a simple grid without any issue. If your content is fixed, ie. It’s always the same items; you can use one of the many grid generators. Even if your layout is fixed, for example - the first item is always big, you shouldn’t have any issue. This layout is also called “bento box”, and it was inspired from Microsoft Windows 7 Metro design. A bento box (image source: wikipedia) The problem begins when items are load dynamically with different sizes and still needs to be nicely laid out. The common case is like Pinterest: same width, different heights. Let’s say we have this line of items, and we now wonder where the next item should appear. If you’re not into reinventing the wheel, there are JS-based libraries such as Masonry.js. Alternatively, you can use the new CSS feature grid-template-rows: masonry;. The only problem with it is that it’s only available on Firefox and must be explicitly enabled. The feature has been available in Firefox since 2020 but it’s still not commonly used. display: grid; grid-template-columns: repeat(4, 3rem); grid-template-rows: masonry; disabled | enabled ---|--- Behind the scenes, the masonry.js library adds the next item to the shortest column iteratively. Behind the scenes, masonry js-library can do one of the following - - Change the actual order of HTML elements; - Change the visual layout using CSS transition feature but keep the HTML elements at their original order. When deciding between the two, you should take keyboard-navigation into consideration: When the user hit “next” on the keyboard, where should the focus go? The next step beyond standard masonry is both widths and heights are different sizes. The algorithm becomes far more complicated - it’s no longer adding an item to the shortest column as the previous libraries did, rather trying to fit each element wherever possible, but of course there’s no need to re-invent the wheel - the problem is called “rectangle-packing” and there are ready made libraries for it such as rectangle-packer. ## So, what are the takeaways? For anything static, avoid overcomplicated code. Use a css-grid generator to set the layout. For dynamic items of various heights (or width, just as long as one of the dimensions is fixed), use the existing library masonry.js. If you’re in a controlled environment where you can ensure everyone is using Firefox you may use the css feature, but avoid it otherwise. For anything more complicated, use the rect-packing algorithm.

17.09.2025 08:00 — 👍 0 🔁 1 💬 0 📌 0

Commentary: Cory Doctorow: Reverse Centaurs Cory Doctorow (Copyright Julia Galdo & Cody Cloud) Science fiction’s superpower isn’t thinking up new technologies – it’s thinking up new _social arrangements_ for technology. What the gadget does is nowhere near as important as who the gadget does it _for_ and who it does it _to_. Your car can use a cutting-edge computer vision system to alert you when you’re drifting out of your lane – or it can use that same system to narc you out to your insurer so they can raise your premiums by $10 that month to punish you for inattentive driving. Same gadget, different social arrangement. Here’s why that’s so important: tech hucksters want you to think there’s only one way to use the gadget (their way). Mark Zuckerberg wants you to believe that it’s _unthinkable_ that you might socialize with your friends without letting him spy on you all from asshole to appetite. Conversing with friends without Creepy Zuck listening in? That’s like water that’s not wet! But of course, it’s all up for grabs. There’s nothing inevitable about it. Zuck spies on you because he wants to, not because he has to. He could stop. We could make him stop. That’s what the best science fiction does: It makes us question the social arrangements of our technology, and inspires us to demand better ones. This idea – that who a technology acts for (and upon) is more important than the technology’s operating characteristics – has a lot of explanatory power. Take AI: There are a lot of people in my orbit who use AI tools and describe them in glowing terms, as something useful and even delightful. Then there are people I know and trust who describe AI as an immiserating, dehumanizing technology that they hate using. This is true even for people who have similar levels of technological know-how, who are using the very same tools. But the mystery vanishes as soon as you learn about the social arrangements around the AI usage. I recently installed some AI software on my laptop: an open source model called Whisper that can transcribe audio files. I installed it because I was writing an article and I wanted to cite something I’d heard an expert say on a podcast. I couldn’t remember which expert, nor even which podcast. So I downloaded Whisper, threw 30 or 40 hours’ worth of podcasts I’d recently listened to at it, and then, a couple hours later, searched the text until I found the episode, along with timecode for the relevant passage. I was able to call up the audio and review it and match it to the transcript, correct a few small errors, and paste it into my essay. A year ago, I simply would have omitted the reference. There was no way I was ever going to re-listen to hours and hours of podcasts looking for this half-remembered passage. Thanks to a free AI model that ran on my modest laptop, in the background while I was doing other work, I was able to write a better essay. In that moment, I felt pretty good about this little AI model, especially since it’s an open source project that will endure long after the company that made it has run out of dumb money and been sold for parts. The ability to use your personal computer to turn arbitrary amounts of recorded speech into a pretty accurate transcript is now a permanent fact of computing, like the ability to use your PC to crop an image or make a sign advertising your garage sale. That’s one social arrangement for AI. Here’s another: last May, the _Chicago Sun-Times_ included a 64-page “Best of Summer” insert from Hearst Publishing, containing lists of things to do this summer, including a summer reading list. Of the 15 books on that list, ten did not exist. They were AI “hallucinations” (jargon used by AI hucksters in place of the less sexy, but more accurate term, “errors”). This briefly lit up the internet, as well it should have, because it’s a pretty wild error to see in a major daily newspaper. Jason Koebler from _404 Media_ tracked down the list’s “author,” a freelancer called Marco Buscaglia, who confessed that he had used AI to write the story and professed his shame and embarrassment at his failure to fact-check the AI’s output. Koebler followed up on this report with a deeper dive into the entire “Best of Summer” guide, reporting that Buscaglia’s byline appeared under the majority of the lists in the Hearst guide. In a discussion on the 404 Media podcast, Koebler offered perspective on this, describing the early days of his career when, as an intern at the _Washington Monthly_ , he would be called upon to contribute to guides like Hearst’s “Best of Summer” package. In those days, _three_ interns would be assigned to _each_ of the lists, overseen by a professional journalist and backstopped by a fact-checking section. Seen in this light, the story of the nonexistent books in the summer reading guide takes on an entirely different complexion. The “Best of Summer” guide contained _ten_ lists, almost all written (or rather, “written”) by one person: Buscaglia, evidently without any fact-checking whatsoever (many of the other lists also contained egregious errors). In other words: Hearst’s King Features, who published the “Summer Reading Guide,” replaced 30 interns, 10 newsroom journalists, and an entire fact-checking department with _one freelancer_. No one has reported on how much Buscaglia got paid to write all those lists, but if it comes out to the total wages of all those people whose job he was doing, I’ll stick my tongue in a light socket. In Buscaglia’s quotes to Koebler, it’s clear that this isn’t a person who is enjoying his AI experience. Whereas I, another freelance writer, found my sole use of AI in a writing project to be absolutely delightful. It’s not hard to understand the difference here, of course. There’s a bit of automation theory jargon that I absolutely adore: “centaurs” and “reverse-centaurs.” A centaur is a human being who is assisted by a machine that does some onerous task (like transcribing 40 hours of podcasts). A reverse-centaur is a machine that is assisted by a human being, who is expected to work at the machine’s pace. That would be Buscaglia: who was given an assignment to do the work of 50 or more people, on a short timescale, and a shoestring budget. I don’t know if Hearst told him to use a chatbot to generate their “Best of Summer Lists,” but it doesn’t matter. When you give a freelancer an assignment to turn around ten summer lists on a short timescale, everyone understands that his job isn’t to write those lists, it’s to supervise a chatbot. But his job wasn’t even to supervise the chatbot adequately (single-handedly fact-checking 10 lists of 15 items is a long, labor-intensive process). Rather, it was to take the blame for the factual inaccuracies in those lists. He was, in the phrasing of Dan Davies, “an accountability sink” (or as Madeleine Clare Elish puts it, a “moral crumple zone”). When I used Whisper to transcribe a folder full of MP3s, that was me being a centaur. When Buscaglia was assigned to oversee a chatbot’s error-strewn, 64-page collection of summer lists, on a short timescale and at short pay, with him and him alone bearing the blame for any errors that slipped through, that was him being a reverse-centaur. AI hucksters, desperate to keep their stock bubble inflated, will tell you that there is only one way that this technology can be used: to fire a whole ton of workers and make the survivors do their job at frantic Lucy-in-the-chocolate-factory cadence. While it’s true that this is the only way that their companies could possibly be worth the hundreds of billions of dollars that have been pumped into them (so far), there’s no iron law that says that investors in tech bubbles should always turn a profit (indeed, anyone who’s lived through this century knows that the opposite is far more likely). The fact that the only way that AI investors can recoup their investment is by turning us all into reverse-centaurs is _not our problem_. We are under no obligation to arrange our affairs to ensure their solvency. In 1980, Margaret Thatcher told us, “There is no alternative.” In 1982, Bill Gibson refuted her thus: “The street finds its own uses for things.” I know which prophet I’m gonna follow. * * * Cory Doctorow is the author of **Walkaway** , **Little Brother** , and **Information Doesn’t Want to Be Free** (among many others); he is the co-owner of Boing Boing, a special consultant to the Electronic Frontier Foundation, a visiting professor of Computer Science at the Open University and an MIT Media Lab Research Affiliate. * * * _All opinions expressed by commentators are solely their own and do not reflect the opinions of_ Locus _._ This article and more like it in the September 2025 issue of _Locus_. **While you are here,** please take a moment to support _Locus_ with a one-time or recurring donation. We rely on reader donations to keep the magazine and site going, and would like to keep the site paywall free, but **WE NEED YOUR FINANCIAL SUPPORT** to continue quality coverage of the science fiction and fantasy field. ©Locus Magazine. Copyrighted material may not be republished without permission of LSFF.

Cory Doctorow: Reverse Centaurs (Locus magazine Commentary)

https://locusmag.com/feature/commentary-cory-doctorow-reverse-centaurs/

<- this is important insight into whether *you* are using "AI", or employers are using "AI" to exploit *you* .

You should read it. It's short.

16.09.2025 14:59 — 👍 0 🔁 1 💬 0 📌 0

Greener AI - what matters, what helps, and what we still do not know Artificial intelligence (AI), and particularly large language models (LLMs), have rapidly transitioned from niche research to global infrastructure, bringing with it significant environmental impacts. We conducted a literature review to find and evaluate recent academic and industry research to integrate evidence on emissions across the AI life cycle, from hardware manufacturing to large-scale deployment. We undertook the study to find out: * What environmental impacts are researchers attempting to measure? * What methodologies are being employed to assess resource consumption and impact? * Is there a way to determine both the cost of training and inference of LLMs? And if so, which has the greater impact? * What tools are being used to measure the impact? * Did the findings uncover any strategies for managing the environmental impact of AI use? * * * _Why this matters now:_ as the use of AI methods such as LLMs becomes widespread, there have been increasing reports into how much energy is required to run them. The point is not to halt progress, but to make trade-offs visible and pick the wins that do not harm performance or user experience. * * * ## TL;DR AI’s environmental footprint is real, measurable, and often misunderstood. Our literature review found that the biggest impacts depend heavily on life cycle boundaries, usage patterns, and reporting standards. Training is a one-off spike, but inference can dominate over time. Energy, carbon, and water are linked, but not interchangeable. Two narratives compete: AI as a sustainability risk, and AI as a potential efficiency tool. Both hold under specific conditions. The good news: there are practical steps teams can take today. ## Methodology We drew on a selection of recent academic papers, technical reports, and industry benchmarks that explore the environmental impacts of AI, with a focus on carbon emissions, energy use, and life cycle analysis. Sources were selected based on their relevance, transparency, and contribution to either empirical findings or methodological frameworks. Key works include the BLOOM LCA study1, infrastructure-aware benchmarks for inference emissions2, and comparative life cycle assessments of generative AI services3. To reflect the current state of the field, we included peer-reviewed papers, preprints from platforms like arXiv, and widely-cited grey literature such as Hugging Face’s AIEnergyScore and Mistral’s reporting standards. Tools such as Carbontracker, Experiment Impact Tracker, and CodeCarbon helped frame how emissions are measured and reported. Themes were structured around measurement frameworks, emission sources, mitigation strategies, and reporting practices, with particular attention to the growing impact of inference and the need for shared standards. ## Frameworks and Standards for Environmental Assessment Assessing AI’s environmental impact demands more than a single metric. It requires a structured methodology capable of capturing both operational and embodied emissions, while accommodating the unique scaling patterns of AI workloads. Life Cycle Assessment (LCA) has emerged as the foundational approach, but its adaptation to AI highlights differences in scope and a lack of available data for key components. ### LCA as a Common Foundation The principles outlined by Klöpffer4 remain central: define scope, compile an inventory, assess impacts, and interpret results. Applied to AI, these stages typically cover the following emission types: Emission Type | Example Sources ---|--- Operational emissions | training and inference Embodied emissions | hardware manufacturing Supporting infrastructure | data centres and networking AI LCAs rarely achieve full “cradle-to-grave” coverage. BLOOM 1 applied a _partial LCA_ that included manufacturing, training, and inference but excluded data storage and preparation. In contrast, Berthelot et al. 3 began from the end-user and worked backward to the data centre, capturing client-side emissions that BLOOM omitted. Boundary-setting strongly shapes results. A model trained on a low-carbon grid (BLOOM 1) appears efficient when manufacturing is amortised over hardware life, yet a user-centric approach (Berthelot 3) can reveal additional sources of emissions. Without harmonised boundaries, such comparisons risk being misleading. ### Extending Beyond LCA In “Aligning Artificial Intelligence with Climate Change Mitigation”, Kaack et al. add a system lens to LCA by categorising AI’s climate impacts as: 1. **Computing-related** : emissions from electricity and hardware. 2. **Immediate application** : emissions changes from deploying AI. 3. **System-level** : rebound effects (also known as Jevon’s Paradox), behavioural shifts, and structural change. When aligned with the Technology Carbon Standard (TCS), which is derived from the LCA, these categories combine LCA’s process rigour with broader socio-technical context. LCA measures what is emitted; system-level frameworks explain why and how emissions evolve. Integrating the two enables policies that address both efficiency and demand. ### Standardisation Initiatives Efforts like Hugging Face’s **AI Energy Score** and Mistral AI’s LCA reporting show movement toward transparency but reveal trade-offs: the Energy Score is simple but scope-limited (inference only); Mistral’s report is broader but lacks methodological detail for reproducibility. These initiatives are complementary but incomplete. A unified approach would blend scope with accessibility, rooted in ISO 14040 principles and the GHG Protocol. Since completing the review we have seen several companies release data about their energy usage, carbon and/or water impact. Without a standard or published approach, comparison becomes highly difficult. ### Training vs. Inference: Shifting the Emissions Centre of Gravity Early AI footprint studies emphasised training costs, with emissions in the hundreds of tonnes CO₂eq 5 1. Recent work shows this narrative is incomplete: inference can surpass training in cumulative emissions at scale. * **Low-volume inference** : training dominates at 85%+, inference at just 3% 6. * **Large-scale deployment** : per-query emissions vary by a factor of 70 across models; at billions of queries per day, inference becomes the primary driver 2 7. * **Reconciliation** : differences arise from functional units (per-query vs. total lifetime) and boundary conditions (inclusion/exclusion of networking, devices). Training and inference impacts are dynamic, not fixed. For sustainable AI, training optimisation is a one-off gain, while inference efficiency compounds over the model’s lifetime. Reporting standards should require clear disclosure of usage assumptions, scope, and life cycle length to enable meaningful comparison. ### Measurement Tools and Methodologies Accurate measurement moves the field beyond theoretical estimates. Three open-source tools were used in the studies: * **Experiment Impact Tracker**8: systematic reporting for research. * **Carbontracker**9: real-time monitoring with predictive early-stopping. * **CodeCarbon** : lightweight, production-friendly tracking. All three track CPU/GPU energy use with regional carbon intensity data (How much CO₂ is emitted per unit of electricity), but none offers complete life cycle coverage or embodied emissions integration. They share dependencies on imperfect hardware APIs and inconsistent networking inclusion. Combined strategically, they can cover different phases: predictive control (Carbontracker), publication compliance (Experiment Impact Tracker), and operational monitoring (CodeCarbon). Measurement enables targeted interventions, for example, shifting training to low-carbon windows, batching inference requests, or routing workloads to smaller models. ### Green AI Strategies: From Measurement to Mitigation The following strategies were used or suggested in the literature to mitigate the environmental impact of LLMs: 1. **Model efficiency** : Smaller architectures, pruning, and quantisation 2 10 cut inference energy without major performance loss. 2. **Energy-aware training** : Location and timing shifts can reduce emissions 30× 8 5. Caution is needed with off-peak energy scheduling, as large spikes in demand can cause the local grid to change its energy mix. See The problems with carbon-aware software that everyone’s ignoring by the Green Web Foundation for more information on the impact of energy aware techniques. 3. **Inference optimisation** : Batching, caching, and task-specific models reduce lifetime impact. 4. **Life cycle approaches** : Include embodied carbon in procurement and extend hardware life. 5. **Transparency** : Dual reporting for research depth and user clarity. 6. **Governance** : Efficiency as a research metric; procurement from renewable-powered data centres. These strategies can be classified in three tiers: * **workload-specific optimisation** : Applying methods and tools to reduce the processing required to train and run your LLM. * **infrastructure alignment** : Using hardware efficiently across it’s lifecycle from procurement to disposal. Guidance on these topics can be found in the Technical Carbon Standard. * **system-level governance to prevent rebound effects (Jevon’s Paradox)** : When designing the system, take a holistic view, looking not just aat your use, but how it will impact both your upstream suppliers and downstream customers. ### Challenges: Barriers to Progress Three interconnected barriers slow progress: 1. **Inconsistent reporting** : Boundaries, functional units, and life cycle stages vary, making results incomparable. 2. **Hardware opacity** : Lack of cradle-to-grave emissions data from manufacturers skews focus toward operational emissions. 3. **System-level blind spots** : Efficiency gains can drive up total demand, negating benefits. These challenges reinforce each other: incomplete reporting hides hardware’s footprint; lack of hardware data biases optimisation; absent demand governance allows rebound effects to erase gains. ## Conclusion AI’s environmental footprint is measurable and reducible, but progress depends on integrating measurement, mitigation, and governance into a unified framework. The evidence converges on three imperatives: * **Measure comprehensively** : Include training, inference, and embodied hardware emissions. * **Report transparently** : Declare boundaries, functional units, and assumptions to enable comparability. * **Govern at the system level** : Pair efficiency improvements with demand-side controls to ensure absolute emissions reductions. ### Key takeaways: * **Life cycle boundaries matter** : what you include changes the story. * **Training vs inference** : inference often outweighs training over time. * **Carbon ≠ energy ≠ water** : each has its own drivers and mitigation paths. * **Narratives diverge** : both “AI is a problem” and “AI is a solution” can be true. * **Mitigation is possible** : batching, caching, model choice, and location all help. **Action you can take today** : Start tracking energy, carbon, and water in your CI pipeline, and show the numbers in your PRs or a regular report. Without shared standards and life cycle-inclusive reporting, efficiency gains risk being cosmetic. With them, environmental performance could become as visible and competitive a metric as accuracy or speed — aligning AI’s development with global climate goals. ## References 1. Luccioni, A. S., Viguier, S., & Ligozat, A.-L. (2022). “Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model”. Hugging Face, Graphcore, LISN & ENSIIE. ↩ ↩2 ↩3 ↩4 2. Jegham, N., Abdelatti, M., Elmoubarki, L., & Hendawi, A. (2025). “How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference”. University of Rhode Island, University of Tunis, Providence College. ↩ ↩2 ↩3 3. Berthelot, A., Caron, E., Jay, M., & Lefèvre, L. (2024). “Estimating the Environmental Impact of Generative-AI Services Using an LCA-Based Methodology”. “Procedia CIRP, 122”, 707–712. https://doi.org/10.1016/j.procir.2024.01.098. ↩ ↩2 ↩3 4. Klöpffer, W. (1997). “Life Cycle Assessment: From the Beginning to the Current State”. “Environmental Science and Pollution Research, 4”(4), 223–228. ↩ 5. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., & Dean, J. (2021). “Carbon Emissions and Large Neural Network Training”. Google & University of California, Berkeley. ↩ ↩2 6. Mistral AI. (2024, May 28). “Our Contribution to a Global Environmental Standard for AI”. https://mistral.ai/news/our-contribution-to-a-global-environmental-standard-for-ai ↩ 7. Luccioni, A. S., Jernite, Y., & Strubell, E. (2024). “Power Hungry Processing: Watts Driving the Cost of AI Deployment?”. In “Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT ’24), 21 pages. https://doi.org/10.1145/3630106.3658542. ↩ 8. Henderson, P., Hu, J., Romoff, J., Brunskill, E., Jurafsky, D., & Pineau, J. (2020). “Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning”. “Journal of Machine Learning Research, 21”, 1–44. ↩ ↩2 9. Anthony, L. F. W., Kanding, B., & Selvan, R. (2020). “Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models”. University of Copenhagen. ↩ 10. Iftikhar, S., Alsamhi, S. H., & Davy, S. (2025). “Enhancing Sustainability in LLM Training: Leveraging Federated Learning and Parameter-Efficient Fine-Tuning”. “IEEE Transactions on Sustainable Computing. https://doi.org/10.1109/TSUSC.2025.3592043. ↩

16.09.2025 09:00 — 👍 0 🔁 1 💬 0 📌 0

Signal president Meredith Whittaker: ‘In technology, it’s way too easy for marketing to replace substance. That’s what’s happened with Telegram’ The app best known for respecting privacy looks to grow, despite anti-privacy efforts

It's Signal. It's worth using. Good interview with @Mer__edith
https://english.elpais.com/technology/2025-09-14/signal-president-meredith-whittaker-in-technology-its-way-too-easy-for-marketing-to-replace-substance-thats-whats-happened-with-telegram.html
h/t

#Signal

15.09.2025 12:25 — 👍 204 🔁 54 💬 1 📌 3

Reading The Gentle Singularity Through a Sustainability Lens In June 2025, Sam Altman published _The Gentle Singularity_, a post exploring the future of AI. Among his predictions, he claims: > “As datacenter production gets automated, the cost of intelligence should eventually converge to near the cost of electricity.” He also shares figures on the power and water used in an average ChatGPT query. At first glance, these per-query figures seem reassuringly small. But without context, are they all that they seem? In this post, we examine Altman’s claims using scientific literature and accepted sustainability frameworks to better understand their implications. ## The Gentle Singularity: A Snapshot Altman’s post outlines several bold predictions: * We’ve crossed the event horizon toward superintelligence. Agents that do real cognitive work arrived in 2025; insight-generating systems may emerge in 2026, and capable robots by 2027. * Intelligence and energy will become abundant, driving massive gains in productivity and quality of life. * He estimates an average ChatGPT query uses about **0.34 watt-hours (Wh)** and **0.000085 gallons of water** —roughly one fifteenth of a teaspoon. * The goal is to scale safely, solve alignment, and make superintelligence cheap and widely available. To better understand the environmental implications, let’s take a closer look at Altman’s estimates for a typical ChatGPT query. ## The Numbers Problem: Averages Without Assumptions Altman writes: > “The average query uses about 0.34 watt-hours, about what an oven would use in a little over one second, or a high-efficiency lightbulb would use in a couple of minutes.” But what is an _average query_? This term is ambiguous. For example, Hugging Face’s AI Energy Score uses six different types of queries to assess model inference, from summarizing text to generating images. The energy usage varies significantly depending on the task. So, is an average useful? On its own, probably not. We also don’t know how this average was calculated. Was it measured across real-world usage in production, or in a controlled test harness using predictable queries? Another key factor is **where** the query was run. The same 0.34 Wh in a coal-powered region has a much higher carbon footprint than in a region powered by nuclear or renewables. This is called **carbon intensity** which is measured as CO₂ emissions per unit of electricity. Is Altman’s number plausible? Yes, though it’s on the lower end. _How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference_ , by Jegham et al., suggests a **median estimate of 0.42 Wh** for a short GPT-4o prompt. The paper also notes that **query batching** (processing multiple queries together) can significantly reduce energy per query, while longer prompts increase it. Source | Energy per Query (Wh) | Water per Query (ml) ---|---|--- Sam Altman (2025) | 0.34 | 0.32 Jegham et al. (2025) | 0.42 | 1.2 ## Beyond Electricity: The Hidden Environmental Costs of AI If we take a **lifecycle assessment (LCA)** approach, used in studies like _Estimating the Carbon Footprint of BLOOM_ (Luccioni et al.) or the Tech Carbon Standard (TCS), we must consider more than just electricity. Lifecycle assessment captures upstream and downstream impacts across air, water, and soil. For AI, this includes: Lifecycle Stage | Description | Environmental Impact Type ---|---|--- Model Training | Energy-intensive process | High electricity and water usage Hardware Embodied Carbon | Manufacturing, transport, disposal of servers and GPUs | CO₂ emissions Networking Equipment | Routers, switches, cables | CO₂ emissions For more detail, see TCS guidance on data centre hardware, networking, and foundation models. > While this post focuses on AI-specific impacts, it’s worth noting that the lack of visibility into data centre operations is a broader issue. Most providers, including those outside the AI space, offer limited transparency around infrastructure emissions, water usage, and lifecycle impacts. Addressing this industry-wide opacity is essential if we’re serious about sustainable digital infrastructure. ## Water Usage: What’s Counted and What’s Not Altman quotes **0.000085 gallons of water per query** , or about **0.32 millilitres**. Expressing the figure in gallons rather than millilitres may visually downplay its size, as the formatting introduces more leading zeros. This figure is also lower than expected. Jegham et al. estimate **1.2 ml per query** , assuming a short prompt under typical conditions — nearly four times Altman’s figure. So, what accounts for this discrepancy? It likely comes down to **on-site vs off-site water usage**. * **On-site water** is used directly at the data centre, mostly for cooling. * **Off-site water** is used indirectly—primarily in generating the electricity that powers the data centre. If a data centre reduces its on-site water usage by increasing power consumption, it may shift the burden to off-site water use. Lifecycle assessments aim to capture these hidden trade-offs. ## The Impact of Scale Even if we accept Altman’s numbers, they look small until we scale them. A TechRadar article from July 2025 reports that OpenAI supports **2.5 billion messages per day** from **500 million users**. Using Altman’s estimates: * **Energy** : 0.34 Wh × 2.5 billion = **850 million Wh/day** Equivalent to powering an average UK home at 8.5 kWh/day) **for 100 days**. * **Water** : 0.000085 gallons × 2.5 billion = **72,250 gallons/day** Enough to supply a four-person UK household at 118.9 gallons/day for **approximately 1.7 years**. ## Conclusion: A Call for Transparency None of this is to say that we shouldn’t be using or embracing AI powered tools. Instead, this is a call to action for AI companies such as OpenAI, Mistral, Alphabet and Meta to produce an open, scientifically backed standard which openly discloses the environmental cost of training and using AI based technology. Open metrics would allow developers and consumers to: * Make informed decisions * Employ strategies to minimise their environmental impact (for example by using services in regions with greener power generation, or using right sized models for their workload) * Give AI service providers metrics to improve their environmental impact. What we want to avoid is out of context data with no way to verify its methodology or understand the assumptions it is built on as this can lead to misunderstandings and poor decisions.

09.09.2025 10:00 — 👍 0 🔁 1 💬 0 📌 0

Building An AI-Agnostic Conversation Logger - Phase 4: Mini-Me # How Scope Creep Reminded Me of the Value of Product Ownership Every saga needs a bridge: a messy, transitional chapter that connects what came before with what comes next. In the MCU, Phase 4 followed the grand conclusion of the Infinity Saga. It was sprawling, uneven, and experimental, yet it laid the groundwork for what came next. In my own AI journey, Phase 4 was embodied by Mini-Me, my first attempt at building an AI-agnostic conversation logger as the natural evolution of Phase 3. It was meant to capture and structure conversations across multiple agents, and even begin to learn my quirks and style. Ambitious? Certainly. Useful? At times. Built to last? About as much as a Tony Stark prototype: flashy, clever, but destined to explode in testing! Mini-Me became a shadowy pastiche of my intentions, undone by scope creep and feature bloat. Yet from its ashes came the sharpest lesson so far: when working with AI copilots, strong product ownership and governance are not optional extras. They are the only things keeping your elegant vision from collapsing under its own weight. ## Setting the Scene Before Mini-Me, my work had focused on using multiple AIs for analysis and visualisation. Phase 1 was about building a React SPA to interact with different models. Phase 2 explored how developer tools like Cursor IDE could help restructure and scale those experiments. Phase 3 compared ways to extract structured data from AI outputs. Together these efforts revealed a gap: if I wanted to manage long-running conversations, compare different agents, and start building personalisation into the workflow, I would need something more persistent and structured than a UI or ad hoc script. What I also wanted, though, went beyond logging. I wanted a tool that could start to learn about me - not just the questions I asked, but the way I wrote, the tone I used, and the quirks of my style. The vision was that, over time, it would tailor its responses so that they increasingly sounded like something I might have written myself. That was the leap from “structured assistant” to “Mini-Me.” ## Enter Mini-Me Mini-Me was my first serious attempt at building a personal AI CLI assistant. The goals were simple: * Capture prompts and responses in JSON logs * Maintain threaded conversations * Generate embeddings with FAISS for semantic search * Orchestrate multiple AI backends (OpenAI, Anthropic, Ollama, GPT4All) * Provide failover logic and some early personalisation For a while, it even worked. I could start a conversation, switch between backends, and recall earlier threads. ## The Product Owner’s Dilemma Working alongside AI companions is intoxicating. They offer enticing shortcuts and seductive features that always feel just one tweak away. It is a little like walking the Jedi path, with the dark side of scope creep always beckoning from the shadows. Quicker, easier, more seductive. It creates the illusion of power by delivering more features more quickly. Yet those features are not always required, wanted, or even useful. Giving in feels good in the moment, but the result is a murkier product that is harder to control. As the product owner of Mini-Me, my hardest job was not the coding. It was keeping the project, and myself, on message. ## Why Mini-Me Fell Short Mini-Me became more powerful, seemingly, but also unstable. To really push my MCU analogies, it was my Ultron: ambitious, sometimes impressive, but ultimately not the foundation I wanted to build on. The technical weaknesses became clear: * **Monolithic architecture** : new features bolted on wherever they fitted, creating tangles. * **Agent sprawl** : each backend wired in separately, each loading its own config.yaml. Model overrides, error handling, and response parsing were duplicated instead of abstracted behind a common interface. * **Personalisation coupling** : early tailoring logic ended up hard-coded in the core. * **Failover chains** : clever in theory, unwieldy in practice. * **Search** : useful, but not modular. The deeper problem was governance. I allowed scope to expand too quickly. I let myself be seduced by shiny ideas. Mini-Me’s collapse wasn’t just technical debt: it was a masterclass in how AI acceleration can amplify poor product decisions. ## What I Learned The experience sharpened my perspective as both architect and product owner: 1. **Modularity matters** Clear boundaries between agents, services, data, and utilities prevent spirals of complexity. 2. **Config should be central** Scattering preferences and API keys across the codebase is a recipe for drift. Central .env files and persona definitions (personas.yaml) keep governance visible. 3. **Governance is the true accelerant** AI can accelerate delivery enormously, but that speed is double-edged. Without firm product management, you accelerate off the road just as quickly as you accelerate towards value. 4. **Logs and metadata are gold** Capturing not just responses but context, agents, personas, and modes creates transparency. It is the governance trail that prevents you from losing your way. Mini-Me did not just teach me how to structure an AI CLI. It reminded me that even when working with AI copilots, product ownership still matters more than ever. Looking back, Mini-Me’s failures taught me that every AI project needs a governance framework. Here’s the checklist I wish I’d had from the start: ### My Own “Product Owner’s Checklist” for my AI Projects [ ] Keep scope under control. New features are tempting, but discipline matters more than speed. [ ] Build for modularity. Agents, services, and data should be loosely coupled. [ ] Centralise config and personas. Make preferences explicit and governable. [ ] Capture logs and metadata. Every response, context, and decision should be traceable. [ ] Treat AI like a co-pilot, not the driver. Governance and direction must come from you. [ ] Remember: faster is not always better. What feels like momentum can just as easily be misdirection. ## The Dawn of JARVIS Mini-Me’s shortcomings gave rise to JARVIS: a clean-slate successor, designed with governance in mind. Where Mini-Me was sprawling, JARVIS embraces modularity by design. The cli/ directory handles command-line parsing and user interaction, while agents/ contains clean interfaces for each backend - no more duplicated config loading or tangled response parsing. Services/ orchestrates the heavy lifting: logging conversations, building search indices, managing personas, and coordinating between components. Utils/ keeps configuration, environment handling, and shared utilities in one place, while data/ provides a clean home for threads and embeddings. JARVIS is: * Persona-aware (coder, pre-sales, blogger) * Agent-agnostic (OpenAI, Claude, with more to come) * Session-continuing (auto-follows your last conversation) * Governance-friendly: logs everything with metadata, making decisions transparent If Mini-Me was Ultron, JARVIS is Vision: reborn, leaner, more maintainable, and crucially, better aligned with its product owner’s vision. And yes, that makes me the product _Visionary_. ## Key Takeaway: A Succinct Product Owner’s Checklist for working with AI Agents on Projects Treat your AI like a teammate, not a boss: keep it on track, log everything, and stay in charge. [] **Control scope.** Resist the lure of every AI-suggested feature. [] **Think modular.** Keep agents, code, and data loosely coupled. [] **Centralise settings.** Keep configs, preferences, and personas in one place. [] **Log everything.** Preserve prompts, responses, and metadata for transparency. [] **Lead, don’t follow.** Treat AI as a co-pilot, not the driver. [] **Value quality over speed.** Fast output can feel powerful but can be misaligned without oversight. ## Coming Next: Phase 5 - JARVIS Takes Shape If Phase 4 was my Ultron moment, Phase 5 is where Vision steps onto the stage. JARVIS represents a clean break from the chaos of Mini-Me, rebuilt with modularity, governance, and sustainability at its core. In this phase I will show how I stitched together agent switching, persona management, critique and consensus modes, and auto-following conversations into something coherent. The focus is not just on building features, but on sequencing them with discipline so that JARVIS grows steadily into the reliable companion I first imagined.

08.09.2025 15:08 — 👍 0 🔁 1 💬 0 📌 0

08.09.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Some Best Practices for Writing Readable Automation Tests While on my most recent project I had the unique experience of working closely with many testers and test minded individuals. This allowed me to learn some much-needed lessons about how to best implement automation testing with readability in mind, a sometimes-overlooked area of test automation. ## Introduction What I hope to share with you is how simple it can be to both think about and implement the approach to make sure your tests are accessible and readable. You may be reading this because you have strict reasons on your project to ensure compliance with certain metrics, or because you want to increase the longevity of your code, or because you simply want to read some wonderful thoughts from a certain tester’s perspective! ## What to expect() Now, it’s good to make note that the lessons learnt here are all from the work I have carried out while implementing Playwright tests. This means you will need some basic understanding of TypeScript and Playwright for when I discuss some examples. So, without further ado, let’s discuss the `expect()` function. This neat little function is the bread-and-butter of automating in playwright, it’s what we want to see when we place our test into a certain configuration. This doesn’t mean it has to do handstands or backflips, but just like any gymnastics routine, we wait on it with bated breath, hoping for results we will cheer for. This anticipation makes it easy for us to write something which makes us wait for that big finale in our tests. And therein lies the issue; during a test we usually assert for confirmation to ensure we are where we expect. That means we can `expect(response.status()).toBe(200)` which is a wonderful successful request that has found the requested resource of a page. Say that instead we get a `201` response, which is just as equally a successful request, except it led to the creation of a resource. Test results can’t always be perfect and that means that this stopping point during the test, before we check the larger critical parameters later on, will skip the test. Instead, we have to accept that not every response will be exciting and just `expect(response.status()).toBeOK())` which allows us to still capture a successful response from a test, even if the status code is different. ## Don’t just getBy, Role with it After correctly measuring our expectations, we want to roll into selecting all the necessary parameters we can get by with. Therefore, it comes as no surprise we use the `.getByRole` function, but in order for us to make sense of it then it’s all about those pesky selectors. This is because there is an interface that runs behind the scenes of the UI and Playwright can pick selectors which will not be seen by someone who needs to use the keyboard to navigate around the system. For these users, the elements they have access to are either tabbed to or navigated to via arrow keys. If using a screen reader these selectors need to have a consistent and readable naming convention. Otherwise, some software that reads it aloud would read out aloud could read jargon which will only confuse the User. Things like filters and drop-downs can sometimes be difficult to locate but they can be found. A good tip for finding this is by entering the webpage and open up DevTools, then clicking near the item you wish to find the selector of, but not on it. After that press tab and open up the console within DevTools. This works best when your initial click is to the top left of the item you wish to find. If you then type the command: `document.activeElement` you will find the info you need. Failing this you can always right-click and inspect the item. It’s also generally good practice to not use `.getbyTestID`.This is due to the fact that names of those same ID’s will not make sense when either looked at later in a projects life cycle or to others in the team who do not have knowledge in that area. More information can be found about locators in the Playwright documentation Sometimes though this may not have been implemented on the project. This requires roles to be assigned correctly on the project you are on, things like test ID’s or divs could have been used instead which makes this impossible to implement. When this is the case it is always best to have that conversation with the team so that you can explain how useful and helpful HTML roles are. ## Some notable names worth tagging And after all of these suggestions, we find ourselves back at the top of a finished test, ready to move on and forget about all that good work we have done. If only there was a way to easily locate your previous work when you have some busy moments on the project and need to hand over what you’ve created. Well, to name a few reasons, let’s start with the Name of your test and follow with some tagging advice. When creating names of tests they tend to look like this: `test('should return a 200 when User is Valid')` Which uses single quotation marks `'` as the name is a string. However, it can be difficult when you wish to pass in a variable to that same string, as variables are noted by single quotation marks also. So, when creating test names, it’s best practice to instead use the back tick ``` as this will still allow you to use a variable in the test name as seen in this example: `test(`should return a 400 when User Infomation is '${variable}'`)` For those who don’t know where it is, the back tick can be located underneath the ESC key on the majority of keyboards. Going forward with the creation of our tests we want to ensure that they can be associated with the correct story or ticket from which they were created. The tag function can be used in the test name and can be done like this: `test(`a test example to show a test name for a test under ticket THW-000`, {tag: 'THW-000'}, async()` Here you can see that the tag simply needs to be placed at the end of the test name and is separated with a comma. For good practice this should be the ticket number that the work is being generated from. If you want to go the extra mile, it’s even better to use annotations for ticket numbers. That way, tags can be used exclusively to filter what tests are run, filtering test results. This is what is documented in the Playwright documentation ## Conclusion Well, there have been some odd analogies and puns along the way but hopefully this peek into a tester’s mind has helped you understand the importance of, and some ways to implement some good approaches in your automation tests while also keeping it simple to implement. Maybe there are even more simple techniques out there so go out and find them, or better yet bring them up with others and let’s get everything even more readable!

04.09.2025 00:00 — 👍 0 🔁 1 💬 0 📌 0

Paul Edwin

Latest posts by paul-edwin.hachyderm.io.ap.brid.gy on Bluesky

@paul-edwin.hachyderm.io.ap.brid.gy is following 1 prominent accounts