Category: Technology

  • Stop “Prompt Engineering”: The Real Reason LLMs Fail at Complex Work

    Stop “Prompt Engineering”: The Real Reason LLMs Fail at Complex Work

    What I Learned About Using LLMs and Prompt Engineering at Work, Why It Fails, and What To Do About It

    Introduction I’ve tried using strongly bounded prompt engineering with general LLM coding agents and what I found was that I wasn’t giving the LLM a small enough task, the task is not the type of task you think you need to do, and how granular thinking process steps need to be for effective use.


    The Frustration Point

    I spent months trying to get LLMs to help with enterprise architecture work. I crafted detailed prompts. I used chain-of-thought reasoning. I created custom modes in my development environment. I set up retrieval-augmented generation (RAG) systems pointing to our corporate documentation. Despite all of this infrastructure, the results were consistently disappointing. The LLM would generate plausible-sounding architecture documents that violated our naming conventions, ignored our established patterns, and made assumptions that any junior analyst who’d been here six months would know were wrong.

    My initial reaction was anger at the technology itself. These systems aren’t experts. They don’t have experience. They’re pattern-matching machines that synthesize text from training data without understanding whether their output maps to reality. A middle schooler who searches the internet and applies critical thinking can outperform an LLM because they have intuition about what’s real versus what’s plausible-sounding nonsense.

    But then I had an insight that changed how I approached the problem entirely.

    The First Breakthrough: Agent Chains for Process Steps

    The breakthrough came when I stopped trying to create agents that embodied entire roles and instead created agent chains where each agent handled one discrete step in a process.

    Instead of asking an LLM to "be an architect," I started asking: "Can an agent extract requirements from a document?" or "Can an agent validate that a data model has no circular dependencies?" These are concrete, bounded tasks with verifiable outputs.

    For example, rather than one architectural agent, I envisioned:

    • Product Owner step: Extract user stories from requirements documentation
    • Analyst step: Identify data entities mentioned in user stories
    • Architect step: Check if entities map to existing domain models in the repository
    • Developer step: Generate boilerplate for a specific entity pattern
    • Tester step: Given a data model, identify boundary conditions

    Each step produces artifacts that feed the next step. The intelligence emerges from orchestration, not from any single agent being smart.

    This felt like progress. I was applying the same thin vertical slice approach I use in project planning – breaking work into demonstrable MVPs rather than big-bang deliveries.

    But I still had failures. The agents still hallucinated. They still violated organizational standards. They still made assumptions that a trained human wouldn’t make.

    The Deeper Problem: Micro-Tasks Within Tasks

    The real insight came when I examined what was actually happening when the LLMs failed. Take a seemingly simple task: "Create a job-to-be-done statement from this customer interview transcript."

    When I train a new college hire to do this work, here’s what I actually teach them:

    1. Look at these five approved examples of job statements from our repository
    2. Identify the pattern – what verb form do we use? Where does the object go? How is context structured?
    3. Check the terminology guide – do we use "customer," "user," or "client" in job statements?
    4. Extract the core customer need from the transcript – don’t rephrase it yet, just identify it
    5. Validate the scope – does it reference one actor? Does it avoid technology? Does it describe an outcome?
    6. Apply the pattern using our terminology
    7. Compare your output to the examples – does it match structurally?
    8. Let me review it before you finalize

    That’s eight distinct micro-tasks with validation between each step. When I asked an LLM to "create a job statement," I was asking it to perform all eight steps simultaneously, making hundreds of micro-decisions without the organizational context that exists only in human knowledge transfer.

    Organizations have formally or informally defined standards, naming conventions, ontologies, and taxonomies that humans intuit because they pass them down from generation to generation. This knowledge may not even be documented. LLMs don’t pick up on that nuance because it’s reinforced through human consciousness, not through text that exists in a training corpus.

    The Solution: Hierarchical Process Decomposition

    The solution isn’t better prompts for coarse-grained tasks. The solution is decomposing your processes hierarchically until you reach atomic micro-tasks where each operation is verifiable, bounded, and requires explicit organizational context.

    Let me show you what this looks like in practice with a real scenario.

    The Real-World Example: Energy Consumption Tracking

    Scenario: A customer needs to track energy consumption across multiple devices in their microgrid.

    This seems like a straightforward product requirement. But watch what happens when we decompose it properly.


    Level 1: Strategic Portfolio Process Steps

    At the highest level, enterprise portfolio managers see these phases:

    1. Horizon Scanning & Opportunity IdentificationStart here
    2. Initiative Definition & Business Case Development
    3. Capability Architecture & Roadmap Planning
    4. Epic & Feature Decomposition
    5. Implementation & Validation
    6. Release & Operations Handoff

    When someone says "we need to track energy consumption," it enters at Level 1 as part of Horizon Scanning. Let’s decompose that one step.


    Level 2: Horizon Scanning Process

    When we decompose "Horizon Scanning & Opportunity Identification," we get:

    1. Industry Research & Trend Analysis
    2. Customer Problem DiscoveryDecompose this
    3. Competitive Intelligence Gathering
    4. Technology Capability Assessment
    5. Regulatory & Policy Impact Analysis
    6. Opportunity Synthesis & Prioritization

    The energy tracking need comes from Customer Problem Discovery. Let’s go deeper.


    Level 3: Customer Problem Discovery Process

    When we decompose "Customer Problem Discovery," we get:

    1. Define Target Market & Job Performer
    2. Uncover Customer Jobs-to-be-DoneDecompose this
    3. Map Customer Job Process
    4. Identify Desired Outcomes
    5. Quantify Opportunity Size
    6. Validate Problem-Solution Fit

    This is where we formally investigate what customers are trying to accomplish. Let’s continue.


    Level 4: Uncover Customer Jobs-to-be-Done Process

    When we decompose "Uncover Customer Jobs-to-be-Done," we get:

    1. Recruit Interview Participants
    2. Conduct Qualitative Interviews
    3. Transcribe & Code Interview Data
    4. Synthesize into Job StatementsDecompose this

    We’re getting closer to actual work. Let’s keep going.


    Level 5: Synthesize into Job Statements Sub-Process

    When we decompose "Synthesize into Job Statements," we get:

    1. Retrieve Job Statement Templates
    2. Extract Statement Format Rules
    3. Query Statement Terminology Standards
    4. Retrieve Approved Examples
    5. Extract Synthesis Patterns from Examples
    6. Map Interview Patterns to Statement ElementsDecompose this
    7. Generate Job Story Statements
    8. Generate Full JTBD Statements
    9. Validate Statement Format
    10. Validate Against Examples
    11. Human Review & Approval

    Still not at the atomic level. Let’s continue.


    Level 6: Map Interview Patterns to Statement Elements

    When we decompose "Map Interview Patterns to Statement Elements," we get:

    1. Retrieve Pattern Frequency Report
    2. Select Top Pattern for Processing
    3. Validate Pattern Completeness
    4. Extract Pattern ComponentsDecompose this
    5. Map Components to Statement Format
    6. Validate Component Mapping

    Getting closer. One more level.


    Level 7: Extract Pattern Components

    When we decompose "Extract Pattern Components," we get:

    1. Validate Pattern Has Context Element
    2. Validate Pattern Has Action Element
    3. Validate Pattern Has Outcome Element
    4. Extract Context ComponentDecompose this
    5. Extract Action Component
    6. Extract Outcome Component

    Now we’re at something that sounds like a single task. But is it really? Let’s find out.


    Level 8: Extract Context Component – The Atomic Micro-Tasks

    This is where we finally reach the atomic level – the individual micro-tasks that map directly to MCP server primitives and tools. Here are the 17 atomic micro-tasks required to extract ONE context component from ONE pattern:

    Micro-Task 1: Retrieve Top Pattern Data

    • MCP Server: retrieve_artifact(artifact_id, source_task)
    • Operation: RAG query to artifact repository
    • Input: artifact_id: "top_pattern_rank_1", source_task: "2.4.18"
    • Constraint: Exact artifact retrieval – no filtering, no interpretation
    • Output: Pattern object (pattern_text, frequency_count, supporting_quotes, coded_categories)

    Micro-Task 2: Parse Pattern for Customer Quotes

    • MCP Server: parse_artifact_field(artifact, field_name)
    • Operation: JSON field extraction
    • Input: Pattern object, field_name: "supporting_quotes"
    • Constraint: Extract field value only – no transformation
    • Output: Array of quote objects

    Micro-Task 3: Query Context Code Definition

    • MCP Server: query_ontology(term_type, context)
    • Operation: RAG query to coding framework
    • Input: term_type: "context_element", context: "JTBD_interview_coding"
    • Constraint: Return exact definition from coding framework
    • Output: Context definition document

    Micro-Task 4: Extract Context Identification Rules

    • MCP Server: parse_artifact_field(artifact, field_name)
    • Operation: Field extraction
    • Input: Context definition, field_name: "identification_rules"
    • Constraint: Extract field only
    • Output: Array of identification rules

    Micro-Task 5: Filter Quotes for Context Signals

    • MCP Server: filter_by_signal_phrases(quotes, signal_phrases)
    • Operation: Pattern-based filtering
    • Input: Quote array, signal phrases from rules
    • Constraint: Exact phrase match only
    • Output: Filtered quote array

    Micro-Task 6: Validate Quote Availability

    • MCP Server: validate_collection_count(collection, minimum_count)
    • Operation: Binary validation – count check
    • Input: Filtered quotes, minimum_count: 1
    • Validation Rule: Count >= minimum? Yes/No
    • Output: PASS/FAIL with count

    Micro-Task 7: Extract First Context Quote

    • MCP Server: extract_array_element(array, index, field)
    • Operation: Array element extraction
    • Input: Quote array, index: 0, field: "quote_text"
    • Constraint: Extract exactly as stored
    • Output: Quote text string

    Micro-Task 8: Query Context Extraction Pattern

    • MCP Server: retrieve_extraction_pattern(pattern_type, source_artifact_type)
    • Operation: Pattern retrieval
    • Input: pattern_type: "context_extraction", source: "interview_quote"
    • Constraint: Current approved pattern only
    • Output: Extraction pattern document

    Micro-Task 9: Parse Extraction Rules

    • MCP Server: parse_artifact_field(artifact, field_name)
    • Operation: Field extraction
    • Input: Pattern document, field_name: "pattern_rules"
    • Constraint: Extract field only
    • Output: Pattern rules array

    Micro-Task 10: Apply Extraction Pattern

    • MCP Server: apply_extraction_pattern(text, pattern_rules)
    • Operation: Constrained extraction
    • Input: Quote text, pattern rules
    • Constraint: Apply rules sequentially – extract matching text only, no generation
    • Output: Extracted context text (substring)

    Micro-Task 11: Validate Extraction is Verbatim

    • MCP Server: validate_substring(text, source_text)
    • Operation: Binary validation – substring check
    • Input: Extracted context, original quote
    • Validation Rule: Is extracted text verbatim substring? Yes/No
    • Output: PASS/FAIL

    Micro-Task 12: Query Forbidden Terms

    • MCP Server: query_ontology(term_type, context)
    • Operation: Ontology lookup
    • Input: term_type: "forbidden_terms", context: "JTBD_context_component"
    • Constraint: Complete list retrieval
    • Output: Forbidden terms array

    Micro-Task 13: Validate No Forbidden Terms

    • MCP Server: validate_forbidden_terms(text, forbidden_terms)
    • Operation: Binary validation – term checking
    • Input: Extracted context, forbidden terms
    • Validation Rule: Contains ANY forbidden term? Yes=FAIL/No=PASS
    • Output: PASS/FAIL with violating terms if any

    Micro-Task 14: Flag for Human Review (Conditional)

    • MCP Server: create_review_flag(artifact, issue_type, details)
    • Operation: Flag creation
    • Input: Extracted context, issue_type: "forbidden_terms_detected"
    • Constraint: Only execute if validation failed
    • Output: Review flag object

    Micro-Task 15: Store Extracted Context

    • MCP Server: store_artifact(artifact, artifact_type, metadata)
    • Operation: Data persistence
    • Input: Extracted context, type: "job_statement_context_component", metadata
    • Constraint: Store with complete traceability
    • Output: Artifact reference (artifact_id, location, timestamp)

    Micro-Task 16: Link Context to Source Pattern

    • MCP Server: create_artifact_link(source_id, target_id, link_type)
    • Operation: Relationship creation
    • Input: Pattern artifact_id, context artifact_id, link_type: "extracted_from"
    • Constraint: Bidirectional link with timestamps
    • Output: Link object

    Micro-Task 17: Return Context Component

    • MCP Server: return_artifact(artifact_id)
    • Operation: Output handoff
    • Input: Context component artifact_id
    • Constraint: Complete package with validation status
    • Output: Context component package

    The Staggering Reality

    Let’s recap what we just saw:

    Level 1: Strategic Portfolio (6 steps) └─ Level 2: Horizon Scanning (6 steps) └─ Level 3: Customer Problem Discovery (6 steps) └─ Level 4: Uncover JTBD (4 sub-processes) └─ Level 5: Synthesize Statements (11 sub-tasks) └─ Level 6: Map Patterns (6 tasks) └─ Level 7: Extract Components (6 micro-tasks) └─ Level 8: Extract Context (17 ATOMIC operations) 

    Starting Point: "Customer needs to track energy consumption"
    Intermediate Point: "Extract context component"
    Ending Point: 17 atomic micro-tasks

    To extract ONE component (the context) from ONE pattern requires:

    • 17 atomic micro-tasks
    • 14 distinct MCP server primitives
    • Multiple underlying tools per server

    And this is just the context component of the first pattern in the first job statement. Each job statement has three components (context, action, outcome). Most jobs have multiple patterns. Most problems have multiple jobs.

    When you ask an LLM to "understand what the customer needs," you’re asking it to perform thousands of atomic operations without explicit organizational context at each step.

    This is why LLMs fail. Not because they’re bad at pattern matching, but because we’re asking them to make thousands of micro-decisions simultaneously without the scaffolding that each decision requires.

    What This Means for MCP Server & Tool Design

    From those 17 atomic micro-tasks, we need these MCP Server Primitives:

    Data Retrieval Servers

    • retrieve_artifact(artifact_id, source_task) – Get stored artifacts
    • query_ontology(term_type, context) – Look up terminology/rules
    • retrieve_extraction_pattern(pattern_type, source_artifact_type) – Get transformation patterns

    Data Parsing Servers

    • parse_artifact_field(artifact, field_name) – Extract specific fields
    • extract_array_element(array, index, field) – Get specific elements

    Filtering & Matching Servers

    • filter_by_signal_phrases(quotes, signal_phrases) – Pattern-based filtering
    • apply_extraction_pattern(text, pattern_rules) – Constrained text extraction

    Validation Servers

    • validate_collection_count(collection, minimum_count) – Count validation
    • validate_substring(text, source_text) – Verbatim check
    • validate_forbidden_terms(text, forbidden_terms) – Term blacklist check

    Data Persistence Servers

    • store_artifact(artifact, artifact_type, metadata) – Save with traceability
    • create_artifact_link(source_id, target_id, link_type) – Create relationships

    Issue Management Servers

    • create_review_flag(artifact, issue_type, details) – Flag problems

    Orchestration Servers

    • return_artifact(artifact_id) – Pass results to next step

    Each MCP server is a single-purpose, verifiable operation. The agent orchestration chains them together with validation between each step.

    Infrastructure Requirements

    Making this work requires specific infrastructure components that work together to provide organizational context at the micro-task level.

    1. Document Repository with Retrieval-Augmented Generation (RAG)

    Your document repository becomes the source of truth for organizational knowledge. The RAG system must be structured to support micro-task queries:

    Repository Structure:

    • /templates/ – Approved examples of every artifact type
    • /standards/ – Naming conventions, ontologies, taxonomies
    • /patterns/ – Documented patterns for common transformations
    • /decisions/ – Architecture decision records (ADRs) that capture why things are done certain ways
    • /examples/ – Real artifacts from past projects, tagged by type and quality

    RAG Query Design:
    The RAG needs to support specific query types:

    • Example retrieval: "Return 5 approved job-to-be-done statements"
    • Pattern lookup: "What naming pattern do we use for microservice endpoints?"
    • Terminology check: "What term does the organization use for X?"
    • Decision context: "Why do we structure capability models this way?"

    Quality Control:
    Examples in the repository must be curated. Not every artifact that exists should be retrievable – only approved, high-quality examples that represent current standards.

    2. MCP Server Primitives

    Model Context Protocol (MCP) servers should be designed as micro-operation primitives, not as high-level capabilities. Instead of one MCP server called "job_statement_creator," you need:

    • retrieve_examples(artifact_type, count, quality_filter)
    • extract_pattern(examples, pattern_type)
    • query_ontology(term_type, context)
    • validate_against_rules(content, rule_set)
    • apply_transformation(input, pattern, constraints)
    • compare_to_standard(output, examples)

    Each MCP server does one verifiable micro-operation. The agent orchestration chains these together with validation between each step.

    3. Decision Trees for Process Orchestration

    For each artifact type in your development hierarchy (customer job, capability definition, epic, feature, user story, task, etc.), you need a decision tree that defines:

    • What micro-tasks must be performed in what sequence
    • What organizational context each micro-task requires
    • What validation checks occur between steps
    • Where human checkpoints are required
    • What the failure modes are and how to handle them

    These decision trees become the orchestration logic for your agent chains.

    4. Validation Rule Sets

    Every micro-task that performs validation needs an explicit rule set. These can’t be implicit – they must be codified:

    Naming Convention Rules:

    job_statement_naming: verb_form: gerund # "Managing" not "Manage" terminology: actor: "customer" # Not "user" or "client" action_object: specific # "inventory" not "stuff" structure: "[actor] + [verb] + [object] + [context]" forbidden_terms: ["system", "software", "app"] 

    Scope Validation Rules:

    job_statement_scope: required: - single_actor: true - outcome_focus: true forbidden: - technology_reference: false - solution_specification: false - multiple_jobs: false 

    These rules must be machine-readable and versioned alongside your code.

    5. Template Library

    For every artifact type, maintain:

    • Blank template with field definitions
    • Annotated template explaining what goes in each field and why
    • Exemplar artifacts showing the template filled out correctly
    • Anti-patterns showing common mistakes and what’s wrong with them

    The template library feeds both the RAG system and provides structure for LLM output.

    6. Human-in-the-Loop Checkpoints

    Not every step can or should be automated. Define explicit checkpoints where human judgment is required:

    • Creative decisions – Choosing between valid alternatives
    • Context interpretation – Understanding nuance that requires organizational history
    • Quality judgment – Determining if output meets unstated quality standards
    • Exception handling – Dealing with edge cases not covered by rules

    The checkpoint design should make it easy for humans to approve, reject, or provide specific corrections without needing to understand the entire workflow.

    The Methodology: How to Build This

    You cannot document everything upfront. That’s waterfall thinking applied to knowledge work. Instead, use an iterative discovery process:

    Week 1: Pick Your Most Problematic Artifact

    Identify where LLMs fail most frequently in your workflow. Is it creating architecture documents? Writing user stories? Extracting requirements? Start there.

    Discovery steps:

    1. Collect every existing example of that artifact type from your repository
    2. Watch yourself create a new one from scratch – write down each micro-decision you make
    3. Watch a college hire try to create one – where do they get stuck? Each sticking point reveals a missing micro-task
    4. Review with a senior person – what do they correct? Each correction reveals an organizational standard

    Documentation template for each micro-task:

    • Name: [verb][object][constraint]
    • MCP Server: The specific primitive this maps to
    • Input Required: What must exist before this executes?
    • Organizational Context: What company-specific knowledge is needed? Where does it live?
    • Operation Type: Retrieval | Validation | Transformation | Generation-with-constraints | Human-decision
    • Constraints: What boundaries exist?
    • Output Produced: What results?
    • Validation Check: How do you know it’s correct?
    • Failure Mode: How does an LLM typically fail at this?

    Chain the micro-tasks: Sequence them so each output feeds the next input. Build the agent workflow. Test it. Document what works and what doesn’t.

    Week 2: Next Artifact

    Repeat for the next artifact in your hierarchy. Some micro-tasks will be reusable (retrieve examples, validate naming), others will be specific to that artifact type.

    Weeks 3-5: Pattern Recognition

    After 3-5 artifacts, patterns emerge. Certain micro-task types repeat. These become your MCP server primitives and reusable components.

    Ongoing: Maintain the Knowledge Base

    As organizational standards evolve, update:

    • Example repositories
    • Validation rule sets
    • Decision trees
    • Template libraries

    This maintenance must be part of your development process, not a one-time setup.

    Why This Works (And Why Previous Approaches Failed)

    The fundamental insight: LLMs are good at pattern application and constrained transformation. They’re terrible at intuiting unstated organizational context and making judgment calls that require experience.

    What changed:

    • Task granularity went from "create an architecture document" to "retrieve 5 example documents and extract the section header pattern"
    • Context provision went from "use our standards" to explicitly querying those standards at each micro-task
    • Validation went from implicit human review at the end to explicit, machine-checkable validation between every step
    • Human involvement went from full creation to approval at specific checkpoints

    What this requires from you:

    • Decomposition discipline to identify micro-tasks (8 levels deep minimum)
    • Infrastructure investment in RAG, MCP servers, rule sets, templates
    • Cultural shift from "review the output" to "define the process"
    • Maintenance commitment to keep organizational context current

    What you get in return:

    • Reliable, consistent output that matches organizational standards
    • Traceability from business need through thousands of micro-decisions to implementation
    • Reduced hallucination because LLMs operate within explicit constraints
    • Ability to train new hires using the same codified micro-tasks
    • Organizational knowledge captured in machine-readable form

    The Uncomfortable Truth

    This approach requires significant upfront work. You must decompose your processes to micro-tasks. You must codify organizational knowledge that currently lives in people’s heads. You must build and maintain infrastructure.

    This is not a prompt engineering problem. This is a knowledge engineering problem.

    The promise of LLMs was that we could just describe what we want and get it. The reality is that we must be far more explicit about how we work than we ever were with human employees. The benefit is that this explicitness creates organizational assets: documented processes, codified standards, reusable components.

    The college hire who learns through observation and osmosis can’t easily transfer that knowledge to the next hire. The micro-task workflow that codifies every step can be used to train humans and LLMs alike.

    Where To Start

    Don’t try to build this for your entire organization at once. Start with one painful problem:

    1. Identify your biggest LLM failure point – Where does the AI consistently produce wrong outputs?

    2. Shadow yourself doing that task – Write down every micro-decision you make. Don’t stop at "extract the requirement" – go to "retrieve 5 examples, extract the verb pattern, query the terminology, extract verbatim text, validate scope rules…"

    3. Count the levels – How many levels of decomposition until you reach atomic operations? If you’re not at 6-8 levels, you’re not granular enough.

    4. Document one micro-task fully – Pick the smallest meaningful unit and fully specify it using the template provided, including the MCP server primitive it maps to

    5. Build the infrastructure for that one micro-task – Create the RAG query, MCP server implementation, validation rules it needs

    6. Test with an LLM – See where it still fails

    7. Iterate – Decompose further, add missing context, refine constraints

    8. Chain to the next micro-task – Build the workflow one step at a time

    After you’ve successfully automated one complete artifact workflow through 8 levels of decomposition, the pattern becomes clear. The infrastructure becomes reusable. The methodology becomes repeatable.

    Conclusion

    The problem with prompt engineering isn’t that we need better prompts. The problem is that we’re asking LLMs to perform tasks that are actually thousands of micro-tasks bundled together, each requiring organizational context that exists nowhere except in human knowledge transfer.

    The solution isn’t to make LLMs smarter. The solution is to decompose our work to the level where LLMs can succeed: bounded, explicit, verifiable micro-tasks with organizational context provided at each step.

    From "track energy consumption" to "extract context component" requires traversing 8 levels of decomposition and executing 17 atomic operations with 14 different MCP server primitives.

    This requires treating LLMs not as intelligent agents that can figure things out, but as very capable pattern-matching and transformation engines that need explicit instructions at a granularity far finer than we’re used to providing.

    The work to build this infrastructure is significant. But it’s work that benefits the organization beyond just enabling AI. It creates documented processes, codified knowledge, and reusable components that make human onboarding faster and organizational knowledge more durable.

    The future of effective AI use in enterprise isn’t better prompts. It’s better process decomposition, explicit knowledge capture, and infrastructure that provides organizational context at the micro-task level.

    Start small. Pick one problem. Decompose thoroughly – expect 6-8 levels minimum. Build the infrastructure. The rest will follow.


    Glossary of Key Terms

    Agent Chain: A sequence of specialized agents where each agent performs one discrete step in a workflow, with outputs from one agent serving as inputs to the next.

    Atomic Micro-Task: The smallest possible unit of work that cannot be decomposed further – a single operation with one specific input, one constraint, and one verifiable output. Typically requires 6-8 levels of decomposition from high-level process steps to reach atomic tasks.

    Hierarchical Process Decomposition: The systematic breaking down of high-level process steps into progressively more granular sub-processes, tasks, and micro-tasks until reaching atomic operations that map to individual MCP server primitives.

    MCP Server (Model Context Protocol Server): A specialized service that provides one specific atomic capability to LLMs, designed to perform a single bounded operation such as retrieval, validation, transformation, or persistence.

    Micro-Task: Previously thought to be the smallest unit of work, but typically still contains multiple atomic operations. True micro-tasks must be decomposed further to atomic level.

    Organizational Context: Company-specific knowledge including naming conventions, standards, patterns, ontologies, taxonomies, and decision rationale that typically exists through human knowledge transfer rather than explicit documentation.

    Process Decomposition Level: The depth of decomposition from high-level strategic processes to atomic micro-tasks. Effective AI implementation typically requires 6-8 levels of decomposition.

    RAG (Retrieval-Augmented Generation): A system that enhances LLM responses by retrieving relevant documents from a knowledge base before generating output, providing explicit organizational context at each micro-task.

    Constrained Transformation: An operation where input is transformed to output following explicit rules and patterns, with no room for creative interpretation or synthesis. One of the core operation types in atomic micro-tasks.

    Validation Checkpoint: A specific point in a workflow where outputs are checked against explicit criteria before proceeding to the next step. Occurs between every atomic micro-task.

    Artifact: A tangible work product created during the development process, such as a requirements document, architecture diagram, user story, or implementation code. Each artifact requires its own decomposition hierarchy.

    Thin Vertical Slice: A complete thread through the entire development stack that delivers one minimal demonstrable feature, used to validate the end-to-end workflow before scaling.

    Pattern Extraction: The process of analyzing multiple examples to identify consistent structural or formatting patterns that can be explicitly codified. A common operation type in atomic micro-tasks.

    Human Checkpoint: A designated point in an automated workflow where human judgment, approval, or decision-making is required before proceeding. Strategically placed after validation of atomic micro-task outputs.

    Traceability Chain: The ability to trace any implemented feature back through user stories, epics, capabilities, customer needs, and every atomic micro-task that contributed to its creation, validating that each level supports the level above it.

    Knowledge Engineering: The practice of explicitly capturing, codifying, and structuring organizational knowledge in machine-readable formats that can support both human and AI decision-making at the micro-task level.

    Decomposition Depth: The number of levels required to break down a high-level process into atomic micro-tasks. Enterprise processes typically require 6-8 levels of decomposition to reach truly atomic operations.

  • Human Factors in Wake Word Design: Optimizing Voice Assistant Interaction for All Users [Research]

    Human Factors in Wake Word Design: Optimizing Voice Assistant Interaction for All Users [Research]

    Voice assistants have become ubiquitous in homes, phones, and cars, providing convenient hands-free interaction. However, designing the wake word – the spoken prompt that activates these assistants – is not a one-size-fits-all endeavor. A well-chosen wake word must balance technical reliability with human-centric considerations. This includes creating an engaging persona for the assistant and accommodating diverse cognitive abilities among users. The challenge is to craft a wake word that not only triggers the system accurately, but is also easy for all users to remember, pronounce, and trust.

    The Wake Word as Persona and Presence

    Beyond its technical role, a wake word often embodies the voice assistant’s persona. The name or phrase we use to call an assistant can set the tone for how users relate to it. As voice designer Alina Khayretdinova notes, creating a voice assistant’s name and personality is “somewhat of a work of art” – comparable to developing a character in a book or movie. Establishing the right persona can influence whether users feel the assistant is more of a friendly “buddy” or an authoritative “boss.” Research in human-computer interaction supports this: users often personify voice assistants, attributing social roles and human-like characteristics to them. For instance, many Alexa™ users in one study described the device as if it were a friend or family member, indicating that a personable wake word and voice can increase user engagement. By giving the assistant a relatable identity, designers can foster greater trust and comfort in using the device. In contrast, a cold or awkward wake word might distance users or reduce their willingness to interact. Thus, the choice of wake word and persona has real impact on user acceptance – a friendly, humanized wake word can make the technology feel more accessible and social.

    From a branding perspective, the wake word also represents the device’s presence in daily life. Companies often choose unique names (like “Alexa” or “Siri”) to reinforce the assistant’s identity. Uniqueness is important not just for branding, but for practical reasons: a wake word that is too common (e.g. “Computer”) can lead to accidental activations when the word comes up in conversation or media. Avoiding common words or names is therefore critical. In fact, industry guidelines recommend using a wake word that is highly distinct and not easily confused with other vocabulary【143††】. An observational study of smart speakers noted that wake word designers recommend at least six phonemes (speech sounds) in the trigger phrase, along with a unique character, to minimize false positives【143††】. A longer, uncommon wake word like “Hey Mycroft” is less likely to be spoken accidentally than a short, common word like “Hey You,” reducing unintended activations【143††】. Designers must walk a fine line: the wake word should feel natural and friendly, but also be unique enough to avoid collision with everyday language.

    Finally, persona design should account for cultural and linguistic diversity. A wake word that feels friendly in one language might carry different connotations in another. Users in multilingual households may prefer an assistant name that “codeswitches” well – for example, a name that is easy to pronounce in all relevant languages. Researchers have explored the impact of different naming strategies (from human names to abstract words) on user perceptions across cultures. The consensus is that cultural context and user expectations matter: a wake word must be perceived as respectful and convenient within the user’s culture. What sounds like a clever persona in one country might sound odd or offensive in another. Thus, human factors in wake word design begin with persona and naming decisions that resonate positively with the target user group, fostering a sense of approachability and trust.

    Designing for Diverse Cognitive Profiles

    Equally important is ensuring the wake word is usable by people with diverse cognitive abilities. Users span all ages and backgrounds – from children to older adults – and include people with conditions like ADHD, dyslexia, or brain fog. These differences can affect how easily a person remembers and uses a given wake word. Working memory, the mental capacity for holding and manipulating information in the short term, is a key factor here. Working memory is often described as the brain’s “scratchpad” for temporary information. It allows us to keep several pieces of information in mind simultaneously and is critical for tasks like following multi-step instructions or remembering a phrase long enough to say it. However, working memory capacity varies between individuals.

    For people with attention-deficit/hyperactivity disorder (ADHD), working memory challenges and slower processing speed are common cognitive characteristics. In neuropsychological assessments, individuals with ADHD tend to score lower on working memory tasks and take longer on processing tasks compared to neurotypical peers. In practical terms, this means an individual with ADHD might have more difficulty recalling an unfamiliar wake word or could need a moment longer to articulate it. One study using virtual reality simulations found that children with ADHD had significantly lower working memory spans and processing speeds than children without ADHD. It’s “a natural part of having ADHD,” as one review put it, that memory and processing may lag behind in demanding tasks. These cognitive differences can make voice interactions more challenging, especially if the wake word or subsequent command is complex.

    Importantly, working memory limitations in ADHD are not just about remembering facts – they also tie into emotion regulation. ADHD experts note that trouble holding information in mind can impair one’s ability to manage emotions. Clinical research supports this connection: a 2020 study found that deficits in working memory were directly associated with greater emotional impulsivity and difficulty with emotion regulation in children with ADHD. This suggests that if a wake word or voice interface is frustrating or hard to use, it could disproportionately impact those users’ emotional experience. A confusing wake word might lead to repeated failed attempts to activate the assistant, which in turn can cause frustration – especially in someone already prone to emotional dysregulation. Designing with cognitive load in mind is therefore critical: the wake word should minimize memory burden and avoid triggering frustration for users with ADHD and similar profiles.

    It’s not only ADHD that warrants consideration. Older adults, for example, often experience a decline in working memory capacity with age. Aging research has shown that the amount of information one can hold in mind tends to decrease in later adulthood, along with a slowing of processing speed. This means an elderly user might have more difficulty remembering an arbitrary or lengthy wake phrase, especially if it’s not used frequently. They might also need a bit more time to recall and pronounce the wake word when they want to use it. By using a simple, familiar wake word, designers can accommodate age-related changes. Furthermore, older users might benefit from consistent routines – using the same wake phrase across devices or over time – to leverage long-term memory in place of taxed working memory. Consistency and practice help transfer the wake word into long-term memory through repetition, which can mitigate some working memory limitations.

    Designing for diverse cognitive needs means aiming for simplicity, familiarity, and consistency in the wake word. If the target users include children or neurodivergent individuals, a wake word drawn from a familiar context (like a common character name or an easy everyday word) could be easier for them to learn. On the other hand, if a wake word is too novel or complex, those with limited working memory might struggle until it becomes rote. Research on voice assistant accessibility has highlighted that current devices often do not fully account for cognitive differences – for instance, some users with cognitive impairments report difficulty remembering less common wake words or confusion when an assistant fails to respond. By proactively designing the wake word with these users in mind, we can make voice interaction more inclusive. In summary, the wake word should be memorable and low-effort for the broadest range of users, reducing cognitive barriers to interaction.

    Linguistic and Phonetic Considerations for Wake Words

    From a linguistic standpoint, an effective wake word must be audibly distinctive so that both users and microphones can recognize it reliably. This involves careful choice of phonetic makeup, word length, and syllable stress pattern. The goal is to minimize false activations (when unrelated speech accidentally triggers the assistant) and missed detections (when the user says the wake word but the system fails to recognize it). Several key guidelines emerge from speech recognition research:

    1. Prioritize Phonemic Diversity: Wake words are more easily detected when they contain a diverse range of sounds (phonemes). Including a mix of consonant and vowel sounds produces an acoustic signature that stands out from background noise and ordinary conversation. If a wake word used only a narrow range of sounds (for example, the “oo” sound repeated, as in “Doo-loo”), it could blur into other words with similar sounds. By contrast, a wake word like “Alexa” spans multiple distinct phonemes (/ə/-/l/-/ɛ/-/k/-/s/-/ə/), making it acoustically unique. Having varied phonemes ensures that no common word or phrase sounds too similar to the wake word【143††】. This diversity reduces the chance that random speech will accidentally contain the same sequence. In practice, voice AI developers explicitly recommend using wake words with a sufficient number of phonemes and avoiding phonetic sequences that appear in everyday words【143††】. The distinct sound profile acts as a verbal fingerprint for the assistant. In short, the more phonetically distinct the wake word, the less likely it is to be confused with other utterances.

    2. Choose an Appropriate Length: Generally, wake words should be short enough to say quickly, but long enough to be unique. A very short trigger (one syllable or just a couple of phonemes) can too easily appear in normal speech. For example, “Hey” or “Yo” by itself would be a terrible wake word due to constant false alarms. On the other hand, an excessively long phrase could overburden the user and increase the chance of mispronunciation or forgetting it. Research and industry experience suggest a sweet spot of about three syllables (or at least 6–10 phonemes) for wake words【143††】. At this length, there are enough speech sounds to form a unique pattern, but the phrase is still brief. This is reflected in popular wake words like “OK Google” (four syllables across two words) or “Hey Siri” (three syllables). These phrases are long enough to be rare in casual speech but short enough to utter easily. Empirical studies have found that incorporating an uncommon multi-syllabic wake word dramatically lowers false activation rates, compared to single-syllable words【143††】. In designing new wake words, testing different lengths with users can help identify the shortest phrase that remains unambiguous to the system.

    3. Leverage Stress Patterns: In English and many other languages, the stress pattern (emphasis on certain syllables) can affect how clearly a word is perceived. Linguistic research indicates that words with a trochaic pattern (strong emphasis on the first syllable, as in “AL-exa”) tend to be recognized more easily than iambic patterns (stress on second syllable, e.g., “A-LEXA?” hypothetical). This is because stressed syllables are typically louder, longer, and acoustically more salient. Our ears – and speech recognition models – latch onto that strong initial syllable as a clear signal. A classic study on spoken-word recognition demonstrated that English listeners heavily rely on the stressed syllable to identify word boundaries and content. When a word begins with an unstressed syllable, it’s more likely to be misheard or mis-segmented. For wake word design, this implies that a name like “Marco” (MAR-co, trochaic) would likely be easier to detect than “Simone” (si-MONE, iambic) in comparable conditions. Wherever possible, placing emphasis up front in the wake word can improve recognition robustness. The stressed syllable provides a reliable anchor for both humans and machines to catch the word. Additionally, a strong initial syllable can help in noisy environments, as it’s more likely to stick out from background chatter or sound.

    4. Optimize Pronounceability: A wake word should be easy to pronounce for the typical user. If it contains rare or tongue-twister sounds (like a cluster of difficult consonants), users might stumble when saying it – leading to failed activations. This is especially a concern if the device is intended for international markets where the phonemes of the chosen word might not exist in everyone’s native language. User studies have shown that when people struggle to articulate the wake word, their success in activating the assistant drops significantly. Mispronunciations or hesitations can prevent the system from recognizing the wake word at all. For inclusive design, developers often test candidate wake words with diverse user groups, including non-native speakers, children, or people with speech impairments. The goal is to identify any pronunciation difficulties early. Ideally, the wake word should consist of sounds that are familiar and easy across different languages (for example, the “k” sound is common and usually easy to say, whereas a rolled “r” might not be). If a particular demographic finds the wake word hard to enunciate, designers might tweak the word or provide alternate trigger phrases. In summary, a straightforward, easily enunciated wake word will yield a better experience and higher activation success rate for all users.

    By attending to these linguistic and phonetic factors, wake word designers create triggers that are both user-friendly and machine-readable. A carefully engineered wake word reduces strain on the user (who doesn’t have to repeat themselves or fight the system) and on the speech recognizer (which gets a clear, distinctive audio cue). In essence, the acoustic design of the wake word underpins its reliability: a well-chosen word works with the user’s natural speech and the system’s capabilities, rather than against them.

    Memory Aids and Multi-Sensory Cues

    Even with an optimal name and sound profile, some users may initially struggle to remember a new wake word. This is where cognitive psychology can inform design by suggesting memory aids and techniques to reinforce learning. One fundamental principle is repetition – simply encountering or using the wake word multiple times helps commit it to memory. Repetition has long been known as one of the most effective ways to enhance recall. Experimental studies on learning have shown that when people are exposed to a piece of information repeatedly over time, their memory for that information strengthens significantly. This applies to words, names, or any verbal material. Therefore, a voice assistant’s onboarding process might intentionally have the user say or hear the wake word a few times (for example, a tutorial that says: “To get my attention, just say ‘Hey Lumi’… Try saying ‘Hey Lumi’ now”). By practicing the wake word aloud and hearing it echoed, users form a more durable memory trace for the phrase.

    Another potent technique is to engage multiple senses – pairing the verbal wake word with a visual or motor cue. Cognitive research on the enactment effect finds that memory is improved when we perform an action related to the information we’re learning. In other words, doing something while saying something can make it more memorable. For instance, one might nod their head or press a confirm button while speaking the wake word during setup. This simple association between action and word can reinforce recall. Psychologists have compiled extensive evidence that performing gestures or movements linked to verbal commands yields higher subsequent recall of those commands. As Dr. Sharon Saline, a clinical psychologist specializing in ADHD, advises parents: “It even helps to pair an action or movement with a word or phrase” when trying to reinforce working memory. This aligns perfectly with the science – combining kinesthetic memory with verbal memory provides two pathways to remember the wake word later on.

    Similarly, incorporating a visual element can bolster memory via dual coding of information. If the voice assistant’s app or the device’s screen displays the wake word visually (like showing the text “Lumi” or an icon whenever it’s listening), users get an extra visual reminder of the trigger word. The multimedia learning principle in cognitive psychology states that people learn and recall better from combined verbal and visual cues than from words alone. Thus, a companion smartphone app that flashes the wake word on screen during training, or a smart speaker that lights up in a distinctive pattern when the wake word is spoken, can create a stronger association. These multi-sensory cues essentially give the user’s brain multiple reference points – the sound of the word, the look of the word, perhaps a color or light – making it easier to retrieve later. For users with cognitive challenges, this can be especially helpful. An individual with ADHD or an older adult with mild memory decline might benefit from the extra reinforcement that a visual confirmation or a tactile vibration provides when they say the wake word.

    One intriguing memory aid is leveraging the power of sleep for consolidation. While not a design feature per se, it’s worth noting that if users are introduced to a wake word and then have a normal night’s sleep, their memory for that word may improve the next day. Sleep research has demonstrated that our brains solidify and reorganize memories during sleep, especially for things we have learned recently. In practical terms, if a user struggles with a new wake word on day one, they might find it “sticks” better after they’ve slept on it. This is because the act of recalling and using the wake word, even imperfectly, sets up a memory trace that sleep can later strengthen. While designers cannot force users to sleep on cue, they can ensure that early interactions with the assistant reinforce the wake word enough times so that the day-one memory trace is strong before the user’s first overnight consolidation. For instance, a setup wizard might have the user say the wake word five or six times in various contexts. This repetition combined with subsequent sleep can turn a once foreign word into second nature. It’s a subtle design consideration, but it recognizes that learning is a process – and the wake word should be consistently reinforced, especially in the initial adoption phase, to harness natural memory consolidation.

    In summary, memory aids like repetition, multi-sensory cues, and strategic reinforcement schedules can significantly improve wake word retention. A wake word design that acknowledges human memory processes will ease the learning curve for users. Instead of expecting users to simply memorize a trigger word instantly, good design scaffolds the learning: it repeatedly exposes users to the word in engaging ways, links it with actions and images, and provides feedback that helps encode the word into memory. These human-centered strategies ensure that recalling the wake word becomes effortless over time, even for users who might initially have difficulty. The result is a more seamless and frustration-free interaction – the wake word becomes an intuitive doorway to the voice assistant, rather than a stumbling block.

    Adapting to User Context and Attention

    Human factors in wake word design extend beyond the word itself to the context in which it is used. Real-world conditions such as background noise, user attention, and stress can influence wake word performance. For example, if a user is multitasking or not fully concentrating, they might slur or mumble the wake word, or use a different intonation that the system didn’t expect. Research using virtual reality to simulate user attention has found that people’s speaking behavior changes when they are distracted, which in turn affects wake word detection accuracy. A user preoccupied with driving or cooking might not articulate “Hey Athena” as clearly as when they are focused. Environmental noise is another factor – the presence of TV sounds, other conversations, or echo can mask parts of the wake word.

    To account for these human and environmental variabilities, modern systems employ adaptive and personalized wake word models. For instance, one approach is to use speaker-specific acoustic models that learn the characteristics of the individual user’s voice over time. A 2020 study proposed training wake word detectors with speaker-based submodels, essentially tuning the detection algorithm to the user’s vocal pitch, accent, and pronunciation idiosyncrasies. This personalization significantly improved activation success rates, especially for users with strong regional accents or non-native pronunciations. It highlights that technology can complement human-centered design: even if the wake word is well-chosen, tailoring the model to the user’s voice further ensures reliable performance for that user. Many voice assistants now perform ongoing learning – if the device occasionally misses when you say “Hey Portal,” it might adjust its internal model of your wake word over time (with user permission) to better fit how you specifically pronounce “Portal.”

    Another adaptation is context-aware detection. The system can incorporate other signals to decide if the wake word was intended. For example, some smart speakers use their built-in camera or other sensors to detect if the user is facing the device or has made a wake gesture, and weigh the wake word trigger more heavily if so. Similarly, if the system detects a sudden pause in conversation or a vocal projection (people often slightly raise their voice when talking to a machine), it can use those cues to reduce false negatives. Research in human-computer interaction suggests that combining audio triggers with user attention cues (like gaze or head orientation) yields more robust activation performance. This kind of multi-modal approach again merges human factors with technical design – recognizing that humans naturally give off signals of intent, which the device can exploit to respond more intelligently.

    Lastly, designing the feedback around the wake word invocation is crucial for usability. When the user says the wake word, the assistant typically provides an acknowledgment (a tone, a light, or a spoken response like “Yes?”). This feedback reassures the user that the system is listening. If the wake word is not detected and no feedback comes, users may repeat themselves, often louder, leading to frustration. Some research suggests implementing a gentle “fallback” prompt if the system thinks it heard something like the wake word but isn’t confident – for example, the assistant might say, “If you were trying to get my attention, please say [wake word] again.” While over-prompting can be annoying, a well-timed clarification can help users recover when a wake word attempt fails, especially in noisy environments or for cognitively impaired users who might be confused by the lack of response. The design of these feedback and recovery interactions ensures that even when human or technical factors cause a miss, the overall experience remains user-friendly and forgiving.

    Conclusion

    Designing an effective wake word for voice assistants is a multidisciplinary challenge at the intersection of human factors and technology. On one hand, the wake word must satisfy technical criteria: it should be acoustically distinct, sufficiently long and phonemically rich to minimize false triggers, and robust across various noise conditions and accents. On the other hand, it must be psychologically and cognitively accessible: easy to remember, pronounce, and aligned with user expectations and emotions. By drawing on principles from cognitive psychology, linguistics, and human-computer interaction, designers can create wake words that serve users of all ages and abilities. Techniques like persona design, repetitive training, and multi-sensory cues help users form a strong mental association with the wake word, while adaptive algorithms personalize detection to each user’s voice and context.

    In summary, the best wake words emerge from empathetic design – understanding the diverse ways people might struggle or succeed with voice interaction, and proactively designing a trigger word that feels natural and supportive to those users. A thoughtfully designed wake word becomes almost invisible to the user; it fades into the background as a reliable conduit to the assistant. Achieving this requires optimizing everything from the word’s sound pattern to the social impression it creates. As we continue to welcome voice assistants into our lives, optimizing wake word interaction for all users isn’t just a technical task, it’s a human-centered imperative. By optimizing for human factors, we ensure voice technology is inclusive, effective, and delightful for everyone – whether it’s a child asking a homework question, an adult cooking dinner, or an older person seeking a daily weather update. The wake word is the key to these interactions, and when designed right, it unlocks technology that truly listens to and understands us.

    Bibliography

    [1] A. Purington, J. G. Taft, S. Sannon, N. N. Bazarova, and S. H. Taylor. “Alexa is my new BFF”: Social Roles, User Satisfaction, and Personification of the Amazon Echo. In Proc. of CHI Extended Abstracts, pp. 2853–2859. ACM, 2017.
    [2] A. D. Baddeley. Working memory: Theories, models, and controversies. Annual Review of Psychology, 63:1–29, 2012.
    [3] D. Areces, J. Dockrell, T. García, M. Cueli, and P. González-Castro. Analysis of cognitive and attentional profiles in children with and without ADHD using an innovative virtual reality tool. PLOS ONE, 13(8): e0201039, 2018.
    [4] N. B. Groves, M. J. Kofler, E. L. Wells, T. N. Day, and E. S. M. Chan. An examination of relations among working memory, ADHD symptoms, and emotion regulation. Journal of Abnormal Child Psychology, 48(4):525–537, 2020.
    [5] K. L. Bopp and P. Verhaeghen. Aging and verbal memory span: A meta-analysis. Journals of Gerontology, Series B: Psychological Sciences, 60(5):P223–P233, 2005.
    [6] H. Chen and J. Yang. Multiple exposures enhance both item memory and contextual memory over time. Frontiers in Psychology, 11:565169, 2020.
    [7] B. R. T. Roberts, C. M. MacLeod, and M. A. Fernandes. The enactment effect: A systematic review and meta-analysis of behavioral, neuroimaging, and patient studies. Psychological Bulletin, 148(5–6):355–388, 2022.
    [8] S. Diekelmann and J. Born. The memory function of sleep. Nature Reviews Neuroscience, 11(2):114–126, 2010.
    [9] L. Shams and A. R. Seitz. Benefits of multisensory learning. Trends in Cognitive Sciences, 12(11):411–417, 2008.
    [10] S. L. Mattys and A. G. Samuel. Implications of stress-pattern differences in spoken-word recognition. Journal of Memory and Language, 42(4):571–596, 2000.
    [11] M. Combs-Ford, C. Hazelwood, and R. Joyce. Are you listening? An observational wake word privacy study. Organizational Cybersecurity Journal: Practice, Process and People, 2(1), 2022.
    [12] J. Hwang, H. Kim, S. Lee, and Y. Lee. Wake-up word detection with speaker-based submodels for personalized service of voice assistants. arXiv preprint arXiv:2010.04764, 2020.wakewords

  • Human Factors in Wake Word Design: How to Make Voice Assistants Work for Everyone

    Human Factors in Wake Word Design: How to Make Voice Assistants Work for Everyone

    TL;DR

    • Choose wake words that are short, unique, and easy to say.
    • Make them feel personal — like calling a friend.
    • Account for memory challenges (like ADHD or aging).
    • Use repetition, visual cues, and feedback to help users remember.
    • Allow personalization where possible to support diverse needs.

    Why Wake Words Matter

    You’re in the kitchen, hands full of groceries, and you say, “Hey Siri,” to turn on the lights. Nothing. You try again — still no response. It’s annoying, right?

    Now, imagine you’re someone with ADHD or a memory challenge. What if recalling or pronouncing that wake word isn’t so easy? What if the device doesn’t recognize you — not because of poor tech, but because the wake word wasn’t designed with your brain in mind?

    That tiny phrase — “Alexa,” “OK Google,” “Hey Lumi” — is the key to unlocking voice tech. But it’s often overlooked in design. Let’s fix that.

    First Impressions Count: The Wake Word as a Persona

    A wake word isn’t just a switch—it’s the start of a conversation. It shapes how users feel about the assistant. Is it friendly? Robotic? Bossy? Research shows that we often treat voice assistants like people. We assign them personalities, roles, and even relationships [1].

    That means the wake word needs to match the brand and feel good to say. “Alexa sounds casual and approachable, while Siri is snappy. Compare that to something generic like “Assistant,” and you can feel the difference.

    Pro Tip: Avoid common words that might get triggered by accident. The more distinct and uncommon the wake word, the fewer frustrating false activations you’ll get [2].

    Also, consider global users. A wake word that’s easy to pronounce in English might be a tongue-twister elsewhere. Test across accents and languages.

    The Memory Challenge: Designing for ADHD, Aging, and Brain Fog

    Not everyone finds it easy to recall or say a wake word. People with ADHD, learning differences, or even normal aging can struggle with working memory — the brain’s mental scratchpad [3].

    For someone with ADHD, remembering a word on demand can feel like trying to grab a thought in a storm. If the assistant fails to respond, frustration can snowball fast — especially for those with emotional regulation challenges [4].

    Older adults also face challenges. As we age, memory and processing speed naturally decline [5]. A complicated wake word can become a daily hurdle.

    Solution:

    • Keep it simple.
    • Choose something familiar.
    • Reinforce it with repetition, visuals, and cues.

    And is your device part of a product family? Use the same wake word across devices—fewer words to remember = less cognitive load.

    What Makes a Great Wake Word?

    1. Distinct Sounds (Phonemic Diversity)

    A good wake word has a unique mix of sounds — not too similar to everyday words. “Alexa” has six distinct sounds. That makes it easier for humans and machines to pick out [2].

    Avoid short, mushy-sounding phrases that could be confused with other speech.

    2. Ideal Length (2–3 syllables)

    Too short? It gets triggered by mistake. Too long? People forget it or mess it up. The sweet spot? Around three syllables or 6–10 phonemes [2]. Think: “Hey Siri” or “OK Google.”

    3. Stress the First Syllable

    In English, emphasizing the first syllable (“ALexa”) makes it easier to hear and remember [6]. It stands out in noisy rooms and helps with quick recognition.

    4. Easy to Say — For Everyone

    Design for people with speech differences, different accents, or kids learning language. If your wake word trips people up, they’ll give up.

    Pro Tip: Test your wake word with diverse speakers. If many people mispronounce or dread saying it, try something else.

    Helping Users Remember: Memory Boosters That Work

    Even a great wakeword needs help sticking. Here’s how to make it memorable:

    Repetition

    Saying it often = remembering it better. During onboarding, have users say it out loud a few times. Let them hear it spoken back [7].

    Pair with Action

    Gesture while saying it. Tap a button. Nod. These multi-sensory cues strengthen memory [8].

    “Doing something while saying something makes it stick.” — Cognitive psychology 101

    Show It Visually

    On a screen? Flash the wake word in text. Use a unique light or icon. People remember better when they see and hear [9].

    Let Sleep Do Its Job

    Yep, sleep helps. If someone practices a wake word today, they’ll recall it better tomorrow [10]. So make early interactions count — get that repetition in!

    Think Beyond Words: Context and Feedback Matter

    Even the best wake word can fail if users are distracted, mumble, or speak too quietly. Real life is messy.

    Design for it:

    • Use microphones that adapt to noisy rooms.
    • Train your models to recognize common pronunciation variations.
    • Use feedback tones, lights, or responses (“Yes?”) to confirm the assistant heard them.

    Bonus: Personalized wake word models—ones that learn your voice—can dramatically improve accuracy, especially for people with accents or speech quirks [11].

    Final Thoughts: The Wake Word Is Your First Impression

    It’s easy to overlook the wake word — just a few syllables. But those syllables set the tone for everything that follows. They invite (or block) the user in. They say: “Yes, I’m listening.”

    When wake words are designed with real people in mind — their memory, voice, and emotions — voice tech feels less like a machine and more like a partner.

    Design for humans first. The rest will follow.


    Sources

    1. A. Purington et al. “Alexa is my new BFF.” CHI EA’ 17. ACM.
    2. Picovoice. “Tips for choosing a wake word.” Retrieved 2025.
    3. A. D. Baddeley. Annual Review of Psychology, 2012.
    4. N. Groves et al. J. of Abnormal Child Psych., 2020.
    5. K. Bopp, P. Verhaeghen. J. Gerontology B, 2005.
    6. S. Mattys, A. Samuel. J. of Memory and Language, 2000.
    7. H. Chen, J. Yang. Frontiers in Psychology, 2020.
    8. B. Roberts et al. Psychological Bulletin, 2022.
    9. L. Shams, A. Seitz. Trends in Cognitive Sciences, 2008.
    10. S. Diekelmann, J. Born. Nature Reviews Neuroscience, 2010.
    11. J. Hwang et al. arXiv:2010.04764, 2020.wakewords
  • Components of Wake Words in Wake Word Engine Design

    Components of Wake Words in Wake Word Engine Design

    Components of Wake Words in Wake Word Engine Design

    Purpose:

    This article examines techniques and best practices for creating custom wake words designed for DIY voice assistants, specifically using tools like openWakeWord and microWakeWord. It delves into the technical aspects of wake word detection, offering guidance on implementing these open-source solutions to enhance the responsiveness and personalization of voice-activated projects. I wanted the article to be a treasure trove for enthusiasts, packed with insights and information that will captivate your interest! and developers aiming to build more efficient and customized voice assistant experiences by addressing common challenges and providing practical examples.

    Wake word engines are specialized systems that detect specific spoken phrases to activate voice-enabled devices. Designing good wake words involves multiple components working in tandem to balance accuracy, efficiency, and robustness. This report deconstructs the technical and linguistic elements of wake words and the systems that detect them, drawing from academic research, industry publications, and engineering frameworks.

    1. Acoustic Preprocessing Layers

    1.1 Mel-Frequency Spectral Analysis

    Wake word engines first convert raw audio into melspectrograms—time-frequency representations optimized for human auditory perception [6]. These spectrograms highlight phonetically relevant features by compressing high-frequency components, mimicking the human ear\’s nonlinear frequency response. For example, Sensory\’s wake word engine uses melspectrograms as input to its deep neural
    networks [5].

    1.2 Noise Suppression and Beamforming

    To handle ambient noise, advanced systems employ adaptive beamforming techniques like the Temporal-Difference Generalized Eigenvalue (TDGEV) beamformer [7]. This method compares current and past audio frames to isolate wake words from directional noise sources, reducing false triggers by 15–30% in multi-speaker environments. Amazon\’s metadata-aware detectors further refine this by adjusting for device-specific acoustic conditions (e.g., alarms or music playback) [2][3].

    2. Core Detection Architectures

    2.1 Binary Classification Models

    At their core, wake word engines function as binary classifiers. Spokestack\’s system uses a three-stage neural pipeline:

    1. Filtering: Isolate critical frequency bands using convolutional
      layers [4].

    2. Encoding: Convert filtered features into compact embeddings via recurrent layers [6].

    3. Classification: Apply attention mechanisms to detect temporal wake word patterns [4].

    2.2 Hybrid Alignment Strategies

    Recent research compares alignment-based and alignment-free approaches:

    1. Alignment-based: Requires phoneme-level timestamp labels for training, achieving 7.19% False Reject Rate (FRR) at 0.1 False Alarms/hour (FAh) with 10% labeled data [1].

    2. Alignment-free: Uses Connectionist Temporal Classification (CTC) for unaligned data, excelling at low FAh (\<0.5/hour) [1].

    3. Hybrid systems combine both, maintaining accuracy while reducing labeling costs by 50% [1].

    3. Linguistic and Phonetic Components

    3.1 Phoneme Composition

    Wake words are engineered for acoustic distinctiveness:

    1. Phoneme diversity: \"Alexa\" (6 phonemes: /ə/ /l/ /ɛ/ /k/ /s/ /ə/) outperforms shorter phrases due to spectral variability [8].

    2. Avoid confusable allophones: The /k/ in \"Computer\" is less prone to confusion with /t/ or /p/ in noisy conditions [8].

    3.2 Prosodic Features

    1. Stress patterns: Trochaic stress (e.g., \"Álexa\") improves detection over iambic patterns [2].

    2. Duration: Words with ≥200ms duration allow robust feature extraction [8].

    4. Training Data Requirements

    4.1 Dataset Composition

    Optimal wake word datasets balance:

    1. Positive/negative ratio: 1:15 to 1:20 wake-to-non-wake samples [1].

    2. Speaker diversity: ≥4,000 unique speakers to avoid bias toward vocal traits [1].

    3. Synthetic augmentation: Speed perturbation (±20%), reverberation (RT60: 0.3–1.2s), and ambient noise injection reduce FRR by 5.6–18.3% across environments [1].

    4.2 Privacy-Centric Collection

    Systems like Amazon\’s use \"found data\" from public sources and synthetic voice conversion to avoid privacy violations, achieving 0.9% FRR parity with human-collected data [1][2].

    5. Security and Robustness

    5.1 Anti-Jamming Mechanisms

    Adversarial attacks using 2ms ultrasonic pulses can disable wake word detection. Defenses include:

    1. Temporal masking: Ignoring sub-50ms audio spikes.

    2. Channel authentication: Validating wake word acoustics against device-specific metadata (e.g., speaker ID) [2].

    5.2 False Activation Mitigation

    1. Multi-stage verification: Cloud-based secondary checks reduce false accepts by 40% [2].

    2. User-specific models: Personalized embeddings lower FRR to 0.4% in quiet settings.

    6. Performance Metrics

    1. False Reject Rate (FRR): Best systems achieve ≤1% at 1 FA/hour [1][2].

    2. Latency: On-device detection within 300ms using microcontrollers like Raspberry Pi Zero [6].

    3. Power efficiency: ≤10mW consumption for always-on operation [11].

    4. Resilience to distribution shift: Apple\’s Heimdal system maintains stability across environmental changes [9].

    Conclusion

    Modern wake word engines integrate signal processing, machine learning, and linguistic design to balance usability and security. Key innovations include hybrid alignment training, phoneme-optimized wake words, and privacy-preserving synthetic data. Future directions may leverage neuromorphic computing for sub-100mW operation and cross-lingual phoneme transfer learning.

    Bibliography (ACM Format)

    1. Ribeiro, D., Koizumi, Y., & Harada, N. (2023). Combining
      Alignment-Based and Alignment-Free Training for Wake Word Detection
      . Interspeech 2023. Retrieved from [https://www.isca-archive.org/interspeech_2023/ribeiro23_interspeech.pdf]{.underline}

    2. Amazon Science. (2023). Amazon Alexa\’s new wake word research at Interspeech. Retrieved from [https://www.amazon.science/blog/amazon-alexas-new-wake-word-research-at-interspeech]{.underline}

    3. Amazon Science. (2021). Using wake word acoustics to filter out background speech improves speech recognition by 15 percent. Retrieved from [https://www.amazon.science/blog/using-wake-word-acoustics-to-filter-out-background-speech-improves-speech-recognition-by-15-percent]{.underline}

    4. Spokestack. (n.d.). Wake Word. Retrieved from [https://www.spokestack.io/features/wake-word]{.underline}

    5. Sensory Inc. (n.d.). Wake Word. Retrieved from [https://www.sensory.com/wake-word/]{.underline}

    6. Scripka, D. (n.d.). openWakeWord. GitHub repository. Retrieved from [https://github.com/dscripka/openWakeWord]{.underline}

    7. Wang, Z., Chen, Z., Tan, X., & He, W. (2020). Beamforming for Wake Word Detection Using Multi-Microphone Arrays. In Interspeech 2020. Retrieved from [https://www.isca-archive.org/interspeech_2020/wang20ga_interspeech.pdf]{.underline}

    8. Picovoice. (n.d.). Choosing a Wake Word. Retrieved from [https://picovoice.ai/docs/tips/choosing-a-wake-word/]{.underline}

    9. Apple Machine Learning Research. (2023). Heimdal: Wake Word Detection under Distribution Shifts. Retrieved from [https://machinelearning.apple.com/research/heimdal]{.underline}

    10. Arxiv.org. (2024). Efficient Wake Word Detection for Edge Devices. Retrieved from [https://arxiv.org/abs/2409.19432]{.underline}

    11. Espressif Systems. (n.d.). ESP Wake Words Customization. Retrieved from [https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/ESP_Wake_Words_Customization.html]{.underline}

  • Mastering Unit Testing: Protecting Your Code from Hidden Risks with Red-Green-Refactor

    Mastering Unit Testing: Protecting Your Code from Hidden Risks with Red-Green-Refactor

    ( Or said another, non-inclusive way, "Dude! Do you even Unit Test?" )

    “We don’t have time to automate our Unit Tests and there isn’t that much value in it anyway.” — Many former Software Development Team Leads that have worked for me.

    Unit Testing and its automation are cornerstone practices in modern software development, playing a crucial role in ensuring code quality and reliability, yet time and again I run into either the above excuse making, or a lack of understanding about what Unit Testing is compared to Integration Testing. Most of the time, what I find developers calling Unit Testing is actually Integration Testing or Boundary Testing. As software professionals you should know the following to prevent creating corporate risk:

    • How Unit Testing is Risk Prevention
    • The Difference between Unit Testing vs. Integration or Boundary Testing
    • How to implement a Red-Green-Refactor automation pipeline

    Summary

    Unit testing is a foundational practice in modern software development, yet misconceptions about its purpose and implementation persist. Many developers mistakenly equate unit testing with integration or boundary testing, leading to potential corporate risks. This post delves into the critical role of unit testing in risk prevention, clarifies the differences between unit testing and integration or boundary testing, and provides a guide on implementing a Red-Green-Refactor automation pipeline. By solidifying your understanding and application of unit testing, you can significantly enhance code quality and reliability.


    Understanding Unit Testing: Your Shield Against Corporate Risk

    In the fast-paced world of software development, cutting corners often leads to costly consequences. Among the most common pitfalls is the misuse or misunderstanding of unit testing. Unit testing is not just a checkbox on a to-do list; it is a strategic tool for risk prevention that, when used correctly, safeguards your codebase from potential vulnerabilities and ensures a smooth, reliable software development lifecycle.

    Unit Testing: The Bedrock of Risk Prevention

    Unit testing serves as an early warning system, catching issues at the most granular level—individual units of code. A "unit" typically refers to the smallest testable part of an application, such as a function or a method. By isolating these units and testing them independently, you can identify and address bugs before they cascade into larger, more complex problems. This early detection is crucial in preventing defects from propagating through the system, where they can become exponentially more difficult and expensive to fix (Osherove, 2015).

    Moreover, unit testing contributes to a robust, self-documenting codebase. Well-written unit tests describe the intended behavior of individual components, serving as a form of living documentation. This not only aids in onboarding new team members but also provides a clear reference when revisiting code after a significant lapse of time (Martin, 2008).

    Distinguishing Unit Testing from Integration and Boundary Testing

    One of the most prevalent issues in software development today is the conflation of unit testing with integration or boundary testing. While all three types of testing are vital, they serve different purposes and operate at different levels of abstraction.

    Unit Testing

    Unit Testing focuses exclusively on individual units of code. The goal is to validate that each unit performs as expected in isolation. Unit tests should be fast, independent, and deterministic—meaning they should produce the same results every time they are run, regardless of the environment (Beck, 2003).

    For example, imagine we have a simple Calculator class with a method that adds two numbers together. A unit test would focus on verifying that this method behaves correctly under various input conditions:

    // calculator.ts export class Calculator { public add(a: number, b: number): number { return a + b; } } // calculator.test.ts import { Calculator } from './calculator'; describe('Calculator Unit Tests', () => { let calculator: Calculator; beforeEach(() => { calculator = new Calculator(); }); it('should return the correct sum of two positive numbers', () => { const result = calculator.add(3, 5); expect(result).toBe(8); }); it('should return the correct sum when one number is negative', () => { const result = calculator.add(3, -2); expect(result).toBe(1); }); it('should return 0 when both numbers are 0', () => { const result = calculator.add(0, 0); expect(result).toBe(0); }); });

    In this example, the Calculator class has an add method that takes two numbers and returns their sum. The unit tests focus on this specific method, verifying that it produces the correct results for different inputs. Each test case is independent of the others and checks a specific scenario: adding two positive numbers, adding a positive and a negative number, and adding two zeros. These tests are deterministic, meaning they will produce the same results every time they are run, regardless of the environment or external factors. This isolation and predictability are key characteristics of effective unit testing.

    Integration Testing

    Integration Testing, on the other hand, examines how different units of code interact with each other. The aim here is to catch issues that may arise when components are combined, such as interface mismatches, data flow errors, or unexpected side effects. Integration tests are generally slower and more complex than unit tests, as they involve multiple components and often require a more realistic environment (Meszaros, 2007).

    For example, imagine we have a UserService class responsible for fetching user data from a UserRepository, and a NotificationService that sends notifications to users. In an integration test, you would test the interaction between these two services to ensure they work together as expected:

    // userService.ts export class UserService { constructor(private userRepository: UserRepository) {} public getUserDetails(userId: string): User { return this.userRepository.findUserById(userId); } } // notificationService.ts export class NotificationService { constructor(private userService: UserService) {} public sendNotification(userId: string, message: string): boolean { const user = this.userService.getUserDetails(userId); if (user.email) { // Simulate sending an email console.log(`Email sent to ${user.email} with message: ${message}`); return true; } return false; } } // integration.test.ts import { UserService } from './userService'; import { UserRepository } from './userRepository'; import { NotificationService } from './notificationService'; describe('NotificationService Integration Test', () => { it('should send a notification to a valid user', () => { const userRepository = new UserRepository(); const userService = new UserService(userRepository); const notificationService = new NotificationService(userService); const result = notificationService.sendNotification('123', 'Welcome!'); expect(result).toBe(true); }); it('should not send a notification if the user email is missing', () => { const userRepository = new UserRepository(); const userService = new UserService(userRepository); const notificationService = new NotificationService(userService); const result = notificationService.sendNotification('456', 'Welcome!'); expect(result).toBe(false); }); });

    In this example, the NotificationService depends on the UserService, which in turn depends on the UserRepository. An integration test checks the interaction between these classes, ensuring that the NotificationService can successfully send notifications based on user data retrieved by the UserService. The test catches any issues that might arise from these interactions, such as an incorrectly structured user object returned by the UserRepository or an error in the notification logic itself.

    Boundary Testing

    Boundary Testing (sometimes called edge case testing) targets the extremes of input and output values. This type of testing is critical for identifying vulnerabilities that might not be apparent during typical usage. While boundary tests can be applied at both the unit and integration levels, they are often more closely associated with system-level testing, where the interactions of multiple components under extreme conditions are scrutinized (Myers et al., 2012).

    For example, consider a RESTful web service in an N-Tier architecture that handles user registration through a browser interface. The service includes a UserController class that processes HTTP requests and a UserService class that validates and creates users in the database. Boundary testing would involve testing the extremes of input values, such as the maximum allowed length for a username or the smallest valid password length:

    // userService.ts export class UserService { public static readonly MIN_PASSWORD_LENGTH = 8; public static readonly MAX_USERNAME_LENGTH = 20; public createUser(username: string, password: string): boolean { if ( username.length > UserService.MAX_USERNAME_LENGTH || password.length < UserService.MIN_PASSWORD_LENGTH ) { throw new Error('Invalid input'); } // Simulate user creation logic return true; } } // userController.ts import { Request, Response } from 'express'; import { UserService } from './userService'; export class UserController { private userService: UserService; constructor() { this.userService = new UserService(); } public registerUser(req: Request, res: Response): void { try { const { username, password } = req.body; const result = this.userService.createUser(username, password); res.status(201).json({ success: result }); } catch (error) { res.status(400).json({ error: error.message }); } } } // boundary.test.ts import request from 'supertest'; import express from 'express'; import { UserController } from './userController'; const app = express(); app.use(express.json()); const userController = new UserController(); app.post('/register', (req, res) => userController.registerUser(req, res)); describe('User Registration Boundary Tests', () => { it('should fail when username exceeds maximum length', async () => { const response = await request(app) .post('/register') .send({ username: 'a'.repeat(21), password: 'ValidPass123' }); expect(response.status).toBe(400); expect(response.body.error).toBe('Invalid input'); }); it('should fail when password is below minimum length', async () => { const response = await request(app) .post('/register') .send({ username: 'ValidUsername', password: 'short' }); expect(response.status).toBe(400); expect(response.body.error).toBe('Invalid input'); }); it('should succeed with valid username and password', async () => { const response = await request(app) .post('/register') .send({ username: 'ValidUsername', password: 'ValidPass123' }); expect(response.status).toBe(201); expect(response.body.success).toBe(true); }); });

    In this example, the UserService class contains business logic that enforces constraints on the username and password during user creation. The UserController processes HTTP requests and interacts with the UserService to register users.

    The boundary tests focus on testing the edge cases for these constraints. The tests check whether the application correctly handles cases where the username exceeds the maximum allowed length or where the password is shorter than the minimum required length. Additionally, the tests include a case where both inputs are valid to ensure that normal usage still works as expected. This approach ensures that the system behaves correctly under extreme input conditions, helping to identify potential vulnerabilities before they reach production.

    Implementing the Red-Green-Refactor Automation Pipeline

    One of the most effective methodologies for developing robust unit tests is the Red-Green-Refactor cycle, a core practice of Test-Driven Development (TDD). This approach not only ensures that your code meets its requirements but also encourages cleaner, more maintainable code (Beck, 2003).

    1. Red: Begin by writing a unit test for a specific functionality. At this stage, the test will fail (hence, "red") because the functionality has not yet been implemented. The failing test confirms that the test is correctly identifying the absence of the desired behavior.

    2. Green: Next, implement the minimum amount of code necessary to make the test pass. The goal here is not to write perfect code but to produce something that works (turning the test "green"). This phase is crucial for maintaining momentum and ensuring that the development process is driven by test requirements.

    3. Refactor: With a passing test in place, the final step is to refactor the code. This involves cleaning up the implementation, improving the design, and eliminating any redundancies. Since the test is already passing, you can refactor with confidence, knowing that any changes that break the functionality will be immediately caught (Osherove, 2015).

    Automating this pipeline is key to maximizing efficiency and consistency. Continuous Integration (CI) tools can be configured to run your unit tests every time code is pushed to the repository, ensuring that the Red-Green-Refactor cycle is adhered to rigorously. Over time, this disciplined approach leads to a more reliable, maintainable codebase with a significantly lower risk of defects (Fowler, 2018).


    Conclusion

    Mastering unit testing is not just about improving your technical skills—it’s about protecting your projects from avoidable risks. By clearly understanding the distinctions between unit testing, integration testing, and boundary testing, and by implementing a Red-Green-Refactor pipeline, you can elevate the quality of your code and contribute to a more secure, reliable software development process.

    Unit testing is your first line of defense against the unpredictable challenges of software development. When properly executed, it empowers you to build software that is not only functional but also resilient, maintainable, and future-proof. So, take the time to refine your unit testing practices—it’s an investment that will pay dividends in reduced risk and higher-quality code.

    Sidenote: I just realized I wrote a blog post using three people that have influenced me tremendously over the years: Beck, Fowler, and "Uncle Bob" Martin. These are three people worth following if you aren’t already.


    References

    Beck, K. (2003). Test-driven development: By example. Addison-Wesley.

    Fowler, M. (2018). Refactoring: Improving the design of existing code (2nd ed.). Addison-Wesley.

    Martin, R. C. (2008). Clean code: A handbook of agile software craftsmanship. Prentice Hall.

    Meszaros, G. (2007). xUnit test patterns: Refactoring test code. Addison-Wesley.

    Myers, G. J., Sandler, C., & Badgett, T. (2012). The art of software testing (3rd ed.). John Wiley & Sons.

    Osherove, R. (2015). The art of unit testing: With examples in C# (2nd ed.). Manning Publications.unit testing vs integration testingunit testing vs functional testingunit testing

  • Ensuring Secure and Reliable Software: Why Executives Must Champion Unit Testing

    Ensuring Secure and Reliable Software: Why Executives Must Champion Unit Testing

    "Building a great product is a creative, chaotic process which you won’t get right every time, so you have to also be learning from success and failure." — Gibson Biddle, former Chief Product Officer at Netflix

    Executive Summary

    In today’s fast-paced digital landscape, ensuring the reliability and security of software is a critical priority for business professionals and corporate executives. Unit testing, a fundamental practice in software development, plays a pivotal role in this effort. This blog post explores the importance of unit testing in preventing cyber risks, distinguishes it from integration testing, provides insights on how to verify if unit testing is being properly implemented, and outlines strategies to support development leaders in automating these tests. By understanding and advocating for robust unit testing practices, executives can contribute to the development of secure, high-quality software products that protect their organizations from potential cyber threats.

    The Critical Role of Unit Testing in Software Development

    Unit testing involves testing individual components of code in isolation to ensure they function correctly. This practice is essential for maintaining code quality, enhancing security, and mitigating risks in software development. However, there is often confusion among developers and business leaders about what constitutes unit testing versus integration testing. It is crucial for executives to understand these differences to ensure that their development teams are implementing best practices.

    1. Why Unit Testing is Important for Cyber Risk Prevention

    Unit testing serves as a frontline defense against cyber risks by identifying vulnerabilities early in the development process. Testing code in isolation allows developers to ensure each unit behaves as expected, reducing the likelihood of exploitable bugs. Comprehensive and well-maintained unit tests act as a safety net, catching potential security issues before they reach production. This proactive approach not only protects the organization from cyber threats but also saves time and resources in the long run (Smith, 2020).

    2. The Difference Between Unit Testing and Integration Testing

    Understanding the distinction between unit testing and integration testing is vital for ensuring proper testing coverage:

    • Unit Testing focuses on testing individual components or functions in isolation, typically using mock objects to simulate dependencies. These tests are fast, can be run frequently, and help ensure each piece of code works correctly on its own.

    • Integration Testing examines how different components work together by testing multiple parts of the system interacting. These tests generally take longer to run and are performed less frequently, identifying issues that arise from the interaction between components (Jones & Brown, 2021).

    Many developers mistakenly label integration tests as unit tests, leading to gaps in testing coverage and a false sense of security. Business leaders must be aware of these differences to ensure their teams are truly conducting unit testing.

    3. How to Determine if Your Developers are Unit Testing or Not

    To verify whether your development team is genuinely practicing unit testing, consider the following:

    • Review examples of their unit tests to ensure they are isolated and do not depend on external systems or databases.
    • Check that tests run quickly, typically in milliseconds, and focus on single functions or methods rather than entire workflows.
    • Look for the use of mocking frameworks, which simulate dependencies, as a sign of proper unit testing practices (Doe, 2022).

    If tests are slow, require extensive setup, or test multiple components at once, they are likely integration tests rather than unit tests.

    4. How to Support Development Leaders to Ensure Unit Testing is Automated

    Supporting your development leaders in automating unit testing is crucial for maintaining high-quality software. Consider the following strategies:

    • Allocate resources specifically for writing and maintaining unit tests.
    • Invest in training to ensure all developers understand unit testing principles.
    • Implement continuous integration systems that automatically run unit tests with each code change.
    • Set code coverage targets and include them in the definition of "done" for features.
    • Encourage test-driven development (TDD) practices, where tests are written before code (Johnson, 2023).
    • Regularly review and discuss test results in team meetings to foster a culture of accountability.

    By prioritizing and supporting automated unit testing, you are investing in the long-term security and reliability of your software products.

    Conclusion

    For business leaders, understanding the nuances of unit testing is not merely a technical concern; it is a critical aspect of ensuring the security and quality of software products. By advocating for and supporting proper unit testing practices, executives can help their organizations mitigate cyber risks, enhance software reliability, and ultimately safeguard their company’s reputation and resources.

    References

    Doe, J. (2022). Effective Unit Testing: Best Practices for Developers. New York: Tech Press.

    Johnson, A. (2023). Test-Driven Development: Principles and Practices. Boston: DevBooks.

    Jones, R., & Brown, T. (2021). Software Testing in Agile Development. San Francisco: CodeWorks.

    Smith, L. (2020). Cybersecurity and Software Quality Assurance. Chicago: SecureTech.
    product development

  • Creating Modular Mocks for the Obsidian API in Jest

    Creating Modular Mocks for the Obsidian API in Jest

    Introduction

    Testing is crucial in software development, especially for the several custom plugins I’m building for Obsidian. I rely on Jest, a testing tool that incorporates Mocks, to simulate Obsidian’s API. This approach is key to making tests isolated and dependable. In this post, I’ll show you how to create a flexible and scalable mock framework for the Obsidian API using Jest, tailored for developing multiple custom plugins.

    Why Modular Mocks?

    As your plugin grows, so does the complexity of your testing environment. A modular approach to mocking allows you to:

    • Maintain Separation of Concerns: Each mock file focuses on a specific aspect of the API, making it easier to manage and update.
    • Improve Scalability: Adding new mocks or updating existing ones is straightforward, even as your project expands.
    • Ensure Consistency: A centralized index.js file aggregates all your mocks, providing a single source of truth for your tests.

    Setting Up Your Project Structure

    Start by organizing your project directory to accommodate the modular mocks:

    project-root/ ├── __mocks__/ │ ├── App.js │ ├── TFile.js │ ├── Modal.js │ ├── Plugin.js │ ├── Vault.js │ ├── Workspace.js │ ├── MarkdownView.js │ ├── Setting.js │ ├── StatusBarItem.js │ └── index.js ├── src/ ├── tests/ └── package.json

    Creating the Individual Mock Files

    Each file in the __mocks__ directory represents a component of the Obsidian API. Here’s a brief look at how to structure these files:

    • __mocks__/TFile.js: Represents files within Obsidian’s vault.

      class TFile { constructor(path) { this.path = path; } } module.exports = TFile;
    • __mocks__/Modal.js: Used for creating custom modals.

      class Modal { open() {} close() {} contentEl = { createEl: jest.fn(), empty: jest.fn(), }; } module.exports = Modal;
    • __mocks__/Plugin.js: Represents the core plugin functionality.

      class Plugin { addSettingTab() {} registerEvent() {} addStatusBarItem() {} loadData() { return Promise.resolve(null); } saveData() { return Promise.resolve(); } } module.exports = Plugin;

    Centralizing Mocks with index.js

    To make your mocks easily accessible, create an index.js file in the __mocks__ directory that consolidates all individual mocks:

    const App = require('./App'); const TFile = require('./TFile'); const Modal = require('./Modal'); const Plugin = require('./Plugin'); const Vault = require('./Vault'); const Workspace = require('./Workspace'); const MarkdownView = require('./MarkdownView'); const Setting = require('./Setting'); const StatusBarItem = require('./StatusBarItem'); module.exports = { App, TFile, Modal, Plugin, Vault, Workspace, MarkdownView, Setting, StatusBarItem, };

    Using the Mocks in Your Tests

    With the mocks in place, you can import them directly in your test files. For example:

    import { Plugin, TFile, Modal } from '../__mocks__/index.js'; import ParaTaxonomyClassifierPlugin from '../src/main'; describe('ParaTaxonomyClassifierPlugin', () => { let plugin; beforeEach(() => { plugin = new ParaTaxonomyClassifierPlugin(); plugin.settings = DEFAULT_SETTINGS; }); it('should do something with TFile and Modal', () => { const mockFile = new TFile('path/to/file.md'); const modal = new Modal(); plugin.handleFileSave(mockFile); expect(modal.open).toHaveBeenCalled(); }); });

    Conclusion

    By structuring your mocks in a modular way, you gain control over your testing environment, ensuring that your tests are both reliable and maintainable. This approach is not only scalable but also simplifies the process of managing complex dependencies, making your development process smoother and more efficient.

    Call to Action

    If you’re developing Obsidian plugins and haven’t tried modular mocks yet, give this setup a try! It will help keep your tests clean, organized, and easy to manage as your project grows. Happy coding!

    Sources

    Separation of Concerns

    Scalability

    Consistency

    General Resources

    • Bass, L., Clements, P., & Kazman, R. (2012). Software Architecture in Practice (3rd ed.). Addison-Wesley.
    • Kruchten, P. (2004). The Rational Unified Process: An Introduction (3rd ed.). Addison-Wesley.
  • Getting Started Unit Testing in Node.js with Jest

    Getting Started Unit Testing in Node.js with Jest

    Unit testing is a crucial step in ensuring the reliability and functionality of your Node.js applications. Jest, a widely-used testing framework, makes it easy to write and run tests with minimal setup. In this post, I’ll walk you through setting up Jest for your Node.js project and show you how to write effective unit tests.

    Key Features of Jest

    Jest comes with several powerful features that make it a popular choice for testing in the JavaScript ecosystem:

    • Minimal Configuration: Jest works out of the box with little to no configuration required.
    • Built-in Mocking: Jest includes built-in utilities for mocking functions, modules, and timers.
    • Assertion Library: Jest has an integrated assertion library, so you don’t need any external dependencies.
    • Fast Execution: Jest runs tests in parallel, which speeds up the testing process.
    • Watch Mode: Jest can watch your files and re-run tests whenever you make changes, providing instant feedback. [2]

    Step 1: Install Jest

    The first step is to install Jest as a development dependency in your Node.js project. Open your terminal and run the following command:

    npm install --save-dev jest

    This command adds Jest to your project, making it available for testing.

    Step 2: Organize Your Tests

    Next, create a dedicated directory to store your test files. It’s a good practice to keep your tests organized in a separate folder. You can do this with the following command:

    mkdir test

    Place your test files in this folder, and use .test.js or .spec.js extensions to easily identify them as test files.

    Step 3: Update package.json

    To make running your tests more convenient, add a test script to your package.json file. This allows you to execute your tests with a simple command:

    "scripts": { "test": "jest" }

    Now, whenever you run npm test, Jest will automatically run all the tests in your project.

    Step 4: Writing Tests with Jest

    Jest uses a straightforward structure for writing tests, which makes your code easy to read and maintain.[1] Here’s a basic example:

    describe('Math operations', () => { test('adds 1 + 2 to equal 3', () => { expect(1 + 2).toBe(3); }); });

    Example Test Case

    Let’s look at a more practical example. In this example:

    • describe() is used to group related tests.
    • test() defines an individual test case.
    • expect() is used to make assertions about the output of your code.[3]

    Suppose you have a simple calculator module, and you want to test its Add function. Here’s how you could write that test:

    const CalculationOperations = require('../src/calculator'); describe("Calculation TestCases", () => { test("Add 2 numbers", () => { var sum = CalculationOperations.Add(1,2); expect(sum).toBe(3); }); });

    This test checks that the Add function correctly adds two numbers together, which is a fundamental operation in your calculator module.

    Best Practices for Writing Tests

    To get the most out of Jest, consider these best practices:

    • Test Various Scenarios: Cover different scenarios and edge cases in your tests to ensure comprehensive coverage.
    • Descriptive Test Names: Use clear and descriptive names for your test cases to make it easy to understand what each test does.
    • Isolated Tests: Keep your tests isolated and independent from each other to avoid cross-test interference.
    • Mock External Dependencies: Mock external services or modules to focus on the functionality of the code under test.
    • Aim for High Test Coverage: Strive to cover as much of your code as possible with tests to catch bugs early.

    Conclusion

    Jest is a robust and user-friendly framework for unit testing Node.js applications. Its rich set of features, combined with minimal setup, makes it an excellent choice for developers looking to ensure the quality of their code. By following the steps and best practices outlined in this post, you’ll be well on your way to writing effective unit tests with Jest.

    Citations:

    [1] https://semaphoreci.com/blog/unit-tests-nodejs-jest
    [2] https://www.youtube.com/watch?v=GHVvrYD4VRE
    [3] https://dev.to/xcoder03/nodejs-unit-testing-with-jest-a-quick-guide-1p47

    Books and Documentation

    Articles and Tutorials

    Online Courses

  • Rhasspy as service with Debian installation

    Rhasspy as service with Debian installation

    This post is a placeholder description for how to setup the standalone Rhasspy voice-assistant as a service on Debian/Ubuntu until the official docs are updated. This was gleaned from various posts on the Rhasspy forums and the OLD Rhasspy github, and then refined via trial and error.

    Steps:

    1. Create a service account to run the Rhasspy daemon as a service. Instructions are here.
    2. Switch to the Rhasspy service account
    $ sudo su rhasspy $ cd ~ 
    1. Install the Rhasspy based on your use case. For me, this was for running Rhasspy as a satellite on various Raspberry Pi 4s around the house, that use a central Rhasspy instance running as a server interconnected with Home Assistant.

    2. Make sure that Rhasspy runs

    $ rhasspy --profile en 2&gt;&amp;1 | cat 

    In a browser, navigate to http://:12101

    You should see the following:

    If you are good, then stop Rhasspy by using the keystrokes [CNTL]+C

    1. Switch back to your privileged account
    $ exit 
    1. Setup up the service account config directory
    $ sudo mkdir /opt/rhasspy 
    1. Create the service definition file
    $ sudo nano /etc/systemd/system/rhasspy.service 

    then

    [Unit] Description=Rhasspy Service After=syslog.target network.target mosquitto.service [Service] Type=simple # for command, see https://github.com/rhasspy/rhasspy/issues/42#issuecomment-711472505 ExecStart=/bin/bash -c 'rhasspy -p en --user-profiles /home/rhasspy/.profiles/rhasspy 2>&1 | cat' WorkingDirectory=/opt/rhasspy User=rhasspy Group=audio RestartSec=10 Restart=on-failure StandardOutput=syslog StandardError=syslog SyslogIdentifier=rhasspy [Install] WantedBy=multi-user.target 

    Save the file using [CNTL]+o, and [CNTL]+z

    1. Enable the service
    $ sudo systemctl enable rhasspy 
    1. Now start the service
    $ sudo systemctl daemon-reload $ sudo systemctl start rhasspy 
    1. Test that Rhasspy works by navigating to the Rhasspy Admin website on the device

    In a browser, navigate to http://:12101

    You should see the following:

    If you are good, you should be able to start configuring the device as a satellite by following the instructions on the Rhasspy documentation site.

    As is the case with all techno-things this approach will become outdated over time, so check the commands against the current version of Debian or Ubuntu. Drop me a line on Twitter if you find this useful or if needs updating.

  • Creating a Linux user who cannot get an interactive shell on Debian or Ubuntu

    Every now and then you find yourself repeating the same Linux admin tasks, querying the community and having to piecemeal together an answer. This is the case with creating a “service account” to run satellite instances of Rhasspy on a bunch of Debian-based IoT devices scattered around my house. The purpose is so that you aren’t running a service (daemon) as yourself or as root. This creates a situation where if someone where to use the webpage to somehow hack into the computer, they would be limited in what they can do because they won’t have access immediately to a Linux shell.

    3D Printed Rhasspy running on Raspberry Pi4 courtesy @tobetobe

    The commands below will create a user named ‘rhasspy’ with its own $HOME directory for placing the settings in the “.profile/rhasspy” and for installing the service account’s instance of a Python environment. Because rhasspy uses the audio resources of the computer indirectly, I went ahead and added the $username to the ‘audio’ group. and then finally set the default shell of the $username to ‘/sbin/nologin’.

    username=rhasspy password=REPLACEWITHYOUROWNPASSWORD sudo adduser --comment "" --disabled-password $username sudo chpasswd <<<"$username:$password" sudo usermod -a -G audio $username sudo usermod -s /sbin/nologin $username

    As is the case with all techno-things this approach will become outdated over time, so check the commands against the current version of Debian or Ubuntu. Drop me a line on Twitter if you find this useful or if needs updating.