fix(grammargen): css parity and generation fast path#16
Conversation
- Add run_grammargen_focus_targets.sh to run high-value grammargen lanes (javascript, typescript, tsx, c, cpp, c_sharp, cobol, fortran) with per-language Docker isolation - Add test_race_serial.sh for host-safe race testing one package at a time - Update AGENTS.md with stricter non-negotiables: no gts-suite profiling, no repo-wide host tests, Docker-only for heavy work, narrow to one language when chasing OOMs - Update gate presets to use Docker runners with scoped -run filters instead of broad host-side sweeps - Update README.md testing section to prefer single-language Docker parity commands - Update cgo_harness/README.md to recommend single-language commands over unified gate runner for local OOM diagnosis - Improve run_single_grammar_parity.sh to detect and report PARITY vs MISMATCH vs OOM vs FAIL states - Add scripts/README.md guidance preferring Docker runners over host-side helpers for heavy correctness work
- Add raceEnabled constants via build tags to detect -race builds - Skip Lox grammar tests under -race in grammargen (LALR builds, C generation, blob round-trip) - Skip Scala real-world corpus parser tests under -race in root package - Aligns with project workflow to avoid host-side race sweeps causing memory pressure
- Add DFA token source logic to split `>>` tokens into separate `>` tokens in TypeScript, TSX, and Dart - Implement context-aware splitting when only single `>` has parse actions, action specificity favors `>`, or delimiter/identifier follows in type assertion patterns - Add helper methods for action specificity comparison, symbol lookup by name, and lookahead/lookback for context detection - Guard against splitting actual shift expressions by checking for reduce-only action sharing and type assertion style openers - Add unit tests for single close angle action, right shift preservation, delimiter following, and identifier following cases - Add integration tests for TypeScript shift expressions, type assertions over ternaries, and type assertions over call expressions
…dence-aware conflict - Add comprehensive TypeScript parity tests for conditional types, type assertions over ternaries, generic calls, and namespace/enum constructs - Fix external symbol suppression to only hide non-visible symbols, preserving aliased externals like TypeScript ternary '?' - Preserve explicit zero precedence for trailing automatic semicolon cases in rule flattening - Add conditional type context tagging to distinguish TypeScript conditional type RHS from ternary expressions - Include close brace as boundary lookahead for JS/TS grammars with automatic semicolons - Enhance reduce/reduce conflict resolution with precedence order awareness and type/value token ambiguity handling - Add bitset clear method and lookahead contributor tracking for LALR diagnostics
…fe single-worker - Add --cpus, --pids, --gomaxprocs, and --goflags options to all three Docker parity scripts - Default run_grammargen_focus_targets.sh to single-worker profile (cpus=1, pids=512, GOMAXPROCS=1, GOFLAGS=-p=1) to prevent OOM during local runs - Document the safe local lane pattern in README.md for high-value grammar testing - Pass resource limits through to Docker containers and include in metadata output
- Add applyImportGrammarShapeHints to auto-enable BinaryRepeatMode for JavaScript, TypeScript, and TSX grammars during import - Binary repeat helper shape preserves tree-sitter's lowering for heavy repeat constructs, avoiding state blowups and maintaining JSX/type-parameter ambiguity handling - Apply hints consistently in both ImportGrammarJS and ImportGrammarJSON codepaths - Add TypeScript and TSX corpus snippet parity tests covering generic call constructs and JSX ambiguity cases - Parity tests validate generated vs reference parser output on minimal reproducer snippets
…S normalization - Remove call_expression from interchangeable value types in parity tests to catch real JS/TS generic-call regressions like f<T>(x) - Refactor TypeScript normalization to match node types by name instead of symbol ID for clarity
…arity - Add call precedence normalization for unary/binary expressions wrapping calls - Add unary precedence normalization to handle unary vs binary operator conflicts - Add binary precedence normalization with operator precedence table (?? through **) - Add TypeScript instantiation_expression normalization for generic call parity - Add TypeScript as-expression normalization for assignments, ternaries, and type chains - Extend normalization context with symbols for union/intersection/object types - Add comprehensive test coverage for all precedence scenarios
…comments - Handle JavaScript/TypeScript binary expressions where comments create extra nodes between operands - Rewrite call target by locating operator and right operand via field names instead of fixed child indices - Add commented logical-or chain test case to TypeScript corpus parity suite
…tual keyword and import - Extend TypeScript/TSX test coverage with edge cases: import aliases, module identifiers, async arrow functions, and computed member access in control flow - Add contextual identifier preference for 'get'/'set' keywords based on lookahead (dot, bracket, call chain) to match C parser GLR behavior - Add identifier keyword alias normalization to strip synthetic children and field IDs when keyword nodes masquerade as identifiers - Mark standalone 'import' tokens as not named to align with reference C parser output - Add direct C regression deep parity tests comparing full tree structure between generated and reference parsers
…ontexts - Add lookahead logic in tsxScanAutoSemicolon to distinguish JSX contexts from TypeScript type annotations after `}` - Introduce tsxLooksLikeJSXAttributeContinuation helper to detect attribute continuations and avoid false ASI triggers - Add test coverage for destructured_function_type_parameter parity cases in TypeScript and TSX suites - Prevent premature semicolon insertion that caused parse errors on destructured function type parameters
- Add normalizeJavaScriptProgramStart to align program node start bounds with first non-trivia child - Add normalizeJavaScriptTypeScriptOptionalChainLeaves to prune redundant token children from optional_chain nodes - Integrate both normalizers into JavaScript, TypeScript, and TSX language pipelines - Add unit tests covering program start adjustment and optional chain leaf pruning
…tion fixes - Add focused C parity tests for concatenated strings, multiline preprocessor directives, and primitive type parameters - Add tryRelexCurrentStateDFA to re-lex lookahead tokens after reduce chains change parser state - Fix stale DFA lookahead handling when all live stacks converge on same reduced state - Add normalizeCBuiltinPrimitiveTypeIdentifiers to convert known built-in types to primitive_type nodes - Add normalizeCVariadicParameterEllipsis to populate variadic_parameter children - Fix preprocessor directive range consumption logic to check endByte bound - Remove looksLikeCTypedefName heuristic that over-classified type identifiers - Replace typedef heuristic with precise isCBuiltinPrimitiveTypeName list
- Detect COBOL fixed format sequence number area (cols 1-6) and use column 7 as adjusted start point - Change hardcoded single-child lookups to iterate and find target nodes by type - Add normalizeCobolPeriodChildren to ensure period nodes contain a dot child - Expand GLR suffix exclusions to include _section and _paragraph for raw span table
…d helper - Increase scala grammar generation timeout from 45s to 180s in cgo harness tests - Add TypeScript existential_type AST normalization to restore collapsed nodes for C parity - Extract generic normalizeCollapsedNamedLeafChildren from cobol-specific implementation - Enable reuse of collapse-restoration logic across multiple language normalizers
…as promotion - Resolve auxiliary symbols (repeat helpers, alias wrappers) to parent symbols in allInDeclaredConflict for proper conflict group matching - Fix promoteDefaultAliases to only mark extras as unaliased when they have no recorded aliases, enabling alias promotion for wrapper nonterminals like COBOL's _LINE_COMMENT_ALIAS - Extend c_sharp grammar generation timeout from 90s to 300s to accommodate slower generation paths
- Relax dash fence detection in corpus parser to accept any run of 3+ dashes (not just exactly '---') - Add C# predefined type normalization for parity testing - Cover common types: void, int, string, bool, object, float, double, char, byte, long, short, decimal
…ifiers - Add normalization for C preprocessor newline spans to handle consecutive newlines and whitespace in preproc regions - Add normalization for bare C type_identifier expression statements to wrap them in expression_statement nodes - Both passes integrate into normalizeKnownSpanAttribution for C/C++ parity improvements
- Remove C# predefined type normalization from attribution pipeline - These types no longer need special handling for parity correctness
…pace lex modes - Add CSS function value parity test for rgba/hsla parsing - Support optional diagnostics in generate report for faster language builds - Add after-whitespace lex mode selection in DFA token source - Add fallback parity seed path resolution for offline testing - Add per-stage generation diagnostic tests with memory tracking - Fix immediate token detection to check preceding whitespace
…rity-guardrails-20260319
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the grammar generation and parsing capabilities across several languages, with a particular focus on C, CSS, and TypeScript/TSX. It introduces performance optimizations by allowing diagnostic steps to be skipped when not needed and enhances parsing accuracy through improved lexer behavior and extensive parse tree normalizations. The changes also include updates to the testing infrastructure to promote more efficient and isolated debugging practices. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the grammar generation and parsing capabilities across multiple languages, with a strong focus on C, JavaScript, and TypeScript. Key improvements include updating testing infrastructure to promote Docker-isolated, focused tests, especially for OOM diagnosis, and introducing new scripts and options for resource control. The grammar generation process itself has been optimized for performance by making diagnostic generation conditional and automating language-specific shape hints. Major changes were made to LR table generation and conflict resolution, incorporating new context tagging for TypeScript conditional types, refined handling of assignment and repetition shifts, and more nuanced precedence resolution. The parser core benefits from improved immediate token scanning, complex logic for splitting >> tokens in TypeScript/TSX to resolve ambiguities, and enhanced error recovery through re-lexing. Extensive post-parsing normalization functions have been added for C, COBOL, JavaScript, and TypeScript to align generated parse trees with reference C Tree-sitter behavior, addressing various parsing ambiguities and structural differences. A new suite of parity and diagnostic tests has been added to cover these changes and ensure correctness.
| real_corpus_status_from_log() { | ||
| local log_path="$1" | ||
| local summary | ||
| summary="$(grep -E 'real-corpus\[' "$log_path" 2>/dev/null | tail -1 || true)" | ||
| if [[ -z "$summary" ]]; then | ||
| echo fail | ||
| return | ||
| fi | ||
| if [[ "$summary" =~ no-error[[:space:]]+([0-9]+)/([0-9]+),[[:space:]]+sexpr[[:space:]]+parity[[:space:]]+([0-9]+)/([0-9]+),[[:space:]]+deep[[:space:]]+parity[[:space:]]+([0-9]+)/([0-9]+) ]]; then | ||
| local no_error="${BASH_REMATCH[1]}" | ||
| local eligible_a="${BASH_REMATCH[2]}" | ||
| local sexpr="${BASH_REMATCH[3]}" | ||
| local eligible_b="${BASH_REMATCH[4]}" | ||
| local deep="${BASH_REMATCH[5]}" | ||
| local eligible_c="${BASH_REMATCH[6]}" | ||
| if [[ "$no_error" == "$eligible_a" && "$sexpr" == "$eligible_b" && "$deep" == "$eligible_c" ]]; then | ||
| echo ok | ||
| else | ||
| echo fail | ||
| fi | ||
| return | ||
| fi | ||
| echo fail | ||
| } |
There was a problem hiding this comment.
This function can be simplified by using grep -E with back-references to check for parity, similar to the logic in run_single_grammar_parity.sh. This would make the code more concise and less prone to errors from manual BASH_REMATCH indexing.
| real_corpus_status_from_log() { | |
| local log_path="$1" | |
| local summary | |
| summary="$(grep -E 'real-corpus\[' "$log_path" 2>/dev/null | tail -1 || true)" | |
| if [[ -z "$summary" ]]; then | |
| echo fail | |
| return | |
| fi | |
| if [[ "$summary" =~ no-error[[:space:]]+([0-9]+)/([0-9]+),[[:space:]]+sexpr[[:space:]]+parity[[:space:]]+([0-9]+)/([0-9]+),[[:space:]]+deep[[:space:]]+parity[[:space:]]+([0-9]+)/([0-9]+) ]]; then | |
| local no_error="${BASH_REMATCH[1]}" | |
| local eligible_a="${BASH_REMATCH[2]}" | |
| local sexpr="${BASH_REMATCH[3]}" | |
| local eligible_b="${BASH_REMATCH[4]}" | |
| local deep="${BASH_REMATCH[5]}" | |
| local eligible_c="${BASH_REMATCH[6]}" | |
| if [[ "$no_error" == "$eligible_a" && "$sexpr" == "$eligible_b" && "$deep" == "$eligible_c" ]]; then | |
| echo ok | |
| else | |
| echo fail | |
| fi | |
| return | |
| fi | |
| echo fail | |
| } | |
| real_corpus_status_from_log() { | |
| if grep -Eq 'real-corpus\[.*no-error[[:space:]]+([0-9]+)/\1,[[:space:]]+sexpr[[:space:]]+parity[[:space:]]+([0-9]+)/\2,[[:space:]]+deep[[:space:]]+parity[[:space:]]+([0-9]+)/\3' "$1" 2>/dev/null; then | |
| echo ok | |
| else | |
| echo fail | |
| fi | |
| } |
- Add normalization for Python pass_statement collapsed named leaf children - Improves Go/C parser parity for Python grammar correctness gate
- Relax single-token wrapper requirement in collapsible unary self-reduction to permit collapsing when child and parent symbols share the same name - Add sameSymbolName helper to compare symbols via SymbolMetadata or SymbolNames - Update DFA guard test expectations for state counts reflecting changed reduction behavior
Summary
rgba(...)andhsla(...)stop misparsing as units after numeric valuesLocal Verification
go run ./cmd/gen_linguist -manifest grammars/languages.manifest -languages-yml grammars/languages.yml -out grammars/linguist_gen.go && git diff --exit-code -- grammars/linguist_gen.go.github/workflows/ci.ymlgo build ./...go vet ./...go test . -run '^TestNextDFATokenUsesAfterWhitespaceLexState$' -count=1 -vgo test ./grammargen -run '^(TestGenerateWithReportCtxSkipsDiagnosticsWhenNotRequested|TestConflictDiagnostics|TestCombinators|TestCSSFunctionValueParity|TestImportRealCSSGrammarJS)$' -count=1 -vbash cgo_harness/docker/run_single_grammar_parity.sh --cpus 1 --gomaxprocs 1 --goflags -p=1 --no-build cssbash cgo_harness/docker/run_grammargen_c_parity.sh --cpus 1 --gomaxprocs 1 --goflags -p=1 --langs css --max-cases 10 --max-bytes 262144 --no-build --label css-release-preflight-post-mainNotes
25/25no-error,25/25sexpr parity,25/25deep parity.8/10tree parity on two small structural divergences, but that lane is green and CSS is not currently ratcheted ingrammargen_cgo_parity_floors.json.