Finding incorrect YAML headers

Question

I am trying to identify which files in my project have incorrect headers. The files all starts like this

--- header: . . . title: some header: . . . more headers: level: . . . ---

Where . . . only represents more headers. The headers contains no indentation. Using the following expression I have been able to extract the YAML header from every file.

grep -Przo --include=\*.md "^---(.|\n)*?---" .

Now I want to list the incorrect YAML headers.

Every YAML header must have a title: some text
Every YAML header must have language: [a-z]{2}
It must either contain a external: .* or author: .*.
The placement of title:, level:, external: and language: varies.

I tried to do something like

grep -L --include=\*.md -e "external: .*" -e "author: .* ."

However the problem with this is that it searches the entire file, not just the YAML header. So I guess solving the issues above boils down to how I can feed the YAML header result from my previous search into grep again. I tried

grep -Przo --include=\*.md "^---(.|\n)*?---" . | xargs -0 grep "title:";

However this gave me an error "No such file or directory", so I am a bit uncertain how to proceed.

Examples:

--- title: Rull-en-ball level: 1 author: Transkribert og oversatt fra [Unity3D](http://unity3d.com) translator: Bjørn Fjukstad license: Oversatt fra [unity3d.com](https://unity3d.com/learn/tutorials/projects/roll-ball-tutorial) language: nb ---

Correct YAML, has an author, language and title.

--- title: Mini Golf level: 2 language: en external: http://appinventor.mit.edu/explore/ai2/minigolf.html ---

Correct YAML, has a title, language, and external instead of author.

--- title: 'Stjerner og galakser' level: 2 logo: ../../assets/img/ccuk_logo.png license: '[Code Club World Limited Terms of Service](https://github.com/CodeClub/scratch-curriculum/blob/master/LICENSE.md)' translator: 'Ole Andreas Ramsdal' language: nb ---

Incorrect YAML header, missing author.

Could you replace the . . . with actual data, including "correct" headers as well as "incorrect" headers, so that we know when a solution is working as intended? — Jeff Schaller
– Jeff Schaller ♦, Commented Jul 20, 2018 at 13:35
Also, the yaml I've seen (for Ansible) has indentation; does yours? — Jeff Schaller
– Jeff Schaller ♦, Commented Jul 20, 2018 at 13:36
@JeffSchaller, no indentation. I will update my question accordingly. — Øistein Søvik
– Øistein Søvik, Commented Jul 20, 2018 at 13:40

Jeff Schaller · Accepted Answer · 2018-07-20 21:28:16Z

Here's one way to do it. I assume you have bash (to loop recursively through the files), sed, and awk. Instead of using bash, you could alternatively use find with -exec to search for the files.

The general flow is:

ask bash for the list of *.md files, recursively
pass each file to sed to extract the YAML header
pass that YAML header to awk for validation
if the header fails validation, print the filename

The script:

#!/bin/bash shopt -s globstar for file in **/*.md do # use sed for the header sed -n /^---$/,/^---$/p "$file" | awk ' BEGIN { good_title=0 good_lang=0 good_extaut=0 } /^title: .*/ { good_title=1 } /^language: [a-z][a-z]$/ { good_lang=1 } /^author: .*/ { good_extaut=1 } /^external: .*/ { good_extaut=1 } END { if (good_title && good_lang && good_extaut) exit 0 else exit 1 } ' \ || printf "Incorrect header found in %s\n" "$file" done

You can easily adjust the regex matching patterns in the awk script to be stricter or looser, depending on your exact requirements (perhaps you want alphanumeric characters instead of "any", as the current . in your example has).

The sed statement extracts the YAML header by:

suppressing default-printing (-n)
asking for a line of addresses that match the pattern: beginning of line, ---, end of line; the second pattern must occur after the first pattern.
that range of addresses is then printed

The awk script is a little over-built, but I wanted to spell it out for clarity. Each time awk is called, it sets three flag variables to zero or false. If we see lines that match our criteria, we set the corresponding flag to one/true. Once all the lines have been seen, we return success or failure based on the status of those flags -- they must all be true in order to "pass" validation.

With these appropriately-named sample files scattered into the current directory and a subdirectory:

$ tree . . ├── bad1.md ├── good1.md ├── good2.md └── subdir ├── bad1.md └── good1.md 1 directory, 5 files

... the script outputs:

Incorrect header found in bad1.md Incorrect header found in subdir/bad1.md

Kusalananda · Accepted Answer · 2022-08-27 19:22:06Z

To extract the header of a file, we may use sed like so:

sed -e '1,/^---$/!d' -e '/^---$/d' filename

This removes everything from the file apart from the lines between line 1 and the next line that is exactly ---. The second expression additionally deletes all --- lines from the data so that you are left with only the YAML header.

I will use the Python-based yq utility from Andrey Kislyuk. Since this is a handy wrapper around the versatile JSON parser jq, we may easily detect whether the values corresponding to the keys are null, non-null, or a particular string, etc.

In the jq syntax, we may test whether a key, keyname, exists in an object with has("keyname"). We may also test whether the value of a key matches a particular regular expression, RE, using .keyname | test("RE").

The tests that are mentioned in the question could be translated into the following jq expression:

has("title") and (.title | test(".")) and has("language") and (.language | test("[a-z]{2}")) and (has("external") or has("author"))

or, shorter but less expressive,

(.title? != null) and (.language? | test("[a-z]{2}")) and (has("external") or has("author"))

This ensures that each key exists and that the values for the keys that need to have non-null values are correct.

Running this on the three example files, with our tests in the script file validate:

$ sed -e '1,/^---$/!d' -e '/^---$/d' file1.md | yq -f validate true $ sed -e '1,/^---$/!d' -e '/^---$/d' file2.md | yq -f validate true $ sed -e '1,/^---$/!d' -e '/^---$/d' file3.md | yq -f validate false

We may generalize this to test all the .md files in the current directory or below using find like so:

find . -name '*.md' -type f -exec sh -c ' for pathname do if ! sed -e "1,/^---\$/!d" -e "/^---\$/d" "$pathname" | yq -e -f validate >/dev/null then printf "Invalid YAML header: %s\n" "$pathname" fi done' sh {} +

or, with any shell that supports the ** globbing pattern (enabled with shopt -s globstar in bash):

for pathname in ./**/*.md do if ! sed -e '1,/^---$/!d' -e '/^---$/d' "$pathname" | yq -e -f validate >/dev/null then printf 'Invalid YAML header: %s\n' "$pathname" fi done

Here, we additionally throw away the output from yq and instead use the tool with its -e option. This makes the exit status of the utility reflect the value of the last evaluated expression, i.e., zero for true, and non-zero for false in this case. This makes it easy to use our sed+yq pipeline directly in an if statement.

Running this with our three test files, we get

Invalid YAML header: ./file3.md

Stack Exchange Network

Finding incorrect YAML headers

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Finding incorrect YAML headers

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions