📝 9 Feb 2025
If the Daily Test fails for Apache NuttX RTOS … Can we Auto-Rewind and discover the Breaking Commit? Let’s try this (pic above)
Every Day at 00:00 UTC: Ubuntu Cron shall trigger a Daily Build and Test of NuttX for QEMU RISC-V (knsh64 / 64-bit Kernel Build)
If The Test Fails: Our Machine will Backtrack The Commits, rebuilding and retesting each commit (on QEMU Emulator)
When it discovers the Breaking Commit: Our Machine shall post a Mastodon Alert, that includes the (suspicious) Pull Request
Bonus: The Machine will draft a Polite Note for our NuttX Colleague to investigate the Pull Request, please
Why are we doing this?
If NuttX Fails on QEMU RISC-V: High chance that NuttX will also fail on RISC-V SBCs like Ox64 BL808 and Oz64 SG2000.
Thus it’s important to Nip the Bud and Fix the Bug early, before it hurts our RISC-V Devs. (Be Kind, Rewind!)
We wrote a script that will Rewind the NuttX Build and discover the Breaking Commit…
## Set the GitLab Token, check that it's OK
## export GITLAB_TOKEN=...
. $HOME/gitlab-token.sh
glab auth status
## Set the GitLab User and Repo for posting GitLab Snippets
export GITLAB_USER=lupyuen
export GITLAB_REPO=nuttx-build-log
## Download the NuttX Rewind Script
git clone https://github.com/lupyuen/nuttx-build-farm
cd nuttx-build-farm
## Find the Breaking Commit for QEMU RISC-V (64-bit Kernel Build)
nuttx_hash= ## Optional: Begin with this NuttX Hash
apps_hash= ## Optional: Begin with this Apps Hash
./rewind-build.sh \
rv-virt:knsh64_test \
$nuttx_hash \
$apps_hash
Our Rewind Script runs 20 Iterations of Build + Test like so…
## Build and Test: Latest NuttX Commit
git reset --hard HEAD
tools/configure.sh rv-virt:knsh64
make -j
qemu-system-riscv64 -kernel nuttx
## Build and Test: Previous NuttX Commit
git reset --hard HEAD~1
tools/configure.sh rv-virt:knsh64
make -j
qemu-system-riscv64 -kernel nuttx
...
## Build and Test: 20th NuttX Commit
git reset --hard HEAD~19
tools/configure.sh rv-virt:knsh64
make -j
qemu-system-riscv64 -kernel nuttx
## Roughly One Hour for 20 Rewinds of Build + Test
(What about Git Bisect? We’ll come back to this)
Build and Test 20 times! Won’t it look mighty messy?
Ah that’s why we present neatly the 20 Outcomes (Build + Test) as the NuttX Build History, inside our NuttX Dashboard…
What’s inside our script? We dive in…
(Which Apps Hash to use? NuttX Build History can help)
How to find the Breaking Commit?
We zoom out and explain slowly, from Micro to Macro. This script will Build and Test NuttX for One Single Commit on QEMU: build-test-knsh64.sh
## Build and Test NuttX for QEMU RISC-V 64-bit (Kernel Build)
## Download NuttX and Apps
git clone https://github.com/apache/nuttx
git clone https://github.com/apache/nuttx-apps apps
## Switch to this NuttX Commit and Apps Commit
pushd nuttx ; git reset --hard $nuttx_hash ; popd
pushd apps ; git reset --hard $apps_hash ; popd
## Configure the NuttX Build
cd nuttx
tools/configure.sh rv-virt:knsh64
## Build the NuttX Kernel
make -j
## Build the NuttX Apps
make -j export
pushd ../apps
./tools/mkimport.sh -z -x ../nuttx/nuttx-export-*.tar.gz
make -j import
popd
## Boot NuttX on QEMU RISC-V 64-bit
## Run OSTest with our Expect Script
wget https://raw.githubusercontent.com/lupyuen/nuttx-riscv64/main/qemu-riscv-knsh64.exp
expect qemu-riscv-knsh64.exp
(Expect Script shall validate the QEMU Output)
The script above is called by build_nuttx below. Which will wrap the output in the Log Format that NuttX Dashboard expects: rewind-commit.sh
## Build and Test One Commit
function build_nuttx { ...
## NuttX Dashboard expects this Log Format
echo "===================================================================================="
echo "Configuration/Tool: rv-virt/knsh64_test,"
echo "$timestamp"
echo "------------------------------------------------------------------------------------"
## Build and Test Locally: QEMU RISC-V 64-bit Kernel Build
$script_dir/build-test-knsh64.sh \
$nuttx_commit \
$apps_commit
res=$?
## Omitted: Build and Test Other Targets
echo "===================================================================================="
}
Our Build-Test Log becomes this GitLab Snippet…
What’s the other stuff inside the GitLab Snippet above?
For Every Commit, we bundle Three Commits into a single Log File: This Commit, Previous Commit, Next Commit: rewind-commit.sh
## Build and Test This Commit
build_nuttx $nuttx_hash $apps_hash
## If Build / Test Fails...
if [[ "$res" != "0" ]]; then
echo "BUILD / TEST FAILED FOR THIS COMMIT: nuttx @ $nuttx_hash / nuttx-apps @ $apps_hash"
## Rebuild / Retest with the Previous Commit
build_nuttx $prev_hash $apps_hash
if [[ "$res" != "0" ]]; then
echo "BUILD / TEST FAILED FOR PREVIOUS COMMIT: nuttx @ $prev_hash / nuttx-apps @ $apps_hash"
fi
## Rebuild / Retest with the Next Commit
build_nuttx $next_hash $apps_hash
if [[ "$res" != "0" ]]; then
echo "BUILD / TEST FAILED FOR NEXT COMMIT: nuttx @ $next_hash / nuttx-apps @ $apps_hash"
fi
fi
## Why the Long Echoes? We'll ingest them later
Our Three-In-One Log becomes a little easier to read, less flipping back and forth. Let’s zoom out…
Who calls the script above: rewind-commit.sh?
The script above is called by run_job below: rewind-build.sh
## Build and Test This Commit
## And capture the Build-Test Output
function run_job { ...
script $log_file \
$script_option \
" \
$script_dir/rewind-commit.sh \
$target \
$nuttx_hash $apps_hash \
$timestamp \
$prev_hash $next_hash \
"
}
Which captures the Build-Test Output into a Log File.
What happens to the Log File? We upload and publish it as a GitLab Snippet: rewind-build.sh
## Build and Test One Commit for the NuttX Target
function build_commit { ...
## Build and Test This Commit
## And capture the Build-Test Output into a Log File
run_job \
$log $timestamp \
$apps_hash $nuttx_hash \
$prev_hash $next_hash
clean_log $log
find_messages $log
## Upload the Build-Test Log File
## As GitLab Snippet
upload_log \
$log unknown \
$nuttx_hash $apps_hash \
$timestamp
}
(upload_log creates the GitLab Snippet)
(GitHub Gists are supported too)
Remember we need to Build and Test 20 Commits? We call the script above 20 times: rewind-build.sh
## Build and Test the Latest 20 Commits
num_commits=20
num_success=0
count=1
for commit in $(
TZ=UTC0 \
git log \
-$(( $num_commits + 1 )) \
--date='format-local:%Y-%m-%dT%H:%M:%S' \
--format="%cd,%H"
); do
## Extract the Commit Timestamp and Commit Hash
## Commit looks like 2024-11-24T09:52:42,9f9cc7ecebd97c1a6b511a1863b1528295f68cd7
prev_timestamp=$(echo $commit | cut -d ',' -f 1) ## 2024-11-24T09:52:42
prev_hash=$(echo $commit | cut -d ',' -f 2) ## 9f9cc7ecebd97c1a6b511a1863b1528295f68cd7
## Build and Test the NuttX Hash + Apps Hash
## If It Fails: Build and Test the Previous NuttX Hash + Previous Apps Hash
build_commit \
$tmp_dir/$nuttx_hash.log \
$timestamp \
$apps_hash $nuttx_hash \
$prev_hash $next_hash
## Shift the Commits
## Omitted: Skip the First Commit (because we need a Previous Commit)
next_hash=$nuttx_hash
nuttx_hash=$prev_hash
timestamp=$prev_timestamp
((count++)) || true
## Stop when we have reached the
## Minimum Number of Successful Commits
if [[ "$num_success" == "$min_commits" ]]; then
break
fi
done
With the Test Logs: We have everything we need to identify the Breaking Commit. Just dig up the Earliest Commit that fails the NuttX Test…
$ nuttx_hash=657247bda89d60112d79bb9b8d223eca5f9641b5
$ qemu-system-riscv64 -semihosting -M virt,aclint=on -cpu rv64 -kernel nuttx -nographic
riscv_exception: EXCEPTION: Instruction page fault. MCAUSE: 000000000000000c, EPC: 000000018000001a, MTVAL: 000000018000001a
Error: Test Failed
Though there’s an easier way…
Why publish the Test Log as a GitLab Snippet?
That’s because we’ll Ingest the Test Log into our NuttX Dashboard. (So we can present the logs neatly as NuttX Build History)
This is how we Ingest a Test Log into our Prometheus Time-Series Database (that powers our NuttX Dashboard)
# TYPE build_score gauge
# HELP build_score 1.0 for successful build, 0.0 for failed build
build_score{
## These fields shall be rendered in Grafana (NuttX Dashboard and Build History)
timestamp="2025-01-11T10:54:36",
user="rewind",
board="rv-virt",
config="knsh64_test",
target="rv-virt:knsh64_test",
url="https://gitlab.com/lupyuen/nuttx-build-log/-/snippets/4800059#L85",
## Here's the NuttX Hash and Apps Hash for This Commit
nuttx_hash="657247bda89d60112d79bb9b8d223eca5f9641b5",
apps_hash="a6b9e718460a56722205c2a84a9b07b94ca664aa",
## Previous Commit is OK (Score=1)
nuttx_hash_prev="be40c01ddd6f43a527abeae31042ba7978aabb58",
apps_hash_prev="a6b9e718460a56722205c2a84a9b07b94ca664aa",
build_score_prev="1",
## Next Commit is Not OK (Score=0)
nuttx_hash_next="48846954d8506e1c95089a8654787fdc42cc098c",
apps_hash_next="a6b9e718460a56722205c2a84a9b07b94ca664aa",
build_score_next="0"
} 0 ## Means This Commit Failed (Score=0)
Hello Prometheus: We’re sending you this Test Log at the Specified URL…
NuttX Target is QEMU RISC-V (rv-virt:knsh64)
Previous Commit is OK (Previous Score = 1)
Next Commit is NOT OK (Next Score = 0)
This Commit is NOT OK (This Score = 0)
Which is transformed and transmitted by our Rust App, from GitLab Snippet to Prometheus: ingest-nuttx-builds/main.rs
// Post the Test Log to Prometheus Pushgateway:
// Compose the Pushgateway Metric containing the Test Log
let body = format!(r##"
build_score ... url="{url}" ... {build_score}
"##);
// Send the Metric to Pushgateway via HTTP POST
let client = reqwest::Client::new();
let pushgateway = format!("http://localhost:9091/metrics/job/{user}/instance/{target_rewind}");
let res = client
.post(pushgateway)
.body(body)
.send()
.await?;
(How we fetch GitLab Snippets)
(And Extract the Fields from Test Logs)
We promised to find the Earliest Commit that fails the NuttX Test…
$ nuttx_hash=657247bda89d60112d79bb9b8d223eca5f9641b5
$ qemu-system-riscv64 -semihosting -M virt,aclint=on -cpu rv64 -kernel nuttx -nographic
riscv_exception: EXCEPTION: Instruction page fault. MCAUSE: 000000000000000c, EPC: 000000018000001a, MTVAL: 000000018000001a
Error: Test Failed
Finally we’re ready…
Test Logs are now inside Prometheus Database. How will Prometheus tell us the Breaking Commit?
Recall that our Prometheus Database contains…
20 Test Logs and their Outcomes:
Commit is OK or Failed
Each Test Log contains Three Outcomes:
This Commit vs Previous Commit vs Next Commit
The Test Logs in Prometheus will look like this…
Test Log #1 | This Commit FAILED Previous Commit FAILED |
Test Log #2 | This Commit FAILED Previous Commit FAILED |
… | |
Test Log #6 | This Commit FAILED Previous Commit is OK |
Test Log #7 | This Commit is OK |
Test Log #8 | This Commit is OK |
Test Log #9 | This Commit is OK |
Ding ding: Test Log #6 will reveal the Breaking Commit!
Inside Prometheus: How to find Test Log #6?
We fetch the Breaking Commit with this Prometheus Query…
build_score{
target="rv-virt:knsh64_test",
build_score_prev="1"
} == 0
Dear Prometheus: Please find the Test Log that matches the following…
NuttX Target is QEMU RISC-V (rv-virt:knsh64)
Previous Commit is OK (Previous Score = 1)
This Commit is NOT OK (This Score = 0)
Prometheus returns the Breaking Commit that we seek…
Coded in our Rust App like so: nuttx-rewind-notify/main.rs
// Query Prometheus for the Breaking Commit
let query = format!(r##"
build_score{ ...
target="{TARGET}",
build_score_prev="1"
... } == 0
"##);
// Send query to Prometheus via HTTP Form Post
let params = [("query", query)];
let client = reqwest::Client::new();
let prometheus = format!("http://{prometheus_server}/api/v1/query");
let res = client
.post(prometheus)
.form(¶ms)
.send()
.await?;
// Process the Query Results (Breaking Commit)
let body = res.text().await?;
let data: Value = serde_json::from_str(&body).unwrap();
let builds = &data["data"]["result"];
One more thing to do with our Breaking Commit…
Great! Our Machine has auto-discovered the Breaking Commit. But Our Machine can’t fix it right?
Here comes the Human-Computer Interface: Our Machine (kinda) Escalates the Breaking Commit to a Human Expert for fixing, politely please…
Sorry @USERNAME: The above PR is failing for rv-virt:knsh64_test. Could you please take a look? Thanks! nuttx-build-log/snippets/4800059
$ git clone https://github.com/apache/nuttx
$ git clone https://github.com/apache/nuttx-apps apps
$ pushd nuttx
$ git reset --hard 657247bda89d60112d79bb9b8d223eca5f9641b5
HEAD is now at 657247bda8 libc/modlib: preprocess gnu-elf.ld
$ popd
NuttX Source: https://github.com/apache/nuttx/tree/657247bda89d60112d79bb9b8d223eca5f9641b5
NuttX Apps: https://github.com/apache/nuttx-apps/tree/a6b9e718460a56722205c2a84a9b07b94ca664aa
$ cd nuttx
$ tools/configure.sh rv-virt:knsh64
$ make -j
...
$ qemu-system-riscv64 -semihosting -M virt,aclint=on -cpu rv64 -kernel nuttx -nographic
riscv_exception: EXCEPTION: Instruction page fault. MCAUSE: 000000000000000c, EPC: 000000018000001a, MTVAL: 000000018000001a
riscv_exception: Segmentation fault in PID 2: /system/bin/init
This goes to our Mastodon Server for NuttX Continuous Integration. I’ll copy this and paste it as a PR Comment, after my vetting.
Our Machine writes this based on the Breaking Commit? From the Previous Section?
Exactly! We won’t explain the Dull Bits. They involve…
Extracting the Test Log
(Only the Important Parts)
Fetching the Breaking PR from GitHub
(Based on the Breaking Commit)
Composing the Mastodon Post
(And Posting to Mastodon)
Without any AI or LLM
(Because they ain’t cheap)
But Mastodon Posts are limited to 500 chars?
Bummer. That’s why we Create a GitLab Snippet for our Polite Note. And embed the Hyperlink in our Mastodon Post.
How to get the Breaking PR from GitHub?
We call the GitHub API…
## Fetch the Pull Request for this Commit
$ commit=be40c01ddd6f43a527abeae31042ba7978aabb58
$ curl -L \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
https://api.github.com/repos/apache/nuttx/commits/$commit/pulls
[{ html_url: "https://github.com/apache/nuttx/pull/15444",
title: "modlib: preprocess gnu-elf.ld for executable ELF",
user: { login: "GITHUB_USERID", ...
Which becomes this function in our Rust App.
(Searching for NuttX Commit in Prometheus)
We coded plenty of goodies over the Lunar New Year. How will they be triggered?
Via Ubuntu Cron. Every Day it shall trigger the Daily Test and Rewind (pic above)
## Add a Cron Job
$ crontab -e
## Then insert this...
## Test and Rewind: Every Day at 00:00 UTC
0 0 * * * /home/luppy/nuttx-build-farm/cron.sh 2>&1 | logger -t nuttx-rewind-build
## Or For Testing...
## Test and Rewind: Every Hour at 00:16, 01:16, 12:16, ...
16 * * * * /home/luppy/nuttx-build-farm/cron.sh 2>&1 | logger -t nuttx-rewind-build
## Exit and Monitor our Cron Job
$ tail -f /var/log/syslog
(cron.sh will start rewind-build.sh)
We’ll see the Test and Rewind in action…
(luppy) CMD (/home/luppy/nuttx-build-farm/cron.sh 2>&1 | logger -t nuttx-rewind-build)
+ ./rewind-build.sh rv-virt:knsh64_test HEAD HEAD 1 20
/tmp/rewind-build-rv-virt:knsh64_test/apps /tmp/rewind-build-rv-virt:knsh64_test
#1 of 20: Building nuttx @ 8995e5a66e14819e2bfda467d4f9fb8719fd9134 / nuttx_apps @ 43439a6b16a435bce7d9ac85f05c3a6013f91348
+ build_commit /tmp/rewind-build-rv-virt:knsh64_test/8995e5a66e14819e2bfda467d4f9fb8719fd9134.log 2025-02-03T08:21:26 43439a6b16a435bce7d9ac85f05c3a6013f91348 8995e5a66e14819e2bfda467d4f9fb8719fd9134 dc5251f9c8db878ac9706586eb85ad7e201286b6 8995e5a66e14819e2bfda467d4f9fb8719fd9134
+ run_job /tmp/rewind-build-rv-virt:knsh64_test/8995e5a66e14819e2bfda467d4f9fb8719fd9134.log 2025-02-03T08:21:26 43439a6b16a435bce7d9ac85f05c3a6013f91348 8995e5a66e14819e2bfda467d4f9fb8719fd9134 dc5251f9c8db878ac9706586eb85ad7e201286b6 8995e5a66e14819e2bfda467d4f9fb8719fd9134
+ script /tmp/rewind-build-rv-virt:knsh64_test/8995e5a66e14819e2bfda467d4f9fb8719fd9134.log -c ' /home/luppy/nuttx-build-farm/rewind-commit.sh rv-virt:knsh64_test 8995e5a66e14819e2bfda467d4f9fb8719fd9134 43439a6b16a435bce7d9ac85f05c3a6013f91348 2025-02-03T08:21:26 dc5251f9c8db878ac9706586eb85ad7e201286b6 8995e5a66e14819e2bfda467d4f9fb8719fd9134 '
+ /home/luppy/nuttx-build-farm/build-test-knsh64.sh 8995e5a66e14819e2bfda467d4f9fb8719fd9134 43439a6b16a435bce7d9ac85f05c3a6013f91348#015
...
+ glab snippet new --repo lupyuen/nuttx-build-log --visibility public --title '[unknown] CI Log for rv-virt:knsh64_test @ 2025-02-03T08:21:26 / nuttx @ 8995e5a66e14819e2bfda467d4f9fb8719fd9134 / nuttx-apps @ 43439a6b16a435bce7d9ac85f05c3a6013f91348' --filename ci-unknown.log
Creating snippet in https://gitlab.com/lupyuen/nuttx-build-log/-/snippets/4802191
Done!
And the Polite Note? That goes to our Mastodon Server?
Every 15 Minutes: Ubuntu Cron shall trigger the Mastodon Notification (pic above)
## Add a Cron Job
$ crontab -e
## Then insert this...
## Notify Mastodon: Every 15 minutes
*/15 * * * * /home/luppy/nuttx-rewind-notify/cron.sh 2>&1 | logger -t nuttx-rewind-notify
## Or For Testing...
## Notify Mastodon: Every Hour at 00:16, 01:16, 12:16, ...
16 * * * * /home/luppy/nuttx-rewind-notify/cron.sh 2>&1 | logger -t nuttx-rewind-notify
## Exit and Monitor our Cron Job
$ tail -f /var/log/syslog
(cron.sh will start our Rust App)
The Mastodon Notification appears like so…
(luppy) CMD (/home/luppy/nuttx-rewind-notify/cron.sh 2>&1 | logger -t nuttx-rewind-notify)
+ cargo run
build_score{
target="rv-virt:knsh64_test",
build_score_prev="1"
} == 0
rv-virt : KNSH64_TEST - Build Failed (rewind)
Breaking PR: https://github.com/apache/nuttx/pull/15444
Build History: https://nuttx-dashboard.org/d/fe2q876wubc3kc/nuttx-build-history?var-board=rv-virt&var-config=knsh64_test6
Sorry @USERNAME: The above PR is failing for rv-virt:knsh64_test. Could you please take a look? Thanks!
Wow this looks super complicated. Does it work?
Dunno, we’re still testing? Hopefully the New System will make my Daily Routine a little less painful…
Every Morning: I check the NuttX Daily Test
Oops Daily Test failed! I run a script to Rewind or Bisect the Daily Build
I write a Polite Note (depending on my mood)
And post it to the Breaking Pull Request
That’s why we’re Fast Tracking the complicated new system: Right now it runs Every Hour. (Instead of every day)
What if it’s a smashing success?
We might extend the Daily Rewind to a Real Board: Oz64 SG2000 RISC-V SBC.
Or maybe SG2000 Emulator and Ox64 Emulator, since they’re quicker and more consistent than Real Hardware. (Though less accurate)
Plus other QEMU Targets: qemu-armv8a:nsh, rv-virt:nsh / nsh64 / knsh
Suppose we wish to add Our Own Boards to the System?
Let’s assume we have Automated Board Testing. Then we could upload the NuttX Test Logs (in the prescribed format) to GitLab Snippets or GitHub Gists. They’ll appear in NuttX Dashboard and Build History.
(Rewinding the Build on Our Own Boards? Needs more work)
Why Rewind every commit? Isn’t Git Bisect quicker?
Ah remember that we’re fixing Runtime Bugs, not Compile Errors. Git Bisect won’t work if the Runtime Bug is Not Reliably Reproducible.
When we Rewind 20 Commits, we’ll know if the bug is Reliably Reproducible.
Why aren’t we using Docker?
Docker doesn’t run OSTest correctly on QEMU RISC-V 64-bit.
Any more Grand Plans?
Our new PR Test Bot will allow a PR Comment to trigger a Build + Test on QEMU. For example, this PR Comment…
@nuttxpr test rv-virt:knsh64
Will trigger our PR Test Bot to download the PR Code, and run Build + Test on QEMU RISC-V. Or on Real Hardware…
@nuttxpr test milkv_duos:nsh
Super helpful for Testing Pull Requests before Merging. But might have Security Implications. We’ll chat in the Next Article!
Special Thanks to Mr Gregory Nutt for your guidance and kindness. I’m also grateful to My Sponsors, for supporting my writing.
Got a question, comment or suggestion? Create an Issue or submit a Pull Request here…
Daily Test + Rewind is hosted on this hefty Ubuntu Xeon Workstation