Another timing bug squashed

I spent several hours trying to figure out why my circuit on the second M0 microcontroller was having significant timing issues when the first M0 wasn’t. The second M0 didn’t seem to exhibit the problem in my NTP test rig either. I finally figured it out. As you can see below, when I built the second M0, I didn’t have a second DHT-11 temperature and humidity sensor (the blue box–I forgot to order a second one), but I kept the code the same. If I use the preprocessor flag I put in my code to disable all the DHT related calls, the timing issues go away.

Two microphones listening for booms

Here’s my theory as to what’s happening. The DHT11 product page on Adafruit’s site says that reading the chip requires careful timing. To get the timing right, I suspect that it disables interrupts while reading the chip. If the chip is there then this only takes a few microseconds, and nobody is the wiser. However, if the chip isn’t there, it has to wait for some timeout to happen, and that means interrupts are disabled for a while, and that throws off the millis() and micros() timing functions, which won’t increment while interrupts are disabled. Since my NTP library uses micros() calls frequently to calculate the time since the last server sync, it was accumulating significant errors which were causing the readings to go all over the place.

By the time I’d figured this out, I’d already ripped out a lot of the hairy math I had implemented for calculating gradual skews and correction factors, and replaced it with slightly less hairy math. It’s far more straightforward and easy to understand, and the only downside is that a lot of it is now floating point math, which is slower on some microprocessors.

I’m still seeing about a 1-2 millisecond skew between the readings on the two microphones, and I’m going to have to find a way to adjust for that.

Okay results with 2 microphones

I fixed the previously mentioned time jump problem, at least well enough.

I also now have 2 (roughly) identical microphones set up, and while the initial results were frightening, they settled in and starting behaving reasonably.

My first attempt was right after I had plugged them both in, and the result was this:

Initial skew from M0 microphones
Nowhere close

There was about 90-100 millisecond skew in their readings, which since they were right next to each other, seemed very odd. It’s the kind of delta I would expect if they were 50 meters (165 ft) apart. I thought that maybe my time sync wasn’t nearly as stable as I thought it was, but I decided to give them some time to settle-in and converge, and that helped. After 10 seconds of being on, I got this result:

Microphone skew after settling in
almost synced

This was much more encouraging. The waveforms are almost synced. If I zoom in on the peaks, I can see a better estimate of their delta:

Zoomed in on skew after settling
Looking much better zoomed in.

The actual time difference was about 400-500 microseconds, and at least part of that (maybe 100-200 microseconds) might be due to them being 6 inches apart. This is much closer to the kind of synchronization I’m going to need for this project.

Update: Celebration premature it turns out. I still have serious NTP sync issues that are causing the two microphones to fall out of sync. Ripped out my previous hairy math and trying to fix it now, but still having problems with the corrections oscillating wildly.

Timing problems

I definitely still have timing problems. I happened to catch a squeak from my office chair (hurrah for squeaky chairs) which spanned two captured buffers. Here’s what I looks like:

Overlapping waveforms
Overlapping times

As you can clearly see, there’s a definite backwards time jump. With a little more gnuplot-fu, I managed to get a clearer picture that shows a ~6.5 ms backwards jump.

it’s not supposed to do that

There are several sources of error that I need to track down for this.

  1. Assuming the timer interrupt is perfect
  2. Assuming that time never jumps backward
  3. Assuming that every reading takes exactly the same amount of time

My first poor assumption is that the timer interrupt fires exactly every 100 microseconds (as it would for a perfect 10kHz signal). However, if it were actually (say) every 110 microseconds (aproximately a 9kHz signal), then over 512 readings (the size of the buffer), it would have a cumulative error of 5120 microseconds, or ~5 milliseconds. A negative 6.5 ms error would indicate am actual time of about 87 microseconds and a clock frequency of about 11.5 kHz. I can correct for this on average by calculating the time it takes to read a whole buffer and dividing by the size of the buffer. I’ll do that next.

My second poor assumption is that my timer never jumps backwards, though as I’ve documented, it does, though I’ve only observed a few 10’s of microsecond jump. That doesn’t mean it doesn’t jump more at other times however, and I need to find a way to model that.

My third poor assumption is that every reading takes the same time, whether that’s 100 micros, 110 micros, or 90 micros. This is the hardest to measure, because my clock just isn’t accurate enough to measure a 10 microsecond skew. I suspect that I’m just going to have to punt on this one and assume the buffer wide average is good enough. It’s probably good enough to account for the drift in clock frequency with temperature.

Further timekeeping improvements

One of the problems I’d been seeing was that every time I synced with the NTP server, the clock would jump forward or back anywhere from 15 milliseconds to 100 milliseconds or more. I even had some cases where readings were appearing out of order because they happened right after one of these jumps.

With a bit more hairy math, I’ve got it spreading out the corrections over a full second, and trying to adaptively adjust its mapping of the internal clock to the real world by continually skewing its reading of the micros() call slightly. It’s still jumping slightly, (and backwards, which shouldn’t be possible now, so I likely have some subtle bug), but the jumps are fairly consistently -70 to -85 microseconds, which is shorter than the reading of a single data point, so I doubt it will cause much trouble.

That’s not to say that everything is perfect. The offsets I’m measuring, while smaller, are still large (+/-1000-2000 microseconds), but at least they’re not causing the reading to jump around as much. This might be due to clock instability on the server side, or on the microcontroller side. If it’s on the microcontroller side then my hopes of making the individual mic controllers stay in sync might be in trouble. I’ll have to wait ’til I have more than one fully functional to check out this theory.

It’s got rhythm…

After a lot of debugging (and swearing) and very hairy math, I have my version of the NTP libraries mostly keeping time consistently. It does the full offset and delay calculations. There are still a few quirks that I need to track down, but the skew is far more predictable now–though still larger than I would like. In 10 seconds, the clock drifts between -4000 and -14000 microseconds (about 1000 ppm, which is pretty high). With a little adaptive frequency shifting I think I can get that drift down quite a bit.

I have starting thinking about submitting a pull request to merge my NTPClient changes back into the main library, but I committed a cardinal sin. I modified an older version of the code rather than starting with the latest. My attempts to reconcile their changes (mostly making all the getters idempotent) and mine (changing the fundamental time unit from seconds to microseconds while not breaking the API) have proven very clunky, and I’ve given up for the moment. I’m thinking of heading in the opposite direction and just forking my effort into a whole new refactored library with a more elegant way to handle complex time calculations, similar to the DateTime class in the RTClib, but more flexible.

What time is it, really?

A big part of the success of this project will hinge on getting all the microphones to agree on what time it is. I don’t really care if that time is particularly accurate, as long as all the sensors agree on it. Of course, they’ll never agree exactly, but if I can get them in agreement to within 50-100 microseconds, that should be good enough.

I’ve starting experimenting with the Network Time Protocol (NTP). In theory, this should be satisfactory for what I’m doing. However, the implementation I’ve found for Arduino based systems is pretty basic, and doesn’t implement the full protocol. First, it only exposes the whole number of seconds, and doesn’t expose the fractional part of the seconds. I’ve modified it to expose down to microseconds. Second, it simply takes t2 of the clock synchronization algorithm. It doesn’t do the full offset and round trip delay calculation. This is good enough if all you care about is rough accuracy (+/- 1 second), but apparently isn’t sufficient for what I need. But even if I extend the library further I may not be able to achieve the accuracy I need.

I have my devices syncing to the NTP server on my home box every 10 seconds or so, and I’ve had them log the amount of skew (change in current time) they are seeing every refresh. Over long stretches, it averages about 5,000 microseconds fast per 10 seconds of clock time–which is pretty good and easily compensated for, but individual skews are all over the place. It jumps back and forth by 5,000-120,000 microseconds (5-120 miliseconds) at a time. Keep in mind that with a 20 meter baseline, the delta time is at most 52ms, and usually much less.

What I’m seeing in the logs

This might partly be due to the differing transmission delays that the algorithm currently ignores, (that’s one reason I’m syncing to a local server, to minimize the transmission delays), but 120ms is a pretty big transmission delay for a local network, and is closer to what I would expect for US to Europe round trip.

So what are the sources of error?

  1. The shortcomings in the current library.
  2. The stability of the clock on the microcontroller.
  3. The stability of the clock on the server.
  4. The inevitable jitter caused by the vagaries of WiFi.

Of those, I can only address the first three sources of error. Time will tell if that’s good enough.

First look at the data

I did some simple plotting of measurements taken by a single microphone. I clapped several times a few feet away from it. The first two claps were quite loud:

Clap 1
Clap 2

Full scale on the ADC is 12 bits, or 0-4095 units. Each unit represents about 3.3V/4095 = 0.8 millivolts. You can see that when idle it hovers around 2090, which is approximately half scale, so that’s good. A very loud sound sends it almost full scale (+/- 1500 units), which is also good. The RMS variation when quiet is around 10-20 units, which indicates there’s not a lot of noise in the system.

I’m trying to decide how to identify the time of the arrival of the wavefront. I could pick the reading that is furthest from the mean value and key off of the time of that, and I think this is what I’ll try first. I’m figuring that this translates into the peak of the initial attack, and that will be close to the same position in all the microphones.

The other strategy is to just pick the first reading that is more than handful of standard deviations away from the mean, but that requires a more sophisticated analysis of the data leading up to the wavefront, and in the case where the wavefront is at the beginning of the packet of data, I may not have enough data in those few points to get a meaningful statistical picture, which means I need to get long-term averages from the microphone controllers themselves, which means more computation on the microcontrollers.

Quieter sounds (soft clapping) produced simpler and easier to interpret waveforms:

Experiment 1 clap 3
Clap 3

Here you can see that there is only really one big spike with a 300 unit deviation, and so keying off of that single value is likely the right strategy.

I’m going to have to wait till I have 2 microphones to see if I can synchronize their detection.

Another contender

When I priced out the BeagleBoard Black design, I didn’t know that there was a cheaper BeagleBoard available. It’s the same price at the Feather M0 WiFi, and it’s got several analog inputs, plus I can run Linux on it. The only thing it doesn’t have is WiFi, but there’s a USB port available, and a second one can be attached. So I’d just need a $5 WiFi dongle and a microphone , and I’d be all set.

Initial Mic Prototypes

So I’ve gotten initial prototypes of the microphones done and am ready to start playing more seriously with the software.

First the Feather M0 prototype:

Feather M0 WiFi mic prototype
Sorry about the messy workspace

It’s got nice clean lines because most of the magic happens on the existing boards. I had to jumper the 8kHz square wave output of the RTC board to pin 10 so that it could generate timing interrupts, and I connected the microphone directly to power and the A2 analog input.

The Feather Huzzah ESP8266 microphone prototype is a little messier:

Feather ESP8266 mic prototype
The need for that voltage divider really tripped up the elegance of the design.

Since the microphone generates voltages from 0-3.3V, and the ADC on the 8266 can only handle a maximum input of 1v, I needed a voltage divider to bring down the mic voltage before feeding it into the analog pin. The proto board is overkill, but I didn’t want to wire it up freeform. I’ll put the DHT11 temperature sensor on this board as well, once it arrives.

So far, the M0 board has behaved well. at least once I managed to update it’s WiFi firmware. The ESP8266 has been more problematic. As long as I keep the sampling rate at 500 Hz or less, it behaves pretty well, but as soon as I try to push it past that, it starts dropping packets (sent and received) on the floor and returning errors. As near as I can tell, this is because more of the WiFi handling is done by the main processor (as opposed to being passed off to the dedicated subprocessor in the M0) and when I’m pushing the read rate up, it starves the main processor of the necessary time to do the WiFi stuff. 500 Hz is 20x less than my target rate of 10kHz though, so unless I can solve this problem by changing strategies, I’m going to have to switch to using the M0, and as I said, that will strain the budget. (Ideally I’d like to keep this under $500, but if I have to spend $55 on each microphone, that will blow almost half the budget there.) Maybe I don’t actually need the RTC chip though. There are cheaper oscillators and I can compensate for temperature using NTP.

Trials and Tribulations of debugging

I’ve gotten both the Huzzah ESP8266 and the M0 WiFi set up as monitors, though the huzzah is missing both the mic and temperature sensor for now. The code compiles cleanly on both, but is not completely working on either so far.

The ESP8266 seems to do most of it’s i/o in software rather than firmware or hardware, and that means resource contention can cause it to misbehave. If I have the timer interrupt library set for more than a few hundred interrupts/second (and I need 8K to 10K for the project), it doesn’t seem to be able to send or receive UDP packets, and eventually stops accepting new packets to be queued (probably because some buffer is full). I’ve tried adding a yield() call in the sending routine, and that doesn’t appear to have helped.

For the M0, it seems like it’s caching old library code. I had forked the NTPClient library to add the ability to get fractional parts of the second, since the existing library throws that data away. In theory, the new measurements are precise to the microsecond (not necessarily accurate, but precise, and hopefully accurate to within about 50-100 microseconds if synced to the same server). However, in my first pass I introduced a bug that caused it to jump about 5 years into the future (I’ve invented time travel!). and while I’ve squashed it in all my test cases, it continues to exhibit in the microphone code, though only when compiled for the M0. My best guess is that the arduino IDE is caching the compiled library object files somewhere in the project and and I can’t convince it to recompile. It’s also possible that I haven’t actually completely squashed the bug, but that seems less likely given that my test cases pass.

They both are dumping data to the simplistic server I have written up in python though–at least when the Huzzah can transmit UDP at all. So that’s gratifying.

[Update]. It was a real bug, but not in the NTPClient library. I was passing an IPAddress object in place of the hostname, which was expecting a const char *. The compiler helpfully converted the IPAddress object to a const char *, but in so doing smashed the stack of the NTPClient object so that the time_offset parameter to about 159 million seconds, which is roughly 5 years.