Math is hard…

I haven’t had much time to work on this for the last month or so because work got a little crazy. Now that the Big Project is over, I’ve started back into this.

My first attempt at the triangulation is more complex than I initially expected. First I take each sound pair, and using the skew and the locations of the microphone, calculate a pair of bearings originating from the midpoint between those two microphones. Then I take those midpoints and bearings to calculate intersections, and I get four intersections for each pair of midpoints, and there can be up to 6 midpoints, therefore 15 midpoint pairs, and 60 locations. However, I ignore midpoint pairs that are too close together (2 meters or less), and also discard any location more than a couple of kilometers away as being a false signal, but that still leaves between 5 and 25 locations.

Next I try to cluster the locations by finding locations that are within 100m of each other. This isn’t working quite as well as I had hoped, but that is likely due to poor data quality. I’ve just got got the microphones scattered around my office, rather than at the corners of my property, but the algorithm is assuming the mics are at their final location. Because of the relatively low skew times, the calculations are all coming out as if the sounds originate in the center of my house.

I’ve got to work on the enclosures so I can get the microphones to their proper locations. Unfortunately the fireworks that were pretty common near the 4th of July have abated, so I will have relatively few data points to work with.

The timing is problematic

I’ve got the server end of the solution a little farther along. It now recognizes multiple buffers as belonging to the same sound, can group those sounds into likely pairs for comparison, automatically determine their time skews (offsets) and associate them with an event.

There are still problems getting the microcontrollers to agree on what time it is, and now the server is problematic as well. As I mentioned previously, the algorithm that I use to determine the time offsets is CPU intensive, and now that it’s being done automatically the server is under much heavier load. It’s got 8 cores, so dedicating 4-6 of them to running parallel computations shouldn’t be an issue. Apparently however, the heat-sink on the CPU isn’t as effective as it could be, so first the CPU frequency scales up to handle the load, which causes the temperature to shoot up to nearly 80℃ (see below) which causes the CPU to scale back it’s frequency to deal with the heat. I suspect that all that scaling up and down is playing havoc with it’s internal timekeeping, because it suddenly loses track of the time by 500ms or more, causing it to inappropriately break up new sounds into multiple buffers as the time jumps back and forth. It’s also messing up the microphone microcontrollers, which suddenly get large offsets from the server, quickly followed by large offsets in the opposite direction as the server time corrects itself.

It’s getting hot in here

So I’m going to take two steps to combat this. First I’m going to take the heatsink off the CPU and see if I can get it to be more effective by cleaning the surfaces and applying a thin coat of new thermal compound. Second, I’m going to move the role of NTP server to a new box that doesn’t have as much of an issue with spiky cpu load. (And if that isn’t stable enough, I may dedicate a Raspberry Pi act as a NTP server.)

Pyramid of abstractions

I’m starting to tackle the problem of identifying the individual sounds and correlating them with things detected by all the microphones. In order to do this I’m going to come up with abstractions for the data at each stage in the process. I’m going to start designing this from the bottom up. The terms for each abstraction are presented in bold the first time it is used.

The hierarchy of abstractions

At the base of the pyramid of abstractions are samples. These are individual measurements of the voltage the microphone element produces. The samples are grouped together into buffers. Currently the size of a buffer is 512 samples, collected at approximately 20kHz. The microcontroller decides if the buffer is interesting by looking for samples where the voltage falls above a threshold. If interesting, it then forwards that buffer and preceding and following buffers to the server.

The server receives these sound buffers from an individual microphone controller. The server groups these buffers together based on time. Each buffer spans about 22 milliseconds, and buffers from the same controller that arrive within 50 ms of each other are considered to be part of the same sound (50 ms chosen so that 1 or 2 dropped buffers don’t break up a single sound.)

Sounds from different microphone controllers are grouped together by the server into sound clusters that occur within some small time frame that will be bounded by the time it takes for sound to travel between microphones (plus some small amount to allow for timing errors.) If the sound cluster contains sounds from more than 2 microphones, it will be considered to have originated from the same event.

The server then tries to determine where the event happened. It does this by calculating the time-offsets and similarities of each pair of sounds in a cluster. If the similarity of a pair of sounds falls below some threshold then the time-offset for that pair will be discarded. The remaining time-offsets for a cluster will then be combined the the physical arrangement of their corresponding microphone assemblies and used to calculate a best-guess for the location of the event.

To skew or not to skew

After writing up my (computationally intensive) code to measure the skew between the signals from two microphones, I’ve made a discovery. It works great for stuff with complex, low-frequency sounds like my chair creaking, but not so well in other cases. For sustained, constant frequency sounds (like beeps) it gets confused about which of several possible alignments are “best”. Take for example this short beep as heard by two adjacent microphones:

My visual best fit says the green waveform needs to be shifted a few hundred microseconds to the right, and that these were almost in alignment already. However, my algorithm shifted it ~13,000 microseconds to the left.

It did make the wave peaks line up, but since this is a more or less steady tone, that happens every couple of milliseconds. I’m also sure it maximized my fit function, but to my eye the overall envelopes don’t match nearly as well. I think there are two factors working against my algorithm here. First, the waveforms weren’t complete–the beginning of the waveforms was cut off by different amounts in the different samples. I’ve taken measures to reduce the likelihood of that happening, but I can’t eliminate it altogether. Second, this was a fairly steady tone–as I already mentioned, and there were lots of “pretty good” fits that it had to choose from.

The other situation that it doesn’t handle well is more problematic. It appears that for short, sharp sounds–like a clap, whip crack, fireworks or gunshots–there is too much high-frequency information that the two mics will sample differently, and since my sampling rate is about 20kHz, I really can only differentiate frequencies below about 10kHz (5kHz for a good fit). See the Nyquist-Shannon theorem for a more complete discussion as to why. So, when I have a signal with a lot of high-frequency information, I can’t really match it effectively. Take this example of a clap when the mics where a few feet apart (1-2 meters):

The apparent shift shouldn’t need to be large, but the algorithm doesn’t pay attention to that, and it came up with a fit that looked like:

This is a much worse fit according to my eye. I think a better technique in this case it to line up the beginning of the loud sounds, but I need to come up with a way to identify those algorithmically. I’ll probably use some heuristic like looking at the time of the first samples to fall significantly further from the mean than I’d been seeing previously, but that requires that I have a nice quiet section before the sound happens. I’ve taken steps to try to make sure that I have that (by sending the prior buffer as well when I detect an anomaly), but it doesn’t always work out as you can see in the purple curve.

The good, the bad, and the skewed

One of the technical challenges in this project is to figure out the exact time offset of two waveforms. I think I’ve solved that sufficiently.

The Good

My algorithm correctly detects the time skew of two waveforms. Here’s the raw data from two mics:

Without correction, the two waveforms look disjoint

And here’s after the skew is corrected for:

With the curve from Mic 004 skewed forward by 2.44 milliseconds

The two waveforms are a very good match.

The Bad

The algorithm is very computationally intensive. My first pass at the code, finding the skew took 10-20 minutes for two 50 millisecond waveforms. With a little optimization from caching interpolation results and discarding excess precision, I got it down to 1-2 minutes (much better, but still pretty slow. I may be able to get another factor of 2 or 3 my switching to C++ from Python, but getting the code right will be more difficult.

The Skewed

It has occasionally detected skews in the range I’d expect for two microphones next to each other (a few hundred microseconds), but most of the skews have been in the 2-2.5 millisecond range, which is about 10 times what I’m hoping for. More work on time sync is needed apparently.

And now for the time and temperature…

I’ve been noticing that the microphone that has the DHT-11 temperature sensor consistently under-performs the other microphone in terms of how well is stays in sync with the NTP server. I have documented previously that trying to read a non-existent sensor caused major sync issues, but I now know that even if the sensor is working properly, it still throws the sync off slightly.

On mic 001 (the one with the temperature sensor). I was seeing the average offset being somewhere around +/-1300 microseconds, whereas on mic 004 (the one without the sensor or code to read it). I was typically seeing offsets of +/-300-400 microseconds (1/4 to 1/3 as large). So I disabled the DHT-11 on mic 001, and within 15 minutes the average offset was +/- 400-500 microseconds, and the timing of received sound waves was much more in sync.

I have no idea what this was, but both mics agree when it was.

And zooming in on that first big positive peak you can see that they’re only about 400 microseconds apart, which is pretty good.

.9093730-.909344 = .000386 seconds, or 386 microseconds difference

It’s not the 100 to 200 microsecond offset I’m looking for, but I can live with this level of error.

Now what am I going to do with the temperature sensor? I still need to be able to measure ambient air temperature to calculate the speed of sound accurately, but it was never a requirement that I have 4 of them or that they be co-located with microphones. I have some spare ESP8266 feathers now, and while they’re not good for the microphones, I can easily re-purpose them to being a couple of temperature sensors. I’ll play with low-power deep-sleep and have them wake up every 30 seconds or so to check the temp and report in. That should give me a fairly accurate and current air temp.

ESP32 not working out

Since the ESP8266 wasn’t working for this project, I ordered a Feather Huzzah ESP32, which features a slightly newer Wifi chip from Espressif. I had high hopes that this would be a cheaper alternative to the M0 feather, since it had a math-coprocessor and dual cores. I thought the second core could handle the 10kHz interrupt routine to read the microphone while the primary core did all the I/O with the WiFi.

It seemed to work well initially, but the processors starting panicking and resetting every few seconds. I built a test rig, and it was able to handle reading the mic at 10kHz no problem, however, as soon as I added my NTP routines to timestamp the buffers, then it reset almost immediately. My first thought was that the NTP estimation routine was slow enough to cause the interrupt service routine (ISR) to bleed into the next firing of the timer interrupt, but after a little research there appears to be another problem.

When I ripped out the hairy math that did time skew estimation and replaced it with slightly less hairy math, I used floating point calculations. I thought that the ESP32, with it’s dedicated floating-point co-processor would make quick work of these, and it probably does, but doing floating point math in the ISR is apparently a no-no. Maybe because the coprocessor uses interrupts to signal that it has completed its calculation, and that having interrupts within interrupts was causing a race condition of some kind, and that was occasionally reseting the chip.

So now I either have to go back to hairy integer math in the skew estimation routine, or I need to stick to the M0. I think I’ll stick with the M0, and swallow the additional $15 per microphone.

Another timing bug squashed

I spent several hours trying to figure out why my circuit on the second M0 microcontroller was having significant timing issues when the first M0 wasn’t. The second M0 didn’t seem to exhibit the problem in my NTP test rig either. I finally figured it out. As you can see below, when I built the second M0, I didn’t have a second DHT-11 temperature and humidity sensor (the blue box–I forgot to order a second one), but I kept the code the same. If I use the preprocessor flag I put in my code to disable all the DHT related calls, the timing issues go away.

Two microphones listening for booms

Here’s my theory as to what’s happening. The DHT11 product page on Adafruit’s site says that reading the chip requires careful timing. To get the timing right, I suspect that it disables interrupts while reading the chip. If the chip is there then this only takes a few microseconds, and nobody is the wiser. However, if the chip isn’t there, it has to wait for some timeout to happen, and that means interrupts are disabled for a while, and that throws off the millis() and micros() timing functions, which won’t increment while interrupts are disabled. Since my NTP library uses micros() calls frequently to calculate the time since the last server sync, it was accumulating significant errors which were causing the readings to go all over the place.

By the time I’d figured this out, I’d already ripped out a lot of the hairy math I had implemented for calculating gradual skews and correction factors, and replaced it with slightly less hairy math. It’s far more straightforward and easy to understand, and the only downside is that a lot of it is now floating point math, which is slower on some microprocessors.

I’m still seeing about a 1-2 millisecond skew between the readings on the two microphones, and I’m going to have to find a way to adjust for that.

Timing problems

I definitely still have timing problems. I happened to catch a squeak from my office chair (hurrah for squeaky chairs) which spanned two captured buffers. Here’s what I looks like:

Overlapping waveforms
Overlapping times

As you can clearly see, there’s a definite backwards time jump. With a little more gnuplot-fu, I managed to get a clearer picture that shows a ~6.5 ms backwards jump.

it’s not supposed to do that

There are several sources of error that I need to track down for this.

  1. Assuming the timer interrupt is perfect
  2. Assuming that time never jumps backward
  3. Assuming that every reading takes exactly the same amount of time

My first poor assumption is that the timer interrupt fires exactly every 100 microseconds (as it would for a perfect 10kHz signal). However, if it were actually (say) every 110 microseconds (aproximately a 9kHz signal), then over 512 readings (the size of the buffer), it would have a cumulative error of 5120 microseconds, or ~5 milliseconds. A negative 6.5 ms error would indicate am actual time of about 87 microseconds and a clock frequency of about 11.5 kHz. I can correct for this on average by calculating the time it takes to read a whole buffer and dividing by the size of the buffer. I’ll do that next.

My second poor assumption is that my timer never jumps backwards, though as I’ve documented, it does, though I’ve only observed a few 10’s of microsecond jump. That doesn’t mean it doesn’t jump more at other times however, and I need to find a way to model that.

My third poor assumption is that every reading takes the same time, whether that’s 100 micros, 110 micros, or 90 micros. This is the hardest to measure, because my clock just isn’t accurate enough to measure a 10 microsecond skew. I suspect that I’m just going to have to punt on this one and assume the buffer wide average is good enough. It’s probably good enough to account for the drift in clock frequency with temperature.

Further timekeeping improvements

One of the problems I’d been seeing was that every time I synced with the NTP server, the clock would jump forward or back anywhere from 15 milliseconds to 100 milliseconds or more. I even had some cases where readings were appearing out of order because they happened right after one of these jumps.

With a bit more hairy math, I’ve got it spreading out the corrections over a full second, and trying to adaptively adjust its mapping of the internal clock to the real world by continually skewing its reading of the micros() call slightly. It’s still jumping slightly, (and backwards, which shouldn’t be possible now, so I likely have some subtle bug), but the jumps are fairly consistently -70 to -85 microseconds, which is shorter than the reading of a single data point, so I doubt it will cause much trouble.

That’s not to say that everything is perfect. The offsets I’m measuring, while smaller, are still large (+/-1000-2000 microseconds), but at least they’re not causing the reading to jump around as much. This might be due to clock instability on the server side, or on the microcontroller side. If it’s on the microcontroller side then my hopes of making the individual mic controllers stay in sync might be in trouble. I’ll have to wait ’til I have more than one fully functional to check out this theory.