Variation between sensors

It seems to me that sensor readings tend to correlate between geographically separate sensors. I wonder whether or not it would be a good idea to calculate a variation value based on how closely readings match.

For example, let’s presume the following. Sensor 1 reports a one hour PM2.5 average of 10 and sensor 2 (located 500 meters away) reports a one hour PM2.5 average of 9. Take the maths with a pinch of salt as it’s only an example and it’s been a long time since I did statistics (or maths for that matter lol). I’m sure somebody cleverer than me could work out the specifics, but here’s my bash at it…

  • The mean of these values is (9+10)/2 = 9.5
  • Standard deviation can be worked by subtracting the mean from each sensor, squaring it, and calculating the mean of the squared differences to find the sample variance. Finally square rooting this number gets the standard deviation. We can ignore whether the end result is positive or negative so…
  • (9 - 9.5)² = 0.25
  • (10 - 9.5)² = 0.25 (the same as we only have two sensors in the example)
  • sqrroot(0.25) = 0.5 (standard deviation)
  • We could then use coefficient of variation (standard deviation divided by mean, or 0.5 / 9.5 = 0.053) or 5.3% to show (in this example) that there is little variation amongst samples.

If we did this, people building applications on top of the data (let’s say for example a system which alerts users to breached air quality in their area), rules could be put in place within the application to ensure that it doesn’t alert unless variation value is <10% for example. This would help prevent temporary and highly localised sources of degraded AQ like BBQs from triggering air quality warnings on systems built on top of our data.

Now waiting for somebody to point out the critical error in my maths and logic :joy:

1 Like

Nothing wrong with your thinking @chris or your stats. That is my understanding of coefficient of variation and SD. The only slight question of the thinking is how far apart and under what situations can we assume the ACTUAL PM levels are the same. Clearly devices that are co-located for sure. 500m? Even between yours and mine - about 500m? I would be looking for short term variations so I guess taking a full hour would eliminate these. To do this calculation we probably should let the device run freely for an hour taking a reading every second then do the averaging.

It would be even better to co-locate a few devices and do the same calculation.

I agree that we need evidence of the correlation/variance between the sensors we are using but I would have thought it should be done on co-located sensors. I have two SDS011 sitting in the same cabinet - check the Kinsgswood blob, and one PMS7003 so it should be possible to derive a (better?) correlation using the readings from those devices. (CHASW-55CF1D-1, CHASN-7412C-1 & CHASW-0451C4-1).

Also, Two of my devices also use a BME280 - those can also be correlated.

At some point in the future I plan to fall back to just one device measuring pm, temp and humidity.

That too, but the particular use-case I had in mind was for an application warning people subscribed to a certain ‘area’ that air quality is degraded. People might not be interested in medium term, highly localised AQ degradation caused by - say - a BBQ, which may go on for a couple of hours, and would only be seen on one sensor only.

There is certainly also a use-case for measuring the accuracy of co-located sensors too, just to prove the sensors are working accurately.

For the maths, just use the built in stats functions of whatever language, you don’t even need to think now. You do need to know the meaning.
(such as in python https://docs.python.org/3/library/statistics.html )

Basically you’re damping out the nearby localised sensor spikes from a specific sensor. Over enough sensors, an average be it mean or median should do that.

So take Priory Wood Cemetery for example


it has three nearby sensors, you wish to report on the average for all three as that’s the ‘area’,
but if one goes high as somebody is smoking under it, you don’t want the alert to go out on the device, only three sensors, so they have a large weighting factor, if two where at 10 and the other went up to 80, you’d see 33ug/m3 even if actually it’s not bad. You could give the past hourly averages to drop a short spike ?

My thoughts are, we give the data.
If people build on top of it, then they can do the maths they want. We just need to give easy access to get an area of data and not have to download all of it and do it locally.

Hence pull data by location coordinates and radius?

The reason I say that, is they are likely to use data analysis techniques.
Treat it as amplitude and time, do spectral analysis or signal processing to see patterns, perform damping/noise filtering, prediction etc

Ah, so there are two different use cases. One is about reproducibility between devices and the other is about providing an type of alert mechanism. For the former, device co-location is important. For the latter, there probably needs to be some algorithm for choosing devices to include - within a certain radius maybe? Next is how to combine readings from multiple devices - some interesting maths possible here. Average of last reading from each device included? The average of devices but using say an hour of data? Ignore any outliers? Good discussion to have. Keep comments coming!

Combining readings could be done using a weighting based on distance from the devices since pm values may tail off from position of measurement. Just a thought.