From January 23-25, the Los Angeles Homeless Services Authority (LAHSA) conducted their annual count of unsheltered homeless with the help of more than 8000 volunteers. On the evening of the 24th, I joined the #TheyCountWillYou effort in Los Angeles to count the homeless in the city of Cudahy. As you’ll see in the description of steps conducted the process is thorough and intentional. I was proud to be contributing in a small, but important way, by helping out. If you live in the greater LA area, I recommend you participate next year.
In this blog, we'll discuss the current homeless survey process, and describe how data quality might be improved in the future.
At a high level, the following steps outline the process that the Los Angeles Homeless Services Authority uses during the count.
Now obviously this review is relatively high level given that I only participated in the count as a volunteer, but from the outside looking in I found a number of examples of data quality controls that could be implemented in order to ensure higher quality data. First and foremost, if the count could be conducted with digital input of data (e.g. by tablet or smart phone) the following validations could be conducted.
The list of attributes on the right is the extent of the data currently collected on the paper forms. The 2017 dataset listing all data collected by LAHSA for the survey can be downloaded here (including metadata). |
There are at least two ways to improve the data quality of the survey data. Note that although these two options are discussed separately, they aren’t necessarily mutually exclusive. Continuous improvement typically includes both incremental steps and bigger technology focused changes.
In the Reactive & Process Focus, all collected data could be profiled using software tools (proprietary or open-source depending on the funding available). This is likely already done, but surely additional business rules, expectations for quality levels and governance could be documented and improvements made for a relatively small investment in time and computing resources. The following are a few observations about where to start this effort.
Discuss what types of analysis are going to be conducted by stakeholders using the data- documenting quality levels in terms of dimensions of data quality (preferably the Conformed Dimensions). Then run profiling on data collected to evaluate gap analysis between expectations and actual data quality levels.
Assuming that only the Reactive & Process Focus is followed, there is a limit to how much control and validation can be conducted due to the nature of the data collection medium (paper based). The biggest problem is that errors in data collection are all discovered after-the-fact, making it difficult to correct the data. At best observations could be removed from the analysis set (so as to not bias the final analysis). If a digital data collection method (e.g. using a smartphone or tablet) is used the following methods of data quality control are possible. More specifically, this is where use of the Conformed Dimensions of Data Quality is helpful.
Using the Proactive & Technology Focus computers/tablets/smartphones could be used to collect the data- thereby allowing the data entry application to enforce validation rules when data is collected. Assuming that a mobile device using either a standard Web page data entry or platform specific App (e.g. Apple App and Android App) the following Conformed Dimensions could be programmed into the software in order to ensure that only valid data is collected.
CDDQ release 3.3 | ||
Validity |
Values in Specified Range |
Values must be between some lower number and some higher number. |
Values Conform to Business Rule |
Validity measures whether values adhere to some declarative formula. |
|
Domain of Predefined Values |
This is a set of permitted values. |
|
Values Conform to Data Type |
Validity measures whether values have a specific characteristic (e.g. Integer, Character, Boolean). Data types restrict what values can exist, the operations that can be use on it, and the way that the data is stored. |
|
Values Conform to Format |
Validity measures whether the data are arranged or composed in a predefined way. |
Additionally, given that most devices are equipped with GPS technology, spatial data collection (and thereby additional validation) could be conducted. This would for instance not only allow LAHSA to collect counts of persons (available today) but know where those counts were made. This provides a much more Precise accounting of the nature of homelessness and enables further services to be deployed to the most appropriate locations. Below is a screenshot of a census track surveyed (534301- part of the City of Cudahy). Using the LAHSA Census Tract finder page, you can get a feel for the size of each area surveyed by volunteers. Click on the image to open another window to see detail.
Using spatial data quality rules- embedded into the App- additional data quality can be ensured. Here are a few that may provide value:
Additional reasons for conducting data collection via smartphones include:
In Conclusion
I was very impressed with the general survey structure conducted by LAHSA, and hope that future surveys could be done digitally thereby maximizing data quality through the ways discussed in this blog, as well as improving the general precision of data collected. I'm sure many of you (readers) will think of additional ideas for validation and data collection, so please email those to me (dan[at]dqmatters[dot]com), so that I can add post script- adding value to survey administrators and similar organizations around the world.