Cluster analysis of poor restaurants and public health cases on example of New York City

 · 5 min read
 · Tomáš Hána

In this task, I modeled a relationship between restaurants violations and public health cases in the New York City (NYC) neighbourhoods. The objective was to find out relation between datasets, detect neighbourhoods with poor food related health conditions and possibly to find other effects that are result of the processed data.

The process consisted of creating Python script that processes the data using pandas library, clustering the data using K-means method, and subsequently visualizing them in Datawrapper tool. That enables to present findings in a interactive and clear way. At the same time, everyone can easily understand what are the findings about.

The process

Firstly, from the dataset with inspection results, I filtered those which apply to the same time period as the second dataset with 311 complaints, which was provided in the task assignment.

Currently, both dataset represent situation in November and December 2022.

Secondly, I calculated % of poor inspections (with grade B or C) based on total count of inspections in each ZIP area. By this, I standardized the values in relation to varying ZIP area size.

Because some of the records did not include grade, I dived into the documentation. I concluded that I will not exclude records without grades, because it's just the different stage of inspection and further limitation of the dataset would make impact on validity.

Similar approach was applied to the other dataset, too. Here I standardized the complaints to all health-related complaints. I selected this approach in order to reduce the impact of other, unrelated reports, like in high traffic areas where many of the reports were related to this field.

After processing the data (explanation of each logical step is provided in the main.py file with the script itself) and doing the clustering, I plotted the clusters to 2D chart.

The scatter plot analysis

After plotting the data, and colouring clusters accordingly, several interesting facts arise immediately:

  1. There exist a link between these two qualities. If people report food gastrointernal disease, it is often seen in a percent of poor graded restaurants. This is generally good seen in the blue region in the plot (clusters 1, 3, 8, 11, 12)
  2. We see one-dimensional extremes both in % of poor inspection results and % of food health-related complaints. People in clusters 2 and 4 more often report problems with food, but inspection results turn out slightly better. And, in contrast, clusters 5 and 9 report significantly high count of poor inspection results, but lower share of the complaints.
  3. There is a not insignificant number of cases (clusters 0, 7, 10) where people report gastrointernal problems, but the inspections does not represent that. Interestingly enough, a lot of such cases has 0% poor results.

These are mine main findings from the plot. Together with linear correlation trend line, we can also see more granually that restaurants in the cluster 6, 8, 9 are, combining these two metrics, healthy condition wise worse than the others. But some of the findings may be biased because of limited data and high variance.

Still, the cluster analysis showed us three main categories of data we deal with. Time to project it to the map.

The geographical representation

Here is a map of NYC with resolution to ZIP areas. The blue shaded regions, the relation between food-related complaints and worse inspection is strong. In these areas (mainly Staten Island, Brooklyn) the % of poor inspections relate with the % of food-health related complaints. This doesn't necesarrily mean there are bad restaurants in these neighbourhoods, instead, it show us that these areas are not under the radar.

This is big contrast to the red shaded regions with hatching, where are records of food-related complaints, but the inspection results of restaurants do not show that. Additionally, the count of inspections in this parts is significantly lower, what can indicate that there are being overlooked. These areas are concentrated mainly in Bronx and border areas of Queens.

Areas in yellow-shade areas represent the ones with notable extremes. There is no clear evidence, that could explain this behaviour. Nevertheless, the cluster 2 area (where % of complaints exceeds % of poor inspections) are concentrated in east Queens area, neighbouring red clustered areas.

Conclusion

The objective of the task was to find out relation between these datasets, find areas with good and poor restaurants concerning food-health conditions and eventually find something interesting in the data.

Although for the relation itself would be more appropriate to use the correlation analysis (which was not the objective of this task), still, the clustering showed us there is link between these two datasets.

But there is wide variance in different ZIP areas. Thus, we cannot clearly say that some part of NYC indicates excessively worse restaurant. The clusters, where such situation is observed, do not form higher geographical unit. And, as already mentioned, data limitation plays role too.

A different situation happened with the red areas, where situation of insufficient inspection rate is observed.

From these datasets itself, we can hardly find the explanation for this trend. The first step to eliminate possible bias would be extending the time period, as two months are short for this conduction.

If this doesn't help, another possible scenario could be linking it with the economic-related data. Regions of Bronx and Queens are generally percieved poorer.

There could be also bias between inspectors. Although exist rules how are different violations graded, still the people apply the rules.

But again, all these are just speculation as the selected two datasets cannot tell us more.

This project was created as a task solution for Digital Talent Lab at MUNI