Login

Factual Blog /

Category Considerations: Assigning Categories to Factual’s Global Places

One of the hardest attributes to pin down in our Global Places data is our Category Labels field. This is where we describe what a place is or what service it provides. It’s what tells you that LaRocco’s Pizzeria is a Pizza Restaurant, whereas Rocco’s Tavern, a mere 200 feet down the street, is both a Sports Bar and a Pizza Restaurant.

While the process of building data is complex in and of itself, it is exacerbated by a number of issues that impact categories specifically. Categorizing a business is often a subjective endeavour and it can be extremely difficult to guess what category a place may be from any location or name-based cues. For example, it is difficult to know that Public School 310 is a restaurant, as opposed to a school, solely by looking at the name.

We work hard to build high quality data, so our customers receive correct information about the places they care about. Having said that, when you provide data for over 90 million places across 50 countries, there will be some warts. Since categorization poses such a thorny problem, some of these imperfections rise to the surface from time to time. With that in mind, we decided to show some of these challenges, explain why they arise, and describe how we address them so that our data is always of the highest quality.

Where Categories Come From

In order to understand where challenges in categorization come from, it’s important to first understand how categories are assigned to places. We build data by gathering information from multiple sources (feedback from different partner apps, trusted data contributors, data from the web, etc.). Pieces of information about the same place are called inputs, and a key part of our process is determining which value is correct if inputs from different sources have conflicting values.

For categories, every source has their own taxonomy, or way of defining how a business or point of interest fits in the world. Likewise, we have our own taxonomy at Factual. So, a preliminary, essential part of our category process is mapping all of those source taxonomies onto ours. For example, in our taxonomy, pizza restaurants fall under Social > Food and Dining > Restaurants > Pizza. But this can be expressed in a seemingly limitless number of ways by other sources, such as: restaurants - pizza, pizza restaurants, pizzerias, pizza places, pizza-restaurants, etc., all of which must point to the correct node in our taxonomy. We have 466 categories in our taxonomy and other sources can have even more than that, and we are constantly gathering information from new sources; so this mapping must be continually maintained to include as accurate and up-to-date information as possible. We keep our mappings up to date both by adding specific taxonomy mappings one at a time, and via a machine learning-based approach that can add thousands of mappings at a time (developed with one of our former interns, Sarah Krasnik).

The next step in the process is analyzing the data gathered from all of our millions of unique sources to build our places dataset. Once we have identified the many data points that refer to a single place, we then algorithmically determine the most factual representation of that place. To select which categories get assigned to our places, we systematically consider the category information from all of the inputs associated with each place to discover what category or set of categories is most likely correct. Since places can often be represented by more than one category (think bar + restaurant) we surface up to three categories per record. We also keep a small list of categories that can co-occur together, (one place can be labeled both a gas station and a convenience store, for example), so that we don’t end up with bizarre groupings like Pharmacy + Museum (a place can be either a pharmacy or a museum, not both).

Category Confusions

Here are a few examples of particularly tricky place categorizations that we have seen in our data and how we caught and fixed them.

Bethsaida Seventh Day Church

One of the obvious problems associated with categories is simply not having one. Some location services will fill in the blanks when they are unsure of what a place is with a non-informative category label such as “local business” or “establishment.” We prefer to leave the label off all together if we cannot provide a meaningful one (it’s worth noting that this is rare; over 97% of our places in the US have a category assigned). This can happen when we either can’t get reliable information about what a place is, or if the description of the place has not been mapped to our taxonomy yet. When Bethsaida Seventh Day Church first surfaced in our data, it was a case of the former problem.

Name Category Before Category After
Bethsaida Seventh Day Church null Churches

Explanation

You might be thinking: “Why don’t you just look at the name? You can plainly see that it’s a church!” It turns out that trying to assign categories strictly based off of business names is an unreliable approach. While it’s true that “church” often shows up in the names of churches, it also shows up in the names of other things, such as the radio station Wxmc-1310 Am-New Jerusalem Church Lines. So, categorizing based on the word “church” would get that wrong. Now you might be thinking something along the lines of: “But it will work if Church is the first or last word, you could use regular expressions, right?” However, even when you restrict the search with these types of rules you’ll still run into things like Church & Dwight Co., which manufactures cleaning products or Thomas Carton Church, an ophthalmologist.

Name Category
Wxmc-1310 AM-New Jerusalem Church Lines Media
Church & Dwight Co. Manufacturing
Thomas Carton Church Ophthalmologists

When businesses without categories pop up in our data, they typically get categories assigned quickly since we have a dynamic categorization system in place that allows us to make changes at any time to rectify any problems. We add new taxonomy mappings every month, which leads to both adding new categories and correcting existing errors. On top of that, we’re constantly adding new sources to continually improve the quality of our places data. With new data and new category mappings, we assign meaningful categories to Factual places on an ongoing basis.

Vape Star

The recent increasing popularity of vaping has lead to a rise in vape-associated businesses, e.g. retail stores and vape bars, where people can gather to vape together. One such establishment, Vape Star, was initially mislabeled as providing home improvement services.

Name Category Before Category After
Vape Star Home Improvement Tobacco

Explanation

As mentioned above, one of the challenges in assigning categories is mapping the components of our sources’ taxonomies onto our own. The problem for Vape Star is that one of the sources provided a category of “pipe and smoker”, which we had incorrectly mapped to our home improvement node. This happened because in lots of instances, “pipe” legitimately shows up in descriptions of home improvement businesses. For example, businesses like Southland Pipe and Star Pipe Products that sell supplies for pipe-specific construction are often described using just the words “pipe” or “pipes”. So you can see how “pipe and smoker” could be grouped with those other descriptions. In cases like this, we simply update that mapping in our system, and any businesses with this description get automatically assigned to the appropriate Factual category.

Conclusion

Assigning the correct categories to places is deceptively tricky, but it is exactly the type of problem that our engineers love to take on at Factual. We embrace the challenge and allot a considerable amount of time and resources towards ensuring that our Global Places data continue to have the most accurate and comprehensive category coverage possible.

Enjoy this read? Factual might be the place for you!
See Openings