Reports are expected to organize data consistently, but genealogical location data is not consistently encoded. The familiar approach to be sure that like locations appear grouped together is to provide a column for city, county, state and so on. Familypedia's location hierarchy is currently
- country
- subdivision1
- county
- locality
- street address (repeated)
- locality
- county
- subdivision1
Such classifications are deliberately simplified to make encoding by naive users easy and quickly comprehensible. However because they are simplified such classifications often break even for common examples. For example, up to now "locality" is a synonym for hamlet, settlements, towns, cities and megalopolises. For the large percentage of individuals in the global population living in large cities, a single locality field will be insufficient. If an event occurred in wikipedia:Williamsburg, Brooklyn, would a report that had this even ever have the value "Brooklyn"? What would the column title be? How about for Williamsburg? Is London a city? The naive user would say yes. Other contibutors would point out that when people nowdays say London, they generally mean "Greater London", which in formal governmental unit terms is classified as a county, not a city. Shanghai and St. Petersburg are administratively classified as subdivision1's. Prior discussions have pointed out classification dilemnas such as these, including some that violate set logic- such as the problem that some locations do not follow strict containment hierarchies. Williamsburg for example is in Kings county New York state, but other parts of New York City are not in Kings county. Other genealogy sites have not handled this well. WR for example doesn't even categorize Williamsburg as existing in New York City or Brooklyn. Many would consider that an error.
Familypedia is not attempting to create an ultimate classification scheme. We are driven by what requirements our typical visitors have. Let's look at what they would expect:
- Visitors will want to query for the term London or Shanghai and get a hit. We already support this because a query can ask for birth location, and it doesn't matter how London or Shanghai was classified to get a return.
- Visitors don't want to know about overlapping categories. They need to be able to search for births in a county regardless whether the birth is in a city that may have portions that are in multiple counties. SMW already support this. The visitor does not need to know what whether Qingpu is classified as a county or a district or something else. If they search on birth location they get a hit on it regardless how it is classified.
- Do we expect to have reports that list Los Angeles, London, Shanghai and Moscow in the same tabular column? If yes, then we should ignore cases where official administrative status contradicts user expectation of what a city is.
- Would Brooklyn or Lichtenberg ever appear in a tablular column? If so, we need to support a locality subdivision for "boroughs"(NYC, Berlin) or "districts" (Shanghai).
- Would Williamsburg, Karlshorst, or Zhujiajiao (a 1700 year old town, part of the district of Qingpu part of the municipality of Shanghai) ever appear in such a tabular column? If so, we expand our classification scheme to support a locality subdivision2 such as neighborhoods (Williamsburg, Karlshorst), and Boroughs (Brooklyn, Lichtenberg
- Do users expect to have tables with the wikipedia:Regions of England listed in a column? Similarly for the 333 Prefectures of China. Then we need to add subdivision2's.
- Considering the above questions, we have invented 3 additional location fields bringing the total up to 8 (9 if you count landmarks) that contributors need to understand. Can we reasonably expect our contributors to understand the restrictions of all these categories of place, much less methodically review all of their contributions and enter the correct values for each field for each life event? If this is an unrealistic expectation, then practicality dictates that we introduce a less formal method that can be used for generation of such values.
Note that the automated option needs only to be more thorough and as accurate as typical contributors. This is a low threshold, and so even rudimentary generation would be acceptable for first iterations so long as the generation of placenames could be rerun. This provides the foundation of the rationale for having some location values be owned essentially by bots even though it is fully recognized that no bot can ever be as good as a trained human classifier. Our escape clause is that the locations can have a lock out property so that bots will no longer manipulate them.
How automatic location classification works will rely on en:wikipedia as the primary authority. However, many of the placename articles do not have structured information in the form of infoboxes, and for those that do use the infoboxes, the classification is not uniform.~ Phlox 00:03, 14 July 2009 (UTC)