Familypedia


Forums: Index > Watercooler > How we encode our data


Messagebox info grn
Info Pages

This is about Metadata. Feel free to skip the theory to get to the meat: #Using /info subpages

HOLD IT! just when most of us have learned how to use info pages, there may be an even better way (unexpectedly earlier than User:Phlox believed possible 19 months ago), using forms or something a little different from info pages but with the same idea: instead of writing skeleton biographies with lines like "Married: 1897 at London", you can put the 1897 and the London in their own boxes and sit back while the software arranges them sensibly and sometimes even gets the info from somewhere else so that you don't have to type it again. Read up on info pages below the first heading, but also please have a look at Help:Semantic MediaWiki. I know (from using XML in advanced searching)) the sorts of marvellous things it can do, but I don't know how yet. Phlox doesn't know exactly either, but he's slaving away at it and deserves heaps of admiration. — Robin Patterson (Talk) 13:54, 12 May 2009 (UTC)


There are some in the wiki and internet community who advocate representation of information in a way that computers can evaluate. This movement is of relevance to genealogy enthusiasts since, unlike most genealogy programs, a wiki has none of the "database" like features of reusing information. Using genealogy programs and "databasey" genealogy sites like Genealogics, when you change a birthdate for a subject, automatically every page using that birthdate gets the change. Such information sharing makes it easier to keep the database tidy. Beyond this housekeeping advantage, high tech advantages of declaring such information formally include the ability to make inferences based on such representation of information. Genealogy information is more simple than more general domains of information representation and so problems such as identifying inconsistencies does not involve complicated logic. EG.

  1. Joe was married to Mary.
    • The date of this event was X.
  2. Child Y 's mother was Mary.
    • The source of this idea is A.
  3. Child Y's mother was Jane.
    • The source of this idea is B.

Situations with conflicting information such as item #2's version of the truth 2 and item 3's alternate view are well known to anyone dabbling even briefly with genealogy research. Strict genealogy systems sometimes have problems representing inconsistent or ambiguous information, but Wikis have no such constraints. The two approaches are not mutually exclusive. In some future genealogy wikia, such ambiguous and contradictory information can alter the probablistic confidence of particular views of a family history using stuff like Bayesian inference.


Gedcom represents some of this information, but LDS's goal[1] for Gedcom was that it be a format for exporting or importing data to various programs or internet sites, nothing more. Gedcom 6.0 (XML) format continues to confine itself to that goal, and explicitly states this in its draft spec.[1]


In the Wikicommunity, many infobox templates are recording information conforming to the HCard Microformat. This sort of encoding can potentially support a superset of the information that Gedcom 6.0's XML format will support. If we wish to follow that sort of direction, the Gedcom5.5 java program that converts to Gedcom6.0 like XML or alternatively to Resource Description Framework (RDF) format might be of interest. Further information on the program and discussion of the issues for such semantic representation of genealogical information may be found on Jay Askren's site.


Another method of encoding metadata for a person has been advanced by the Biography project. They use "Persondata" information contained in commented text placed at the end of the article. The advantage of this is that it is unobtrusive- no one is required to use infoboxes for their articles. The disadvantage is that if people don't see the information in the resulting article, there is no incentive to keep the information valid with respect to other information in the article. Push comes to shove, I think that sooner or later we will have some kind of standard infobox to normalize the appearance of articles.


What does the hassle of conforming to such templates or supporting these hidden blocks of information buy us? Well, brushing aside all the gee whiz applications of semantic databases, our genealogy wikia would eventually benefit from very practical features, such as the simple idea that it allows information to be shared between articles. Meta's Semantic Mediawiki extension supports encoding data in a central way that can be accessed anywhere in the wiki. It looks like normal wikitext. For example, a person article for Joseph Hester might mention a fact as part of the body of the text of the article:

...Joseph's parents were [[father is::Elias Hester (c1832]] and... 

Previously, one would make a link to another article. With semantic markup, you are asked to do just a little more work so that the semantic relation is indicated (father is). Any time this information is updated, everyone that wants the change can get it. EG. I have a family tree, and for one of the cells I can hardcode the Elias Hester or I simply put

[[father of::Joseph Hester (c1858)| ]]

Some of this stuff is working today, (see example for california at ontology semantic wiki page [2]). When it matures, it is surely something that future contributors to Genealogy wikia will want to begin to use. Note that any it is just another wikitext operator, and this doesn't impose any radical demands on authors. It can be ignored by the majority of contributors, but I expect will gradually gain many converts simply due to time savings. It can be used in an evolutionary way, and I expect the transition will be fairly gradual, with a mixture of usage of hardcoding versus re-using data. This will suit wikia managers very well, because the server loading created by complex templates using such queries are not well understood. It could be that caching will make it a non issue, but note that data dependencies are multiplied. Change the data declaration father is:: relation for William the conquerer, and you could potentially invalidate the cached pages of hundreds and hundreds of pages using this information. It's also impossible to predict what the issues are with vandals. The same issue arose when wikipedia first started, (the objection was that allowing users great power will mean they will abuse it)- come to think of it, I think the nobility said the same thing about allowing the rabble to vote. Anyhow, a gradual transition allows everyone to learn and adapt.


Other explorations of interest:

  • Microformats and genealogy information [3]
  • Inline queries using Semantic mediawiki extension [4]
  • Meta's article on the extension: Semantic MediaWiki


For the foreseeable future, we cannot predict how the data representation formats of the genealogy community will evolve, and can only adapt along with them. At some point, it is inevitable that Genealogy wikia will have a data mass sufficent to earn us a seat at the table so that we may positively influence such evolution.


For the near term, we should encourage folks to encode information using standard templates such as Template:Person. This will help the future upgrade of the data to representations such as the above.


Secondly, an important point was made by Askin on his page. A fundamental issue with data interchange is making sure that the Person A in an input file corresponds to the Person B with the same name, birthdate, birth location but different parent than person A. Jay Askin noted that globally unique identifiers have been in use by LDS for some time to deal with this. Askin considered the use of the AFNs (ancestry file numbers) to deal with that issue. The problem he noted is that the mechanism for creating new ones is controlled by the LDS organization, and it is not clear how open that process is to other contributors. Perhaps it is no big deal- that if LDS would parcel off authority for ranges of numbers and trust other organizations (eg genealogy wikia) to see that they are being used properly, then that seems like it would be acceptable.


Another proposal to carry our own global unique identifiers (GUIDs) (pronounced gooeed). Data import programs would specify the AFNs if they are passed in a gedcom file, but as part of the import for all new records we would also would specify our own Unique identifiers. EG. When we start exporting data from genealogy wikia we make a pass over all articles and generate GUIDs for them using something like a GUID from a site like this. And we just periodically update all new pages with the persondata (or alternatively template:person) UID field with these GUIDS. A bot also would periodically resurrect any inadvertently deleted GUIDs. These GUIDs are not typically displayed, but used for matching when importing/exporting data and when looking up data.


Which brings me to why I am thinking about any of this now. It is my intention to add an AFN and a GUID field to /info subpages of articles. This represents information that is a superset of information in Template:Person. It supports wikipedia's Hcard metadata as well as the Persondata metadata that the Wikipedia Biography project is using. This is good stuff for drawing in new folks because google and other searchers recognize these fields.

Using /info subpages[]

This metadata approach for Genealogy allows authors to re-use data now, without any SQL queries or waiting for some unknown date when Semantic wiki extensions will arrive.

Beginning today, it now is possible to do queries. EG:

{{get|William I, King of England (1027-1087)|key=Spouse}}
produces: Template:Get

Similarly,

Father is:Template:Get
Image is:[[Image:Template:Get|100px]]

Of course, if the "Get" occurs in the william article, the query is compact:

{{get|key=Father}}
gives:
  • Template:Get

This is not much longer syntax than what is required for the semantic wiki wikitext, but semantic wikitext will be better for many reasons, not the least of which is that it is definately more simple and natural to specify. More importantly, such a future approach will be more robust. Currently, if someone moves an article, they will have to remember to move the /info page as well. The advantage of /info subpages is that they deliver the bacon today, and support all the goodies of the microformats as well as the Persondata initiative. In my opinion, the GEDCOM bot should produce /info oriented pages. That means that this will not just be some rare thing, but a substantial number of genealogy's pages could be using this.


Want to kick the tires? Take a look at the William the conquerer example and simply create a /info subpage for any article. Or play with William's /info page. Give it a whirl.


One of the benefits of doing this encoding is that family trees will now automatically update, and with richer info. No more specifying all the levels of the tree, or searching and updating all the fricking trees that might be affected by a newly found ancestor, or worse- fixing a mistaken parentage- an error that cascades to many cells of a tree. Since the entire tree can be inferred from the head node, all you have to do is simply plop a single Ahnentafel template with no parameters on the page and you are done. You need not specify anything since it will assume the name of the article as the head node. This will all be very simple, that is, unless you have moved the article 12 times because you keep renaming it because you felt like overspecifying middle names, death dates etc. Whatever- each time you fiddle with stuff this way, you must now move the metadata page too. To each his own.

Yes folks, we get rich trees without all the bother of cutting and pasting repetitious info- which of course no one will do unless they are super dedicated to a particular ancestor. Template:Ancestors2

Naturally, you can specify the start of the tree so that you can display the tree of any ancestor from another article. Eg. the example above was generated with:

{{Ancestors2|William I, King of England (1027-1087)|display=full)}}

By omitting the display=full option, you get a compact tree without pictures and birth/death dates.

Globalization: Reuse means that a lot of the drudgery of keeping various language versions in sync will now be removed. EG. slip the lang parameter in there, and you have:

Template:Ancestors2


For the purposes of these examples, I only supported the 2 level tree. I will fix the 6 level one in due course.

References Indicating unique identifiers used by other sites is the perfect way of inserting a friendly link to our peer sites on the web. We are all working to preserve our ancestors history- by standing on each others shoulders we work together to that goal. Plus- Google values our hits higher the more links we have to high value sites. So insert this template into your notes or links section:

{{get references}}
for William example gives:

Template:Get references Don't worry. Be happy.

~ Phlox 02:48, 5 October 2007 (UTC)

Notes[]

  1. ^ The church of Latter Day Saints (LDS) authored the Gedcom spec. It has been a great contribution to the community.

In MediaWiki 1.13, being introduced in September 2008, "The option to move subpages is presented as a check box at the bottom of Special:Movepage whenever you move a page, so you can choose whether or not to move subpages on a case-by-case basis. --KyleH@Wikia (talk) 04:20, 7 September 2008 (UTC)" (Kyle's note on Central Wikia)

Robin's first response[]

Haven't read every word above, but I think I've read enough to be highly appreciative. I'm somewhat familiar with GEDCOM format and with XML. Your ideas seem to be the right way to go, especially if it can remain compatible with larger organizations such as WP and LDS. My guess is that Bill will like it too.

You mentioned the GEDCOM bot - I hope that's Brian's Help:Loading Gedcoms program and I hope you can

  1. make it more usable (or the instructions clearer) for those of us who are not sure whether we could import or use all the necessary Java etc
  2. tweak it so that it produces pages that conform with your above ideas or any variations of them that might achieve consensus

Robin Patterson 11:21, 5 October 2007 (UTC)

response re Gedcom bot on Forum:Gedcom bot -~ Phlox 23:53, 5 October 2007 (UTC)

Project status[]

  • Things done
    • Project page with standard form and subpages for key definitions, technical notes etc. created.
    • Upgraded format, key naming to initial upper case, mirroring as much as possible the order of the old Person template.
    • Substitute of the Template:Person created that autoloads data from the info page. Template:Showinfo person.
    • Tab code changed to show info tab.
    • Small example network of nodes (William the conquerer)
    • Mocked up an event info page for Lady Di's wedding. Diana, Princess of Wales (1961-1997)/marriage . I (~ Phlox) am not at all convinced there will be a very much demand to encode this level of granularity at this point, other than to save off data recorded that way in a Gedcom file.


  • Near term things to be done:
    • Add the fam group to the data model (modelled as children subpage with its own info) (This is Gedcom's abstraction of the group of children born of two parents). At that point we will have a respectable model to load data into. (We have will have the INDividual and Fam structures).
  • Things deferred:
    • Event objects:I don't know about creating an abstraction for events as discussed in the Gedcom 6.0 spec. My main hesitation is that I want to go slow and not burden the servers with very small granularity queries. It seems to me that encoding events and unique entities could present a significant load, so we should just ease into this and not get too fancy too fast.

-~ Phlox 17:18, 8 October 2007 (UTC)

Info pages are not vulnerable to renames of articles[]

Info pages access the info pages of other persons by the name of the article, not by a unique identifer. One might think that if someone renamed an article from Joe Smith to Joe H. Smith, they would have to go into all the info pages for the wife, children and parents of Joe Smith to update their article pointers to Joe H. Smith so that templates using the old name would not break. However, page redirects are cooler than that, and if I access data in the info page, it works just fine even if I use the old name.


For example, the death date was added to the Princess Diana article using a move. But the templates using the old name still work. EG:

{{get|Diana, Princess of Wales (1961)|key=Father}}
works just fine, returning Template:Get

I'm not sure why folks would need/want to use AFN's, Genealogics IDs, or our Guids directly, but if we did, we could support this by simply redirecting the info page with that name to the article corresponding to the guid. For example,

{{get|8b5a3e7c-70e4-4e5f-973d-d43ad2a7c541|key=Father}} 

will work because there is a page 8b5a3e7c-70e4-4e5f-973d-d43ad2a7c541/info that #REDIRECTs to Diana, Princess of Wales (1961-1997). Like I said though, I am not sure why we want to use such cryptic names in place of the casual names we assign to articles. The casual names are no less robust than formal names, so why use them (except for data matching applications as I described elsewhere).


This is all very cool because it means the wiki is a lot less fragile than one might suppose about the kind of structured database features that info pages enable. -~ Phlox 23:47, 9 October 2007 (UTC)

Comment about AFNs[]

I would caution that AFNs are not actually GUIDs: someone like Mary Queen of Scots has at least 5 different AFNs, and I have found plenty of obscure people who have more than one. Thurstan 08:38, 8 September 2008 (UTC)