Changes: Forum:Gedcom bot and Rtol's improvements to Yewenyi's process

Latest revision as of 09:17, 27 April 2021

Forums: Index > Watercooler > Gedcom bot and Rtol's improvements to Yewenyi's process

Most of this is valuable background reading; however, further discussion should be on Forum:Regarding importing GEDCOM data, quoting sections of this forum wherever appropriate. -- Robin Patterson (Talk) 03:55, March 26, 2013 (UTC)

Is there any interest in a Gedcom bot? The idea is that folks would place their Gedcom data in an otherwise blank article, It would get approved/ disapproved for upload, then all those approved would get articles auto created for them using the Gedcom data and using a standard article structure. -~ Phlox 00:08, 29 September 2007 (UTC)

Good idea! -AMK152^{(Talk • Contributions} 01:25, 29 September 2007 (UTC)

OK, any proposals for standard components to add, chime in. I figure:

Person template
Children of each spouse (if multiples) in a subpage so that it can be included into each parent's page.
By option, an ancestor tree using Ahnentafel template
Surname template
Notes section with references/>
Template:Stub-incomplete inviting folks to update/ fix the article.

We'll see if I can scrounge some time up for this, but it seems like this would help us bulk up pretty dang fast. No promises though. The wife has been packing around a list of things to do around the house...

~ Phlox 01:49, 29 September 2007 (UTC)

Categories are a big one too. -AMK152^{(Talk • Contributions} 02:15, 29 September 2007 (UTC)

Yes of course cats. I think an Ahnentafel can be generated for each individual, a separate "Family tree tab". Ahnentafel template could be modified to autocollapse but that will be tedious task I will leave for a rainy day. It would be nice to record not just ancestors but descendants with links from any node to the tree page of the other individuals- this could easily be generated by such a program.

Building on the Past? Upcoming Gedcom changes[]

The Gedcom spec is surprizingly archaic- EG: "Logical GEDCOM record sizes should be constrained so that they will fit in a memory buffer of less than 32K. GEDCOM files with records sizes greater than 32K run the risk of not being able to be loaded in some programs." Jeez. Programs from which decade? I don't think we have to run on Robin's Atari (or was it Amiga?) computer anymore do we? ~ Phlox 17:07, 4 October 2007 (UTC)

Commodore 16. 12kB total memory if using BASIC. — Robin Patterson (Talk)

LDS announced that they are moving their sites and apps completely to XML based Gedcom. That is nice, because at the most rudimentary level, character encoding will not use this weird non unicode thing that Gedcom is using. (The very latest draft 5.5.1 Gedcom spec proposed UTF-8, and it isn't even approved yet? Jeez I guess all those folks using characters with double bytes aren't interested in genealogies. Sheesh.) But the semantic representations are awfully primitive. Not only are there the weaknesses in representing ambiguous information, the way they are representing the data is not leveraging standards in content representation such as the microformats that are being supported by searchers and browsers. Anyway, I would much rather build a parser to input a format that is going to be around a few years. It looks like everyone is giving Gedcom the boot.

So the strategy would be to pick some format that is one of the best bets for what will replace Gedcom (either a Gedcom iteration like Gedcom XML, or some other thing). These proponents usually write a converter app- so we still support old gedcom but by first running the converter over it- then running the real parser.

Anyone have any opinion about which of these new formats is best, feel free to chime in here. ~ Phlox 17:28, 4 October 2007 (UTC)

Question about Brian's Java Gedcom thing[]

(copied from Forum:How we encode our data

You mentioned the GEDCOM bot - I hope that's Brian's Help:Loading Gedcoms program and I hope you can

make it more usable (or the instructions clearer) for those of us who are not sure whether we could import or use all the necessary Java etc
tweak it so that it produces pages that conform with your above ideas or any variations of them that might achieve consensus

Robin Patterson 11:21, 5 October 2007 (UTC)

As a matter of principle, I think all high volume bots should use the Pywikipedia framework. My quick look at it implies he wants to do managed conflict resolution, which is a good thing. With a more populated network of ancestors, we will want to have a tool so that folks can do uploads on their own, perhaps as part of a .JS wikimedia tool extension. Writing code for my personal purposes is one thing- writing it for end users requires about 100 times more effort. As a former pro programmer, I make no exaggeration here. Handling errors gracefully, thinking about UIs that make sense, dealing with the special cases- it is just a son of a gun of a task. Maybe if I get really crazy about genealogy I might do it, but for now, it is just a hobby and I can't invest that kind of time. ~ Phlox 23:53, 5 October 2007 (UTC)

You seem to have bypassed or leapfrogged Brian's program. I think I approve highly. So I can toss my 13,000-strong GEDCOM into a near-page that you will name or otherwise set up, and PhloxBot will create pages for everyone except that if there is a matching page it will merely add the info at the bottom and leave a note saying data should be checked and integrated? Robin Patterson 13:59, 8 October 2007 (UTC)

I don't know that I have leapfrogged his thing because I have not examined what it is. He was leveraging some GEDCOM code. If I maintain my attention, I will be leveraging a GEDCOM library for python. It may not have all the 5.5 support I want, but I am comfortable translating from other computer languages. Anyway, there is something to be said for programs that don't progress until you resolve the conflict. Errors tend to cascade.
We could add the conflicting info a couple different ways- maybe create a subpage or subpages with "Gedcom alternates?" The way ancestry does it, they create a separate tree per gedcom upload, and you can jump between them. They tell you how many nodes agree on certain information. I think the voting means nothing- just because someone downloaded some junk and had it in a GEDCOM does not mean they checked it. In fact, the opposite is usually the case. I don't think we want to junk up the article space that way.
Let's think hard about where we want to record the apparent duplicates/ slightly conflicting info. I just threw out the subpage idea, but I am sure there must be a few other good ways to do this. The nice thing about a subpage is that the junkiness of the info need not muck up the site- the tab only displays if a person was interested (viewing could be enabled by monobook setting). Also, I am uncertain how well this bot will detect duplicates. If there is an AFN or a Genealogics Person or Fam ID, then we are good to go. Otherwise we are in murky territory with a certain J. Jones born 1923 or 1924 in New York City. I wouldn't go too far down that road of ad hoc hard coded rules- It's a mess to do things that way. The right way would be to write a bot with bayesian belief network support. Anyway, for all I know I will lose all interest/ will have to change projects in a few months, but for the meantime I'll code it how folks agree they want it, and we can try it out for a while and make adjustments as we learn what gets used and what doesn't. The nice thing about the bot is that if I keep a log of the additions I make, and mark them properly with comments, I (or another bot operator) can easily back them out. ~ Phlox 16:57, 8 October 2007 (UTC)

Duplicate detection and handling[]

Phlox may be over-pessimistic there. J. Jones born 1923 or 1924 in New York City will be translated into a page [[J Jones (c1923-?)]] (or similar, depending on how Phloxbot is told to resolve year-ambiguities) and given a birth decade category and a surname category. People interested in individuals with all or some of those data elements may browse those categories, or search the site, and see apparent near-duplicates probably quite close together. If not, no harm done. If another gedcom has the same vaguely-described person it should go on the same page, as I suggested above (and as I think Brian's program would have done for it). The page may then look a bit ugly until one of the people interested in it (eg, the submitter of one of the gedcoms) looks at it and does the necessary integration (or separation if the individuals are in fact different). Robin Patterson 13:45, 9 October 2007 (UTC)

Ok, the proposal is to err on the side of creating a new article, unless the bot is certain that it is a duplicate (eg. identical fully specified birth and death dates, or identical AFN or genealogics Person ID. Then a real genealogist comes along and does whatever is necessary to resolve the collision.

The enormity of resolving such collisions is a little staggering though. You have your 15K gedcom file, and jeez- if this thing became a popular site, we are going to have hundreds of folks with such huge files, and much of it duplicated. So if the record is a complete identical, I don't have to add a duplicate. But think about it. If someone has added any info to J. Jones (c1923-??) and the very same Gedcom data comes in again, I am forced to create yet another J. Jones (c1923-??). How does the bot know if if it is the same J. Jones? I suppose maybe the bot would know if we store the original GEDCOM record for the INDI in a subpage for comparison.

This might not be a bad idea, because as I think about it, this would help if we ever want to do GEDCOM export. It would reduce the headache of data fidelity that has been troubling me- that is- for cases where we don't store all the info the gedcom has in it. Data fidelity demands you do every feature the other guys are doing, no matter how ill-structured. Well, if you just keep it in the Gedcom file, you just poop that back out.
Yeah- another benefit of doing this would be that the author could refer to the gedcom to second guess/ error check the bot. Also to copy the stuff that it is not yet understanding (maybe old Gedcom format- bad encoding from one vendor- etc.)

Unrelated to Gedcom bot- I see it might be helpful for such maintenance to maybe create a category of possible collision individuals as a separate bot operation- but this would not be integrated with gedcom because such collisions are a general case.

Reactions? -~ Phlox 17:21, 9 October 2007 (UTC)

It all sounds workable. I suggest a trial using my gedcom: there will be duplicates with some of my 20-odd relatives whom I've placed here and maybe some of the others whom other people have placed here. Robin Patterson 01:53, 18 November 2007 (UTC)

However, you've misrepresented my approach. I said "If another gedcom has the same vaguely-described person it should go on the same page". Create a new page only if fairly certain it's not the same person (not "if not certain it is the same" or "err on the side of creating a new article, unless the bot is certain that it is a duplicate"). Recent discussions suggest a subpage rather than adding to the bottom of the original page; that would be: (1) less messy; and (2) possibly easier to merge if merger were found to be correct. — Robin Patterson (Talk) 04:34, 26 May 2009 (UTC)

Just my opinion[]

then TWO people's opinions

Well, it's not surprising really, a lot of people would say I am very opinionated....

I'll point out a couple of things about my knowledge or lack to start: when people are writing about creating the bot I'm lost. Not my field, I don't really understand except the fact that there's problems. The other important thing you should know is that I came to genealogy after I discovered wikis. This actually is an important point because, quite frankly, I would not have started my work on here if I'd found the other genealogy sites first. All my family history was on paper. Then I found this wiki and typed in a heap of info. Then I found a genealogy program I could put on my computer (Gramps - I use Linux at home). Then I found out about the gedcom standard. Then I found websites that could take my ged file and transform it. Then I realised how much work I've done on here.....

I have to say I'm not particularly attracted to a site I have to type everything into. I'll leave my work here, but updating it doesn't seem a huge priority right now.

To tell the truth Familypedia seems antiquated now that I've discovered I can simply upload my files to other websites. I can see problems with the other sites as well, especially as other people can't really contribute to the work I've done. Then again I don't see anyone who shares my family history interested in typing information in here when they are all simply uploading their data.

And I disagree that gedcom itself is a problem or archaic. Or that others are dropping it. It seems to be the basis for almost every genealogy on computer and is used to convert to various website formats. Why would you bother with a double conversion? Just convert from gedcom to whatever you want. The fact that it is so simple is what makes it so useful. Anyone who uses a gedcom standard program can share their info. We can send the ged files to each other and load them in our separate programs. Even across platforms. Everyone I know uses windoze but I can use their ged files in my Linux program. How cool is that? Using the ged standard anyone who has the knowhow and time and inclination should be able to write a conversion program to transfer the data into almost any format. Seems to me. Correct me if I'm wrong ...... Jayoval 01:05, 12 January 2008 (UTC)

These are all valid concerns. I think the beauty of using a wiki in Genealogy is that it enables a community to contribute to information that you add... or you contribute to others. If you simply wish to publish your info and have it out there, then a wiki is probably not for you. The problem with this approach is that "stovepipes" are created; the resulting data is only inter-related within a single GEDCOM upload. If you just wish to edit your own genealogy and not worry about the rest of world, then there are a number of stand alone programs that can help you. If, on the other hand, you wish to contribute a piece to the puzzle, then I think Wiki's are the way to go. I'm not saying that one approach is better than the other, and I certainly hope you don't feel I am criticizing you for your choices. It all depends on why you wish to research your ancestry in the first place.

I do dispute your claim that Familypedia is antiquated. I feel it is quite the opposite. I will admit that the GEDCOM to Wiki conversion is currently cumbersome. That is only because using a Wiki is somewhat of a radical departure to the old "GEDCOM created locally/exported to web" paradigm. That really is the crux of this discussion: How to leverage all the research (creation of GEDCOM files) that has been done to make the site something extremely useful? Currently, it has a very large barrier to entry. I'll admit I have only created 3 entries and have over 700 sitting in my GEDCOM.

The problems described above with respect to conflict resolution are best resolved on a site such as this. If a tool such as the GEDCOM bot can detect potential conflicts/duplicates, it can highlight them to the community. Once the community is aware, the contributors can communicate to resolve those conflicts in an appropriate manner. Even in the absence of an automated conflict detector, in the form of a GEDCOM bot, having many eyes on the data can certainly help and the Wiki gives an easy means of modifying pages to resolve. Conflicts will be resolved where people are interested in the data.

I do have concerns with the potential volume of conflicts/duplicates that could get created with automated GEDCOM bots. It may be too much for the community to keep up with. Can it be throttled in some way? Can it be distilled (instead of showing conflicting individuals, show conflicting branches?) It seems very complex and as Phlox said, being able to invest the time will be a challenge. Wwf jr 20:38, 7 February 2008 (UTC)

Later developments[]

See Category:Wikia help for some useful background reading about GEDCOMs and info pages.

Discussion into early 2009 is on Help talk:Loading Gedcoms, but active discussion is probably better here because people can more easily notice when a forum page has been changed. — Robin Patterson (Talk) 00:54, 21 April 2009 (UTC)

Caution[]

Before we mass create articles with a GEDCOM bot, I strongly advise that we get standardization complete. Second, when we have a GEDCOM bot, we should test it. Have the bot create a few articles, then have contributors inspect the articles. Keep doing this until we have it as perfect as we can get it. It would be very painful to fix 100,000 articles AND info pages. -AMK152^{(talk • contribs}) 02:03, 9 April 2009 (UTC)

I agree 100%, and I think Phlox has said enough to indicate that he would too. — Robin Patterson (Talk) 03:35, 9 April 2009 (UTC)

New version of GEDCOM upload[]

I adjusted User:Yewenyi's code to make in info page and a person page from a standard GEDCOM file. As argued above, this should be thoroughly tested before it is widely applied. There is no immediate danger, as the software still requires copying and pasting, that is, the info page is generated but not uploaded.

To test this, follow the instructions at Help:Loading Gedcoms. Then, overwrite the contents of two files, compile, and run.

Note that most of the code is hocus-pocus. Only two things matter: code += "This adds a line to the info page.\n" nameCode += "This adds a line to the person page.\n"

Rtol 16:50, 10 April 2009 (UTC)

This does not look good. It looks much better in the edit window. Rtol 16:52, 10 April 2009 (UTC)

Demo (results only, not program)[]

Starts at Klaas BALTUS (?-?). Four people in as many minutes, without typing anything. Rtol 09:56, 11 April 2009 (UTC)

How to get this program[]

Drop me a note on my talk page and I'll get it to you. Rtol 19:01, 19 April 2009 (UTC)

AWB[]

{first paragraph copied from another forum)

If I can maintain my focus on this project, I envision adding gedcom input to AWB. The idea is that a familypedia contributor would be able to make sure the article was reasonable before adding in the same way that AWB currently works. In this way we will be able to avoid the problems on other genealogy sites with shoveled content. We don't want obvious clones of other individuals, no mangling of critical data like dates, locations or names) before accepting it. Our competitors can't do it because it requires human eyeballs on these problems they don't have dedicated collaborators. We do. ... We shall prevail. -~ Phlox 16:28, 25 May 2009 (UTC)

Exactly. I've updated Yewenyi's WikiFormatGenerator, which makes info pages from GEDCOM files. The GEDCOM's I found are all rubbish, and I end up spending much time repairing the data. rtol 05:03, 26 May 2009 (UTC)

I didn't intend in any way to slight the good work that you and Yewenyi have done with the gedcom thing. In fact it is superior to AWB because it is more multiplatform. AWB is slightly multiple platform in that many wikipedia folks run it on a mac with compatibility software, but theoretically a Java approach is more clean. So it would be possible to add SMW queries in that app in order to do error checking. But I say theoretically is regards to the java/ GDBI thing because an application requires a lot of work, and AWB is strongly maintained by a dedicated cadre of decent developers. So in my view, our tool has to be an augment to some existing thing that has lots of continued support from developers. At the end of the day, it was a toss up between AWB and GRAMPS. Since AWB is something we need for bot runs anyway (no one except me seems to like pywikipedia), it is AWB now, maybe Gramps later. -~ Phlox 22:32, 26 May 2009 (UTC)

Most of this is valuable background reading; however, further discussion should be on Forum:Regarding importing GEDCOM data, quoting sections of this forum wherever appropriate. -- Robin Patterson (Talk) 03:55, March 26, 2013 (UTC)

@@ Line 83: / Line 83: @@
 ==Later developments==
-See [[:Category:Technical Esoterica]] for some useful background reading about GEDCOMs and info pages.
+See [[:Category:Wikia help]] for some useful background reading about GEDCOMs and info pages.
 Discussion into early 2009 is on [[Help talk:Loading Gedcoms]], but active discussion is probably better here because people can more easily notice when a forum page has been changed. — [[User:Robin Patterson|Robin Patterson]] [[User talk:Robin Patterson|(Talk)]] 00:54, 21 April 2009 (UTC)