(Initial material was copied from Watercooler in September 2006.)
Brian Yap announces program, Sep 2005[]
Weary of manually converting my html pages into pages for this site, I have taken an easier, if somewhat less elegant solution of writing a program. As I only started it two days ago, it is still somewhat crude and is not picking up all of the gedcom details. If anyone else is interested I can supply the program. I really should publish it properly on Sourceforge, but at the moment that seems like a lot of effort. If any of you are interested in the program, with somewhat limited support, you can contact me. Yewenyi 11:01, 12 Sep 2005 (UTC)
Pleased responses; cautions[]
Could be very useful for some people. My family stuff, however, is on Family Tree Maker v 3.4 at home and uploaded to WorldConnect now and again, so I just paste or type from the latter for my new entries here after using the person's nearest relative as a starter for the page. Robin Patterson 22:24, 12 Sep 2005 (UTC)
Ok, I'm convinced. The vast quantity you just slammed up here was quite impressive. I'd be interested in seeing this code. We went from ~600 legit pages to almost 1,000 more than that in no time at all. Very nice. Jtc 07:38, 18 Sep 2005 (UTC)
Might have a bug. Check out http://genealogy.wikicities.com/index.php?title=Category:Minty_Surname&rcid=5201. It's a surname page about Minty, but contains a link to Bagnall instead of Minty. Jtc 08:02, 18 Sep 2005 (UTC)
- You are right about this one. As I use copy and paste, perhaps I accidentally pasted in the wrong one. Early on I was copying and editing. Only later did I write some code. Yewenyi
Too many entries? Ok, I know it's probably bad to suggest this, but this may be able to create more entries per day than is good. This 'Recent Changes' list only allows 500 entries to be seen, but this swamps that. There's no way now to see all the recent changes because there were so many. I'm not sure about the right solution. Jtc 08:02, 18 Sep 2005 (UTC)
- His program is marvellous: a new page every 25 seconds or so, for the half-hour I counted. But Jerrod, if you want to be good with that program it may pay to get really proficient in manipulating the address bar. Next time you reach that "500", look up and change the "500" near the end of the address bar to something bigger. (I thought you got a "Next 500" option anyway, but I see that you don't. Wikipedia doesn't either. Probably because, for that page in particular, it's a bit of a drain on the server, so the managers want to make it available only for people who are so keen that they go to the trouble of editing the address bar.) (8-D Robin Patterson 12:38, 18 Sep 2005 (UTC)
Loading Gedcoms is a page I have created to show more than perhaps belongs on this page. I will include some screen shots and the code. It is still a prototype and I cannot convice it to run stand alone. I need to get out that Java manual to work out what I am doing wrong. Yewenyi
- GEDCOM importing is a must if you want the volume of data you are looking for (for technical-oriented people not familiar with GEDCOM, think of it as the XML for genealogy – in fact, GEDCOM 6, the XML version, is in draft mode). People interested in contributing the family trees of “ordinary” people most likely already have that data in a genealogy program and will be unwilling to manually copy it elsewhere (not to mention the transcription errors likely to creep in if they do). GEDCOM 5.5 (pre-XML) is the current standard that most or all genealogy packages support.
- However, if you let contributors submit vast quantities of data using GEDCOMs, you might end up with a bunch of unlinked family trees – and there are other web sites (such as RootsWeb) that already do this. Adding a few links to Wikipedia articles will not really add much value over RootsWeb. So the only value would be if you would strongly encourage people to match their contributions to existing ones (merging all information about one person onto a single page), so that the family trees are inter-linked. This is, in fact, where Wiki technology adds benefit – by allowing people to link their family trees together, and include reasoning for why they believe that one John Doe is the same person as another John Doe (or split them apart again if they have evidence to the contrary).
- I plan to send an e-mail to I Faqeer with more suggestions around this.
- (Those last paragraphs were from User:Janet Bjorndahl Revision as of 2005-10-03T12:39:52 (her first contribution; nothing more by June 2007))
(Brian continued using his program until his last couple of contributions in December 2005, and did another dozen edits in 2008.)
One "Typical" User's Thoughts, 2007[]
I'm a recent contributor here on GeneWikia and an active and experienced contributor over on Rootsweb and Worldconnect. I've got tons of good data I'd love to add over here, but I am a Wiki-newbie, and finding the going very slow (but interesting, and much helped by folks like Robin and Bill). I also have plenty of HTML experience (but not Java or even CSS, just plain HTML). The GEDCOM2WIKI is definitely a must-- not only for people like me with some HTML experience, but for other researchers with much less technology experience. I just spent hours working on one PERSON page. Granted, it will take less time once I get the hang of it, and choose the best way to format it, etc. But even one hour per person page is going to be too much time for most contributors. I've got a full time job for goodness' sake! (It just so happens I'm on holiday this week.)
I'm glad to have found this page and this discussion. But I have a couple of concerns:
- The help page here indicates you have to have and run Java separately from the gedcom2wiki program. That requirement is not going to cut it for folks like myself. If I have to learn yet another program AND learn Java, too? I don't think so. But maybe you intended ultimately to bundle it all into one program. If not, I think you'll need to if you want most people to use it.
- (And this one really worries me. Warning: rant approaching) As much as I want a GEDCOM2WIKI tool, I also really really do not want to see happen to this wiki space what has happened over at WorldConnect. Between ancestry.com making it easy as pie for people to copy information willy-nilly (apologies for the U.S. slang), and the wonderful ease that WorldConnect allows for updating and uploading GEDCOMs, WorldConnect now has a slew of, excuse me, CRAP, online.
Maybe the hypothesis you're all working from here is that the GeneWikia community will clean up after itself and each other. But how many Robins and Bills and whoever elses is that going to take once you support the ability to easily upload THOUSANDS of pages that need cleaning? How do we make it easier for folks to add their data without turning this Wiki space into yet another genealogical out-house?
Respectfully (really, despite my language), Jillaine 15:31, 6 July 2007 (UTC)
- I agree with much of what Jillaine says there. Jtc and Janet also had doubts about aspects of the system. But at least here we have the authorship preserved, so that errors can be discussed with any unwitting "perpetrators" and can be fixed by later editors even if the initial uploaders have gone. Even the rubbish in a huge GEDCOM has grains of gold in it that may lead some researcher in the right direction. I hope someone can dig into Yewenyi's program and fix the errors he was talking about and make it useful for those of us who don't want to learn Java. Robin Patterson 12:18, 7 July 2007 (UTC)
- AND... WeRelate's GEDCOM2WIKI utility also enables the generation of THOUSANDS of pages and category names that may make finding things difficult, and cleaning up bad data also difficult. Sigh. But the interface ROCKS. Jillaine 18:08, 7 July 2007 (UTC)
- It's certainly easy. Too bad the initial size limit is specified only in the small print several pages in. — Robin Patterson (Talk) 05:08, 3 April 2009 (UTC)
Let's Talk About GEDCOM's[]
I believe for the good of this Wiki, we need an effective way for folks to add their GEDCOM's. The current version can apparently be used, but I'm not sure it's wise to use it in its present form. Things I'd like to see in a GEDCOM program include
- a) a mechanism that ensures that duplicate entries are not created, or over-ride existing articles
- b) a mechanism that ensures that "flags" articles with very similar names to existing articles.
While I believe we need GEDCOM capability simply because such capability is part of what most people would expect from a genealogy site, I have to say that I personally am not a big fan of GEDCOM's, in wiki's or otherwise.
Here are my real concerns:
- I grant that if the object is to add as many pages as possible, GEDCOM dumps really shortcut the process.
- Gee, if one wanted to, with all the GEDCOM's out there available for downloading, one could easily input millions of pages in a matter of hours.
- ...If one wanted to. The shortcomings of that are;
- accuracy of data
- GEDCOM's are far too easy to appropriate. Tend to create genealogies that are not well thought out. They encourage "Snatch and grab" genealogists, who seem to be mostly interested in creating large GEDCOM's rather than having an accurate understanding of their familly history.
- problems with differences between different GEDCOM entries for the same person (if they differ, say in DOB, which one do you accept? or do you create different articles for every point of difference?
On that last point, here's some data:
If you go to Ancestry, and search for John Houston bc 1690, ancestor of General Sam Houston, you'll get about 170 hits. If you pare that down by looking only at the entries with "Notes" (usually better quality and more complete), you get 60 entries
I recently compared those 60 entries in terms of key differences. Here's what I found
Given name:
He appeared under 4 different combinations of given names, including
- 57 John Houston
- 1 John Huston Houston
- 1 John McLung Houston
- 1 John Samuel Houston
His date of birth was given as:
- 4 1689
- 26 1690
- 15 1689-1690
- 2 1686-1690
- 12 c1690
- 1 Unstated
Place of Birth
- 27 Ireland
- 11 Antrim, Ulster, Ireland, and variants
- 5 N. Ireland
- 2 Ireland or Scotland
- 1 Lanrkshire, Scotland
- 5 Londonderry, N. Ireland
- 2 Scotland
- 2 Wigtonshire, Scotland
- 5 Unstated
Father's Name
- 21 John Samuel
- 18 John
- 4 Samuel
- 26 Unstated
Mother's Name
- 38 Margaret McLung
- 1 Esther Watson
- 6 Mary McLung
- 15 Unstated
Spouse's Name
- 32 Margaret Cunningham
- 27 Margaret Mary Cunningham
- 1 Mary Margaret Cunningham
These data give some perspective on the amount of variation encountered even for a single person. True, this particular person, as the ancestor of Gen. Sam Houston, is something of a genealogical magnet, and may have abnormally high amounts of misinformation.
(Side bar: This happens in part because people tend to try to fit the data to make their personal connection "make sense". They know they are connected to Gen. Sam 'cause "my momma told me so", but if their ancestor were Samuel Houston married Mary Cunningham, and the most common data for Gen. Sam's ancestor says it should be John Houston and Margaret Cunningham---then they conclude that clearly their names were really John Samuel Houston and Mary Margaret Cunningham, thus making their connection fit the data. Yes, that does happen, and I can point to specific instances where DNA evidence supports that conclusion.)
On the other hand, this line is heavily-researched, so for the most part really outrageous variations have probably been eliminated.
So, what's the "right data" for John Houston ancestor of Gen. Sam Houston? Each one of those 60 entries has a GEDCOM. Which is right? Is any one GEDCOM entry exactly right? How can you tell? I'm sure each of the 60 people who entered that data think (or at least assume) that THEIR version is correct---(the other 60, to the extent they disagree with them, must be wrong,)
The answer to the last question is that you can't tell. The data in the typical GEDCOM is not amenable to verification and validation in the usual sense of the word. You can tell what the most commonly accepted data is, but deciding genealogical connections by popularity is not really the way to go about deciding what's the right data.
The only way to discriminate between all of these variations is the old fashioned way of verifying and validating the data based on primary sources. Few of the GEDCOM's present in ancestry do a very good job of indicating primary sources. Most rely on tertiary sources such as other peoples GEDCOM's.
Bill 13:07, 9 July 2007 (UTC)
Robin's request for program feature[]
The idea of "uploading a whole bunch of "unsupported" GEDCOM's from the Web" is very far from anything I'm hoping for! Brian Yap (User:Yewenyi) was using his own GEDCOM when he wrote his program. Pages weren't created instantly; I think 17 seconds was about the fastest, which suggests that there was some manual input such as pasting a block of GEDCOM or derived material into something else.
I'd expect any enhancements to be along the lines that the program takes one person at a time, converts to our page name standard - First Middleifknown Last (YOB-YOD) - then adds the data at the bottom of an existing page if any, or creates a new page only if there isn't a match. Where there's a match, the new uploader and previous contributor(s) can be encouraged to talk about it and either
- merge if it's a real match, or
- create separate pages with the same name and dates but also some distinguishing feature, and a page - the basic First Middleifknown Last (YOB-YOD) - that lists them.
(The above was sent to dallan and Beth)
Mechanisms to "flag" similar names[]
On Bill's desideratum near the top above - "b) a mechanism ... that "flags" articles with very similar names to existing articles" - we have a manual version of that in our category systems: Surname, Birth year, Death year. When you put a page in a Surname category, click on the category: similar names will show up close together if the first name starts with the same few letters. Then look at the "Births in ..." category; as its recommended arrangement is by surname, people with similar names will be very close together if the surname has no markedly different spelling, eg Caspar/Kaspar. "Deaths in ...." is a third check. Between them, those should pick up most similarities. Sure, it's tedious, but at one or two clicks and maybe a scroll per check it's probably easier than trolling through several Ancestry.com search or microfiche or IGI index pages. For an extra "flag" if you like, try a Search. If those don't flag near-misses because the names and dates are all just too far apart to be noticed, maybe children of that person will match better and lead to a realisation that the parent should be looked at. Robin Patterson 04:49, 10 July 2007 (UTC)
GEDCOM upload - new trial[]
Other users make it work too. See User talk:Christin. Robin Patterson 09:08, 7 September 2008 (UTC)
- (Next couple of paragraphs started on a user talk page)
is het mogelijk hier een gedcom bestand te uploaden, zodat ik niet elke persoon met de hand moet invoeren ? Bergsmit 20:58, 29 March 2009 (UTC)
Very good question. Some contributors have done it. See Help talk:Loading Gedcoms. — Robin Patterson (Talk) 23:35, 29 March 2009 (UTC)
Looks like Fred has mastered LoadingGedCom. We'll double in size soon. Is there a bot to clean up? Rtol 10:30, 1 April 2009 (UTC)
Now you say Fred has mastered it!!! (Not an April Fool prank?) I hope you and he don't go overboard too fast, because much of the format that User:Yewenyi had is out of date and would not fit in well with our current page styles. Please continue discussion on Help talk:Loading Gedcoms and one of us can post an alert to the mailing list. — Robin Patterson (Talk) 12:30, 1 April 2009 (UTC)
Fred is moving at superhuman speed so I guess he has software support. I had a look at the software, but I do not understand it well enough to update it (and I have a day job unlike most of you guys). Perhaps we can ask User:Yewenyi? Rtol 12:42, 1 April 2009 (UTC)
Christin's pages showed that Yewenyi's output needs much updating to fit in with our current page styles. Is there any way one of you can print out on a wiki page ( <--- that one!) - in plain text if necessary - the coding or output or something that can let the rest of us see and improve it? (Some of us do not have new enough computers to run it for ourselves.) We did not have info pages when Yewenyi was active. The whole procedure might be very much more useful if it could be adapted to create them while updating other aspects of the output. — Robin Patterson (Talk) 12:43, 1 April 2009 (UTC)
I don't master the gedcom problems en I don't have software help, so I am experimenting. Robin, if you are on speaking terms with Dallan Quass, then he will help you with a program for automatic uploading a gedcom ? Bergsmit 13:03, 1 April 2009 (UTC)
I fear that is impossible without his help here uploading a gedcom file with 500.000 (fivehundred thousand) descendants of Charlemagne ! Bergsmit 13:07, 1 April 2009 (UTC)
I have the sourcetext of WeRelate re uploading gedcoms, that software works perfect there. I tried it here, but their templates are working here, but not this sourcetext. Perhaps is here anybody wwho can understand and manage this? Fred Bergman 09:41, 2 April 2009 (UTC)
- WeRelate may be revising its process. See http://www.werelate.org/wiki/WeRelate_talk:Watercooler#New_GEDCOM_import_function_and_a_new_Sandbox_site_.5B27_March_2009.5D - second sentence repeats the warning several people have given here. — Robin Patterson (Talk) 05:21, 3 April 2009 (UTC)
- I thought Fred was using Yewenyi's program "at superhuman speed". What happened there? — Robin Patterson (Talk) 05:21, 3 April 2009 (UTC)
WeRelate has a good gedcom upload program, but, and I like that, they are perfectionists. Dallan improves the upload procedure in that way, that during the uploadproces, there is a compare with existing persons and families, and if possible, these are merged then immediately ! The second improvement he is making is that the maximum upload gedcomfile was 5000 persons and now that will be much bigger.Fred Bergman 06:20, 3 April 2009 (UTC)
New version[]
I think I now sufficiently understand User:Yewenyi's code. I should have time over Easter to adjust the software to make info and person pages from Gedcom files -- unless User:Christin beats me to it. Rtol 07:15, 6 April 2009 (UTC)
We'll get there. Rtol 07:18, 6 April 2009 (UTC)
- That's great news, Rtol! (Christin has been swamped with work on a dissertation, so is unlikely to do much here in the next few weeks.) I hope you can upload some of the code and/or output to Help:Loading Gedcoms/output/Yewenyi so that the rest of us can see and discuss things such as text that might need changing or conversely might demand info template changes. — Robin Patterson (Talk) 08:45, 6 April 2009 (UTC)
- Useful background reading includes Forum:Gedcom bot. — Robin Patterson (Talk) 00:13, 9 April 2009 (UTC)
GEDCOM upload under SMW[]
I think I begin to see how this is all coming together. Here's my outline of what should be in store for any user:
- 1. Upload .ged file
- Give it a name that the uploader and others will recognize, eg Robin13333.ged
- 2. Get invited to incorporate it
- Yes please! but realize that it will take several seconds per person (though need not be all in one edit session; see step "11")
- SMW makes working copy of GEDCOM and assigns date ranges for any dateless persons if possible from their parents or children
- 3. View first new person
- Facts extracted from GEDCOM and listed on left-hand side of screen
- 4. View result of search
- SMW checks facts (using criteria in notes to step 6); if no possible match, invites user to accept new page for person
- 5. If no match, accept as new page?
- SMW will include as "facts" everything extractable from the working copy of the GEDCOM, with "source" GEDCOM noted; user option of hiding some details if person actually or possibly still living; new page recorded as having been created by that user
- Or decline to have that person on separate page (and it will be removed from working copy of GEDCOM)
- 6. If any possible match(es), SMW will list it or them
- On right-hand side of screen, list all with same surname (extending via soundex if possible), at least one same given name or patronymic etc, and not inconsistent dates (e.g. if new person has no dates it will include all, but if new person has dates it will list only existing persons who could be the same); beside each listed name and dates will be three checkboxes: "Same", "Maybe", "Different"
- 7. For each listed person, choose one of "Same", "Maybe", "Different"
- If all get "Different", SMW will revert to step 5 for that new person
- If there's a (or conceivably more than one!) "Same person", SMW presents full facts of it or each, one at a time, to set against facts of existing person for final check
- 8. If "Same", then for each fact choose to add or discard
- This will be tricky; string facts can simply have the new facts appended, but others will have to have a mechanism for storing the alternatives somewhere.
- If more than one "Same", maybe the program can attach the new GEDCOM person to each page and note on each that all seem to be the same.
- 9. Final acceptance of merger
- Person gets facts merged or whatever, with original GEDCOM noted as "source"; page edits recorded as having been made by that user; and working GEDCOM and original GEDCOM tagged with new pagename of that person
- 10. If no "Same" finally accepted, repeat process for all the "Maybe".
- 11. Suspend process indefinitely at user's option (e.g. dinnertime or bedtime)
- Working GEDCOM remains in place; user's talk page gets a note advising of how to resume process
— Robin Patterson (Talk) 04:43, 5 June 2009 (UTC)
- The details of what we can do with the UI and how much magic we can get out of SMW are items I can't comment on yet. I am unclear what the best solutions are that people have done on websites and in software. What are the favorites and why? What is the state of the art with Gedcom merging? I read a secondhand account of how ancestry does it, but haven't gotten around to trying it yet.
- It seems to me that the deep lineages are where the high quality content is (where we can leverage WP) and that much of the needless replication in the Gedcoms have to do with these deep ancestries that everyone have simply copied into their genealogy programs. Articles like William the conqueror and Charlemagne will enjoy the highest traffic because of the greatest commonality. The more important factor is that these lineages form a core organizing structure. Although many genealogists are very serious and have researched large volumes of individuals that are unique, I suspect that a substantial audience of people buying $29 genealogy programs have gone back several generations in straight lines usually paternal lineage but after they tie to some gedcoms that connect them to some deeper ancestries and famous people in side branches, they basically don't elaborate on any of that part of the tree. Our workflow could then be driven by working on just the parts of one's gedcom that are unique. A methodical way to test this assumption would be to do some volume tests on gedcoms for the commonalities. That would reduce the volume of manual merging that has to be done. For example, we would then proceed to show the user a graph of their tree from the gedcom indicating say in green the portions that looks like it is stuff that we already know about. If it happens to be a gedcom file from their own program, they may recall the stuff they just copied, and verify by click accepting the portions of the visual tree that they will accept / don't want to hand verify. So they see a guide indicating their goal (connection to the ancient lineages we have put substantial effort into) and their progress through the nodes turning each from red to green. I think we have to deal with Robin's idea of iterative workflows- where a person will want to stop for the night and continue later where they left off. Further, the node color codes might indicate by hue how rich in content they are (pictures?/ sources?/ volume of text narrative?). This progress indicator would give the positive feedback of accomplishing something in a high quality way. We want to avoid UIs and workflows that overwhelm the user with innumerable relatives and tend to promote reducing their fore bearers to anonymous stepping stones to notable lineages. We want very high quality stuff in the branches, not just the trunk and main limbs of the trees.
- I think someone needs to survey the state of the art and scan for programs and sites that do well with these sorts of goals. (Robin? Anyone else fancy going on a scouting expedition?) I don't really care what people are doing with their datebasey features. Those are just straight programming tasks. The hard part is to channel people towards high value contributions rather than towards mindlessly repetitive or redundant copying tasks that dilutes the quality of the knowledgebase and burns them out.
- As for the tree UI- picture Jewage's image of a tree that has an extension supporting hotspots and zooming. The color coding indicates quality, and we have user stats that show numbers of stub articles versus those that correspond to B and C level articles on Wikipedia. The primary levels are indicated by simple SMW statistics- numbers of properties filled out, size of text entries without a real qualitative assessment. But later these might be periodically assessed as WP articles are. That is one way that might allow us a greater measure of the the "pay later" approach I elaborated on in response to a Forum post who admired high volume Gedcom imports [[Forum:Google_rank#Dutchies|(forum note here)]. -~ Phlox 16:01, 6 June 2009 (UTC)
- Robin's suggestions are a good start for a discussion.
- Step 0.0 is to translate gedcom data into wiki data. This is not hard (ASCII to ASCII) but gedcom data works with individual and family IDs, while name and vital stats are our form of identification and hypertext our form of linking.
- A main concern is that every individual has multiple links: two parents, M spouses, N children. We need to get those links right as well.
- My main worries about deep ancestries are naming and quality. There are multiple spellings of older names, even within a single language, and birth and death dates are often uncertain. People are often mixed up, and too many amateur genealogists happily report an unlikely ancestor as a fact.
- As Step 0.1, I would scan each individual in the gedcom file and see whether there is a match on Familypedia. I would take the children of youngest match as the first people to be added.
- On merging information, I would be careful. Phlox has experimented with alternative theories, and we will need something like that (Step -1). Unfortunately, at present we only have a flag {{Verify}} for the entire content. We could in fact do with a quality flag for every connection. We have also discussed an overall quality assessment (Step -2). I would think that the pages in the highest quality category should be most resistant to mergers.
- I studied the most popular "let's all upload our gedcoms" sites. None of them had the correct children of Charlemagne. The ones that did not procreate are typically missing, the ones with a large offspring are often listed under different names, while the obscure ones are used to claim ancestry that isn't there. We don't want to repeat that. rtol 11:18, 8 June 2009 (UTC)
- What if we begin by inputing only Gedcoms with sources indicated?
- Are there high value trees at any of these sites? I know of some of the criticisms of the LDS site, but what about fetching all the biography articles at WP for which there are exact agreement with between birth and death dates matches in the familysearch database? I could pull those in by Pywikipedia bot pretty easily. I was thinking that the AFN records were more high quality, but actually everything is spotty. Take Charles Babbage. His AFN 11DC-X5P has his birth at about 1794 (WP:26 Dec 1791 ), and none of the LDS records has any of his descendants as does wikipedia (3 listed, 8 total, with a link to a genealogy site now defunct). Not a happy picture. -~ Phlox 05:33, 9 June 2009 (UTC)
- Whatever we do, we'll need eyeballs to track the bot. As a matter of procedure, it would be good if the bot-master were to announce that (s)he uploaded, say, the House of Valois from Wikipedia, and that nothing else is uploaded until someone has checked the new pages and the links. rtol 06:02, 9 June 2009 (UTC)