GEDCOM Import

Importing GEDCOM into XYFT

2018-03-11

I have also been working on importing old GEDCOM files into XY Family Tree. The problem I have found with GEDCOM files is that they can be in encoded in all sorts of different character sets. What might look like this as a source file, evidently encoded as UTF-8

Notepad++ shows characters encoded in UTF-8

gets imported like this

UTF-8 without BOM displays incorrectly

It appears that despite being encoded as UTF-8 it nonetheless translates as rubbish. I found the answer is to read the original ged file into Notepad++ and convert the encoding to UTF-8-BOM so it looks like this

Notepad++ shows characters encoded in UTF-8-BOM

Then the import works to show like this

UTF-8-BOM is correct encoding

And when XY exports that file as GEDCOM and reimports it to XY, it comes back the same.

BOM is Byte Order Mark and is way beyond me. I use Notepad++ to check and edit pretty much all text on my PC (this website is written with Notepad++). You can get it free at https://notepad-plus-plus.org/. Suffice to say that experiments can yield results.

Here’s a nice piece in Greek

Greek letters are fine

And here’s a snip from a test file

UTF-8-BOM shows letters with hook above

I found a file of 33MB online and used it as a test – all perfect for import and export (but of course, not quite perfect as GEDCOM is illogical). Nonetheless, on my 4 year old, low-spec PC all 65,508 names could be imported from GEDCOM in just over 10 minutes and exported as GEDCOM in just under 10 minutes. Needless to say that the GEDCOM files before and after were slightly different because that’s the way GEDCOM is – NOT a standard. But also what was interesting was the fact that although the import was relatively slow, once the 65,508 names were imported, loading into memory took 10 seconds and navigation between the names was just as fast as it is for 20 names – instant!

But the big question is, would you have 65,000 pieces of verifiable evidence and where is this evidence? For certain the evidence is not contained in a GEDCOM file.

2018-03-16

More experiments. I bravely tried the Tamura Jones fan thingy. 16 generations (65,535 names and 32,767 marriages) took about 10 minutes to import from the .ged file into XYFT format. So pushing further I tried 17 generations (131,071 names and 65,535 marriages) and this took 50 minutes to import. Mind you, my PC is low spec and it’s not something that needs to be done twice. Once the import was complete it only takes 15 seconds to load all 131k people and 65k marriages into memory and navigation between them is instant. But the thing is, no one should ever have over 100k names in a single family tree. If you find a file with more than 1k names how many would you rely on? Just one mistake invalidates the whole lot. And anything found on the Internet must always be regarded as most likely wrong, either by accident or mischief.

And yet more: importing a variety of different character sets.
On this site http://heiner-eichmann.de/gedcom/charset.htm there are several challenges.

GEDCM challenge character sets

The only one I couldn’t import was the file with
Byte order: Hi-Lo, no BOM, Line terminator: CR+LF
which looked like this

Ansel text in Notepad

All the rest yielded and were imported with all their characters correct.

Even the Ansel page imported OK, but I think the original was a bit skew-whiff with its descriptions.

There are other challenges and torture tests available involving 1000’s of wives/husbands/children, but XY Family Tree would have no problem with these apart from the difficulty all programs would have in displaying such large numbers on a screen. As for marriages, XYFT can reorder these in any sequence even if dates are not known.

2018-03-28

Looking at the 33,736 names imported from a Gedcom file I found on the Internet I thought it would interesting to run some of the checks that XY Family Tree can do. So

Checking option 3

these people have no connection with anyone else in the data. No obvious reason for them to be there then.

People in this family tree with no connection to anyone else

And checking to see that people are not their own ancestor (how does that happen?)

Checking option 4

we get

William Thornton is an ancestor of himself

In fact XY finds that 3 people are ancestors of themselves, as you can see in this snip.

XYFT showing 3 people are their own ancetors
3 Self-Ancetsors

More: not only are people ancestors of themselves in this file but people share the same UUID. Admittedly the UUID attaches to a very similar person in each case but the essence of a UUID is that it belongs to a single instance of a person, not to two similar people. Without checking all duplicates it quickly becomes apparent that this Gedcom file has too many errors to be anything more than a test of credibility. But the real point is that you can’t trust a large Gedcom file. By the same token you can’t trust an XYXchange file unless you know you can trust it. But at least XYXchange files are much smaller and have no ambition to contain a whole forest.

Big gedcom files may have duplicated _UID

As I’ve said before, large Gedcom files are most likely wrong and here is some proof.

Leave a Reply

Your email address will not be published. Required fields are marked *