GEDCOM import into XYFT

Normal import

Most GEDCOM files import into XYFT without any problems. This is for three reasons:

  1. GEDCOM insists that children descend from families where parents are married and in most cases this true
  2. most GEDCOM files don’t have many people using more than the basics: name, date of birth etc
  3. most GEDCOM files are encoded in UTF-8, either with BOM or no need of BOM (see below about BOM)

For convenience all GEDCOM imports include the original GEDCOM data in the Private Notes field so it’s easy to check if anything is missing or needs moving.

GEDCOM character set

The biggest problem I have found with GEDCOM import into XYFT is that GEDCOM files are in encoded in all sorts of different character sets. What might look like this as a GEDCOM file, evidently encoded as UTF-8:

GEDCOM encoded in UTF-8
Notepad++ shows characters encoded in UTF-8

gives a GEDCOM import that look like this:

GEDCOM import displays incorrectly
GEDCOM UTF-8 without BOM displays incorrectly

It appears that despite being encoded as UTF-8 it nonetheless imports wrongly. The solution is to read the original GEDCOM file into Notepad++ and convert the encoding to UTF-8-BOM, so it looks like this after you’ve converted it:

GEDCOM encoded in UTF-8-BOM
Notepad++ shows characters encoded in UTF-8-BOM

Then the GEDCOM import works to show like this, with all letters correctly displayed:

GEDCOM import success
UTF-8-BOM is correct encoding

And when XYFT exports that file as GEDCOM and reimports it from GEDCOM, it comes back correctly.

BOM is Byte Order Mark and is way beyond me. I use Notepad++ to check and edit pretty much all text on my PC (many of my websites are written with Notepad++). You can get it free at https://notepad-plus-plus.org/. Suffice to say that experiments can yield results.

Here’s a nice piece in Greek:

Greek letters in XYFT
Greek letters are fine

And here’s a snip from a test file:

UTF-8-BOM shows letters with hook above
UTF-8-BOM shows letters with hook above

GEDCOM import using found .ged files

I found a file of 33MB online and used it as a test – all perfect for import and export (but of course, not quite perfect as GEDCOM is illogical). Nonetheless, on my ancient, low-spec PC all 65,508 names imported from GEDCOM in just over 10 minutes and exported as GEDCOM in just under 10 minutes. Needless to say that the GEDCOM files before and after were slightly different because that’s the way GEDCOM is – NOT a standard. But also what was interesting was the fact that although the import was relatively slow, once the 65,508 names were imported, loading into memory took 10 seconds and navigation between the names was just as fast as it is for 20 names – instant!

But the big question is, would you have 65,000 pieces of verifiable evidence and where is this evidence? For certain the evidence is not contained in a GEDCOM file.

GEDCOM import using test files

Tamura Jones

More experiments. I bravely tried the Tamura Jones fan test. 16 generations (65,535 names and 32,767 marriages) took about 10 minutes to import from the .ged file into XYFT format. So pushing further I tried 17 generations (131,071 names and 65,535 marriages) and this took 50 minutes to import. Mind you, as I said before, my PC is ancient and low spec, and it’s not something that needs to be done twice. Once the import was complete it only took 15 seconds to load all 131k people and 65k marriages into memory and navigation between them was instant. But the thing is, no one should ever have over 100k names in a single family tree. If you find a file with more than 1k names how many would you rely on? Just one mistake casts doubt on the whole tree. And anything found on the Internet must always be regarded as most likely wrong, either by accident or mischief.

Heiner Eichmann

And yet more: importing a variety of different character sets.
On this site http://heiner-eichmann.de/gedcom/charset.htm there are several challenges:

GEDCM challenge character sets

The only one I couldn’t import was the file with
Byte order: Hi-Lo, no BOM, Line terminator: CR+LF
which looked like this before and not any better after:

Ansel text in Notepad

All the rest yielded eventually and were imported with all their characters correct.

Even the Ansel page imported OK, but I think the original was a bit skew-whiff with its descriptions.

Torture tests

There are other challenges and torture tests available involving 1000’s of wives/husbands/children, but XYFT would have no problem with these apart from the difficulty all programs would have in displaying such large numbers on a screen. As for marriages, XYFT can reorder these in any sequence even if dates are not known.

GEDCOM – errors in original files

Looking at the 33,736 names imported from a GEDCOM file I found on the Internet I thought it would interesting to run some of the checks that XYFT can do.

This check finds people in the tree that have no connection to anyone else in the tree:

Checking option 3

The check finds that these people below have no connection with anyone else in the data. There is no obvious reason for them to be there:

People in this family tree with no connection to anyone else

And checking to see that people are not their own ancestor (how does that happen?):

Checking option 4

finds that William Thornton is his own ancestor:

William Thornton is an ancestor of himself

In fact XYFT finds that 3 people are ancestors of themselves, as you can see in this snip:

XYFT showing 3 people are their own ancetors
3 Self-Ancetsors

More: not only are people ancestors of themselves in this file but people share the same UUID. Admittedly, the UUID attaches to a very similar person in each case but the essence of a UUID is that it belongs to a single instance of a person, not to two similar people. Without checking all duplicates it quickly becomes apparent that this GEDCOM file has too many errors to be anything more than a test of credibility. But the real point is that you can’t trust any large GEDCOM file. By the same token you can’t trust an XYXchange file unless you know you can trust it. But at least XYXchange files are much smaller and have no ambition to contain a whole forest.

People with the same UUID

As I’ve said before, large GEDCOM files are most likely wrong and here is more proof. The snip below shows 2 pairs of people in the same GEDCOM file having the same unique identity. Actually there were more (like in the previous snip). I have no idea how GEDCOM files can contain these errors – but they can and do. If an individual in your file has a UUID (or _UID) of 24B82…00580 it should not be possible to have another individual in the same file with that same UUID. Something is seriously wrong.

Big gedcom files may have duplicated _UID
Two examples of two people with the same unique ID

Veracity of GEDCOM data

Simply, GEDCOM import from the Internet, or indeed any import from an unverified source, is probably more trouble that it’s worth. The veracity of every Internet GEDCOM is questionable and the GEDCOM file does not answer any questions. However, if your GEDCOM file is your own and properly researched, you’re safe.

Leave a Reply

Your email address will not be published. Required fields are marked *