Tech Issue:Redundant articles

214,231pages on
this wiki
Add New Page
Add New Page Talk0

This page concerns multiple articles on the same individual that are created inadvertently.

There are other problems that are not treated in depth here:

  • Multiple articles on the same place or other non person subject. These can be handled in the normal wikipedia way of manual identification and suggested merge using the merge template.
  • Deliberately created article on same individual in order to record an alternate theory regarding the individual. These are normal in genealogical research and should be retained even after research has conclusively shown that an alternate version is incorrect. Others may come across data suggesting the identical disproved theory and will be interested in reviewing prior work on the alternate theory.

The issue: Inadvertently created duplicates:

Wikipedia has over 10 million pages, including redirects. One might think that their problem would be much more difficult, because in our case, the variants for a person's name are limited, whereas the number of ways of titling an arbitrary topic of human knowlege is virtually limitless. Somehow Wikipedia manages to identify topics that are about the same thing that are named differently, and to merge them. This works because everyone is an editor and can identify such problems. Because of this, we have an inherent advantage over other genealogy sites.

However, our problem is fundamentally different from Wikipedia in an important respect. Wikipedia will have a smaller number of pages that are of common interest to many people. Genealogy Wikia will have a much larger number of pages mostly about recent individuals of interest only to their descendants. In our first decade of existence, Genealogy Wikia should quickly overtake Wikipedia in sheer numbers of articles. It is not difficult to see why this will be the case. Already, it is not uncommon for individuals running their own genealogy sites to have a million individuals. We will be pulling in this data via Bot and so we will quickly reach multiple millions. Further, due to the nature of migrations, genealogy crosses language boundaries and our site is therefore multilingual. We will have not just English language genealogies, but those from a broad global community.

When we crack 100 million individuals, we are really going to need some serious tools and infrastructure to automatically identify such duplicates.

Mechanisms and infrastructure to deal with redundancyEdit

  • Unique identifiers. We are tracking two kinds of unique identifiers in info pages: (AFN)s, and Genealogics person id. Unfortunately, we cannot create our own of these and so will be using our own GUIDs to keep things straight. Our GUIDS will be invisible to users and only be generated when we start to do exports of information to genealogy programs. Likely we will stuff it in a notation field and read them on import.
  • Structured data encoding needed: This is the style of encoding used in databases. What is common any programatic solution is that the code not also be saddled with the problem of performing natural language processing on free text. If the contributor placed the name of the father in a field, then such a program does not have to parse the various formats for cell tables to extract the information. The problem would be worse if it had to try to figure out other useful data like birth county from free text. The decisive way to deal with ambiguous data is to encourage folks not to add it in free form ways. This is one reason why most of the current "Create a page" templates are seriously lacking.
    • Whatever data encoding formalism we use, so long as we have the crucial data encoded in a controlled way, we will be in a position to build CloneKillers.
    • Info pages are encoded in a structured way so that the information can easily be used by a CloneKiller.
  • "CloneKillers" (Programs to identify likely duplicate individuals):
    • Interim solutions using hard coded heuristics will be used until we can incorporate a probabilistic system. Some rules might be the following:
      • Father surname, mother surname match, with birth county match, and birth or death date in an acceptable range.
      • Relative with an UID match- eg genealogics or AFN number identical. Assume the UID data is correct
      • Matrix representation of a graph can be used for fast computation of proximity using matrix math. This is what is used for calculating paths for airline connections. This is sparse multidimensional, so it may not be practical.
      • Hash the Gedcoms- Create some signatures and search those. It looks like there are a lot of identicals and near identical data that folks are simply repeating between each other.
      • Examine if there is an identical record on WorldConnect. If so, probe for near identicals.
  • Ultimate solutions to the redundancy problem will likely employ probabilistic approaches like bayesian belief networks. These are better suited to the genealogy domain where nothing is black and white. We think the father's surname was probably "Foo" because Aunty Matilda said so and her data has been wrong only 5% of the time. With a belief network, instead of being forced to state the surname was definitely Foo, you can say Foo with 95% probability to gauge the contributor's confidence. Then when we go to look at a number of sources to decide a duplicate, we can look at a large number of factors of varying uncertainty such as the confidence that the important locations are nearby enough for birth location, nearby enough for married in location, and so on with death events, residence and occupation data, taking into account historic versus new place names or ambiguous place names. Deciding an identical might decisively turn on primary information such as determining wife's maiden name due to high confidence regarding the the father's surname, and the child relation, but also may be decided on a large volume of lower confidence corroborative evidence. Such secondary data might be used in whole or in part to aid lower confidence primary data items. As one can see, such a network of beliefs are highly interconnected and more accurately model problems that genealogists face with ambiguous data and sometimes changing data. New data — such as DNA evidence — can and will become available that will cascade uncertainty into large numbers of existing genealogical trees. Probabilistic approaches are very good with dealing with such dynamic changes. Hard coded heuristic rules are not.

Also on Fandom

Random Wiki