218,780pages on
Phlox got a drilldown working for the Smiths. It goes like this:

*'''[ Filter on birth, death date and location]'''
*This category uses the filters [[Has filter::Filter:Birth period]], [[Has filter::Filter:Locality of Birth ]] and [[Has filter::Filter:Nation of Birth ‎ ]]
*This category has the drilldown title [[Has drilldown title::Narrowing down persons with last name: Smith]]

I think the wording could be improved a little. For example, the actual search makes no mention of death date and does offer birth nation as well as locality (which is why the display should not have a heading "City of Birth").

It seems to work OK for Smith, though I haven't checked exhaustively.

It doesn't work for others with just a surname change: see "Capet". Let's get the procedure sorted out and documented so that almost anyone can get it working for any surname that has dozens of individuals. Start, maybe, with Special:Contributions/Phlox around 12 September 2009?

-- Robin Patterson (Talk) 14:19, March 30, 2013 (UTC)

Discussion best continued at Talk:Extension:Semantic Drilldown. -- Robin Patterson (Talk) 03:54, April 5, 2013 (UTC)

Explaining the surname numbers Edit


It was puzzling enough when there were two. Now there are three. Would one of you knowledgeable people please replace my speculative introduction with something that tells readers where the numbers come from? -- Robin Patterson (Talk) 07:50, April 4, 2013 (UTC)

Detailed answerEdit

We have problems with automatically producing a number that represents the number of people with a particular surname. The main problems are:

(1) Surname categories contain a lot of other junk, such as images of census records, hndis pages and "How contributor X is descended from royalty" pages.

(2) Page which are in subcategories (such as Category:Kennedy family) get counted twice (or spuriously, because they don't even have the same surname!).

The second number is that returned by the "PAGESINCATEGORY" key word. The third number is returned using the SMW "ask" with "format=count". In an attempt to get around problem (1), I created the (new) first number, which uses "ask" to get the number in the category which are also in either "Non-SMW people articles" or "Facts articles- person", to exclude the "extras" mentioned in (1).

If we look at some examples, we can see the problem:

For "Johnson", we have

Subcats Pages Files
(main) 1 57 (including 1 surname article and 2 cemeteries) 71
Baker-Johnson (surname)‎ 0 1 0

The numbers we get are: |- | Johnson | align="right" | | align="right" | 74 | align="right" | 17 | align="right" | 91 | align="right" | 167: so 54 = 57-1-2 is the "right" answer, and 129=1+57+71 is much too large (I don't know how you get 113!).

However, an example like "Welf" suffers from problem (2): the table looks like:

Subcats Pages Files
(main) 1 72 (all real people pages) 0
House of Welf‎ 4 29 0
... more subcats

Here the numbers are |- | Welf | align="right" | | align="right" | 194 | align="right" | 0 | align="right" | 194 | align="right" | 195, and only the 2nd is "about right": the 2 SMW counts include the pages from all the subcategories.

Thurstan (talk) 22:41, April 4, 2013 (UTC

Where do we go from here with surnames?Edit

I can't see how to get a better count automatically. I think we need to go back to having a "manual" count as well, to use for the ranking, as I don't see any of these numbers as suitable for the primary ranking. Any suggestions welcome.

On a different note:

  • do we want to present the results as a table with several columns?
  • I would suggest that we can drop the 3rd number, if we are going to continue to report using both methods.

Thurstan (talk) 22:41, April 4, 2013 (UTC)


Columns would be a distinct improvement since we clearly want to (and can easily) have more than one number per surname. -- Robin Patterson (Talk) 02:23, April 5, 2013 (UTC)
OKay, I will start on an upgrade. Thurstan (talk) 02:30, April 5, 2013 (UTC)

Removing junkEdit

I like the results that remove the "junk", though the junk will usually be an insignificant proportion of any large category - except for the hndis, but they will be similar in similar-sized categories and thus not distort rankings much. -- Robin Patterson (Talk) 02:23, April 5, 2013 (UTC)

Primary number for sortingEdit

I suggest that our primary number (the one we sort on) be the standard SMW-article-property-derived "Facts articles- person" number; clearly away below the total category number in many cases but it could be an incentive to get one's ancestors upgraded. Second number could be from the "Category AND Non-SMW people articles". -- Robin Patterson (Talk) 02:23, April 5, 2013 (UTC)

OK, be realistic, add the SMW and non-SMW. But we have a few oddities, where the total is away above the category total (notably among the medieval surnames). Is that a fault of the subcategory-inclusion? If so, maybe our sorting criterion should be the lower of the two? That will still under-report the non-SMW pages that don't have a surname categorized (e.g. pages straight from WP with inadequate adjustment; and yes I know I've done a few of those). -- Robin Patterson (Talk) 05:11, April 5, 2013 (UTC)

I think that is the subcategory-inclusion, that is why I sorted them by manual count (as I suggested above). Thurstan (talk) 05:15, April 5, 2013 (UTC)
Using the property rather than the category reduces the double counting for the SMW pages. Thurstan (talk) 20:33, April 5, 2013 (UTC)
So the second number (SMW-property-based) is now much less subject to duplication? Is that reflected in the following block of examples, suggesting another "manual count" (copying the third programmed number) and re-ordering is now due? I can do them if you say I've got it right.
Tol...... 184 108 __0 108 193
Walker 160 _46 114 160 168
Korver. 160 152 __0 152 163
Brown. 157 _85 _72 157 201
-- Robin Patterson (Talk) 05:21, April 6, 2013 (UTC)
I think the second-last column above is now the right thing to sort by (160 for Walker, 152 for Korver, etc) and I think it is very close to the manual count now (I see that the manual count for Tol is wrong because of people who are categorized by their married names as well). I would be happy for you to reorder them now. Thurstan (talk) 06:05, April 6, 2013 (UTC)

Done. I generally left the (formerly-named) "manual" column alone except where it produced a higher figure than the "accurate" one. Then I rearranged except ignoring some that were only one number bigger than an entry above. -- Robin Patterson (Talk) 06:23, April 7, 2013 (UTC)

Wondering now about the "subcategories" mention in the heading to the middle columns since Thurstan made the first of those more precise. -- Robin Patterson (Talk) 06:23, April 7, 2013 (UTC)

Now we want a note here about the formula for producing a report that lists all the categories that have a "total" figure of 50 or more (or maybe "60 or more" so that we don't feel obliged to alter the table for a cat that's only just qualified for the bottom group). Something quicker than looking at all 10,000 subcats. -- Robin Patterson (Talk) 06:23, April 7, 2013 (UTC)

It seems that the indefatigable Thurstan has found the formula and used it in recent months but is keeping it secret!! That's fine as long as he keeps on updating the table. -- Robin Patterson (Talk) 01:43, August 29, 2014 (UTC)

A procedure for producing the listEdit

My "formula" for producing the list of top surnames is as follows:

  1. dump the list of surnames into a text file (I use AWB to read the category:surnames)
  2. edit each entry to produce (eg) "{{:User:Thurstan/temp|Smith}}" for surname "Smith". This template checks whether the count is above a threshold, currently set at 40. I use a vbscript script for this step, I am not sure why I don't just use notepad!
  3. copy-and-paste these entries about 100-150 at a time into the sandbox. For each batch, hit "preview", then "select all" and copy-and-paste the screen into the output file. This step takes some hours.
  4. the output list is then edited so that (eg) the entry for "Smith" read "{{Surname report entry|Smith}}". "Previewing" the result in sandbox gives the final count, to use to sort the list. Again, I use a vbscript script here.

Any suggestion for improvement welcome! Thurstan (talk) 03:40, August 29, 2014 (UTC)

Thurstan, you're a STAR! And here was I thinking it might be a "simple" one-liner #ask listing including things like "count>40". This page needs more publicity, so as - as suggested above - to inspire people to add pages for more of their relatives! ---- Robin Patterson (Talk) 01:31, August 30, 2014 (UTC)
I had been resisting writing down the details while I was trying to think of a quicker way to do it! (Maybe with a python script and pywiki....) Thurstan (talk) 05:33, August 31, 2014 (UTC)

Other websites' reportsEdit

Quick check on one of the best of our competitors. It suffers from errors too!


Robin Patterson (Talk) 01:43, August 29, 2014 (UTC)

Births etc per century Edit

I'd like to see a table of centuries with the total number of FP births, baptisms, etc, rather like a "Century/bdm" but just the count, not the individuals. Do we have anything like that already? -- Robin Patterson (Talk) 08:19, December 7, 2016 (UTC)

