23000 surpassed, considering even more glosses

I’ve nearly completed adding Karakhanid data from Dīwān Luγāt at-Turk. This has brought me to 23000 entries. This also means that I’m nearly out of sources to consult until I can visit a bigger library (which is still difficult due to COVID).

I have considered adding 50 new glosses, which would bring my total up to 500. I’m considering new body parts/functions (palm, feces, pus, sole, hoof, vein), some plants and animals (cockroach, juniper), directional/positional terms (top, bottom, interior, side), and a few random conceptual and cultural terms (wedding, color, thief). I have 38 terms so far; the Dīwān index was very helpful in choosing these. Once I have decided, I’ll post them here. I’ll also do some background work to ensure that I’m not going back to the same sources and looking for terms that aren’t in there. It’s frustrating.

As side note, I have really enjoyed reading the following article: Janhunen, Juha. “Issues of Comparative Uralic and Altaic Studies (3): The Turkic Plural in *-s.” Altai Hakpo, 2017. He breaks down a lot of issues relating to the unusual number of paired items ending in /z/, the z~r controversy, and typological issues related to paired/plural items.

Mishär Tatar

I’ve added yet another language to the list: Mishär Tatar. I’ve found a good German language source and have been adding a few forms. I’m up to 16,800 entries, so 17,000 isn’t too far off.

I have had a bit more time to work on this project, so updates will be fairly continuous.


I’ve been tipped off to some data on a new variety: Xyzyl. Xyzyl is often considered a dialect of Khakas, but because it is so divergent from the standard, some authorities consider it a distinct language. I’ve created a page for it and added it to the map. I should have some lexical data added to the database in the next few days.

New Languages!

I’ve been adding new material left and right. As noted previously, I recently added Ili Salar. I’ve also found some decent Gagauz materials, and have been adding entries as I come across them. My Afghan Uzbek-Turkish dictionary just arrived, so I’ve added Afghan Uzbek (Sar-e Pol). I’m still waiting for the work on the Samangan/Aybak dialect, which I believe is more Kipchak in nature.

I found the most incredible open access journal: Tehlikedeki Diller Dergisi. They occasionally have grammatical sketches, which are really fantastic to have. I’ve added a ton of Dolgan forms, and will be adding a new language, Kalmak. I’ve previously treated Kalmak as a variety of Tomsk Tatar, but the author has convinced me that it’s distinct enough to warrant its own page. I’m well over 15000 entries and will likely pass 15400 very soon.

While I’m talking about great sources of information, I would remiss if I didn’t mention CyberLeninka, which is a source for open access Russian journals. Also, the Russian State Library has begun digitizing a lot of its collections, which means that tons of dissertations and other materials are now freely available.

I went for a long time without finding much new material, and now it just won’t stop. Here’s to hitting 16000 soon.


I’ve added a lot more Salar material. The source I took from uses Pinyin, which is not at all ideal for writing a Turkic language. I’m beginning to doubt whether it was worth adding this new material at all…

I’ve created a completely new page for the Ili variety of Salar, which is different enough that Dwyer considers it a separate dialect. The data for this dialect is really good. I’ve surpassed 15,000 entries in the spreadsheet use to collect my data, but I haven’t entered it into the actual database. I imagine it will all be up in a day or two.

Still here…

I’ve been a bit slower about adding new forms lately. I have reached 14,450 entries.

I’ve added 191 Cuman forms from the Codex Cumanicus. Cuman is challenging because it’s transcribed using medieval Italian conventions, which don’t do well with a lot of Turkic sounds.

I’ve also been working through Urum. This has been incredibly tedious, as its mixed nature means that every gloss has a ton of forms. Occasionally you run across a Greek word, which is exciting.

After I’m done with Urum I’d like to work on some more medieval Turkic languages. It’s harder to find data on these, but I’ve got hopes that I can get some Mamluk or Bolgar data.


Well, I’ve been on a roll. I’ve just added entry no. 14,000. This latest is the Krymchak word for ‘flower’ – čiček.

I’ve got a lot more Krymchak to add, too. Every time I think I’m running out of languages or sources I find more and more. Krymchak is an interesting case because the dictionary my library holds was shelved under PJ; the Library of Congress classification scheme arbitrarily classes ‘Other languages used by Jews’ at the end of the Hebrew range. Our Karaim materials, however, were classed under PL with the rest of the Turkic language material. Weird.

I’m still considering what to do with the glosses I proposed in the previous post – still no decision.

Also, I’d like to set up a page devoted to Crimean Turkic, maybe even incorporating a fancy interactive map. We’ll see…


I’ve been adding tons of forms here and there, with finishing up certain languages and working towards finishing others (Uzbek is on the list right now).

I’ve also begun an interesting foray into Cuman, a medieval language spoken by an early Kipchak people in the steppes of Ukraine and Eastern Europe. Even in the 13th and 14th centuries there’s quite a bit of Persian influence.

Unfortunately, the Codex Cumanicus is written in very confusing medieval Latin, so there are bound to be tons of mistakes on my part. My favorite so far is lupi ceruerij. When I saw the Cuman gloss was silausun I thought “Huh, that looks like the Kipchak word for lynx.” Sure enough, a little searching reveals that lupi cervieri was a term that early Italian traders and costumiers used for lynx fur. The term (which literally just means “wolf-deer”) seems to have had other meanings, but finding those out is a bit beyond me now.

The transcription is very inconsistent, which makes figuring out the original form very difficult. The letter x, for example, seems to represent what I transcribe as z, č, and s. Basically, if I don’t have a modern word to check the Cuman form against, I can’t reconstruct anything.

11000 forms!

I’ve hit 11000 forms in the database. This one’s not too exciting – it’s Turkish for “egg” – yumurta. Whenever I can’t hit an even number but want to insert a batch of forms, I’ll fill in the gaps with one of the easier languages. In this case it’s Turkish. Because Turkish is the easiest language to get data on, I’ve saved it for last.

In other news, I’ve come across discussion of language I’d never even heard of: Chanto. Chanto is a Turki variety spoken in Western Mongolia. The speakers identify as Uyghurs, although many sources call them Uzbeks. Their language is distinct enough to have its own entry, so I’ve set that up here. The same sources give some description of Altay Tuvan, and their data doesn’t neatly align with what I’ve already found. I’m not convinced that this new data is very good, but it’s all I’ve got to work with.