IPhOD Details

IPHOD: HOME, BLOG, DOWNLOAD, SEARCH, CALCULATOR, DETAILS, KENNY VADEN

Details on IPhOD: organization and measures

The IPhOD was calculated over a large word set in calculations for phonotactic probability and neighborhood density, after the approach of Vitevitch and Luce (1999). Phonotactic probabilities refer to the concurrence likelihood of some sequence of sounds that are present in a given word. Phonological neighborhood density counts the number of words that share all but one phoneme with a particular word or pseudoword. Positional probabilities refer to the average likelihood of each phoneme occurring in each position of a word. These counting and probability measures also were weighted using frequency and log frequency to reflect their occurrence in natural language.

The IPhOD measures extend on definitions from Vitevitch and Luce (1999), by performing these calculations while distinguishing vowels with different syllable stress placement or not. In stressed calculations, otherwise identical vowel sounds are considered to be distinct phonemes depending on primary, secondary, or no-stress placement. The so-called unstressed calculations collapse vowel sounds into single phoneme categories. This may allow syllable stress related hypotheses to be tested, since the information was available in the transcriptions in the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994).

Version 2.0 versus 1.4

IPhOD version 2.0 contains phonotactic and density estimates, American English transcriptions of 1-28 phonemes, and word frequencies for 54,030 word and 814,840 pseudoword entries. Each entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994), and written word frequencies from the SUBLEXus database (Brysbaert & New, 2009). Neighborhood density and word averaged phoneme-sequence probabilities were extrapolated from those data using the same formulas for words and pseudowords, so that entries of either type could be chosen using identical criteria.

IPhOD version 1.4 contains transcriptions of 1-17 phonemes, and word frequencies for 33,432 words and 814,840 pseudowords. Each entry contains an American English phonetic transcription from the Carnegie-Mellon Pronouncing Dictionary (Weide, 1994), and Kucera-Francis written word frequencies (1967) from the MRC Psycholinguistic Dictionary (Wilson, 1988).

Additional Information about Version 2.0

Different ways of saying the same thing? IPhOD version 2.0 introduced homophones and homographs to the database. This addition required special steps to be taken to avoid double-counting pronunciations or double-weighting with written word frequencies. For each measure in the database, homophones were counted separately in weighted counts since they had different written frequencies in SUBTLEXus, because words that are pronounced identically but have different spellings and therefore different written frequencies. However, homophone entries were counted only once for raw counts, since their pronunciations are indistinct. Homographs were handled oppositely: weighted counts used only one entry since there was no way of assigning written word frequency to multiple pronunciations of the same orthographic item. Meanwhile, all of the various pronunciations of homographic entries could be counted separately for the raw count. Previous versions of IPhOD had a single pronunciation for each spelling, and did not treat homographs differently from the other words.

Version 2.0: All Values by Column Number and Title:

Column #	Column Name	Description
1	Indx	Index number for word or pseudoword collections.
2	Word	Orthographic form of word, or altered word that generated pseudoword
3	UnTrn	CMU Pronouncing Dictionary transcription. Phoneme glyphs separated by period marks. Unstressed; contains no syllable stress information.
4	StTrn	Stressed transcription; 0,1,2 indicates unstressed, primary or secondary stressed syllable.
5	NSYL	Number of syllables
6	NPHON	Number of phonemes
7,8,9,10	unsDENS, unsFDEN, unsLDEN, unsCDEN	Unstressed phonological neighborhood density.*
11,12,13,14	strDENS, strFDEN, strLDEN, strCDEN	Stressed phonological neighborhood density.*
15,16,17,18	unsBPAV, unsFBPAV, unsLBPAV, unsCBPAV	Unstressed, word-average biphoneme probability (relative frequencies for ordered phoneme pairs).*
19,20,21,22	strBPAV, strFBPAV, strLBPAV, strCBPAV	Stressed, word-average biphoneme probability.*
23,24,25,26	unsTPAV, unsFTPAV, unsLTPAV, unsCTPAV	Unstressed, word-average triphoneme probability (relative frequencies for ordered phoneme triplets).*
27,28,29,30	strTPAV, strFTPAV, strLTPAV, strCTPAV	Stressed, word-average triphoneme probability.*
31,32,33,34	unsPOSPAV, unsFPOSPAV, unsLPOSPAV, unsCPOSPAV	Unstressed, word-average positional probability. (frequency of each phoneme occuring in specific position, e.g. first, second, etc.)*
35,36,37,38	strPOSPAV, strFPOSPAV, strLPOSPAV, strCPOSPAV	Stressed, word-average positional probability.*
39,40,41,42	unsLCPOSPAV, unsFLCPOSPAV, unsLLCPOSPAV, unsCLCPOSPAV	Unstressed, length-constrained word-average positional probability. Similar to positional probability, but only counts phonemes in the specific position - among words that contain the same number of phonemes.*
43,44,45,46	strLCPOSPAV, strFLCPOSPAV, strLLCPOSPAV, strCLCPOSPAV	Stressed, length-constrained word-average positional probability.*
47	SFreq	SUBTLEXus word frequency. **
48	SCDcnt	SUBTLEXus CD count, another measure of word frequency. **

*Note: all measures above that are listed in groups of four were calculated either as unweighted counts or weighted with different frequency measures. Each quad is ordered: unweighted, SUBTLEXus weighted, log (base 10) SUBTLEXus weighted, Context Count weighted (SUBTLEXus), respectively.

**Note: SUBTLEX word frequency columns (47,48) are only available for words (not pseudowords).

Additional Information about Version 1.4

Words & Pseudowords: nearly identical file structure ... with 2 key differences. The database contains identically organized columns in the word and pseudoword textfiles, with TWO exceptions. First, in the Word collection, the last two columns show Kucera Francis frequencies, while pseudowords have neither values. The second difference is that the first column of the pseudoword file shows the *word that was changed to produce the pseudoword*. The pseudoword files seem confusing at first, since many people read the "word" column entry, and don't see the different MRC transcription, which is really the pseudoword, as it is pronounced. Each pseudoword was generated by changing one phoneme from a real word, so it helps to see what that word was when you're going to try to pronounce it correctly. For example, "Fox" might show up as the pseudoword "word" entry - but reading the transcription columns tells you "F AH Z", so it is pronounced "Foz".

Version 1.4: Summary of Columns and Contents

Each file of the database contains columns 1-44, and all word entries contain Kucera-Francis word frequencies (columns 45, 46). IPhOD values are used for finding items, or to quantify aspects of specified wordlists (# neighbors, etc.).

Version 1.4: All Values by Column Number and Title:

Column	Heading	Description
1	Word	Orthographic form of word, or altered word that generated pseudoword
2	NPHON	Number of phonemes
3	NSYL	Number of syllables
4 ... 20	PH01...17	CMU Pronunciation Dictionary phonetic transcription (1, 2, 0 stress)
21	strDENS	Stressed phonological neighborhood density; distinct stressed-vowels
22	strFDEN	strDENS weighted with Kucera-Francis frequency of neighbors
23	strLDEN	strDENS weighted with Kucera-Francis log frequency of neighbors
24	unsDENS	Unstressed phonological neighborhood density; vowel-stress ignored
25	unsFDEN	unsDENS weighted with Kucera-Francis frequency of neighbors
26	unsLDEN	unsDENS weighted with Kucera-Francis log frequency of neighbors
27	strBPAV	Stressed biphoneme probability average; distinct stressed-vowels
28	strFBPAV	strBPAV weighted with Kucera-Francis word frequency
29	strLBPAV	strBPAV weighted with log Kucera-Francis word frequency
30	unsBPAV	Unstressed biphoneme probability average; vowel-stress ignored
31	unsFBPAV	unsBPAV weighted with Kucera-Francis word frequency
32	unsLBPAV	unsBPAV weighted with log Kucera-Francis word frequency
33	strTPAV	Stressed triphoneme probability average; distinct stressed-vowels
34	strFTPAV	strTPAV weighted with Kucera-Francis frequency
35	strLTPAV	strTPAV weighted with log Kucera-Francis frequency
36	unsTPAV	Unstressed triphoneme probability average; vowel-stress ignored
37	unsFTPAV	unsTPAV weighted with Kucera-Francis frequency
38	unsLTPAV	unsTPAV weighted with log Kucera-Francis frequency
39	strPOSPAV	Stressed positional probability average; distinct stressed-vowels
40	strFPOSPAV	strPOSPAV weighted with Kucera-Francis frequency
41	strLPOSPAV	strPOSPAV weighted with log Kucera-Francis frequency
42	unsPOSPAV	Unstressed positional probability; vowel-stress ignored
43	unsFPOSPAV	unsPOSPAV weighted with Kucera-Francis frequency
44	unsLPOSPAV	unsPOSPAV weighted with log Kucera-Francis frequency
45	KFFREQ	Kucera-Francis Written Word Frequency for real words
46	LOGFRQ	log Kucera-Francis Written Word Frequency for real words