英語教育研究のためのテキスト分析入門

ӳޠڭҭݚڀͷͨΊͷ ςΩετ෼ੳೖ໳ খྛ༤Ұ࿠ ʢ౦༸େֶʣ ֎ࠃޠڭҭجૅݚڀ෦ձୈ2ճಛผηϛφʔ ʢ2016೥2݄18೔ɺ໊ݹ԰େֶʣ

ࣗݾ঺հ •  খྛ༤Ұ࿠ʢ͜͹΍͠Ώ͏͍ͪΖ͏ʣ –  ౦༸େֶࣾձֶ෦ϝσΟΞίϛϡχέʔγϣϯֶՊ ॿڭ –  ത࢜ʢݴޠจԽֶɺେࡕେֶʣ
•  ؔ৺ྖҬ –  ֶशऀίʔύεݚڀʢϥΠςΟϯάɺεϐʔΩϯάʣ –  ࣗಈ࠾఺γεςϜʢϥΠςΟϯάɺεϐʔΩϯάʣ –  ܭྔจݙֶɾςΩετϚΠχϯάʢӳޠɺ೔ຊޠʣ –  ݴޠ౷ܭ etc.

h,ps://twi,er.com/langstat

h,p://langstat.hatenablog.com/

ൃදͷܦҢ ;ͱͨ͠৑ஊ͔Βɺɺɺ

ٸల։ʢসʣ

໨࣍ •  ͸͡Ίʹ •  ౷ܭॲཧ؀ڥR •  ӳޠڭҭݚڀͷͨΊͷςΩετ෼ੳʢೖ໳ฤʣ
•  ӳޠڭҭݚڀͷͨΊͷςΩετ෼ੳʢൃలฤʣ •  ͓ΘΓʹ •  ओཁࢀߟจݙ

͸͡Ίʹ •  ӳޠڭҭʹ͓͚Δίʔύεར༻ –  ਅਖ਼ͷݴޠσʔλʹجͮ͘ڭࡐ։ൃ •  ڭՊॻ(e.g., Gillard
& Gadsby, 1998) •  ࣙॻ(e.g., Kaszubski, 1998) –  ಛघ໨తίʔύεΛ༻͍ͨESP/EAPޠኮͷಛఆ(e.g., Coxhead, 2000) –  ֶशऀͷ࣮ଶΛ൓өͨ͠ݴޠςετͷ։ൃ(e.g., Barker, 2006) –  σʔλۦಈܕֶशγεςϜͷߏங(e.g., Granger & Tribble, 1998)

•  ίʔύεͱڭҭԠ༻(౤໺, 2015)ɹ ؍఺ ྖҬ ۩ମྫ ར༻Ϟʔυ ௚઀ར༻
ڭࣨ಺ར༻: σʔλۦಈܕֶश(DDL) ڭһݚम ؒ઀ར༻ ࢿྉ: ֶशޠኮද ڭࡐ: ࣙॻɺจ๏ॻɺڭՊॻͳͲ γϥόεɾΧϦΩϡϥϜ ݴޠςετ CALLγεςϜ ڭҭ༻ ίʔύε࡞੒ ֶशऀίʔύε ESP/EAPίʔύε ೉қ౓ௐ੔ࡁΈίʔύε ίʔύε৘ใ ޠኮ ޠኮ౷ܭʢස౓ɺ෼෍ʣɺίϩέʔγϣϯɺ෼໺ผΩʔϫʔυͳ Ͳ ౷ޠ ඼ࢺɺ඼ࢺ࿈࠯ɺߏจղੳʢ܎Γड͚ʣɺίϦήʔγϣϯɺಈࢺ ԼҐൣᙝԽɺ໊ࢺ۟ͷ௕͞ɾෳࡶ͞ɺͳͲ ஊ࿩ จͷ݁ଋੑɺҰ؏ੑɺஊ࿩ඪࣝͳͲ ֶशऀ ֎త ֶश؀ڥ: EFL vs. ESLɺڭһͷࢦಋೳྗɺITεΩϧɺֶߍͷITΠϯ ϑϥͳͲ ֶशܗଶ: ूஂvs. ݸผɺͳͲ ಺త ೝ஌త: ೝ஌ɾֶशελΠϧɺ೥ྸɺ฼ޠɺ֎ࠃޠͷशಘϨϕϧɺ దਖ਼ɺͳͲ ৘ҙత: ಈػ෇͚ɺੑ֨ɺχʔζɺͳͲ

•  ίʔύεʹ෇༩͞ΕΔ৘ใ(e.g., Garside, Leech, & McEnery, 1997) – 
඼ࢺλά –  ౷ޠλά –  ҙຯλά –  ஊ࿩λά –  Τϥʔλά etc. Yes. Uh. Usually, <at odr="1" crr="the"></at> museum <v_agr odr="2" crr="opens">open</v_agr> on <n_num odr="3" crr="Saturdays">Saturday</n_num> and <n_num odr="4" crr="Sundays">Sunday</n_num>. (NICT JLE Corpus) Well/UH what/WP do/VBP you/PRP think/VB about/IN the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/ VBG to/TO do/VB public/JJ service/NN work/NN for/IN a/DT year/NN ?/. Do/VBP you/PRP think/VBP it/PRP 's/BES a/DT ,/, (Penn Treebank Switchboard Corpus)

•  ίʔύε෼ੳͷํ๏(Meunier, 1998) –  ϫʔυΧ΢ϯτʢe.g., ૯ޠ਺ɺҟޠ਺ʣ –  ޠɾจͷ౷ܭʢe.g.,
ฏۉจ௕ɺT-‐Unitʣ –  ޠኮͷස౓෼ੳʢe.g., ϫʔυϦετɺ෼෍ਤɾ෼ࢄ ਤʣ –  ޠኮͷจ຺෼ੳʢe.g., KWICίϯίʔμϯεɺίϩ έʔγϣϯɺn-‐gramʣ –  ޠኮࢦඪ෼ੳʢe.g., TTRɺޠኮଟ༷ੑʣ –  จ๏෼ੳʢe.g., ඼ࢺλά࿈࠯ʣ –  ౷ޠ෼ੳʢe.g., ߏจղੳࡁΈσʔλͷղੳʣ etc.

•  ڊେԽ͢Δίʔύε –  ௕ॴ: ΑΓҰൠԽՄೳͳ஌ݟͷ֫ಘ –  ୹ॴ: σʔλɾϋϯυϦϯάͷࠔ೉Խ
•  ߴ౓Խ͢Δ෼ੳख๏ –  ௕ॴ: ैདྷ͸ௐ΂Δ͜ͱ͕Ͱ͖ͳ͔ͬͨݴޠݱ৅ͷ؍࡯ –  ୹ॴ: ߴ౓ͳσʔλղੳٕज़ͷඞཁੑ •  ৽ͨͳ෼ੳπʔϧ –  ௕ॴ: ࠷ઌ୺ٕज़ͷେऺԽ –  ୹ॴ: ॲཧաఔͷϒϥοΫϘοΫεԽ

•  πʔϧ΁ͷʢա౓ͳʣґଘ –  ଟ͘ͷϢʔβʔ͕ར༻͢Δʮ࠷େެ໿਺తͳʯػೳͷ ΈΛ࢖༻(ᕆඌɾཥ, 2013) –  σʔλॲཧͷաఔ͕ϒϥοΫϘοΫεԽ͞Ε͍ͯΔͨ
Ίɺग़ྗ݁Ռͷਖ਼͠͞ͷݕূ͕ࠔ೉(େ໊, 2012) –  πʔϧ͝ͱʹʮ୯ޠʯͷఆ͕ٛҟͳΓ(Anthony, 2013)ɺ ҟͳΔπʔϧʹΑͬͯܭࢉ͞Εͨʮޠ਺ʯʹ࠷େͰ 10%ͷҧ͍(Meunier, 1998) –  طଘͷπʔϧʹґଘ͢Δ͜ͱͰɺπʔϧͷݶք͕ͦͷ ··ݚڀͷݶքʹ(Gries, 2010)

•  ࣗ࡞ϓϩάϥϜͷεεϝ(Biber, Conrad, & Reppen, 1998) –  طଘͷπʔϧͰ͸Ͱ͖ͳ͍෼ੳ͕ՄೳʹͳΔ
–  ݕࡧͷ଎౓ͱਫ਼౓͕޲্͢Δ –  ࣗ෼ͷݚڀʹ߹Θͤͨग़ྗΛಘΒΕΔ –  σʔλαΠζͷӨڹΛड͚ͳ͍

•  ςΩετղੳ͕Մೳͳϓϩάϥϛϯάݴޠ –  Perl (e.g., ੺੉઒ɾதඌ, 2004; Hammond, 2003)
–  Python (e.g., ᕆඌɾཥ, 2013; Johnson, 2013) –  Ruby (e.g., Ԯ໺ɾా໺ଜ, 2012) –  JAVA (e.g., Hammond, 2002; Mason, 2001) –  AWK (e.g., Schmi,, Chrismanson, & Gupta, 2010) –  PHP (e.g., ੪౻ɾߴڮ, 2014) –  R (e.g., Gries, 2009; Jockers, 2014) etc.

౷ܭॲཧ؀ڥ3 •  RΛ࢖͏͜ͱͷϝϦοτ(सඌɾߴ࿘, 2005) –  ϑϦʔ΢ΣΞͰ͋ΔͨΊɺ୭Ͱ΋ແྉͰར༻Մೳ –  Windows΍MacͳͲɺ༷ʑͳOS্Ͱಈ࡞
–  ඇৗʹ෼͔Γ΍͍͢ݴޠ࢓༷ –  άϥϑΟοΫεػೳ͕ඇৗʹચ࿅ –  ແྉͷ֦ுύοέʔδ͕ඇৗʹॆ࣮ •  R͸ɺ౷ܭܭࢉͷlingua franca (Everi, & Hothorn, 2010)

h,p://blog.revolumonanalymcs.com/ 2012/05/r-‐now-‐a-‐major-‐programming-‐ language-‐sees-‐a-‐127-‐growth-‐in-‐book-‐ sales.html h,p://blog.revolumonanalymcs.com/ 2012/08/r-‐language-‐popularity-‐for-‐data-‐ mining.html

•  ςΩετղੳͷͨΊͷRύοέʔδʢҰ෦ʣ –  corpora –  koRpus – 
languageR –  RMeCab –  SnowballC –  stringi –  stringr –  stylo –  tm –  wordcloud –  ZipfR etc.

ӳޠڭҭݚڀͷͨΊͷ ςΩετ෼ੳ ʢೖ໳ฤʣ

ޠኮͷଟ༷ੑ •  ӳޠڭҭݚڀʹ͓͚Δޠኮͷଟ༷ੑ –  ݕఆڭՊॻ΍Ϩϕϧผଟಡ༻ڭࡐ(graded readers) ͷ ෼ੳ(e.g., Kobayashi
& Kitao, 2010) –  ֶशऀͷൃ࿩ޠኮͷ෼ੳ(e.g., Crossley, Salsbury, and McNamara, 2009) etc.

•  ༷ʑͳޠኮͷଟ༷ੑࢦඪ –  Type-‐token ra.o (TTR): ૯ޠ਺ͱҟޠ਺ͷൺ – 
Guiraud Index: ૯ޠ਺ͷฏํࠜͱҟޠ਺ͷൺ(Guiraud, 1954) –  Standardized type/token ra.o (STTR): ςΩετΛҰఆޠ ਺͝ͱʹ෼ׂ͠ɺͦΕͧΕ͔ΒಘΒΕͨTTRͷฏۉ –  Moving-‐average type-‐token ra.o (MATTR):Ұఆͷൣғ (window size) ʹجͮ͘TTRͷҠಈฏۉ(Covington and McFall, 2010) –  Measure of textual lexical diversity (MTLD): TTR͕Ұఆͷ ஋(factor size) ʹୡ͢Δͷʹඞཁͳ࿈ଓ͢Δ୯ޠ਺ͷ ฏۉ(McCarthy and Jarvis, 2010) etc. (e.g., Kojima, 2012; Malvern, Richards, Chipere, & Duran, 2004)

•  ޠኮଟ༷ੑΛܭࢉ͢ΔͨΊͷ༷ʑͳπʔϧ (Koizumi, 2012) –  vocd (McKee, Malvern, &
Richards, 2000) –  D_Tool (Meara & Miralpeix, 2007) –  Gramulator (McCarthy, 2011) etc. etc. ɹɹɹɹˣ –  લड़ͷΑ͏ʹɺπʔϧ͝ͱʹʮ୯ޠʯͷఆ͕ٛҟͳΔ ͨΊɺಉ͡ࢦඪΛҟͳΔπʔϧͰܭࢉ͢ΔͱɺҟͳΔ ஋ʹͳΔ͜ͱ΋ –  ෳ਺ͷࢦඪΛܭࢉ͢Δ৔߹͸ɺಉҰͷπʔϧͰܭࢉ͢ Δ͜ͱ͕๬·͍͠

3Ͱޠኮଟ༷ੑΛܭࢉ •  koRpusύοέʔδ –  ߴ౓ͳςΩετ෼ੳػೳΛ࣮૷ –  ޠኮଟ༷ੑ΍ϦʔμϏϦςΟͷܭࢉ͕Մೳ
ɹɹɹɹɹˣ •  ηϯλʔࢼݧͷӳޠ໰୊Λղੳ –  ฏ੒27೥౓ຊࢼݧͷӳޠʢචهʣʹ͓͚Δୈ6໰ͷ௕ จ –  ಠཱߦ੓๏ਓେֶೖࢼηϯλʔͷ΢ΣϒαΠτ͔Β σʔλΛೖख

•  ෼ੳσʔλʢҰ෦ʣ In 1877, Thomas Edison invented the
phonograph, a new device that could record and play back sound. For the first mme, people could enjoy the musical performance of a full orchestra in convenience of their own homes. A few years later, Bell Laboratories developed a new phonograph that offered be,er sound quality; voices and instruments sounded clearer and more true-‐to-‐life. These early products represent two major focuses in the development of audio technology — making listening easier and improving the sound quality of music we hear. The advances over the years have been significant in both areas, but it is important not to let the music itself get lost in all the technology.

3ͷLP3QVTύοέʔδ > # ύοέʔδͷΠϯετʔϧ > install.packages(c("koRpus", "plotrix"), dependencies
= TRUE) > # ύοέʔδͷಡΈࠐΈ > library(koRpus) > # ςΩετͷಡΈࠐΈ > tok <-‐ tokenize("eigo2014A.txt", lang = "en") > # ಡΈࠐΜͩσʔλͷ֓ཁΛ֬ೝ > summary(tok) •  RεΫϦϓτ

ɹSentences: 29 ɹWords: 646 (22.28 per sentence)
ɹLe,ers: 3169 (4.91 per word) ɹWord class distribumon: num pct word 642 99.3798450 unknown 3 0.4651163 number 1 0.1550388 comma 35 NA fullstop 29 NA punctuamon 9 NA

> # จମࢦඪͷ֬ೝ > textFeatures(tok) uniqWd
complx sntCt sntLen syllCt 1 322 0.4783282 29 22.27586 1.673375 charCt l,rCt FOG ﬂesch 1 3899 3169 16.77412 42.65751 > # uniqWd = ҟޠ਺ > # complx = TTR > # sntCT = จ਺ > # sntLen =ฏۉจ௕ > # syllCt = ฏۉԻઅ਺ > # charCt = ۭനΛআ͍ͨจࣈ਺ > # l,rCt = ۭനɾ۟ಡ఺ɾ਺ࣈΛআ͍ͨจࣈ਺ > # FOG = ϦʔμϏϦςΟࢦඪʢޙड़ʣ > # ﬂesch = ϦʔμϏϦςΟࢦඪʢޙड़ʣ

•  ޠኮଟ༷ੑΛܭࢉ͢ΔͨΊͷؔ਺ –  C.ld: Herdan’s C –  CTTR:
Carroll’s corrected type-‐token ramo –  HDD: the HD-‐D –  K.ld: Yule’s K –  Maas: Maas’ indices –  MSTTR: Mean segmental type-‐token ramo –  R.ld: Guiraud index –  S.ld: Summer’s index –  U.ld: Dugast’s Uber index etc.

•  ͲͷΑ͏ͳޠኮଟ༷ੑͷࢦඪΛ༻͍Δ΂͖͔ʁ –  TTR͸ςΩετʹ͓͚Δ૯ޠ਺ͷӨڹΛڧ͘ड͚Δͨ Ίɺʢ૯ޠ਺͕େ͖͘ҟͳΔʣෳ਺ͷςΩετͷൺֱ ʹ͸޲͔ͳ͍(e.g., Baayen, 2008)
–  ۙ೥ͷޠኮݚڀͰ͸ɺMATTR΍MTLDͷؤ݈ੑ͕ใࠂ ͞Ε͍ͯΔ(e.g., Koizumi, 2012)

> # MATTRͷܭࢉ > MATTR(tok) Language: "en"
Total number of tokens: 646 Total number of types: 309 Moving-‐Average Type-‐Token Ramo MATTR: 0.74 SD of TTRs: 0.03 Window size: 100 Note: Analysis was conducted case insensimve.

> # MTLDͷܭࢉ > MTLD(tok) Language: "en"
Total number of tokens: 646 Total number of types: 309 Measure of Textual Lexical Diversity MTLD: 106.03 Number of factors: 6.09 Factor size: 0.72 SD tokens/factor: 41.5 (all factors) 35.15 (complete factors only) Note: Analysis was conducted case insensimve.

ϦʔμϏϦςΟ •  จষͷಡΈ΍͢͞ΛଌΔͨΊͷࢦඪ –  ެจॻͷ࡞੒ɺӳ࡞จͷධՁͳͲɺ෯޿͘ར༻ –  ΞϝϦΧͷฏۉతͳ੒ਓ͸தֶ2೥Ϩϕϧͷಡղྗʢڭҭ ϨϕϧͱಡղྗͷϨϕϧ͸ɺඞͣ͠΋Ұக͠ͳ͍ʣ
(DuBay, 2006) –  ΞϝϦΧͰͷϕετηϥʔͷେ൒͸ɺதֶ1೥Ϩϕϧ –  ҩྍ΍҆શʹؔ͢Δ৘ใΛখֶ5೥ϨϕϧͰॻ͘͜ͱΛٻ Ί͍ͯΔ๏཯(Doak, Doak, & Root, 1996) –  ӳޠݕఆڭՊॻ΍Ϩϕϧผଟಡ༻ڭࡐɺೖࢼ໰୊ͷධՁʹ ΋Ԡ༻(தᑍɾ௕୩઒, 2004; Kobayashi & Kitao, 2010) h,ps://en.wikipedia.org/wiki/Readability h,ps://ja.wikipedia.org/wiki/%E5%8F%AF%E8%AA%AD %E6%80%A7

•  ओͳϦʔμϏϦςΟࢦඪ –  Flesch-‐Kincaid Grade Level (FKGR): 1จ͋ͨΓͷฏۉ୯ޠ ਺ͱ1୯ޠ͋ͨΓͷฏۉԻઅ਺Λ࢖༻(Kincaid,
Fishburne, Rogers, & Chissom, 1975) –  Coleman-‐Liau Index (CLI): 1୯ޠ͋ͨΓͷฏۉจࣈ਺ͱ 100୯ޠʹؚ·ΕΔจ਺Λ࢖༻(Coleman & Liau, 1975) –  Automated Readability Index (ARI): 1จ͋ͨΓͷฏۉ୯ ޠ਺ͱ1୯ޠ͋ͨΓͷฏۉจࣈ਺Λ࢖༻(Senter & Smith, 1967) etc. etc. –  ҼΈʹɺ্ه3ࢦඪͷ૬ؔ܎਺͸ɺ0.95ʙ0.98ఔ౓(છ ୩, 2009)

> # Flesch-‐Kincaid Grade Levelͷܭࢉ > ﬂesch.kincaid(tok)
Flesch-‐Kincaid Grade Level Parameters: default Grade: 12.84 Age: 17.84 Text language: en > # Grade͸ɺΞϝϦΧʹ͓͚Δֶ೥ʹ׵ࢉͨ͠஋ > # Age͸ɺฏۉతͳΞϝϦΧͷ೥ྸʹ׵ࢉͨ͠஋

> # Coleman-‐Liau Indexͷܭࢉ > coleman.liau(tok) Coleman-‐Liau
Parameters: default ECP: 41% (esmmted cloze percentage) Grade: 11.72 Grade: 11.72 (short formula) Text language: en > # Automated Readability Indexͷܭࢉ > ARI(tok) Automated Readability Index (ARI) Parameters: default Grade: 12.81 Text language: en

•  ϦʔμϏϦςΟΛܭࢉ͢ΔͨΊͷؔ਺ –  bormuth: Bormuth mean cloze and grade
placement –  coleman: Coleman’s readability formulas –  dale.chall: New Dale-‐Chall readability formula –  danielson.bryan: Danielson-‐Bryan formula –  DRP: Degrees of reading power –  ELF: Fang’s easy listening formula –  FOG: Gunning FOG index –  harris.jacobson: Revised Harris-‐Jacobson readability formulas –  linsear.write: Linsear Write index etc. etc.

ෳ਺ςΩετͷൺֱ •  ηϯλʔࢼݧ໰୊ʢӳޠʣͷޠኮଟ༷ੑͱϦʔ μϏϦςΟ –  ฏ੒25೥౓ຊࢼݧ(eigo2012A.txt) –  ฏ੒25೥౓௥ࢼݧ(eigo2012B.txt)
–  ฏ੒26೥౓ຊࢼݧ(eigo2013A.txt) –  ฏ੒26೥౓௥ࢼݧ(eigo2013B.txt) –  ฏ੒27೥౓ຊࢼݧ(eigo2014A.txt) –  ฏ੒27೥౓௥ࢼݧ(eigo2014B.txt) ʢӳޠͷୈ6໰ͷ௕จͷΈʣ

> # ςΩετͷಡΈࠐΈ > tok2012A <-‐ tokenize("eigo2012A.txt", lang
= "en") > tok2012B <-‐ tokenize("eigo2012B.txt", lang = "en") > tok2013A <-‐ tokenize("eigo2013A.txt", lang = "en") > tok2013B <-‐ tokenize("eigo2013B.txt", lang = "en") > tok2014A <-‐ tokenize("eigo2014A.txt", lang = "en") > tok2014B <-‐ tokenize("eigo2014B.txt", lang = "en") > # จମࢦඪͷ֬ೝ > w2012A <-‐ textFeatures(tok2012A) > w2012B <-‐ textFeatures(tok2012B) > w2013A <-‐ textFeatures(tok2013A) > w2013B <-‐ textFeatures(tok2013B) > w2014A <-‐ textFeatures(tok2014A) > w2014B <-‐ textFeatures(tok2014B) > # 6ճ෼ͷ݁ՌΛ1ͭͷදͱͯ݁͠߹ > df <-‐ rbind(w2012A, w2012B, w2013A, w2013B, w2014A, w2014B) > rownames(df) <-‐ c("2012A", "2012B", "2013A", "2013B", "2014A", "2014B") > # ݁߹ͨ͠දͷ֬ೝ > df

uniqWd complx sntCt sntLen
syllCt 2012A 282 0.4424342 33 18.42424 1.554276 2012B 286 0.4417077 30 20.30000 1.518883 2013A 327 0.4968254 37 17.02703 1.601587 2013B 320 0.4678899 37 17.67568 1.600917 2014A 322 0.4783282 29 22.27586 1.673375 2014B 324 0.4605067 45 14.91111 1.420268 charCt l,rCt FOG ﬂesch 2012A 3574 2898 12.369697 56.64262 2012B 3445 2766 13.899967 57.73296 2013A 3723 3016 13.604462 54.05828 2013B 3883 3119 13.064154 53.45657 2014A 3899 3169 16.774122 42.65751 2014B 3700 2927 9.600808 71.54553 –  FOG Index͸ɺ1จ͋ͨΓͷฏۉ୯ޠ਺ͱɺ૯ޠ਺ʹ ͓͚Δ3ԻઅҎ্ͷ୯ޠͷൺ཰Λ࢖༻(Gunning, 1952)

> # σʔλͷඪ४Խ > df.scale <-‐ data.frame(scale(df))
> # Flesch Reading Ease Indexͷඪ४Խಘ఺ͷਖ਼ෛΛٯస > df.scale[, 9] <-‐ df.scale[, 9] * -‐1 > # Ϩʔμʔνϟʔτͷඳը > library("plotrix") > radial.plot(df.scale, rp.type = "p", labels = colnames(df.scale), lty = 1 : 6, lwd = 3, show.grid.labels = NA) > legend("topright", legend = rownames(df.scale), col = 1 : 6, lty = 1 : 6, lwd = 3)

ฏ੒27೥౓ຊࢼݧ (2014A) ͷςΩετ͕ൺ ֱత೉͘͠ɺฏ੒25೥ ౓ຊࢼݧ(2012A) ΍ฏ ੒25೥౓௥ࢼݧ(2012B) ͷςΩετ͕ൺֱతқ
͍͠

ӳޠڭҭݚڀͷͨΊͷ ςΩετ෼ੳ ʢൃలฤʣ

ΑΓຊ֨తͳςΩετॲཧ •  ޠኮଟ༷ੑ΍ϦʔμϏϦςΟ –  ςΩετશମͷಛ௃Λ1ͭͷ஋Ͱදݱ ɹɹɹɹɹˣ • 
۩ମతͳޠኮ΍จʹয఺Λ͋ͯΔৄࡉͳ෼ੳ ʢίʔύε෼ੳʣ –  ༻ྫݕࡧ –  ޠኮͷස౓෼ੳ –  ޠኮͷڞى෼ੳ etc.

ֶज़࿦จͷޠኮ෼ੳ •  ෼ੳσʔλ –  Fukuta, J., & Yamashita, J.
(2015). Eﬀects of task demands on a,enmon orientamon in L2 oral producmon. System, 53, 1-‐12.

> # ύοέʔδͷΠϯετʔϧ > install.packages(c("tm", "wordcloud"), dependencies =
TRUE) > # ϑΝΠϧͷಡΈࠐΈ > corpus <-‐ scan("fukuta.txt", what = "char", sep = "\n", quiet = TRUE) > # େจࣈΛখจࣈʹม׵ > corpus.lower <-‐ tolower(corpus) > # ςΩετΛ୯ޠͷϕΫτϧʹม׵ > word.vector <-‐ unlist(strsplit(corpus.lower, "\\W")) > # εϖʔεΛ࡟আ > not.blank <-‐ which(word.vector != "") > word.vector2 <-‐ word.vector[not.blank] > # σʔλͷ֬ೝ > head(word.vector2, 20) [1] "eﬀects" "of" "cognimve" "demands" [5] "on" "a,enmon" "orientamon" "in" [9] "l2" "oral" "producmon" "junya" [13] "fukuta" "junko" "yamashita" "abstract" [17] "in" "the" "ﬁeld" "of"

•  ༻ྫݕࡧ > # ݕࡧޠͷੜىҐஔΛऔಘ > word.posimons <-‐
which(word.vector2[] == "task") > # ݕࡧޠͷલޙԿޠ·Ͱදࣔ͢Δ͔ΛࢦఆʢҎԼͷྫͰ͸ɺ5ޠʣ > context <-‐ 5 > # KWICίϯίʔμϯεͷ࡞੒ > for(i in 1 : length(word.posimons)){ > start <-‐ word.posimons[i] -‐ context > end <-‐ word.posimons[i] + context > before <-‐ word.vector2[start : (start + context -‐ 1)] > a•er <-‐ word.vector2[(start + context + 1) : end] > keyword <-‐ word.vector2[start + context] > cat("-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐", i, "-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐", "\n") > cat(before, "[", keyword, "]", a•er, "\n") > }

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 1 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ supported by many researchers
the [ task ] is a crimcal concept in -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ demands reasoning demand and dual [ task ] demand inﬂuenced the occurrence and -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 3 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ accuracy scores but the dual [ task ] demand did not both types -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 4 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ did not both types of [ task ] demands reduced ﬂuency scores but -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 5 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ verbal protocol analysis suggested that [ task ] demands inhibited learners a,enmon to ɹɹɹɹɹʢதུʣ -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 101 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ learner interacmon the relamonship among [ task ] characterismcs learners internal cognimve process ݸʑͷ࢖༻ྫΛ࣭తʹਫ਼ࠪ

> # ݕࡧޠͷੜىҐஔΛࢹ֮Խ plot(word.vector2 == "task", type = "h",
yaxt = "n", main = "task") ίϯίʔμϯεɾϓϩοτ(Anthony, 2005) ಛఆͷݴޠදݱ͕ςΩετશମͷͲͷҐஔͰଟ͘࢖ΘΕ͍ͯΔ͔Λ ࢹ֮Խ

•  ޠኮͷڞى෼ੳ > # ϊʔυϫʔυͷࢦఆʢҎԼͷྫͰ͸ɺ"task"ʣ > search.word <-‐
"\\btask\\b" > # εύϯͷࢦఆʢҎԼͷྫͰ͸ɺલޙ3ޠ·Ͱʣ > span <-‐ 3 > span <-‐ (-‐span : span) > # ग़ྗϑΝΠϧͷࢦఆ > output.ﬁle <-‐ "output.txt" > # ϊʔυϫʔυͷग़ݱ͢ΔҐஔΛಛఆ > posimons.of.matches <-‐ grep(search.word, word.vector2, perl = TRUE) > # ίϩέʔγϣϯͷूܭ > results <-‐ list() > for(i in 1 : length(span)) { > collocate.posimons <-‐ posimons.of.matches + span[i] > collocates <-‐ word.vector2[collocate.posimons] > sorted.collocates <-‐ sort(table(collocates), decreasing = TRUE) > results[[i]] <-‐ sorted.collocates >}

> # ूܭදͷϔομʔΛग़ྗ > cat(paste(rep(c("W_", "F_"), length(span)), rep(span, each
= 2), sep = ""), "\n", sep = "\t", file = output.file) > # ूܭσʔλΛग़ྗ > lengths <-‐ sapply(results, length) > for(k in 1 : max(lengths)) { > output.string <-‐ paste(names(sapply(results, "[", k)), sapply(results, "[", k), sep = "\t") > output.string.2 <-‐ gsub("NA\tNA", "\t", output.string, perl = TRUE) > cat(output.string.2, "\n", sep = "\t", file = output.file, append = TRUE) >} ಛఆͷޠͱߴස౓Ͱڞى͢ΔޠΛಛఆ

W_-‐3 F_-‐3 W_-‐2 F_-‐2 W_-‐1
F_-‐1 W_0 F_0 W_1 F_1 W_2 F_2 W_3 F_3 the 15 the 17 dual 22 task 101 condimon 17 a 5 the 10 in 7 a 7 of 14 demands 17 is 4 learners 6 that 6 and 7 complex 6 complexity 7 the 4 a,enmon 3 a 3 by 4 the 5 demand 6 by 3 be 3 as 2 effects 4 monologic 4 characterismcs 5 cognimve 3 demands 3 but 2 finger 4 tapping 4 condimons 5 dc 3 and 2 by 2 that 4 demanding 3 engagement 3 in 3 as 2 demand 2 an 3 interacmve 3 features 3 on 3 condimon 2 demands 2 in 3 that 3 and 2 were 3 fluency 2 form 2 1 2 this 3 inherent 2 along 2 learner 2 •  ग़ྗ݁ՌʢҰ෦ʣ ίϩέʔγϣϯɾςʔϒϧ

•  ޠኮͷස౓෼ੳ > # ςΩετ͔Β਺ࣈͱ۟ಡ఺Λ࡟আ > library(tm)
> corpus.cleaned <-‐ removeNumbers(word.vector2) > corpus.cleaned <-‐ removePunctuamon(corpus.cleaned) > # ϫʔυϦετͷ࡞੒ > freq.list <-‐ table(corpus.cleaned) > sorted.freq.list <-‐ sort(freq.list, decreasing = TRUE) > sorted.table <-‐ paste(names(sorted.freq.list), sorted.freq.list, sep = ": ") > # ϫʔυϦετʢස౓্Ґ20Ґ·Ͱʣͷ֬ೝ > head(sorted.table, 20) [1] "the: 448" "of: 223” "to: 193" [4] "and: 189" "in: 149" "task: 101" [7] "that: 94" "a,enmon: 91” "a: 71" [10] "learners: 69” "this: 64" "as: 60" [13] "resource: 60” "were: 58" "form: 55" [16] "was: 53" "by: 52" "on: 50" [19] "condimon: 45" "demands: 45"

> # ϫʔυΫϥ΢υͷ࡞੒ > library(wordcloud) > wordcloud(word.vector2, min.freq
= 5, random.order = FALSE) ςΩετʹ͓͚Δग़ݱස౓͕ϑΥϯτͷେ͖͞ʹ൓ө

•  ޠኮ࿈࠯(n-‐gram) ͷ෼ੳ > # 2-‐gramsͷநग़ > x
<-‐ length(word.vector2) -‐ 1 > results <-‐ vector() > for (i in 1 : x) { > ngram <-‐ paste(word.vector2[i], word.vector2[i + 1]) > results <-‐ append(results, ngram) > } > # ස౓ूܭ > ngram.freq <-‐ table(results) > sorted.ngram.freq <-‐ sort(ngram.freq, decreasing = TRUE) > sorted.table <-‐ paste(names(sorted.ngram.freq), sorted.ngram.freq, sep = ": ") > # ස౓্Ґ20Ґ·ͰΛදࣔ > head(sorted.table, 20) ޠኮ࿈࠯ͷ෼ੳ͸ɺlexical bundlesͷݚڀ(e.g., Biber, Conrad, & Coates, 2004; Coates, 2001; Hyland, 2008) ͳͲͰ׆༻

[1] "a,enmon to: 62" "of the:
51" [3] "in the: 42" "resource model: 31" [5] "linguismc form: 27" "that the: 25" [7] "to linguismc: 24" "limited resource: 23" [9] "the results: 23" "dual task: 22" [11] "to the: 21" "mulmple resource: 20" [13] "cognimve demands: 19" "on the: 18" [15] "this study: 18" "learners a,enmon: 17" [17] "task condimon: 17" "task demands: 17" [19] "the limited: 17" "to form: 17" > # εΫϦϓτΛগ͠मਖ਼͢Δ͜ͱͰɺ2ޠ࿈࠯(2-‐grams) ͚ͩͰͳ͘ɺ 3 ޠ࿈࠯(3-‐grams) ΍4ޠ࿈࠯(4-‐grams) Λूܭ͢Δ͜ͱ΋Մೳ

͓ΘΓʹ •  σʔλʹج࣮ͮ͘ূతݚڀ –  ࠶ݱੑ͕ٻΊΒΕΔ ɹɹɹɹɹˣ • 
ϓϩάϥϜΛࣗ࡞͢Δ͜ͱͷར఺(େ໊, 2012) –  ෼ੳऀࣗ਎͕σʔλॲཧͷաఔΛཧղ –  ॲཧํ๏ͷҧ͍͕ղੳ݁Ռʹٴ΅͢Өڹʹ͍֮ͭͯࣗ –  طଘͷπʔϧΛ࢖͏ࡍʹ΋ɺϒϥοΫϘοΫεԽ͞Ε ͨॲཧͷ಺༰Λ͋Δఔ౓ਪଌͰ͖ΔΑ͏ʹͳΓɺ ޡͬͨ࢖͍ํΛճආ

ओཁࢀߟจݙ •  Gries, S. Th. (2009). Quan,ta,ve corpus linguis,cs with
R: A prac,cal introduc,on. New York: Routledge. •  Jockers, M. L. (2014). Text analysis with R for students of literature. New York: Springer. •  খྛ༤Ұ࿠(2015). ʮޠኮଟ༷ੑͱϦʔμϏϦςΟΛ༻ ͍ͨςΩετ෼ੳʯʰ֎ࠃޠڭҭϝσΟΞֶձத෦ࢧ෦ ֎ࠃޠڭҭجૅݚڀ෦ձ2014 ೥౓ใࠂ࿦ूʱ 49-‐59. •  খྛ༤Ұ࿠(2015). ʮRʹΑΔӳจςΩετղੳʯʰ౦༸ େֶࣾձֶ෦لཁʱ 53(1), 51-‐64.

খྛ༤Ұ࿠ [email protected]

英語教育研究のためのテキスト分析入門

英語教育研究のためのテキスト分析入門

More Decks by Yuichiro Kobayashi

Other Decks in Research

Featured

Transcript