Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS APIGateway + Python Lambda + NEologdで作るサーバレス日本語形態素解析API

Satoru Kadowaki
September 09, 2017

AWS APIGateway + Python Lambda + NEologdで作るサーバレス日本語形態素解析API

AWS APIGateway + Python Lambda + NEologdで作るサーバレス日本語形態素解析API #pyconjp2017

Satoru Kadowaki

September 09, 2017
Tweet

Other Decks in Programming

Transcript

  1. ࣗݾ঺հ • ໳࿬ ་ (KADOWAKI Satoru) • BakFoo, Inc
 όΫϑʔגࣜձࣾ

    CTO • PyConJP2015τʔΫ • Tornado/ElasticSearchͰ࣮ݱ͢ΔେྔπΠʔτͷ
  2. ໨త • MeCab + ৽ޠΛ࣋ͭࣙॻ؀ڥ • Ͱ͖Δ͚ͩؾܰʹ࢖͍͍ͨ • Θ͟Θ͟ઐ༻αʔόͰͳͯ͘΋...(αʔόϨεʣ •

    MeCabͷηοτΞοϓ͸ҙ֎ʹ໘౗ • OSʹґଘͨ͠จࣈίʔυ໰୊ • ࣙॻ͸Ͳ͏͢΂͖͔໰୊ • ԿΛ࢖͏͔ɼͲ͏͍͏ࣙॻ͕ඞཁ͔ • ൺֱత௿ίετͰօͰ࢖͑Δ؀ڥ • Ϗϧυ͔ΒσϓϩΠ·Ͱͷ֓ཁ
  3. Ϗϧυ؀ڥ • AWS LambdaͷͨΊʹEC2 (ඞਢ) • AMI: amzn-ami-hvm-2016.03.3.x86_64-gp2 • Python

    3.6.1 + Miniconda • Python2.7ͷྫ • AWS Lambda Ͱ MeCab Λಈ͔͢ • http://dev.classmethod.jp/cloud/improved-aws-lambda-with-mecab/ • NEologdࣙॻͷϏϧυ • Ubuntu 16.04 ͜ͷAMIͰmecabΛίϯύΠϧʂ
  4. 1. Ϗϧυ؀ڥߏங • @Lambda༻EC2Πϯελϯε • AMI: amzn-ami-hvm-2016.03.3.x86_64-gp2 • Πϯετʔϧ࡞ۀ $

    sudo yum install gcc $ sudo yum install gcc-c++ $ sudo yum install git $ sudo yum install patch $ sudo rpm -ivh \ > http://packages.groonga.org/centos/groonga-release-1.1.0-1.noarch.rpm
  5. 1. Ϗϧυ؀ڥߏங • Python؀ڥ $ wget --quiet \ > https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

    \ > -O ~/miniconda3.sh $ /bin/bash miniconda3.sh -b -p $HOME/miniconda3 $ echo 'export PATH="$HOME/miniconda3/bin:$PATH"' >> .bashrc $ pip install boto3
  6. 1. Ϗϧυ؀ڥߏங • Lambda༻ͷmecabΠϯετʔϧ $ mkdir $HOME/lambda_neologd $ curl -L

    \ > "https://drive.google.com/uc? export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE" \ > -o mecab-0.996.tar.gz $ tar zxvf mecab-0.996.tar.gz && cd mecab-0.996 $ ./configure --prefix=$HOME/lambda_neologd/local --enable-utf8-only $ make && make install ๨Εͣʹ
  7. 1. Ϗϧυ؀ڥߏங • ಉ༷ʹmecab-ipadic΋Πϯετʔϧ $ curl -L \ > "https://drive.google.com/uc?

    export=download&id=0B4y35FiV1wh7MWVlSDBCSX ZMTXM" \ > -o mecab-ipadic-2.7.0-20070801.tar.gz $ tar xvzf mecab-ipadic-2.7.0-20070801.tar.gz $ cd mecab-ipadic-2.7.0-20070801/ $ export PATH=$HOME/lambda_neologd/local/bin:$PATH $ export \ > LD_LIBRARY_PATH=$HOME/lambda_neologd/local/lib:$LD_LIBRARY_PATH $ ./configure --prefix=$HOME/lambda_neologd/local \ > --enable-utf8-only --with-charset=utf8 $ make && make install ๨Εͣʹ ؀ڥม਺
  8. 1. Ϗϧυ؀ڥߏங • mecabͷಈ࡞֬ೝ $ mecab -F "%m\t%h,%H,%pw,%pc,%pn\n" mecabςετ mecab

    41,໊ࢺ,ݻ༗໊ࢺ,Ұൠ,*,*,*,mecab,ϝΧϒ,ϝΧϒ,6518,6208,6208 ςετ 36,໊ࢺ,αม઀ଓ,*,*,*,*,ςετ,ςετ,ςετ,3637,9107,2899 • Python MeCabͷΠϯετʔϧ $ cd $HOME/sandbox/lambda_neologd $ pip install mecab-python3 -t . ๨Εͣʹ
  9. 1. Ϗϧυ؀ڥߏங • NEologdࣙॻͷੜ੒ • ඪ४తͳࣙॻ࡞੒Ͱ͸Lambda࣮૷Ͱ͖ͳ͍ • ࠷খߏ੒ͰࣙॻΛ࡞੒͢Δ $ git

    clone --depth 1 https://github. com/neologd/mecab-ipadic-neologd.git $ cd mecab-ipadic-neologd $ ./bin/install-mecab-ipadic-neologd -y \ > -p $HOME/neologd -n --eliminate-redundant-entry --eliminate-redundant-entryͰࣙॻαΠζ͸400MBఔ౓
  10. 1. Ϗϧυ؀ڥߏங • --eliminate-redundant-entry: Ͳ͏͢Δʁ • NEologdͷࣙॻʹ͸༳Β͕͗ొ࿥͞Ε͍ͯΔ • ྫ: •

    AMAZON, amazon, Amazon • ͍͢͝ɼ͢͝ʔ͍ɼ͢͝ʔʔʔʔʔ͍ ϊʔϚϥΠζ͞ΕͨࣙॻΛ࡞੒͢Δ • ۩ମతʹ΍Δ͜ͱ • ΞϧϑΝϕοτ൒֯খจࣈԽ • AMAZON, amazon, Amazon →ʮamazonʯ
  11. 1. Ϗϧυ؀ڥߏங • ࣙॻͷݩσʔλ • mecab-ipadic-neologd/seed/*.csv.xz • (ྫ) mecab-ipadic-neologd/seed/mecab-user-dict-seed.20170828.csv.xz •

    ͜ͷCSV.XZͷϑΝΠϧͷதʹ୯ޠ͕ొ࿥͞Ε͍ͯΔ • ͜ΕΒͷϑΝΠϧΛϊʔϚϥΠζͯࣙ͠ॻ࡞੒͢Δ
  12. 1. Ϗϧυ؀ڥߏங - ࠷ॳͷࣙॻ࡞੒ޙ $ python seed_normalize.py - ࣙॻΛ࠶࡞੒(ࣙॻߋ৽͠ͳ͍) $

    ./bin/install-mecab-ipadic-neologd -y \ > -p $HOME/neologd --eliminate-redundant-entry ᶃ seedΛϊʔϚϥΠζ ᶄࣙॻͷݩσʔλΛߋ৽͠ͳ͍ͰࣙॻΛ࠶࡞੒ ==ʮ"-n" ΦϓγϣϯΛ͚ͭͳ͍ʯ
  13. 1. Ϗϧυ؀ڥߏங • seed_normalize.py # coding: utf-8 import lzma import

    os import unicodedata seeds_dir = 'mecab-ipadic-neologd/seed/' files = os.listdir(seeds_dir) for file in files: if 'xz' in file: print('normalizing mecab seed...', file) f = lzma.open(seeds_dir + file, 'rb') lines = f.readlines() f.close() with lzma.open(seeds_dir + file, "w") as f: for line in lines: norm_text = unicodedata.normalize( 'NFKC', line.decode()).lower() f.write(norm_text.encode('utf-8')) ϊʔϚϥΠζ͢Δ͚ͩ
  14. 1. Ϗϧυ؀ڥߏங dicdir = /tmp/neologd • $HOME/lambda_neologd/local/etc/mecabrc • ࣙॻͷύεΛ/tmpʹมߋ •

    Ϗϧυ؀ڥͰ͸γϯϘϦοΫϦϯΫʹ $ ln -s $HOME/neologd /tmp/neologd Ϗϧυ؀ڥͰ͖͕͋Γʂ
  15. 2. LambdaεΫϦϓτ࡞੒ • PythonεΫϦϓτͰ΍Δ͜ͱ 1. ࣙॻΛLambdaΠϯελϯεॳظԽ࣌ʹ S3͔Βμ΢ϯϩʔυ • ࣙॻ͸mecabrcʹ߹Θͤͯ/tmp/neologdʹ 2.

    τʔΫφΠζͷલॲཧ • ࣙॻੜ੒ͱಉ༷ʹϊʔϚϥΠζΛ࣮ߦ • ςΩετΛαχλΠζ • ςΩετͷछྨʹΑ࣮ͬͯ૷Λม͑Δ 3. ߹໊ࢺɼϑϨʔζநग़ͷ࣮૷Λ௥Ճ
  16. 2. LambdaεΫϦϓτ࡞੒ • αχλΠζͱ͸ • ೔ຊޠͷॲཧʹ͓͍ͯ • จষͷதͰͦΕ΄Ͳҙຯͷͳ͍(Ԛ͍)૷০ͷ ͨΊͷจࣈྻͳͲΛҰൠతͳจࣈྻʹ߹Θ ͤΔ

    • ྫ: • ಉ఺ΰʔʔʔʔϧʂ → ಉ఺ΰʔϧʂ • ௕Իූ߸ʢ−ʣΛ·ͱΊΔ • ۟ಡ఺Λ·ͱΊΔ • จࣈྻͷؒͷεϖʔε͸۟఺ΛຒΊΔ • ϨΠΞ΢τͷͨΊͷ࿈ଓ͢ΔεϖʔεΛ ࡟আʢΠϯσϯτʣ
  17. 2. LambdaεΫϦϓτ࡞੒ • ʮNHKϚΠϧνϟϯϐΦϯγοϓʯ
 ˠʮNHKϚΠϧʯͰলུ͞ΕΔέʔε
 ˠ NHKͱϚΠϧʹ෼ׂ͞Εͯ͠·͏
 ɹɹˠʮNHKʯ-ʮϚΠϧʯΛ࿈݁ • ߹໊ࢺͱ͸

    • ࣙॻʹ͸ͳ͍࿈ଓ͢Δ໊ࢺΛ૊Έ߹Θͤͯ Ұͭͷ୯ޠʢ৽ޠʁʣͱͯ͠ѻ͏ • ʮετʔϦʔɾϝΠΩϯά͕ૉ੖Β͍͠ʯ
 ɹˠʮετʔϦʔʯ-ʮɾʯ-ʮϝΠΩϯάʯΛ࿈݁
  18. 2. LambdaεΫϦϓτ࡞੒ • lambda_function.py ͷϑϩʔ 1. Πϯϙʔτ෦ 2. S3઀ଓͷॳظԽ (boto3)

    3. mecabࣙॻͷμ΢ϯϩʔυ 4. MeCabϞδϡʔϧϩʔυ 5. τʔΫφΠζʢίΞʣॲཧ
  19. 2. LambdaεΫϦϓτ࡞੒ • lambda_function.py • 1. Πϯϙʔτ෦ import os import

    logging import traceback import unicodedata import json import boto3 import ctypes import normalize import termextract libdir = os.path.join(os.getcwd(), 'local', 'lib') libmecab = ctypes.cdll.LoadLibrary(os.path.join(libdir,'libmecab.so')) ctypesΛ࢖༻ͯ͠libmecab.soΛϩʔυ
  20. • lambda_function.py • 2. S3ॳظԽ(boto3) # Configuration: AWS AWS_S3_BUCKET =

    'mecabdic' BOTOCONF = { 'aws_access_key_id': 'AKI.....', 'aws_secret_access_key': 'pTWl.....', 'region_name': 'ap-northeast-1' } boto3.setup_default_session(**BOTOCONF) session = boto3._get_default_session() session_region = session._session.get_config_variable('region') s3 = boto3.client('s3') 2. LambdaεΫϦϓτ࡞੒
  21. • lambda_function.py • 3. mecabࣙॻͷμ΢ϯϩʔυ MECAB_DIC_FILES = [ # mecabࣙॻͷϦετ

    'char.bin', 'dicrc', 'left-id.def','matrix.bin','pos-id.def', 'rewrite.def', 'right-id.def', 'sys.dic', 'unk.dic', ] DICDIR = '/tmp/neologd/' # ࣙॻͷอଘઌ # S3͔ΒࣙॻΛμ΢ϯϩʔυ def prepareMecabDic(): if not os.path.exists(DICDIR): os.mkdir(DICDIR) for mdic in MECAB_DIC_FILES: dest_dic = DICDIR + mdic if not os.path.exists(dest_dic) or os.path.getsize(dest_dic) == 0: with open(dest_dic, 'wb') as f: s3.download_file(AWS_S3_BUCKET, mdic, dest_dic) prepareMecabDic() /tmp/neologdʹμ΢ϯϩʔυ 2. LambdaεΫϦϓτ࡞੒ ඞཁͳࣙॻΛϦετʹ
  22. • lambda_function.py • 4. MeCabϞδϡʔϧϩʔυ import MeCab # init MeCab

    MECABRC = os.path.join(os.getcwd(), 'local', 'etc', 'mecabrc') tagger = MeCab.Tagger("-r %s" % MECABRC) import normalize # αχλΠζϞδϡʔϧ import termextract # ߹໊ࢺੜ੒ɼϑϨʔζநग़Ϟδϡʔϧ 2. LambdaεΫϦϓτ࡞੒ mecabrcύεͷࢦఆ αχλΠζɼ߹໊ࢺੜ੒Ϟδϡʔϧ
  23. • lambda_function.py • 5. τʔΫφΠζ(ίΞ)෦෼ def lambda_handler(event, context): text =

    event['queryStringParameters'].get('text', '') is_termext = event['queryStringParameters'].get('termextract', '') is_phrase = event['queryStringParameters'].get('phrase', '') if text: text = unicodedata.normalize('NFKC', text.strip()).lower() text = normalize.cleansingText(text) node = tagger.parse('') node = tagger.parseToNode(text) (ଓ͖͸࣍ͷεϥΠυʹ....) 2. LambdaεΫϦϓτ࡞੒ httpΫΤϦετϦϯάऔಘ ϊʔϚϥΠζ/αχλΠζ/τʔΫφΠζ
  24. • lambda_function.py • 5. τʔΫφΠζ(ίΞ)෦෼ node = tagger.parse('') node =

    tagger.parseToNode(text) ### ଓ͖ tokens = [] if is_termext: # ߹໊ࢺੜ੒ tokens = termextract.tokenize(node) elif is_phrase: # ϑϨʔζநग़ tokens = termextract.phrases(node) else: while node: token = { 'surface': node.surface, 'posid': node.posid, 'feature': node.feature, } tokens.append(token) node = node.next 2. LambdaεΫϦϓτ࡞੒ ߹໊ࢺੜ੒ ϑϨʔζநग़ mecabͦͷ··
  25. ؀ڥߏங • normalize.py ςΩετͷαχλΠζॲཧ import re CLEANSING_PATTERNS = [ [r'\r\n|\r|\n|\\n',

    'ɻ'], # վߦίʔυΛআڈ [r'\t+|\s+', ' '], # ࿈ଓ͢Δλϒ·ͨ͸࿈ଓ͢Δεϖʔε [u'\u30FC+', u'\u30FC'], # ࿈ଓ͢ΔϋΠϑϯΛ1ͭͷϋΠϑϯʹ͢Δ [r'([^\w])\s([^\w])', r'\1ɻ\2'], # 2όΠτจࣈؒͷεϖʔε͸۟఺Ͱஔ׵ [u'ɾɾ+', 'ɾ'], # ࿈ଓ͢Δத఺Λ1ͭʹ [u'ɻɻ+', 'ɻ'], # ࿈ଓ͢Δ۟఺Λ1ͭʹ ] def cleansingText(text): for src, dst in CLEANSING_PATTERNS: res = re.sub(src, dst, text) return res [ม׵લ, ม׵ޙ] ΛϦετʹ re.subͰஔ׵
  26. 2. LambdaεΫϦϓτ࡞੒ • termextract.py • 1. ࢖༻͢Δ඼ࢺIDͷఆٛ import re APPLY_IDS

    = [30, 36, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 56, 57] APPLY_JOIN_IDS = [10, 13, 18, 20, 24, 25, 31, 32, 33, 51, 58] NOT_ENDOFWORDS = [ [13, "͔Β"], [18, "͕"], [18, "΋ͷͷ"], [18, "ͱ"], [24, "ͷ"], ] EX_APPLY_IDS = [ [18, "ͯ"], [18, "Ͱ"], [25, "ͨ"], [25, "ͣ"], [25, "ͳ͍"], [25, "·͢"], [25, "Μ"], [25, "·͠"], [25, "·ͤ"], [31, "͍"], [54, "ͦ͏"] ] IGNORE_WORDS =[u"ͦ͏", u"ͨ͠", u"͋ͱ", u"Կ౓΋", u"Կ͔"] ࿈݁ର৅ͷ඼ࢺID ඼ࢺIDͱ෇ଐޠͷηοτΛ૊Έ߹Θͤ
  27. • termextract.py • ߹໊ࢺͷੜ੒ϧʔϧ࡞੒ʢྫʣ def wordFilter(token, buf): if len(buf) !=

    0: if token.posid == 4 \ and re.search(r'[ʔɾ]', token.surface) \ and re.search(r'^[͊-ΜΝ-λμ-ϲʔɾ]+$', buf[-1].surface) \ and buf[-1].posid != 30: return True for pid, w in EX_APPLY_IDS: if pid == token.posid and w == token.surface: return True if token.posid in APPLY_IDS: # ΞϧϑΝϕοτͷཏྻ໊͕ࢺͱ൑அ͞ΕΔͷΛ๷͙ # (ID: 45Ͱݻ༗໊ࢺ,૊৫ͱID:38ͰҰൠ໊ࢺͷ2ύλʔϯ͕͋Δ) if len(token.surface) < 3 and re.search(r'[a-zA-Z:]+', token.surface): return False if len(buf) == 0 and token.surface in IGNORE_WORDS: return False return False (த఺ or ௕Իූ߸) and (࠷ޙͷจࣈྻ͕ͻΒ͕ͳ/ΧλΧφ) and ࠷ޙͷ୯ޠͷ඼ࢺID͕30(໊ࢺ઀ଓ)Ҏ֎Λ࿈݁͢Δ 2. LambdaεΫϦϓτ࡞੒ ৚݅ʹΑͬͯ͸࿈݁͠ͳ͍΋ͷ΋...
  28. 2. LambdaεΫϦϓτ࡞੒ • ۩ମతʹ... • ʮετʔϦʔɾϝΠΩϯά͕ૉ੖Β͍͠ʯ ετʔϦʔ APPLY_IDS = [30,

    36, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 56, 57] ࿈݁ର৅඼ࢺID ɾ ϝΠΩϯά ඼ࢺID: 38 Ұൠ໊ࢺ ඼ࢺID: 4 ه߸ ඼ࢺID: 41 ݻ༗໊ࢺ ࿈݁ର৅IDʢͻΒ͕ͳɼ͔͔ͨͳɼத఺Λڐ༰ʣ
  29. 3. AWS΁ͷσϓϩΠ • σϓϩΠͷϑϩʔ 1. ࡞੒ͨ͠ϓϩάϥϜΛzipʹ·ͱΊΔ 2. AWSϢʔβʔϩʔϧͷઃఆ 3. Lambda

    → S3 ϧʔτઃఆ 4. ηΩϡϦςΟάϧʔϓ࡞੒ 5. Lambdaͷ࡞੒ 6. API Gatewayͷઃఆ
  30. 3. AWS΁ͷσϓϩΠ • 1. lambda_neologdΛzipͰ·ͱΊΔ • ෆཁͳϑΝΠϧΛexcludeʹ (zip: 1.5MBఔ౓ʹ) $

    zip lambda_mecabneologd.zip -r * -x@exclude $ vim lambda_neologd/exclude *.dist-info/* *.egg-info *.pyc env/* exclude lambda_mecabneolod.zip local/bin/* local/include/* local/libexec/* local/share/* zipϑΝΠϧͷ࡞੒ exclude(zipʹؚΊͳ͍) ϑΝΠϧͷத਎
  31. • 3. Lambda → S3΁ͷϧʔτΛ࡞੒ • VPC, SubnetΛ࡞੒ • VPCΤϯυϙΠϯτΛ࡞੒

    • ඞਢͰ͸ͳ͍:ݖݶ·ΘΓͰϋϚΒͳ͍ͨΊʹ ↓ → 3. AWS΁ͷσϓϩΠ
  32. • MeCab + ৽ޠΛ࣋ͭࣙॻ؀ڥͱͯ͠ • AWS Lambda + API GatewayΛ࢖༻

    • NEologd͸lambda؀ڥʹ৐ͤΔ͜ͱ͕Ͱ͖Δ • ࣙॻΛ࠷খݶʹ͠ͳ͍ͱ͍͚ͳ͍ • Lambda͢Δ͜ͱͰؾܰʹ࢖͑Δͱ͍͏ϝϦοτ͸͋Δ • Lambdaʹ৐ͤͨͱ͖ͷҰ൪ͷ໰୊͸ॳճىಈ͕஗͍ ·ͱΊ
  33. • ࣙॻ͸Ͳ͏͢΂͖͔໰୊ʹ͍ͭͯ • NEologdͷࣙॻ͸େมΑ͘Ͱ͖͍ͯΔ • ͔͠͠σΧ͍ʂʂ • ࣙॻΛ࠷খݶʹ͢ΔσϝϦοτΑΓ΋೔ຊޠͷΏΒ͗ʹ Ԡͯ͡ʮϊʔϚϥΠζʯɼʮαχλΠζʯ͔ͯ͠Βτʔ ΫφΠζ͢Δ͜ͱ͸ࣙॻ͕Ͳ͏Ͱ͋Ε༗ҙٛͰ͋Δ

    • ໊ࢺ࿈݁Λ࣮૷͢Δ͜ͱͰ߹໊ࢺ͕࡞੒Ͱ͖Δ • Ωʔϫʔυநग़ͳͲͰ͸ͱͯ΋࢖͑Δ • ϑϨʔζʹؔͯ͠͸ඞཁʹԠͯ͡඼ࢺ࿈݁Λ࣮૷ ·ͱΊ ͱ͜ΖͰɼσϞͰݟͤͨจॻൺֱͰ͕͢....