Upgrade to Pro — share decks privately, control downloads, hide ads and more …

第七回全日本コンピュータビジョン勉強会 A Multiplexed Network for End-to-End, Multilingual OCR

第七回全日本コンピュータビジョン勉強会 A Multiplexed Network for End-to-End, Multilingual OCR

Yamato.OKAMOTO

July 28, 2021
Tweet

More Decks by Yamato.OKAMOTO

Other Decks in Technology

Transcript

  1. ୈࣣճશ೔ຊίϯϐϡʔλϏδϣϯษڧձ
    ".VMUJQMFYFE/FUXPSLGPS&OEUP&OE
    .VMUJMJOHVBM0$3
    :BNBUP0,".050

    View Slide

  2. ࣗݾ঺հʢ୹͘ʂʣ
    Ԭຊେ࿨ʢ͓͔΋ͱ΍·ͱʣ

    [email protected]%&46
    ✓ ֶੜ࣌୅͸ژ౎େֶͰը૾ೝࣝΛઐ߈
    ✓ ΦϜϩϯʹͯɺࣄۀͱٕज़ͷೋ౛ྲྀਓࡐͱͯ͠#J[%FWd3%·Ͱखֻ͚Δ
    ✓ ΑΓɺ-*/&גࣜձࣾͷ$PNQVUFS7JTJPO-BC5FBNʹॴଐ˒/FX
    ➡ -*/&$7-BCΛzӉ஦Ұָ͘͠ಇ͚ΔνʔϜzʹ͢΂͘νʔϜϏϧυத
    ➡ ເ͸ʮژ౎ʹಌΕΒΕΔݚڀڌ఺Λ࡞Δʯ͜ͱ
    ˞ຊ೔ͷൃද͸Ұൠެ։৘ใͷ࿦จΛ঺հ͢ΔҐஔ෇͚Ͱ͢ɻ

    ˞ॴଐஂମ͸ؔ܎͋Γ·ͤΜɻ

    View Slide

  3. ঺հ͢Δ࿦จ
    ● ".VMUJQMFYFE/FUXPSLGPS&OEUP&OE .VMUJMJOHVBM0$3 'BDFCPPL"*


    IUUQTBSYJWPSHQEGQEG
    ✓ ޫֶจࣈೝࣝ 0$30QUJDBM$IBSBDUFS3FDPHOJUJPO
    ͷݚڀ
    ✓ ैདྷ͸ɺzΞϧϑΝϕοτzʹয఺Λ͋ͯͨݚڀ͕ଟ͔ͬͨ
    ✓ ଟݴޠରԠ͢Δʹͯ͠΋ɺग़ྗΫϥε਺Λ਺े͔Β਺ઍ΁ͱ૿΍͚ͩ͢ͷ߽շͳख๏͕ଟ͔ͬͨ
    ✓ ঺հ࿦จ͸ɺΞϥϒޠ΍೔ຊޠͳͲܭछʹରԠͭͭ͠ɺεϚʔτͳଟݴޠϞσϧΛఏҊ

    View Slide

  4. લఏ֬ೝɿͦ΋ͦ΋0$3ͱ͸ʁ

    View Slide

  5. 0$3ͷ໨త͸ը૾͔ΒͷςΩετநग़
    • Πϯϓοτɿը૾
    • Ξ΢τϓοτɿը૾தͷςΩετྖҬʴจࣈྻ
    https://arxiv.org/pdf/2007.09629.pdf

    View Slide

  6. ࿦จ঺հͷલʹɺஸೡʹݴ༿ͷఆٛΛ੔ཧ͍ͤͯͩ͘͞͞
    2VFTUJPOɿ0$3͕ݕग़͢Δͷ͸4FOUFODFʁ8PSEʁ$IBSBDUFSʁ
    w 4FOUFODFɹ➔ɹz*BNEPOFXJUINBOLJOE +0+0z
    w 8PSEɹɹɹ➔ɹz*zɺlBNzɺlEPOFzɺlXJUIzɺlNBOLJOEzɺl zɺl+0+0zɺlz
    w $IBSBDUFS➔ɹlNzɺBzɺlOzɺlLzɺlJzɺlOzɺlEz
    "OTXFSɿͲΕͰ΋ͳ͍
    w 0$3͸zҙຯతͳ୯ҐzͰ͸ͳ͘zߏ଄తͳ୯ҐzͰจࣈྻΛݕग़͢ΔʢJOTUBODFMFWFMͱ΋ݺͿʣ
    w ҎޙɺຊࢿྉͰ͸ߏ଄తʹ͋Δఔ౓·ͱ·ͬͨจࣈྻΛʮ5FYUςΩετʯͱݺͿ
    w จࣈͣͭʢ$IBSBDUFSʣΛର৅ʹ͢ΔͱɺΞϊςʔγϣϯ΍ख๏͕ҟͳΔͷͰ໌֬ʹ۠ผ͢Δ

    View Slide

  7. ࿦จ঺հͷલʹɺͲΜͳΞϊςʔγϣϯΛ࢖͏ͷ͔੔ཧ
    • 5FYUMBCFMɹˡԼهͷը૾தʹ͸ͳ͍͕ɺ͜ͷ৔߹͸ʮ#"--:4ʯ͕5FYUMBCFMʹ૬౰͢Δɻ͜Ε͸͍͍ͩͨೖखͰ͖Δɻ
    • 5FYUMFWFMCPVOEJOHCPYʢ੨ʣ
    • 5FYUQPMZHPOBMCPVOEJOHCPYʢ੺ʣ
    • QPMZHPOBMCPVOEJOHCPYΛ࠷খۣܗͰғͬͨCPVOEJOHCPYʢ྘ʣˡ੨͕ͳ͍৔߹ʹ׆༻
    • $IBSBDUFSMFWFMCPVOEJOHCPYʢԫʣˡ΄΅ଘࡏ͠ͳ͍ɺ΄ͱΜͲͷ৔߹͸σʔλˍڭࢣϥϕϧΛը૾߹੒Ͱੜ੒͢Δɻ
    • $IBSBDUFSMBCFM ԫ৭ృΓͷ੺จࣈ
    ˡ΄΅ଘࡏ͠ͳ͍ɺ΄ͱΜͲͷ৔߹͸σʔλˍڭࢣϥϕϧΛը૾߹੒Ͱੜ੒͢Δɻ

    View Slide

  8. ࿦จ঺հͷલʹɺOCRͷॲཧΛ੔ཧ
    ̏ͭͷॲཧΛཧղ͠·͠ΐ͏
    㾎5FYU%FUFDUJPO
    㾎5FYU3FDPHOJUJPO
    㾎5FYU4QPUUJOHɹˡࠓճͷ࿦จ͕ѻ͏΋ͷ

    View Slide

  9. Text Detectionͱ͸ʁ
    ը૾ʹؚ·ΕΔશͯͷ5FYUྖҬΛɺۣܗ΍1PMZHPOͰநग़͢Δ
    w 'BTU3$//ͳͲطଘͷ%FUFDUJPOϞσϧΛར༻Ͱ͖Δ
    w ςΩετ͕ࣼΊʹͳΔͱඞཁҎ্ʹۣܗ͕େ͖͘ͳΔ
    w ςΩετྖҬҎ֎ͷಛ௃ྔ΋ࠞೖ͢Δ৔߹͕ଟ͍
    w Ξϊςʔγϣϯʹख͕͔͔ؒΔ
    w ϚεΫॲཧͳͲͷ௥ՃʹΑͬͯςΩετྖҬ͔ΒͷΈ

    ಛ௃ྔΛநग़Մೳ
    ྫɿ'BTU3$//ͷΑ͏ʹ4MJEJOH8JOEPXͰݕग़͢Δ ྫɿ<5FYU/055FYU>ͷ4FHNFOUBUJPOΛ͢Δ

    View Slide

  10. Text Recognitionͱ͸ʁ
    ݕग़ͨ͠5FYUྖҬʹؚ·ΕΔ$IBSBDUFSྻΛೝࣝ͢Δ -07&
    w ֤$IBSBDUFSΫϥεഎܠΫϥεͰ4FHNFOUBUJPO
    w մײΛ΋ͬͯલܠͱ༧ଌ͞ΕͨྖҬʹจࣈ͕͋ΔͱԾఆ͢Δ
    w Ͳͷ$IBSBDUFSΫϥεʹଐ͢Δ͔શϐΫηϧͰଟ਺ܾ͢Δ

    w $IBSBDUFSMFWFMͷΞϊςʔγϣϯ͕ඞཁͳͷͰ޻਺େ
    w จࣈͷॱং͕֫ಘͰ͖ͳ͍఺΋՝୊
    w 5FYUྖҬͷಛ௃ྔΛܥྻσʔλͷܗʹม׵͢Δ
    w 3//ϕʔεͷϞδϡʔϧͰਪ࿦ʢ-45.΍(36౳ʣ
    w 4QBUJBM"UUFOUJPOߏ଄ΛೖΕΔ৔߹΋ଟ͍
    w $IBSBDUFSMFWFMͷΞϊςʔγϣϯෆཁͳͷͰ͓खܰ
    ྫɿ4FNBOUJD4FHNFOUBUJPOͰ$IBSBDUFSΛ༧ଌ͢Δ ྫɿ4FR4FR&OD%FDͰจࣈͣͭग़ྗ͢Δ

    View Slide

  11. Text Spottingͱ͸ʁ
    5FYU%FUFDUJPOͱ5FYU3FDPHOJUJPOΛ྆ํ࣮ࢪ͢ΔॲཧΛ5FYU4QPUUJOHͱݺΜͰ͍Δ
    ʢ̎ͭͷॲཧΛܨ͛Δ͚ͩɺͱฉ͑͜Δ͔΋͠Εͳ͍͕ɺͲ͏ֶश͢Δ͔ͳͲɺ՝୊΍޻෉఺͸ଟ͍ʣ

    View Slide

  12. ࿦จ঺հͷલʹɺ0$3ͷ೉͠͞Λྻڍ
    • ςΩετͷํ޲͕λςɺϤίɺφφϝɺΧʔϒͳͲ༷ʑͰ͋Δ
    • ̍ຕͷը૾಺Ͱෳ਺ͷςΩετํ޲͕ࠞࡏ͢Δ
    • ̍ͭͷςΩετ͕̎ͭҎ্ʹ෼཭ɺ·ͨ͸ɺҟͳΔ̎ͭҎ্ͷςΩετ͕ͭʹ݁߹ͯ͠ݕग़ͯ͠͠·͏
    • Ξϊςʔγϣϯ͕ݶΒΕΔɻಛʹ$IBSBDUFSMFWFMͷΞϊςʔγϣϯ͸΄΅खʹೖΒͳ͍
    • ಛ௃ྔநग़ɺ%FUFDUJPOɺ3FDPHOJUJPOͱ͍ͬͨෳ਺ͷҟͳΔػೳϞδϡʔϧΛ&&Ͱֶश͍ͨ͠

    ʢֶश͕ޮ཰తͩ͠ɺࣗવͳઃܭʹͳΔ͠ɺҟͳΔػೳϞδϡʔϧಉ࢜ͷ૬ޓิ׬͕ظ଴Ͱ͖ΔͨΊʣ

    View Slide

  13. ʢযΒͯ͠͝ΊΜͳ͍͞ʣ
    0$3ͷैདྷݚڀ΋঺հ͍ͤͯͩ͘͞͞

    View Slide

  14. 0$3ͰΑ͘Ҿ༻͞ΕΔख๏
    &&.-5
    • $[FDI5FDIOJDBM6OJWFSTJUZ $BSOFHJF.FMMPO6OJWFSTJUZ
    • &&.-5BO6ODPOTUSBJOFE&OEUP&OE.FUIPEGPS.VMUJ-BOHVBHF4DFOF5FYU
    • IUUQTBSYJWPSHQEGQEG
    $IBS/FU
    • .BMPOH5FDIOPMPHJFT
    • $POWPMVUJPOBM$IBSBDUFS/FUXPSLT
    • IUUQTBSYJWPSHQEGQEG
    5FYU4QPUUFS
    • 'BDFCPPL"* )VB[IPOH6OJWFSTJUZ
    • .BTL5FYU4QPUUFSW4FHNFOUBUJPO1SPQPTBM/FUXPSLGPS3PCVTU4DFOF5FYU4QPUUJOH
    • IUUQTBSYJWPSHQEGQEG
    $3"'54
    • $MPWB"*3FTFBSDI /"7&3$PSQ
    • $IBSBDUFS3FHJPO"UUFOUJPO'PS5FYU4QPUUJOH
    • IUUQTBSYJWPSHQEGQEG

    View Slide

  15. &&.-5 "$$7`

    CRAFTS
    ը૾શମͷ
    ಛ௃ྔநग़
    ResNet34Λϕʔεͱͨ͠FPN(Feature
    Pyramid Net)ɻ
    ςΩετ

    ྖҬݕग़
    1/4 Scaleͷ֤࠲ඪͰ


    Text/NOT Textɺb-boxɺAngleΛਪ࿦ɻ


    ʢAncher͸࢖༻͠ͳ͍ʣ
    ςΩετ

    ྖҬͷ


    ಛ௃ྔநग़
    ݕग़ͨ͠b-box͔Βճస΍࿪ΈܰݮΛ
    ໨తʹɺύϥϝλਪఆ͠ͳ͕Β


    spatial transformer layerΛద༻͢Δɻ
    ςΩετ


    ೝࣝ
    Conv૚ͰจࣈೝࣝثΛߏ੒


    ೖྗɿԣ෯͚ͩՄม௕ͷಛ௃ྔ


    ग़ྗɿจࣈ਺෼(໿7500)ͷlog-softmax


    ग़ྗ͢Δจࣈ਺͸ಛ௃ྔͷԣ෯ʹൺྫ
    ͯ͠૿΍͢ɻ


    ֶश޻෉
    ը૾߹੒ʹΑͬͯଟݴޠͷֶशσʔλ
    Λߏஙɻ
    ˛ఏҊϞσϧͷશମ૾ɻݕग़෦ͱೝࣝ෦͕௚ྻʹฒͿɻ
    ˛ֶश༻ͷ߹੒ը૾

    View Slide

  16. $IBS/FU $713`

    ˛.BTL5FYU4QPUUFS΍$3"'54ʹൺ΂Δͱɺ
    ɹςΩετ΍จࣈͷೝࣝॲཧ͕ฒྻͰҰؾʹͳ͞ΕΔɻ
    CharNet
    ը૾શମͷ
    ಛ௃ྔநग़
    ResNet-50 ͱ

    Hourglass networks(Newell 2016)ͷ

    ૊Έ߹Θͤɻ
    ςΩετ

    ྖҬݕग़
    Text Detection BrunchͱCharacter
    BrunchΛฒྻʹઃ͚Δɻ


    ɾText Detection Brunch


    ࣼΊ΍Χʔϒʹ΋ରԠՄೳͳ


    طଘख๏(EAST, Textfield)Λద༻ɻ


    ςΩετྖҬΛग़ྗ͢Δɻ


    ɾCharacter Brunch


    3ͭͷϞδϡʔϧΛฒྻʹ഑ஔ


    -(1)[Text/NOT text] ͷsegmentation


    -(2)b-boxʹΑΔCharacter Detection


    -(3)จࣈ਺෼ͷଟΫϥεsegmentation


    Characterͷb-boxͱϥϕϧΛग़ྗ͢
    Δɻ


    ݕग़ͨ͠ςΩετྖҬʹؚ·ΕΔจ
    ࣈू߹Λग़ྗ͢Δ (ͳͷͰɺग़ྗ͸
    ݫີʹ͸ςΩετͰ͸ͳ͍)ɻ
    ςΩετ

    ྖҬͷ


    ಛ௃ྔநग़
    ςΩετ


    ೝࣝ
    ֶश޻෉
    ֶशʹ͸Text-levelͱcharacter-level྆
    ํͷΞϊςʔγϣϯ͕ඞཁͳͷͰɺ
    ߹੒σʔλͰֶश͢Δɻ


    ͦͷޙɺ࣮σʔλΛ࢖ͬͯ


    Weakly Supervised Learning͢Δɻ
    ˝5FYU%FUFDUJPO#SVODI͸
    ɹςΩετྖҬΛݕग़
    ˝$IBSBDUFS#SVODI͸಺෦ͰͭͷॲཧΛฒྻ࣮ߦͯ͠ɺ
    ɹ$IBSBDUFSMFWFMͷݕग़ͱೝࣝΛ࣮ߦ͢Δɻ

    View Slide

  17. 5FYU4QPUUFSW &$$7`

    ˛ςΩετྖҬΛ4ISJOLͨ͠ྖҬΛڭࢣσʔλͱֶͯ͠शͤ͞Δɻ

    ͜ΕʹΑΓྡͷςΩετ͕ͬͭ͘͘ͷΛ๷͙ʢ࣍ͷॲཧʹҠΔͱ͖ʹ͸VOTISJOL͢Ε͹Α͍ʣ
    ◀︎
    ˛<PS>Ͱಛ௃ྔΛϚεΫ͕͚͢Δ
    ˝௨Γͷํ๏ͰจࣈྻΛ֫ಘ͢ΔɻԼஈͷख๏4".Ͱ͸จࣈϨϕϧͷΞϊςʔγϣϯෆཁ
    TextSpotter (v1~v3)
    ը૾શମͷ
    ಛ௃ྔநग़
    ResNet50ΛϕʔεʹFPNΛઃ͚Δ(v2)


    ResNet50ΛϕʔεʹU-NetΛઃ͚Δ(v3)
    ςΩετ

    ྖҬݕग़
    Fast-RCNNϕʔεͷAncherʹΑΔݕग़ (v2)


    Text/NOT TextΛSegmentation(v3)
    ςΩετ

    ྖҬͷ


    ಛ௃ྔநग़
    AncherͰݕग़ۣͨ͠ܗྖҬʹRoI AlignΛద
    ༻ͯ͠ಛ௃ྔநग़ (v2)


    Segmentation݁ՌΛ࠷খۣܗͰ੾Γग़͠ɺ
    ಛ௃ྔʹRoI AlignͱMaskΛద༻(v3)
    ςΩετ


    ೝࣝ
    (1)֤จࣈʴഎܠͷSegmentationΛ࣮ߦɻ


    จࣈީิྖҬ಺Ͱଟ਺ܾ(PixelVoting)Λͯ͠
    จࣈΛ൑ఆɻ


    (2)Sequentialͳಛ௃ྔʹม׵ͯ͠Attention෇
    ͖ͷseq2seqͰจࣈྻΛग़ྗɻ


    (1)(2)ͷ2ͭͷ༧ଌ݁ՌΛ֫ಘޙɺ৴པ౓ͷ
    ߴ͍ํΛ࠾༻͢Δɻ
    ֶश޻෉
    Character-levelͷΞϊςʔγϣϯ͕ͳͯ͘΋
    (2)͸ֶशՄೳʢ˞(1)ͷֶशʹ͸ඞཁʣɻ

    View Slide

  18. $3"'54 &$$7`

    ˛514ʹΑΔۣܗม׵
    ࣮ࡍ͸'FBUVSF.BQΛม׵

    ˛จࣈྖҬͷ༧ଌ͕Ͱ͖Ε͹
    ɹ͔ͦ͜Β1PMZHPOྖҬΛ֫ಘՄೳ
    ˝࣮σʔλͰͷֶश࣌͸ɺDIBSBDUFSMFWFMͷΞϊςʔγϣϯ͕ແ͍ͨΊ
    ɹٙࣅϥϕϧʹΑΔ8FBLMZ4VQFSWJTFE-FBSOJOHΛ͢Δ
    CRAFTS
    ը૾શମͷ
    ಛ௃ྔநग़
    ResNet50ΛϕʔεʹU-Netߏ଄Λઃ͚
    Δɻ
    ςΩετ

    ྖҬݕग़
    ߹੒σʔλʴಠࣗͷڭࢣϥϕϧͰֶश


    (1)จࣈத৺͕ݪ఺ͷΨ΢εείΞ

    (2)ྡ઀จࣈͷܨ͕ΓΛࣔ͢είΞ

    (3)จࣈํ޲

    (1)(2)ͷ༧ଌ݁Ռ͔ΒҰఆͷܭࢉॲཧ
    ͰςΩετྖҬΛPolygonͰநग़ɻ
    ςΩετ

    ྖҬͷ


    ಛ௃ྔநग़
    ಛ௃ྔͱ(1)(2)༧ଌ݁ՌΛconcat͢Δɻ


    thin-plate splineʹΑͬͯɺPolygonͰݕ
    ग़ͨ͠ςΩετྖҬΛݻఆαΠζͷۣ
    ܗʹม׵ͯ͠ಛ௃ྔΛநग़͢Δɻ
    ςΩετ


    ೝࣝ
    Sequentialͳಛ௃ྔʹม׵ͯ͠Attention
    ෇͖ͷseq2seqͰจࣈྻΛग़ྗɻ
    ֶश޻෉
    ςΩετͷΞϊςʔγϣϯ͔Βಠࣗͷ
    ֶश༻σʔλΛ࡞੒ͯ͠ɺͦΕΛ༧ଌ
    Ͱ͖ΔΑ͏Ϟσϧʹֶशͤ͞Δɻ
    ˛ֶश༻ͷ߹੒σʔλ࡞੒࣌ʹ͸্هͷΑ͏ͳಠࣗͷڭࢣϥϕϧΛੜ੒ֶͯ͠शͤ͞Δɻ
    ɹྫ͑͹-JOL4DPSF͸ॎॻ͖ͷςΩετͷࣝผ཰޲্ʹد༩͢Δɻ

    View Slide

  19. ͓଴ͨͤ͠·ͨ͠ɺ΍ͬͱ࿦จ঺հʹҠΓ·͢ɻ
    ● ".VMUJQMFYFE/FUXPSLGPS&OEUP&OE .VMUJMJOHVBM0$3

    'BDFCPPL"*ɹIUUQTBSYJWPSHQEGQEG

    View Slide

  20. ͜ͷ࿦จ͕஫໨ͨ͠՝୊
    0$3ݚڀͷଟ͕͘ӳޠΛର৅ʹ͍ͯ͠Δ͕ɺ
    ੈքʹ͸ଞʹ΋ͨ͘͞Μݴޠ͕͋Δͧ
    0$3ΛଟݴޠରԠͤͨ͞ݚڀྫͰ͸ɺ
    Ϟσϧͷग़ྗΫϥε਺Λ֦ு͚ͨͩ͠ͷઃܭ͕ଟ͍ɻ
    ͦΕͰ͍͍ͷͩΖ͏͔ʁ
    ݴޠͷ௥Ճ࡟আͳͲϝϯςφϯε༰қͳߏ੒Λ࣮ݱ͍ͨ͠ʂ
    ˞͜͜Ͱ͍͏ݴޠͱ͸ɺਖ਼֬ʹ͸จࣈମܥʢ4DSJQUʣͷ͜ͱ
    ˞ӳޠ͕ଟ͗͢Δͱ͍͏͚ͩͰɺ೔ຊޠ0$3΍ؖࠃޠ0$3΋ʢ໪࿦ʣଘࡏ͠·͢ɻ
    🤔
    🤔

    View Slide

  21. ิ଍ɿੈքͷจࣈମܥͨͪ
    ࢀরɿWikipedia
    ݴޠʢ-BOHVBHFʣͱ͸ʁɹ➔ɹzӳޠz zυΠπޠz z೔ຊޠzͳͲͷ͜ͱ
    จࣈମܥʢ4DSJQUʣͱ͸ʁɹ➔ɹz-BUJOz z,PSFBOzͳͲͷ͜ͱ
    จࣈʢ$IBSBDUFSʣͱ͸ʁɹ➔ɹz"z l͋z lউzͳͲͷ͜ͱ
    ˞ݴޠ͕ҟͳͬͯ΋จࣈମܥ͕ಉҰͳ৔߹΋͋Δɻ

    ˞Ҏޙ͸ʮ-BOHVBHF㲈4DSJQUʯͱͯ͠ɺʮݴޠʢ-BOHVBHFʣʯͱ͍͏දݱʹ౷Ұͯ͠આ໌ΛਐΊ·͢ɻ

    View Slide

  22. $POUSJCVUF5FYU4QPUUFSΛϕʔεʹ֦ு༰қͳଟݴޠରԠϞσϧΛఏҊ
    ʢஶऀ͸5FYU4QPUUFSWͱಉ͘͡'BDFCPPL"*ͷݚڀऀʣ
    ৽ͨʹݴޠೝࣝثΛઃஔɻ


    ςΩετྖҬͷಛ௃ྔ͔Βݴޠೝࣝ͢Δ
    ݴޠ͝ͱͷจࣈೝࣝثΛ
    Multi-HeadͰઃஔɻ


    ݴޠೝࣝ݁ՌʹΑͬͯ


    HeadΛ੾Γସ͑Δɻ
    จࣈೝࣝ͸࣌ܥྻσʔλΛѻ͏χϡʔϥϧωοτʢ͜͜Ͱ͸GRUʣΛར༻ɻ


    Attention΋׆༻͠ͳ͕Β1จࣈͣͭ༧ଌ݁ՌΛग़ྗ͢Δɻ
    SegmentationͰςΩετྖҬநग़ɻ


    ҟͳΔςΩετྖҬ͕͔ͬͭ͘ͳ͍Α͏


    ѹॖ๲ுॲཧͳͲͷ޻෉΋ࢪ͞ΕΔɻ
    ϚεΫͯ͠ςΩετྖҬͷ

    ಛ௃ྔ͚ͩΛநग़
    ˒͜ͷ࿦จ͕ߩݙͨ͠ϙΠϯτ
    ˒͜ͷ࿦จ͕ߩݙͨ͠ϙΠϯτ

    View Slide

  23. ΞΠσΞɿMulti-HeadʹΑΔଟݴޠରԠ
    ՝୊ɿଟݴޠରԠ͍ͨ͠ʂ͔͠΋֦ு༰қʹ͍ͨ͠ʂ
    ➡ ୯७ͳํ๏͸ɺ̔ݴޠؚ͕ΉશͯͷจࣈΛѻ͑ΔΑ͏ग़ྗΫϥε਺Λ֦ு͢Δ͜ͱ
    ➡ ͨͩ͠ɺ೔ຊޠɾதࠃޠɾؖࠃޠΛѻ͓͏ͱ͢Δͱສఔ౓·ͰΫϥε਺͕૿Ճ͢Δ
    ➡ ·ͯ͠ɺݴޠʹΑͬͯ৅ܗɾॎॻ͖ɾԣॻ͖ͳͲಛ௃͕େ͖͘ҟͳΔ
    ➡ ͜ΕΒΛͭͷೝࣝثʢ4JOHMF)FBEʣͰѻ͏ͷ͸ద੾ͩΖ͏͔ʁ
    ➡ จࣈͰ΋௥Ճ࡟আͨ͘͠ͳͬͨ৔߹ɺ࠷ॳ͔Βֶश͢Δख͕ؒൃੜ͔͠Ͷͳ͍
    ΞΠσΞ
    ➡ .VMUJIFBEߏ଄Λ࠾༻ɻͭͭͷݴޠʹಛԽͨ͠ܭͭͷจࣈೝࣝثΛ഑ஔͨ͠ɻ
    ➡ ͋Θͤͯɺݴޠೝࣝثʢ-BOHVBHF1SFEJDUJPO/FUXPSL -1/
    ʣΛ഑ஔͨ͠ɻ
    ➡ ςΩετྖҬ͝ͱʹݴޠೝࣝΛͯ͠ɺ࠷దͳจࣈೝࣝثΛͭબΜͰਪ࿦࣮ࢪͨ͠ɻ
    ςΩετྖҬ͝ͱʹ


    จࣈೝࣝثΛ࢖͍෼͚
    ݴޠࣝผ݁ՌʹԠͯ͡


    จࣈೝࣝثΛ੾Γସ͑
    ෳ਺ͷจࣈೝࣝث
    Λ഑ஔͨ͠

    View Slide

  24. ΞΠσΞɿLanguageͷڭࢣσʔλΛඞཁͱ͠ͳֶ͍श
    ՝୊ɿݴޠೝࣝثΛઃஔͨ͠΋ͷͷɺݴޠͷΞϊςʔγϣϯ͕গͳֶͯ͘शͰ͖ͳ͍ɻ
    ➡ ݴޠͷΞϊςʔγϣϯ͕͋Ε͹ɺͦΕΛ༧ଌͰ͖ΔΑ͏ʹݴޠೝࣝثΛֶशͤ͞Ε͹ྑ͍ɻ
    ➡ ͔͠͠ɺ࣮ࡍ͸Ξϊςʔγϣϯ͕΄ͱΜͲͳ͍ɻͦΕʹɺϞδϡʔϧݸผͰ͸ͳ͘&&ʹֶश͍ͨ͠ɻ
    ΞΠσΞ
    ➡ ݴޠΞϊςʔγϣϯ͕ແͯ͘΋ɺςΩετϥϕϧͷΈͰ&&ʹֶशͰ͖Δํ๏ΛߟҊɻ
    ➡ จࣈೝࣝثʹඇରԠͷจࣈʢଞݴޠͷจࣈʣ͕ೖྗ͞Εͨ৔߹ʹϖφϧςΟΛ͔͚ͨɻ
    DUɿU൪໨ͷจࣈ
    ZUɿU൪໨ͷจࣈͷڭࢣσʔλ
    $Sɿจࣈೝࣝػ͕ѻ͏จࣈू߹
    ̞ɿPSͷ஋Λฦ͢
    5ɿ࠷େग़ྗจࣈ਺ʢݻఆύϥϝλʣ
    ݴޠೝࣝثʹݴޠΛਪ࿦ͤͯ͞ɺ

    DSPTTFOUSPQZMPTTͰֶश͢Δɻ
    5FYUMBCFM͚ͩͰֶशͤ͞Δɻ
    จࣈೝࣝػʹඇରԠͷจࣈ͕ೖྗ͞ΕΔͱϖφϧςΟЌΛ͔͚Δɻ
    ͜ΕʹΑΓɺݴޠೝࣝث͕ద੾ͳจࣈೝࣝثΛબ୒͢ΔΑ͏ֶश͢Δɻ
    Mɹɹɿݴޠ
    Q M
    ɿϞσϧ͕ਪ࿦ͨ֬͠཰
    MMBOHɿѻ͏ݴޠͷ૯਺
    ݴޠΞϊςʔγϣϯ͕͋Δ৔߹ ݴޠΞϊςʔγϣϯ͕ͳ͍৔߹

    View Slide

  25. ࣮ݧ݁Ռͱߟ࡯

    View Slide

  26. ଟݴޠͷText Spottingͷ࣮ݧ݁ՌˠʮCRAFTSҎ֎ʹ͸উͬͨͥʂʯ
    CRAFTS(paper
    )

    CRAFTͷ࿦จ͕ใࠂ͞Ε͍ͯΔ࣮ݧ݁Ռ
    CRAFT
    S

    ஶऀΒ͕CRAFTΛ࠶ݱ࣮૷ͯ͠ධՁͨ݁͠Ռ
    Single-head TextSpotte
    r

    ఏҊख๏ΛMulti-HeadʹͤͣSingle-HeadͰ8ݴޠͷશͯͷจࣈ(໿9000छ)ʹରԠͤͨ͞Ϟσϧ
    Multiplexed TextSpotte
    r

    ຊࢿྉͰ঺հ͍ͯ͠ΔఏҊख๏ɻ
    ैདྷͷଟݴޠରԠOCRϞσϧͱɺ8ݴޠͷText SpottingλεΫͰੑೳൺֱͨ͠
    ※࣮ݧσʔλʹ͸ଟݴޠΛؚΉ”MTL19 Dataset”Λར༻
    Ὂ݁Ռɿ֓Ͷߴੑೳͱͳ͕ͬͨɺ།ҰɺCRAFTʹ͸ಧ͔ͣ
    F஋ Precision Recall

    View Slide

  27. ଟݴޠͷText Detectionͷ࣮ݧ݁ՌˠʮCRAFTS(paper)Ҏ֎ʹ͸উͬͨͥʂʯ
    ଟݴޠσʔλʢMLT19ʣΛର৅ʹςΩετݕग़Ͱੑೳൺֱͨ͠
    Ὂ݁Ռɿ֓Ͷߴੑೳͱͳ͕ͬͨɺ།ҰɺCRAFT(paper)ʹ͸ಧ͔ͣ
    Average

    Precision
    F஋ Precision Recall
    CRAFTS(paper
    )

    CRAFTͷ࿦จ͕ใࠂ͞Ε͍ͯΔ࣮ݧ݁Ռ
    CRAFT
    S

    ஶऀΒ͕CRAFTΛ࠶ݱ࣮૷ͯ͠ධՁͨ݁͠Ռ
    Single-head TextSpotte
    r

    ఏҊख๏ΛMulti-HeadʹͤͣSingle-HeadͰ8ݴޠͷશͯͷจࣈ(໿9000छ)ʹରԠͤͨ͞Ϟσϧ
    Multiplexed TextSpotte
    r

    ຊࢿྉͰ঺հ͍ͯ͠ΔఏҊख๏ɻ

    View Slide

  28. ͜͜ʹ஫໨ɿSoTA͡Όͳͯ͘΋CVPR࠾୒ʹ଍Δߩݙ͕͋Δʂ
    ςΩετݕग़λεΫʹͯݴޠผʹΈΔͱʢಛʹArabicͱChineseͰʣੑೳ޲্Λୡ੒͍ͯ͠Δ
    Average

    Precision
    F஋ Precision Recall F஋

    ͦ΋ͦ΋ͷ໨త͸ϋϯυϦϯά͠΍͍͢ଟݴޠϞσϧΛఏҊ͢Δ͜ͱ
    ✓ ఏҊϞσϧ͸Multi-HeadͳͷͰݴޠͷ௥Ճ࡟আ͕༰қ👍
    ✓ ໿10000ΫϥεͷSoftmaxΑΓ΋ඒ͍͠ߏ੒ͩͱݴ͑Δ👍
    ✓ ͪΌΜͱଞͷOCRϞσϧͱಉ༷ʹE2EʹֶशՄೳ👍
    CRAFTSͷੑೳʹ͸ಧ͔ͳ͔͕ͬͨվྑͷ༨஍͕·ͩ·ͩ͋Δ
    ✓ CRAFTS͸Link ScoreͷಋೖʹΑͬͯॎॻ͖ςΩετʹ΋ڧ͍
    ✓ ॎॻ͖ςΩετʹର͢ΔೝࣝੑೳࠩͰউෛ͕෼͔Εͨͱߟ࡯͍ͯ͠Δ
    ✓ CRAFTSͷ޻෉఺͸ఏҊख๏ʹ΋ಋೖՄೳʢͦΕΛ࣮૷ͨ͠TextSpotter v4͕ۙʑൃද͞ΕͨΓͯ͠…!?ʣ
    ✓ ଞʹ΋ɺจࣈೝࣝثͷύϥϝλ਺΍ɺࣄલֶशσʔλྔ͕ɺCRAFTͷํ͕ང͔ʹଟ͍ɺͳͲͳͲɺɺɺ

    View Slide

  29. ·ͱΊɿଟݴޠରԠͷ0$3Λֶश͠΍͍͢ˍվ଄͠΍͍͢ߏ੒Ͱ࣮ݱͨ͠ʂ
    ՝୊ɿଟݴޠରԠ͍ͨ͠ʂ͔͠΋֦ு༰қʹ͍ͨ͠ʂ
    ➡ .VMUJ)FBEߏ଄Λ࠾༻ͯ͠ɺͭͭͷݴޠʹಛԽͨ͠ܭͭͷจࣈೝࣝثΛ഑ஔͨ͠
    ➡ ݴޠೝࣝثΛઃஔͯ͠ɺೖྗςΩετʹର͢Δݴޠਪ࿦݁ՌʹԠͯ͡จࣈೝࣝثΛ੾Γସ͑ͨɻ
    ՝୊ɿݴޠೝࣝثΛઃஔͨ͠΋ͷͷɺݴޠͷΞϊςʔγϣϯ͕গͳֶͯ͘शͰ͖ͳ͍ɻ
    ➡ จࣈೝࣝثʹඇରԠͷจࣈʢଞݴޠͷจࣈʣ͕ೖྗ͞Εͨ৔߹ʹϖφϧςΟΛ͔͚ͨɻ
    ➡ ͜ΕʹΑΓɺݴޠΞϊςʔγϣϯ͕ແͯ͘΋ɺςΩετϥϕϧͷΈͰʢݴޠೝࣝث΋ؚΊͯʣ&&ʹֶशͨ͠ɻ
    ࣮ݧ݁Ռɿ
    ● 4P5"ʹ͸ಧ͔ͳ͔ͬͨ΋ͷͷɺ5FYU%FUFDUJPOͱ5FYU3FDPHOJUJPOͰैདྷख๏ͱಉ౳ఔ౓ͷߴੑೳΛୡ੒ɻ

    View Slide

  30. ࢀߟจݙ
    .BTL5FYU4QPUUFSW
    IUUQTBSYJWPSHQEGQEG
    .BTL5FYU4QPUUFSW
    IUUQTBSYJWPSHQEGQEG
    .BTL5FYU4QPUUFSW
    IUUQTBSYJWPSHQEGQEG
    $3"'5
    IUUQTBSYJWPSHQEGQEG
    $3"'54
    IUUQTBSYJWPSHQEGQEG
    8IBU*T8SPOH8JUI4DFOF5FYU3FDPHOJUJPO
    .PEFM$PNQBSJTPOT %BUBTFUBOE.PEFM
    "OBMZTJT
    IUUQTBSYJWPSHQEGQEG
    $IBS/FU
    IUUQTBSYJWPSHQEGQEG
    5FYU'JFME-FBSOJOH"%FFQ%JSFDUJPO'JFMEGPS
    *SSFHVMBS4DFOF5FYU%FUFDUJPO
    IUUQTBSYJWPSHQEGQEG
    &"45"O&GGJDJFOUBOE"DDVSBUF4DFOF5FYU
    %FUFDUPS
    IUUQTBSYJWPSHQEGQEG
    4UBDLFE)PVSHMBTT/FUXPSLT
    IUUQTBSYJWPSHQEGQEG
    %BUBTFUBOE.PEFM"OBMZTJT
    IUUQTBSYJWPSHQEGQEG
    5PXBSET6ODPOTUSBJOFE&OEUP&OE5FYU
    4QPUUJOH
    IUUQTBSYJWPSHQEGQEG

    View Slide

  31. ิ଍ɿσʔληοτ
    Ex. Language & Script Data Difficulty Annotation
    ICDAR 2017 MLT
    dataset (MLT17)
    9 languages representing


    6 different scripts equally
    multi-oriented scene text annotated using
    quadrangle bounding
    boxes.
    ICDAR 2019 MLT
    dataset (MLT19)
    10 languages representing


    7 different scripts.
    multi-oriented scene text annotated using
    quadrangle bounding
    boxes.
    Total-Text dataset English language.


    wide variety of horizontal,
    multi-oriented and curved
    text
    annotated at word-level
    using polygon bounding
    boxes.
    ICDAR 2019 ArT
    dataset (ArT19)
    English and Chinese
    languages
    highly challenging arbitrarily
    shaped text
    annotated using arbitrary
    number of polygon
    vertices
    ICDAR 2017 RCTW
    dataset (RCTW17)
    Chinese scene text in Chinese drawing polygons to
    surround every text line


    ICDAR 2019 LSVT
    dataset (LSVT19)
    Chinese,


    but also has about 20% of its
    labels in English words.
    street view text in Chinese drawing polygons to
    surround every text line
    ICDAR 2013 dataset


    (IC13)
    English language horizontal text annotated at word-level
    using rectangular
    bounding boxes
    ICDAR 2015 dataset


    (IC15)
    English language multi-oriented scene text annotated at word-level
    using quadrangle
    bounding boxes.
    $IBSBDUFSMFWFMͷ"OOPUBUJPO͕ແ͍఺ɺݴޠ͕ภ͍ͬͯΔ఺ʹ஫໨ɻ

    View Slide