Upgrade to Pro — share decks privately, control downloads, hide ads and more …

第七回全日本コンピュータビジョン勉強会 A Multiplexed Network for End-to-End, Multilingual OCR

第七回全日本コンピュータビジョン勉強会 A Multiplexed Network for End-to-End, Multilingual OCR

68753ad1beb5f665fa03ade3278f9e33?s=128

Yamato.OKAMOTO

July 28, 2021
Tweet

Transcript

  1. ୈࣣճશ೔ຊίϯϐϡʔλϏδϣϯษڧձ ".VMUJQMFYFE/FUXPSLGPS&OEUP&OE  .VMUJMJOHVBM0$3 :BNBUP0,".050

  2. ࣗݾ঺հʢ୹͘ʂʣ Ԭຊେ࿨ʢ͓͔΋ͱ΍·ͱʣ 
 5XJUUFS3PBESPMMFS@%&46 ✓ ֶੜ࣌୅͸ژ౎େֶͰը૾ೝࣝΛઐ߈ ✓ ΦϜϩϯʹͯɺࣄۀͱٕज़ͷೋ౛ྲྀਓࡐͱͯ͠#J[%FWd3%·Ͱखֻ͚Δ ✓ ΑΓɺ-*/&גࣜձࣾͷ$PNQVUFS7JTJPO-BC5FBNʹॴଐ˒/FX

    ➡ -*/&$7-BCΛzӉ஦Ұָ͘͠ಇ͚ΔνʔϜzʹ͢΂͘νʔϜϏϧυத ➡ ເ͸ʮژ౎ʹಌΕΒΕΔݚڀڌ఺Λ࡞Δʯ͜ͱ ˞ຊ೔ͷൃද͸Ұൠެ։৘ใͷ࿦จΛ঺հ͢ΔҐஔ෇͚Ͱ͢ɻ 
 ˞ॴଐஂମ͸ؔ܎͋Γ·ͤΜɻ
  3. ঺հ͢Δ࿦จ • ".VMUJQMFYFE/FUXPSLGPS&OEUP&OE .VMUJMJOHVBM0$3 'BDFCPPL"* 
 IUUQTBSYJWPSHQEGQEG ✓ ޫֶจࣈೝࣝ 0$30QUJDBM$IBSBDUFS3FDPHOJUJPO

    ͷݚڀ ✓ ैདྷ͸ɺzΞϧϑΝϕοτzʹয఺Λ͋ͯͨݚڀ͕ଟ͔ͬͨ ✓ ଟݴޠରԠ͢Δʹͯ͠΋ɺग़ྗΫϥε਺Λ਺े͔Β਺ઍ΁ͱ૿΍͚ͩ͢ͷ߽շͳख๏͕ଟ͔ͬͨ ✓ ঺հ࿦จ͸ɺΞϥϒޠ΍೔ຊޠͳͲܭछʹରԠͭͭ͠ɺεϚʔτͳଟݴޠϞσϧΛఏҊ
  4. લఏ֬ೝɿͦ΋ͦ΋0$3ͱ͸ʁ

  5. 0$3ͷ໨త͸ը૾͔ΒͷςΩετநग़ • Πϯϓοτɿը૾ • Ξ΢τϓοτɿը૾தͷςΩετྖҬʴจࣈྻ https://arxiv.org/pdf/2007.09629.pdf

  6. ࿦จ঺հͷલʹɺஸೡʹݴ༿ͷఆٛΛ੔ཧ͍ͤͯͩ͘͞͞ 2VFTUJPOɿ0$3͕ݕग़͢Δͷ͸4FOUFODFʁ8PSEʁ$IBSBDUFSʁ w 4FOUFODFɹ➔ɹz*BNEPOFXJUINBOLJOE +0+0z w 8PSEɹɹɹ➔ɹz*zɺlBNzɺlEPOFzɺlXJUIzɺlNBOLJOEzɺl zɺl+0+0zɺlz w $IBSBDUFS➔ɹlNzɺBzɺlOzɺlLzɺlJzɺlOzɺlEz

    "OTXFSɿͲΕͰ΋ͳ͍ w 0$3͸zҙຯతͳ୯ҐzͰ͸ͳ͘zߏ଄తͳ୯ҐzͰจࣈྻΛݕग़͢ΔʢJOTUBODFMFWFMͱ΋ݺͿʣ w ҎޙɺຊࢿྉͰ͸ߏ଄తʹ͋Δఔ౓·ͱ·ͬͨจࣈྻΛʮ5FYUςΩετʯͱݺͿ w จࣈͣͭʢ$IBSBDUFSʣΛର৅ʹ͢ΔͱɺΞϊςʔγϣϯ΍ख๏͕ҟͳΔͷͰ໌֬ʹ۠ผ͢Δ
  7. ࿦จ঺հͷલʹɺͲΜͳΞϊςʔγϣϯΛ࢖͏ͷ͔੔ཧ • 5FYUMBCFMɹˡԼهͷը૾தʹ͸ͳ͍͕ɺ͜ͷ৔߹͸ʮ#"--:4ʯ͕5FYUMBCFMʹ૬౰͢Δɻ͜Ε͸͍͍ͩͨೖखͰ͖Δɻ • 5FYUMFWFMCPVOEJOHCPYʢ੨ʣ • 5FYUQPMZHPOBMCPVOEJOHCPYʢ੺ʣ • QPMZHPOBMCPVOEJOHCPYΛ࠷খۣܗͰғͬͨCPVOEJOHCPYʢ྘ʣˡ੨͕ͳ͍৔߹ʹ׆༻ •

    $IBSBDUFSMFWFMCPVOEJOHCPYʢԫʣˡ΄΅ଘࡏ͠ͳ͍ɺ΄ͱΜͲͷ৔߹͸σʔλˍڭࢣϥϕϧΛը૾߹੒Ͱੜ੒͢Δɻ • $IBSBDUFSMBCFM ԫ৭ృΓͷ੺จࣈ ˡ΄΅ଘࡏ͠ͳ͍ɺ΄ͱΜͲͷ৔߹͸σʔλˍڭࢣϥϕϧΛը૾߹੒Ͱੜ੒͢Δɻ
  8. ࿦จ঺հͷલʹɺOCRͷॲཧΛ੔ཧ ̏ͭͷॲཧΛཧղ͠·͠ΐ͏ 㾎5FYU%FUFDUJPO 㾎5FYU3FDPHOJUJPO 㾎5FYU4QPUUJOHɹˡࠓճͷ࿦จ͕ѻ͏΋ͷ

  9. Text Detectionͱ͸ʁ ը૾ʹؚ·ΕΔશͯͷ5FYUྖҬΛɺۣܗ΍1PMZHPOͰநग़͢Δ w 'BTU3$//ͳͲطଘͷ%FUFDUJPOϞσϧΛར༻Ͱ͖Δ w ςΩετ͕ࣼΊʹͳΔͱඞཁҎ্ʹۣܗ͕େ͖͘ͳΔ w ςΩετྖҬҎ֎ͷಛ௃ྔ΋ࠞೖ͢Δ৔߹͕ଟ͍ w

    Ξϊςʔγϣϯʹख͕͔͔ؒΔ w ϚεΫॲཧͳͲͷ௥ՃʹΑͬͯςΩετྖҬ͔ΒͷΈ 
 ಛ௃ྔΛநग़Մೳ ྫɿ'BTU3$//ͷΑ͏ʹ4MJEJOH8JOEPXͰݕग़͢Δ ྫɿ<5FYU/055FYU>ͷ4FHNFOUBUJPOΛ͢Δ
  10. Text Recognitionͱ͸ʁ ݕग़ͨ͠5FYUྖҬʹؚ·ΕΔ$IBSBDUFSྻΛೝࣝ͢Δ -07& w ֤$IBSBDUFSΫϥε എܠΫϥεͰ4FHNFOUBUJPO w մײΛ΋ͬͯલܠͱ༧ଌ͞ΕͨྖҬʹจࣈ͕͋ΔͱԾఆ͢Δ w

    Ͳͷ$IBSBDUFSΫϥεʹଐ͢Δ͔શϐΫηϧͰଟ਺ܾ͢Δ 
 w $IBSBDUFSMFWFMͷΞϊςʔγϣϯ͕ඞཁͳͷͰ޻਺େ w จࣈͷॱং͕֫ಘͰ͖ͳ͍఺΋՝୊ w 5FYUྖҬͷಛ௃ྔΛܥྻσʔλͷܗʹม׵͢Δ w 3//ϕʔεͷϞδϡʔϧͰਪ࿦ʢ-45.΍(36౳ʣ w 4QBUJBM"UUFOUJPOߏ଄ΛೖΕΔ৔߹΋ଟ͍ w $IBSBDUFSMFWFMͷΞϊςʔγϣϯෆཁͳͷͰ͓खܰ ྫɿ4FNBOUJD4FHNFOUBUJPOͰ$IBSBDUFSΛ༧ଌ͢Δ ྫɿ4FR4FR&OD%FDͰจࣈͣͭग़ྗ͢Δ
  11. Text Spottingͱ͸ʁ 5FYU%FUFDUJPOͱ5FYU3FDPHOJUJPOΛ྆ํ࣮ࢪ͢ΔॲཧΛ5FYU4QPUUJOHͱݺΜͰ͍Δ ʢ̎ͭͷॲཧΛܨ͛Δ͚ͩɺͱฉ͑͜Δ͔΋͠Εͳ͍͕ɺͲ͏ֶश͢Δ͔ͳͲɺ՝୊΍޻෉఺͸ଟ͍ʣ

  12. ࿦จ঺հͷલʹɺ0$3ͷ೉͠͞Λྻڍ • ςΩετͷํ޲͕λςɺϤίɺφφϝɺΧʔϒͳͲ༷ʑͰ͋Δ • ̍ຕͷը૾಺Ͱෳ਺ͷςΩετํ޲͕ࠞࡏ͢Δ • ̍ͭͷςΩετ͕̎ͭҎ্ʹ෼཭ɺ·ͨ͸ɺҟͳΔ̎ͭҎ্ͷςΩετ͕ͭʹ݁߹ͯ͠ݕग़ͯ͠͠·͏ • Ξϊςʔγϣϯ͕ݶΒΕΔɻಛʹ$IBSBDUFSMFWFMͷΞϊςʔγϣϯ͸΄΅खʹೖΒͳ͍ •

    ಛ௃ྔநग़ɺ%FUFDUJPOɺ3FDPHOJUJPOͱ͍ͬͨෳ਺ͷҟͳΔػೳϞδϡʔϧΛ&&Ͱֶश͍ͨ͠ 
 ʢֶश͕ޮ཰తͩ͠ɺࣗવͳઃܭʹͳΔ͠ɺҟͳΔػೳϞδϡʔϧಉ࢜ͷ૬ޓิ׬͕ظ଴Ͱ͖ΔͨΊʣ
  13. ʢযΒͯ͠͝ΊΜͳ͍͞ʣ 0$3ͷैདྷݚڀ΋঺հ͍ͤͯͩ͘͞͞

  14. 0$3ͰΑ͘Ҿ༻͞ΕΔख๏ &&.-5 • $[FDI5FDIOJDBM6OJWFSTJUZ $BSOFHJF.FMMPO6OJWFSTJUZ  • &&.-5BO6ODPOTUSBJOFE&OEUP&OE.FUIPEGPS.VMUJ-BOHVBHF4DFOF5FYU • IUUQTBSYJWPSHQEGQEG

    $IBS/FU • .BMPOH5FDIOPMPHJFT • $POWPMVUJPOBM$IBSBDUFS/FUXPSLT • IUUQTBSYJWPSHQEGQEG 5FYU4QPUUFS • 'BDFCPPL"* )VB[IPOH6OJWFSTJUZ • .BTL5FYU4QPUUFSW4FHNFOUBUJPO1SPQPTBM/FUXPSLGPS3PCVTU4DFOF5FYU4QPUUJOH • IUUQTBSYJWPSHQEGQEG $3"'54 • $MPWB"*3FTFBSDI /"7&3$PSQ • $IBSBDUFS3FHJPO"UUFOUJPO'PS5FYU4QPUUJOH • IUUQTBSYJWPSHQEGQEG
  15. &&.-5 "$$7` CRAFTS ը૾શମͷ ಛ௃ྔநग़ ResNet34Λϕʔεͱͨ͠FPN(Feature Pyramid Net)ɻ ςΩετ 


    ྖҬݕग़ 1/4 Scaleͷ֤࠲ඪͰ Text/NOT Textɺb-boxɺAngleΛਪ࿦ɻ ʢAncher͸࢖༻͠ͳ͍ʣ ςΩετ 
 ྖҬͷ ಛ௃ྔநग़ ݕग़ͨ͠b-box͔Βճస΍࿪ΈܰݮΛ ໨తʹɺύϥϝλਪఆ͠ͳ͕Β spatial transformer layerΛద༻͢Δɻ ςΩετ ೝࣝ Conv૚ͰจࣈೝࣝثΛߏ੒ ೖྗɿԣ෯͚ͩՄม௕ͷಛ௃ྔ ग़ྗɿจࣈ਺෼(໿7500)ͷlog-softmax ग़ྗ͢Δจࣈ਺͸ಛ௃ྔͷԣ෯ʹൺྫ ͯ͠૿΍͢ɻ ֶश޻෉ ը૾߹੒ʹΑͬͯଟݴޠͷֶशσʔλ Λߏஙɻ ˛ఏҊϞσϧͷશମ૾ɻݕग़෦ͱೝࣝ෦͕௚ྻʹฒͿɻ ˛ֶश༻ͷ߹੒ը૾
  16. $IBS/FU $713` ˛.BTL5FYU4QPUUFS΍$3"'54ʹൺ΂Δͱɺ ɹςΩετ΍จࣈͷೝࣝॲཧ͕ฒྻͰҰؾʹͳ͞ΕΔɻ CharNet ը૾શମͷ ಛ௃ྔநग़ ResNet-50 ͱ 


    Hourglass networks(Newell 2016)ͷ 
 ૊Έ߹Θͤɻ ςΩετ 
 ྖҬݕग़ Text Detection BrunchͱCharacter BrunchΛฒྻʹઃ͚Δɻ ɾText Detection Brunch ࣼΊ΍Χʔϒʹ΋ରԠՄೳͳ طଘख๏(EAST, Textfield)Λద༻ɻ ςΩετྖҬΛग़ྗ͢Δɻ ɾCharacter Brunch 3ͭͷϞδϡʔϧΛฒྻʹ഑ஔ -(1)[Text/NOT text] ͷsegmentation -(2)b-boxʹΑΔCharacter Detection -(3)จࣈ਺෼ͷଟΫϥεsegmentation Characterͷb-boxͱϥϕϧΛग़ྗ͢ Δɻ ݕग़ͨ͠ςΩετྖҬʹؚ·ΕΔจ ࣈू߹Λग़ྗ͢Δ (ͳͷͰɺग़ྗ͸ ݫີʹ͸ςΩετͰ͸ͳ͍)ɻ ςΩετ 
 ྖҬͷ ಛ௃ྔநग़ ςΩετ ೝࣝ ֶश޻෉ ֶशʹ͸Text-levelͱcharacter-level྆ ํͷΞϊςʔγϣϯ͕ඞཁͳͷͰɺ ߹੒σʔλͰֶश͢Δɻ ͦͷޙɺ࣮σʔλΛ࢖ͬͯ Weakly Supervised Learning͢Δɻ ˝5FYU%FUFDUJPO#SVODI͸ ɹςΩετྖҬΛݕग़ ˝$IBSBDUFS#SVODI͸಺෦ͰͭͷॲཧΛฒྻ࣮ߦͯ͠ɺ ɹ$IBSBDUFSMFWFMͷݕग़ͱೝࣝΛ࣮ߦ͢Δɻ
  17. 5FYU4QPUUFSW &$$7` ˛ςΩετྖҬΛ4ISJOLͨ͠ྖҬΛڭࢣσʔλͱֶͯ͠शͤ͞Δɻ 
 ͜ΕʹΑΓྡͷςΩετ͕ͬͭ͘͘ͷΛ๷͙ʢ࣍ͷॲཧʹҠΔͱ͖ʹ͸VOTISJOL͢Ε͹Α͍ʣ ◀︎ ˛<PS>Ͱಛ௃ྔΛϚεΫ͕͚͢Δ ˝௨Γͷํ๏ͰจࣈྻΛ֫ಘ͢ΔɻԼஈͷख๏4".Ͱ͸จࣈϨϕϧͷΞϊςʔγϣϯෆཁ TextSpotter (v1~v3)

    ը૾શମͷ ಛ௃ྔநग़ ResNet50ΛϕʔεʹFPNΛઃ͚Δ(v2) ResNet50ΛϕʔεʹU-NetΛઃ͚Δ(v3) ςΩετ 
 ྖҬݕग़ Fast-RCNNϕʔεͷAncherʹΑΔݕग़ (v2) Text/NOT TextΛSegmentation(v3) ςΩετ 
 ྖҬͷ ಛ௃ྔநग़ AncherͰݕग़ۣͨ͠ܗྖҬʹRoI AlignΛద ༻ͯ͠ಛ௃ྔநग़ (v2) Segmentation݁ՌΛ࠷খۣܗͰ੾Γग़͠ɺ ಛ௃ྔʹRoI AlignͱMaskΛద༻(v3) ςΩετ ೝࣝ (1)֤จࣈʴഎܠͷSegmentationΛ࣮ߦɻ จࣈީิྖҬ಺Ͱଟ਺ܾ(PixelVoting)Λͯ͠ จࣈΛ൑ఆɻ (2)Sequentialͳಛ௃ྔʹม׵ͯ͠Attention෇ ͖ͷseq2seqͰจࣈྻΛग़ྗɻ (1)(2)ͷ2ͭͷ༧ଌ݁ՌΛ֫ಘޙɺ৴པ౓ͷ ߴ͍ํΛ࠾༻͢Δɻ ֶश޻෉ Character-levelͷΞϊςʔγϣϯ͕ͳͯ͘΋ (2)͸ֶशՄೳʢ˞(1)ͷֶशʹ͸ඞཁʣɻ
  18. $3"'54 &$$7` ˛514ʹΑΔۣܗม׵ ࣮ࡍ͸'FBUVSF.BQΛม׵ ˛จࣈྖҬͷ༧ଌ͕Ͱ͖Ε͹ ɹ͔ͦ͜Β1PMZHPOྖҬΛ֫ಘՄೳ ˝࣮σʔλͰͷֶश࣌͸ɺDIBSBDUFSMFWFMͷΞϊςʔγϣϯ͕ແ͍ͨΊ ɹٙࣅϥϕϧʹΑΔ8FBLMZ4VQFSWJTFE-FBSOJOHΛ͢Δ CRAFTS ը૾શମͷ

    ಛ௃ྔநग़ ResNet50ΛϕʔεʹU-Netߏ଄Λઃ͚ Δɻ ςΩετ 
 ྖҬݕग़ ߹੒σʔλʴಠࣗͷڭࢣϥϕϧͰֶश (1)จࣈத৺͕ݪ఺ͷΨ΢εείΞ 
 (2)ྡ઀จࣈͷܨ͕ΓΛࣔ͢είΞ 
 (3)จࣈํ޲ 
 (1)(2)ͷ༧ଌ݁Ռ͔ΒҰఆͷܭࢉॲཧ ͰςΩετྖҬΛPolygonͰநग़ɻ ςΩετ 
 ྖҬͷ ಛ௃ྔநग़ ಛ௃ྔͱ(1)(2)༧ଌ݁ՌΛconcat͢Δɻ thin-plate splineʹΑͬͯɺPolygonͰݕ ग़ͨ͠ςΩετྖҬΛݻఆαΠζͷۣ ܗʹม׵ͯ͠ಛ௃ྔΛநग़͢Δɻ ςΩετ ೝࣝ Sequentialͳಛ௃ྔʹม׵ͯ͠Attention ෇͖ͷseq2seqͰจࣈྻΛग़ྗɻ ֶश޻෉ ςΩετͷΞϊςʔγϣϯ͔Βಠࣗͷ ֶश༻σʔλΛ࡞੒ͯ͠ɺͦΕΛ༧ଌ Ͱ͖ΔΑ͏Ϟσϧʹֶशͤ͞Δɻ ˛ֶश༻ͷ߹੒σʔλ࡞੒࣌ʹ͸্هͷΑ͏ͳಠࣗͷڭࢣϥϕϧΛੜ੒ֶͯ͠शͤ͞Δɻ ɹྫ͑͹-JOL4DPSF͸ॎॻ͖ͷςΩετͷࣝผ཰޲্ʹد༩͢Δɻ
  19. ͓଴ͨͤ͠·ͨ͠ɺ΍ͬͱ࿦จ঺հʹҠΓ·͢ɻ • ".VMUJQMFYFE/FUXPSLGPS&OEUP&OE .VMUJMJOHVBM0$3 
 'BDFCPPL"*ɹIUUQTBSYJWPSHQEGQEG

  20. ͜ͷ࿦จ͕஫໨ͨ͠՝୊ 0$3ݚڀͷଟ͕͘ӳޠΛର৅ʹ͍ͯ͠Δ͕ɺ ੈքʹ͸ଞʹ΋ͨ͘͞Μݴޠ͕͋Δͧ 0$3ΛଟݴޠରԠͤͨ͞ݚڀྫͰ͸ɺ Ϟσϧͷग़ྗΫϥε਺Λ֦ு͚ͨͩ͠ͷઃܭ͕ଟ͍ɻ ͦΕͰ͍͍ͷͩΖ͏͔ʁ ݴޠͷ௥Ճ࡟আͳͲϝϯςφϯε༰қͳߏ੒Λ࣮ݱ͍ͨ͠ʂ ˞͜͜Ͱ͍͏ݴޠͱ͸ɺਖ਼֬ʹ͸จࣈମܥʢ4DSJQUʣͷ͜ͱ ˞ӳޠ͕ଟ͗͢Δͱ͍͏͚ͩͰɺ೔ຊޠ0$3΍ؖࠃޠ0$3΋ʢ໪࿦ʣଘࡏ͠·͢ɻ 🤔

    🤔
  21. ิ଍ɿੈքͷจࣈମܥͨͪ ࢀরɿWikipedia ݴޠʢ-BOHVBHFʣͱ͸ʁɹ➔ɹzӳޠz zυΠπޠz z೔ຊޠzͳͲͷ͜ͱ จࣈମܥʢ4DSJQUʣͱ͸ʁɹ➔ɹz-BUJOz z,PSFBOzͳͲͷ͜ͱ จࣈʢ$IBSBDUFSʣͱ͸ʁɹ➔ɹz"z l͋z lউzͳͲͷ͜ͱ

    ˞ݴޠ͕ҟͳͬͯ΋จࣈମܥ͕ಉҰͳ৔߹΋͋Δɻ 
 ˞Ҏޙ͸ʮ-BOHVBHF㲈4DSJQUʯͱͯ͠ɺʮݴޠʢ-BOHVBHFʣʯͱ͍͏දݱʹ౷Ұͯ͠આ໌ΛਐΊ·͢ɻ
  22. $POUSJCVUF5FYU4QPUUFSΛϕʔεʹ֦ு༰қͳଟݴޠରԠϞσϧΛఏҊ ʢஶऀ͸5FYU4QPUUFSWͱಉ͘͡'BDFCPPL"*ͷݚڀऀʣ ৽ͨʹݴޠೝࣝثΛઃஔɻ ςΩετྖҬͷಛ௃ྔ͔Βݴޠೝࣝ͢Δ ݴޠ͝ͱͷจࣈೝࣝثΛ Multi-HeadͰઃஔɻ ݴޠೝࣝ݁ՌʹΑͬͯ HeadΛ੾Γସ͑Δɻ จࣈೝࣝ͸࣌ܥྻσʔλΛѻ͏χϡʔϥϧωοτʢ͜͜Ͱ͸GRUʣΛར༻ɻ Attention΋׆༻͠ͳ͕Β1จࣈͣͭ༧ଌ݁ՌΛग़ྗ͢Δɻ

    SegmentationͰςΩετྖҬநग़ɻ ҟͳΔςΩετྖҬ͕͔ͬͭ͘ͳ͍Α͏ ѹॖ๲ுॲཧͳͲͷ޻෉΋ࢪ͞ΕΔɻ ϚεΫͯ͠ςΩετྖҬͷ 
 ಛ௃ྔ͚ͩΛநग़ ˒͜ͷ࿦จ͕ߩݙͨ͠ϙΠϯτ ˒͜ͷ࿦จ͕ߩݙͨ͠ϙΠϯτ
  23. ΞΠσΞɿMulti-HeadʹΑΔଟݴޠରԠ ՝୊ɿଟݴޠରԠ͍ͨ͠ʂ͔͠΋֦ு༰қʹ͍ͨ͠ʂ ➡ ୯७ͳํ๏͸ɺ̔ݴޠؚ͕ΉશͯͷจࣈΛѻ͑ΔΑ͏ग़ྗΫϥε਺Λ֦ு͢Δ͜ͱ ➡ ͨͩ͠ɺ೔ຊޠɾதࠃޠɾؖࠃޠΛѻ͓͏ͱ͢Δͱສఔ౓·ͰΫϥε਺͕૿Ճ͢Δ ➡ ·ͯ͠ɺݴޠʹΑͬͯ৅ܗɾॎॻ͖ɾԣॻ͖ͳͲಛ௃͕େ͖͘ҟͳΔ ➡ ͜ΕΒΛͭͷೝࣝثʢ4JOHMF)FBEʣͰѻ͏ͷ͸ద੾ͩΖ͏͔ʁ

    ➡ จࣈͰ΋௥Ճ࡟আͨ͘͠ͳͬͨ৔߹ɺ࠷ॳ͔Βֶश͢Δख͕ؒൃੜ͔͠Ͷͳ͍ ΞΠσΞ ➡ .VMUJIFBEߏ଄Λ࠾༻ɻͭͭͷݴޠʹಛԽͨ͠ܭͭͷจࣈೝࣝثΛ഑ஔͨ͠ɻ ➡ ͋Θͤͯɺݴޠೝࣝثʢ-BOHVBHF1SFEJDUJPO/FUXPSL -1/ ʣΛ഑ஔͨ͠ɻ ➡ ςΩετྖҬ͝ͱʹݴޠೝࣝΛͯ͠ɺ࠷దͳจࣈೝࣝثΛͭબΜͰਪ࿦࣮ࢪͨ͠ɻ ςΩετྖҬ͝ͱʹ จࣈೝࣝثΛ࢖͍෼͚ ݴޠࣝผ݁ՌʹԠͯ͡ จࣈೝࣝثΛ੾Γସ͑ ෳ਺ͷจࣈೝࣝث Λ഑ஔͨ͠
  24. ΞΠσΞɿLanguageͷڭࢣσʔλΛඞཁͱ͠ͳֶ͍श ՝୊ɿݴޠೝࣝثΛઃஔͨ͠΋ͷͷɺݴޠͷΞϊςʔγϣϯ͕গͳֶͯ͘शͰ͖ͳ͍ɻ ➡ ݴޠͷΞϊςʔγϣϯ͕͋Ε͹ɺͦΕΛ༧ଌͰ͖ΔΑ͏ʹݴޠೝࣝثΛֶशͤ͞Ε͹ྑ͍ɻ ➡ ͔͠͠ɺ࣮ࡍ͸Ξϊςʔγϣϯ͕΄ͱΜͲͳ͍ɻͦΕʹɺϞδϡʔϧݸผͰ͸ͳ͘&&ʹֶश͍ͨ͠ɻ ΞΠσΞ ➡ ݴޠΞϊςʔγϣϯ͕ແͯ͘΋ɺςΩετϥϕϧͷΈͰ&&ʹֶशͰ͖Δํ๏ΛߟҊɻ ➡

    จࣈೝࣝثʹඇରԠͷจࣈʢଞݴޠͷจࣈʣ͕ೖྗ͞Εͨ৔߹ʹϖφϧςΟΛ͔͚ͨɻ DUɿU൪໨ͷจࣈ ZUɿU൪໨ͷจࣈͷڭࢣσʔλ $Sɿจࣈೝࣝػ͕ѻ͏จࣈू߹ ̞ɿPSͷ஋Λฦ͢ 5ɿ࠷େग़ྗจࣈ਺ʢݻఆύϥϝλʣ ݴޠೝࣝثʹݴޠΛਪ࿦ͤͯ͞ɺ 
 DSPTTFOUSPQZMPTTͰֶश͢Δɻ 5FYUMBCFM͚ͩͰֶशͤ͞Δɻ จࣈೝࣝػʹඇରԠͷจࣈ͕ೖྗ͞ΕΔͱϖφϧςΟЌΛ͔͚Δɻ ͜ΕʹΑΓɺݴޠೝࣝث͕ద੾ͳจࣈೝࣝثΛબ୒͢ΔΑ͏ֶश͢Δɻ Mɹɹɿݴޠ Q M ɿϞσϧ͕ਪ࿦ͨ֬͠཰ MMBOHɿѻ͏ݴޠͷ૯਺ ݴޠΞϊςʔγϣϯ͕͋Δ৔߹ ݴޠΞϊςʔγϣϯ͕ͳ͍৔߹
  25. ࣮ݧ݁Ռͱߟ࡯

  26. ଟݴޠͷText Spottingͷ࣮ݧ݁ՌˠʮCRAFTSҎ֎ʹ͸উͬͨͥʂʯ CRAFTS(paper ) CRAFTͷ࿦จ͕ใࠂ͞Ε͍ͯΔ࣮ݧ݁Ռ CRAFT S ஶऀΒ͕CRAFTΛ࠶ݱ࣮૷ͯ͠ධՁͨ݁͠Ռ Single-head TextSpotte

    r ఏҊख๏ΛMulti-HeadʹͤͣSingle-HeadͰ8ݴޠͷશͯͷจࣈ(໿9000छ)ʹରԠͤͨ͞Ϟσϧ Multiplexed TextSpotte r ຊࢿྉͰ঺հ͍ͯ͠ΔఏҊख๏ɻ ैདྷͷଟݴޠରԠOCRϞσϧͱɺ8ݴޠͷText SpottingλεΫͰੑೳൺֱͨ͠ ※࣮ݧσʔλʹ͸ଟݴޠΛؚΉ”MTL19 Dataset”Λར༻ Ὂ݁Ռɿ֓Ͷߴੑೳͱͳ͕ͬͨɺ།ҰɺCRAFTʹ͸ಧ͔ͣ F஋ Precision Recall
  27. ଟݴޠͷText Detectionͷ࣮ݧ݁ՌˠʮCRAFTS(paper)Ҏ֎ʹ͸উͬͨͥʂʯ ଟݴޠσʔλʢMLT19ʣΛର৅ʹςΩετݕग़Ͱੑೳൺֱͨ͠ Ὂ݁Ռɿ֓Ͷߴੑೳͱͳ͕ͬͨɺ།ҰɺCRAFT(paper)ʹ͸ಧ͔ͣ Average
 Precision F஋ Precision Recall CRAFTS(paper

    ) CRAFTͷ࿦จ͕ใࠂ͞Ε͍ͯΔ࣮ݧ݁Ռ CRAFT S ஶऀΒ͕CRAFTΛ࠶ݱ࣮૷ͯ͠ධՁͨ݁͠Ռ Single-head TextSpotte r ఏҊख๏ΛMulti-HeadʹͤͣSingle-HeadͰ8ݴޠͷશͯͷจࣈ(໿9000छ)ʹରԠͤͨ͞Ϟσϧ Multiplexed TextSpotte r ຊࢿྉͰ঺հ͍ͯ͠ΔఏҊख๏ɻ
  28. ͜͜ʹ஫໨ɿSoTA͡Όͳͯ͘΋CVPR࠾୒ʹ଍Δߩݙ͕͋Δʂ ςΩετݕग़λεΫʹͯݴޠผʹΈΔͱʢಛʹArabicͱChineseͰʣੑೳ޲্Λୡ੒͍ͯ͠Δ Average
 Precision F஋ Precision Recall F஋  

    ͦ΋ͦ΋ͷ໨త͸ϋϯυϦϯά͠΍͍͢ଟݴޠϞσϧΛఏҊ͢Δ͜ͱ ✓ ఏҊϞσϧ͸Multi-HeadͳͷͰݴޠͷ௥Ճ࡟আ͕༰қ👍 ✓ ໿10000ΫϥεͷSoftmaxΑΓ΋ඒ͍͠ߏ੒ͩͱݴ͑Δ👍 ✓ ͪΌΜͱଞͷOCRϞσϧͱಉ༷ʹE2EʹֶशՄೳ👍 CRAFTSͷੑೳʹ͸ಧ͔ͳ͔͕ͬͨվྑͷ༨஍͕·ͩ·ͩ͋Δ ✓ CRAFTS͸Link ScoreͷಋೖʹΑͬͯॎॻ͖ςΩετʹ΋ڧ͍ ✓ ॎॻ͖ςΩετʹର͢ΔೝࣝੑೳࠩͰউෛ͕෼͔Εͨͱߟ࡯͍ͯ͠Δ ✓ CRAFTSͷ޻෉఺͸ఏҊख๏ʹ΋ಋೖՄೳʢͦΕΛ࣮૷ͨ͠TextSpotter v4͕ۙʑൃද͞ΕͨΓͯ͠…!?ʣ ✓ ଞʹ΋ɺจࣈೝࣝثͷύϥϝλ਺΍ɺࣄલֶशσʔλྔ͕ɺCRAFTͷํ͕ང͔ʹଟ͍ɺͳͲͳͲɺɺɺ
  29. ·ͱΊɿଟݴޠରԠͷ0$3Λֶश͠΍͍͢ˍվ଄͠΍͍͢ߏ੒Ͱ࣮ݱͨ͠ʂ ՝୊ɿଟݴޠରԠ͍ͨ͠ʂ͔͠΋֦ு༰қʹ͍ͨ͠ʂ ➡ .VMUJ)FBEߏ଄Λ࠾༻ͯ͠ɺͭͭͷݴޠʹಛԽͨ͠ܭͭͷจࣈೝࣝثΛ഑ஔͨ͠ ➡ ݴޠೝࣝثΛઃஔͯ͠ɺೖྗςΩετʹର͢Δݴޠਪ࿦݁ՌʹԠͯ͡จࣈೝࣝثΛ੾Γସ͑ͨɻ ՝୊ɿݴޠೝࣝثΛઃஔͨ͠΋ͷͷɺݴޠͷΞϊςʔγϣϯ͕গͳֶͯ͘शͰ͖ͳ͍ɻ ➡ จࣈೝࣝثʹඇରԠͷจࣈʢଞݴޠͷจࣈʣ͕ೖྗ͞Εͨ৔߹ʹϖφϧςΟΛ͔͚ͨɻ ➡

    ͜ΕʹΑΓɺݴޠΞϊςʔγϣϯ͕ແͯ͘΋ɺςΩετϥϕϧͷΈͰʢݴޠೝࣝث΋ؚΊͯʣ&&ʹֶशͨ͠ɻ ࣮ݧ݁Ռɿ • 4P5"ʹ͸ಧ͔ͳ͔ͬͨ΋ͷͷɺ5FYU%FUFDUJPOͱ5FYU3FDPHOJUJPOͰैདྷख๏ͱಉ౳ఔ౓ͷߴੑೳΛୡ੒ɻ
  30. ࢀߟจݙ .BTL5FYU4QPUUFSW IUUQTBSYJWPSHQEGQEG .BTL5FYU4QPUUFSW IUUQTBSYJWPSHQEGQEG .BTL5FYU4QPUUFSW IUUQTBSYJWPSHQEGQEG $3"'5 IUUQTBSYJWPSHQEGQEG $3"'54

    IUUQTBSYJWPSHQEGQEG 8IBU*T8SPOH8JUI4DFOF5FYU3FDPHOJUJPO .PEFM$PNQBSJTPOT %BUBTFUBOE.PEFM "OBMZTJT IUUQTBSYJWPSHQEGQEG $IBS/FU IUUQTBSYJWPSHQEGQEG 5FYU'JFME-FBSOJOH"%FFQ%JSFDUJPO'JFMEGPS *SSFHVMBS4DFOF5FYU%FUFDUJPO IUUQTBSYJWPSHQEGQEG &"45"O&GGJDJFOUBOE"DDVSBUF4DFOF5FYU %FUFDUPS IUUQTBSYJWPSHQEGQEG 4UBDLFE)PVSHMBTT/FUXPSLT IUUQTBSYJWPSHQEGQEG %BUBTFUBOE.PEFM"OBMZTJT IUUQTBSYJWPSHQEGQEG 5PXBSET6ODPOTUSBJOFE&OEUP&OE5FYU 4QPUUJOH IUUQTBSYJWPSHQEGQEG
  31. ิ଍ɿσʔληοτ Ex. Language & Script Data Difficulty Annotation ICDAR 2017

    MLT dataset (MLT17) 9 languages representing 6 different scripts equally multi-oriented scene text annotated using quadrangle bounding boxes. ICDAR 2019 MLT dataset (MLT19) 10 languages representing 7 different scripts. multi-oriented scene text annotated using quadrangle bounding boxes. Total-Text dataset English language. wide variety of horizontal, multi-oriented and curved text annotated at word-level using polygon bounding boxes. ICDAR 2019 ArT dataset (ArT19) English and Chinese languages highly challenging arbitrarily shaped text annotated using arbitrary number of polygon vertices ICDAR 2017 RCTW dataset (RCTW17) Chinese scene text in Chinese drawing polygons to surround every text line ICDAR 2019 LSVT dataset (LSVT19) Chinese, but also has about 20% of its labels in English words. street view text in Chinese drawing polygons to surround every text line ICDAR 2013 dataset (IC13) English language horizontal text annotated at word-level using rectangular bounding boxes ICDAR 2015 dataset (IC15) English language multi-oriented scene text annotated at word-level using quadrangle bounding boxes. $IBSBDUFSMFWFMͷ"OOPUBUJPO͕ແ͍఺ɺݴޠ͕ภ͍ͬͯΔ఺ʹ஫໨ɻ