Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lucene Index Deep Dive

Lucene Index Deep Dive

社内の Lucene 勉強会で使った資料です。Apache Lucene https://lucene.apache.org/ のインデックスにフォーマットついて発表しました。
サンプルコード: https://github.com/takuyaa/hello-lucene
ブログ記事(下書き): https://stop-the-world.hatenablog.com/draft/entry/NXbtOo1V1X261E94FKRPr5eNXmM

13f3313ae1ec1d9b3ed76ccbd746291b?s=128

Takuya Asano

August 02, 2021
Tweet

Transcript

  1. 5BLVZB"TBOP !UBLVZB@C -VDFOF*OEFY 
 %FFQ%JWF 
 -VDFOF3FBEJOH

  2. લճ·Ͱͷ͋Β͢͡

  3. w +BWBͰॻ͔ΕͨશจݕࡧΤϯδϯϥΠϒϥϦ w ೥ݱࡏɺશจݕࡧʹ͓͚ΔσϑΝΫτελϯμʔυ w શจݕࡧʹඞཁͳػೳ͕΄΅࣮૷͞Ε͍ͯΔ w &MBTUJDTFBSDI΍4PMSͰݕࡧͷίΞ෦෼ͱͯ͠࢖ΘΕ͍ͯΔ w &MBTUJDTFBSDI4PMS͸-VDFOFͷ3&45"1*ΛϢʔβʹఏڙ

    w ଞʹ΋-VDFOFʹ͸ͳ͍ػೳʢ෼ࢄݕࡧͳͲʣΛ࣮૷͍ͯ͠Δ ͕͜͜Ͱ͸ׂѪ 8IBUJT"QBDIF-VDFOF IUUQTHJUIVCDPNBQBDIFMVDFOF
  4. 'VMMUFYU4FBSDI#BTJDTPO-VDFOF *OWFSUFEJOEFY5IFDPSFEBUBTUSVDUVSFPGTFBSDIFOHJOFT w શจݕࡧͰ͸ɺݕࡧର৅ͷίϯςϯπ͸จॻ EPDVNFOU ͱͯ͠ ϞσϧԽ͞ΕΔ w &$ݕࡧͰ͋Ε͹঎඼ w

    8FCݕࡧͰ͋Ε͹8FCϖʔδ w -VDFOFͰ͸DocumentΫϥεʹରԠ w -VDFOF͸సஔΠϯσοΫε JOWFSUFEJOEFY ʹΑΓ 
 จॻΛΠϯσοΫεԽ͢Δ w సஔΠϯσοΫεํࣜͷશจݕࡧ͸ɺେن໛ͳจॻू߹͔Βͷ ݕࡧʹ޲͍͍ͯΔ w -VDFOFͰ͸ɺจॻ͸EPDJEͰࣝผ͞ΕΔ 5FSN 1PTUJOHT-JTU BDUJPO  DPPLCPPL  JO  MVDFOF   "OFYBNQMFPGJOWFSUFEJOEFYTUSVDUVSFGPSEPDVNFOUT  l-VDFOFJO"DUJPOzBOEl-VDFOF$PPLCPPLz" TFUPGBMMUFSNTJTPGUFOSFGFSSFEUPBTBlUFSN EJDUJPOBSZzPSTJNQMZlEJDUJPOBSZz SFG*OGPSNBUJPO3FUSJFWBMBOE8FC4FBSDI·ͱΊ  సஔΠϯσοΫεTUPQUIFXPSME 
 IUUQTTUPQUIFXPSMEIBUFOBCMPHDPNFOUSZDTJOGPSNBUJPOSFUSJFWBM
  5. *OEFYJOH1SPDFTT0WFSWJFX $PSFDPNQPOFOUTGPSJOEFYJOH $POTUSVDUBEPDVNFOUPCKFDU 
 จॻΦϒδΣΫτΛߏங "OBMZ[FUFYUDPOUFOUT QSFQSPDFTTJOH 
 ςΩετղੳʢલॲཧʣ #VJMEB-VDFOFJOEFY

    
 ΠϯσοΫεߏங 8SJUFBOJOEFYUPBTUPSBHF 
 ΠϯσοΫεॻ͖ࠐΈ Analyzer Directory IndexWriter Document
  6. %PDVNFOU3FQSFTFOUBUJPO w จॻ͸DocumentΫϥεͷΦϒδΣΫτͱͯ͠දݱ͞ΕΔ w Document͸ෳ਺ͷField͔Βߏ੒͞ΕΔ w ,FZWBMVFNBQͷΑ͏ͳσʔλߏ଄ w Field͸ϑΟʔϧυ໊ͱ಺༰ɺϑΟʔϧυͷλΠϓΛ΋ͭ Document

    doc1 = new Document(); doc1.add(new Field("title", "Lucene in Action", TextField.TYPE_STORED)); Document doc2 = new Document(); doc2.add(new Field("title", "Lucene Cookbook", TextField.TYPE_STORED)); 'JFMEOBNF 'JFMEDPOUFOU 'JFMEUZQFzTUPSFEPSOPUTUPSFE 4UPSFE fi FMETBSFTUPSFEJOBOJOEFY5IJTBMMPXTZPVUP 
 SFUSJFWFUIF fi FMEDPOUFOUTBUTFBSDIUJNF .PEFMPCKFDUUPCFJOEFYFE
  7. "OBMZ[FS 5FYUQSFQSPDFTTPST w AnalyzerTokenizer Filters w Tokenizer w ςΩετจࣈྻΛ5PLFOͷྻʹ෼ׂ͢Δ w

    Filter w 5PLFOΛҰఆͷϧʔϧͰআڈ͢ΔʢFHStopFilterʣ w 5PLFOͷจࣈྻΛҰఆͷϧʔϧͰஔ׵͢ΔʢFHLowerCaseFilterʣ w AnalyzerͷྫStandardAnalyzer w StandardAnalyzerStandardTokenizer + StopFilter  LowerCaseFilter w 6OJDPEF5FYU4FHNFOUBUJPO ϕʔεͷ Tokenizer "Lucene in Action" "Lucene", "in", "Action" "Lucene", "Action" "lucene", "action" StandardTokenizer StopFilter LowerCaseFilter *GUIFStopFilterIBTlJOzBTBTUPQXPSE
  8. #BTJD*OEFYJOH"1* "DPEFFYBNQMFUPCVJMEB-VDFOFJOEFY // Create a directory for storing Lucene index

    Path indexDirPath = Files.createDirectory(Path.of("index")); Directory directory = FSDirectory.open(indexDirPath); // Set up IndexWriter Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig config = new IndexWriterConfig(analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); // Index a document: "Lucene in Action" Document doc1 = new Document(); doc1.add(new Field("title", "Lucene in Action", TextField.TYPE_STORED)); indexWriter.addDocument(doc1); // Index a document: "Lucene Cookbook" Document doc2 = new Document(); doc2.add(new Field("title", "Lucene Cookbook", TextField.TYPE_STORED)); indexWriter.addDocument(doc2); // Write index to the directory indexWriter.close();
  9. _0.fdm _0.fdt _0.fdx _0.fnm _0.nvd _0.nvm _0.si _0_Lucene84_0.doc _0_Lucene84_0.pos _0_Lucene84_0.tim

    _0_Lucene84_0.tip _0_Lucene84_0.tmd segments_1 write.lock *OEFYBGUFSTUDPNNJU 4FHNFOU 4FHNFOUT'JMF -PDL'JMF TFHNFOUT@ 4FHNFOU w ΠϯσοΫε͸ෳ਺ͷηάϝϯτ TFHNFOU ͔ΒͳΔ w ͢΂ͯಉ͡σΟϨΫτϦʹอଘ͞ΕΔ w ηάϝϯτ͸αϒΠϯσοΫε w ୯ମͰ΄΅-VDFOFΠϯσοΫεͱͯ͠ػೳ͢Δ w ηάϝϯτ͸ෳ਺ͷϑΝΠϧ͔ΒͳΔ w ϑΝΠϧ໊͸_gen.extPS_gen_Lucene84_0.extͷܗࣜ w &H_0.fnm _0_Lucene84_0.pos ʜ w genηάϝϯτͷੈ୅ FH    w extϑΥʔϚοτ͝ͱͷ֦ுࢠ FHGON QPT   w IndexWriter͕ fl VTIͨ͠ͱ͖ʹηάϝϯτ͕ͭ࡞ΒΕΔ w DPNNJU͞Εͨͱ͖ʹॳΊͯsegments_N͔Βࢀর͞ΕΔ w N  ʜ *OEFY4FHNFOUT -VDFOFJOEFY fi MFT
  10. *OEFY'JMF'PSNBUT 'PSNBU/BNF &YUFOTJPO 3FMBUFE$MBTT %FTDSJQUJPO 4FHNFOU'JMF segments_N SegmentInfos ίϛοτϙΠϯτΛอ࣋ -PDL'JMF

    write.lock N/A εϨουηʔϑͷͨΊͷϩοΫϑΝΠϧ 4FHNFOU*OGP .si SegmentInfoFormat ηάϝϯτͷϝλσʔλ 'JFME*OGPT .fnm FieldInfosFormat ϑΟʔϧυ໊ͳͲͷϑΟʔϧυ৘ใ fi FMEJOGP  5FSN.FUBEBUB .tmd PostingsFormat λʔϜʹؔ͢Δ౷ܭ৘ใ΍'45΁ͷϙΠϯλ 5FSN*OEFY .tip PostingsFormat λʔϜࣙॻ΁ͷϙΠϯλʢ'45όΠφϦʣ 5FSN%JDUJPOBSZ .tim PostingsFormat λʔϜ΍ͦͷ౷ܭ৘ใɺϙεςΟϯάϦετ΁ͷϙΠϯλ 'SFRVFODJFTBOE4LJQ%BUB .doc PostingsFormat λʔϜ͝ͱͷϙεςΟϯάϦετ΍ɺλʔϜස౓ 1PTJUJPOT .pos PostingsFormat จॻ಺ͷλʔϜͷग़ݱҐஔ 1BZMPBETBOE0 ff TFUT .pay PostingsFormat ग़ݱҐஔ͝ͱͷϝλσʔλʢจࣈΦϑηοτͳͲʣ 'JFME%BUB .fdt StoredFieldsFormat ϑΟʔϧυσʔλ 'JFME*OEFY .fdx StoredFieldsFormat ϑΟʔϧυσʔλ΁ͷϙΠϯλ 'JFME.FUB .fdm StoredFieldsFormat ϑΟʔϧυσʔλͷϝλσʔλ /PSNT%BUB .nvd Lucene90NormsFormat จॻ΍ϑΟʔϧυͷϊϧϜͷσʔλ /PSNT.FUBEBUB .nvm Lucene90NormsFormat จॻ΍ϑΟʔϧυͷϊϧϜͷϝλσʔλ -JWF%PDVNFOUT .liv Lucene50LiveDocsFormat ࡟আ͞Ε͍ͯͳ͍ MJWF จॻͷ৘ใ
  11. -VDFOFΠϯσοΫεͷ֓ཁ

  12. w ηάϝϯτ͸ϑΝΠϧͱͯ͠ॻ͖ࠐΉ w ΫΤϦ࣌ʹ͸ϑΝΠϧ͔ΒಡΈࠐΉ w جຊతʹ͸ϑΝΠϧΛNNBQ͠ ͯ࢖͏ʢ04ͷϑΝΠϧγεςϜΩ ϟογϡΛར༻ʣ w ΠϯσοΫεΛϝΠϯϝϞϦʹ͢

    ΂ͯϩʔυ͢ΔͱϝϞϦ͕͍͘Β ͋ͬͯ΋଍Γͳ͍ IUUQPQFOTFBSDIMBCPUBHPBDO[QBQFS@QEG Białecki, Andrzej, et al. "Apache lucene 4." SIGIR 2012 workshop on open source information retrieval. 2012. )PX-VDFOFVTFTJOEFY fi MFT -VDFOF"SDIJUFDUVSF
  13. *OEFYGPSNBUBCTUSBDUJPO $PEFD"1*  /** Encodes/decodes postings * / public abstract

    PostingsFormat postingsFormat() ; /** Encodes/decodes docvalues * / public abstract DocValuesFormat docValuesFormat() ; /** Encodes/decodes stored fields * / public abstract StoredFieldsFormat storedFieldsFormat() ; /** Encodes/decodes term vectors * / public abstract TermVectorsFormat termVectorsFormat() ; /** Encodes/decodes field infos file * / public abstract FieldInfosFormat fieldInfosFormat() ; /** Encodes/decodes segment info file * / public abstract SegmentInfoFormat segmentInfoFormat() ; /** Encodes/decodes document normalization values * / public abstract NormsFormat normsFormat() ; /** Encodes/decodes live docs * / public abstract LiveDocsFormat liveDocsFormat() ; /** Encodes/decodes compound files * / public abstract CompoundFormat compoundFormat() ; /** Encodes/decodes points index * / public abstract PointsFormat pointsFormat() ; /** Encodes/decodes numeric vector fields * / public abstract KnnVectorsFormat knnVectorsFormat() ; w -VDFOFͰ$PEFD"1*͕ಋೖ͞Εͨ w $PEFD*OEFY'JMF'PSNBUT w ΠϯσοΫεϑΥʔϚοτ͝ͱʹɺσʔλΛͲ͏όΠφϦʹ͢ Δ͔ʢΤϯίʔυํ๏ʣΛنఆ w ΠϯσοΫεϑΝΠϧͷಡΈॻ͖Λந৅Խ w ϑΥʔϚοτͷ࣮૷ΛೖΕସ͑΍͘͢ͳͬͨ w SimpleTextCodecͳͲͷಛघͳ$PEFDΛࢦఆͰ͖Δ w ந৅ΫϥεCodecʢӈਤࢀরʣΛܧঝ͢ΔܗͰ࣮૷͢Δ w ֤ΠϯσοΫεϑΥʔϚοτ͝ͱʹ࣮૷Λࢦఆ͢Δ w ॻ͖ࠐΈ࣌ͷIndexWriterConfigʹ$PEFDΛࢦఆͰ͖Δ org.apache.lucene.codecs.Codec ΫϥεΑΓൈਮ
  14. *OEFYGPSNBUDPNQBUJCJMJUZ w ΠϯσοΫεϑΥʔϚοτͷόʔδϣϯ؅ཧ͕Մೳʹ w ҟͳΔόʔδϣϯͷϑΥʔϚοτΛࠞࡏͯ͠ར༻Ͱ͖Δ w -JWF%PDTͷϑΥʔϚοτ͸-VDFOFͷ΋ͷΛ࢖͏ͱ͔ w ͲͷόʔδϣϯͷϑΥʔϚοτΛඥ෇͚Δ͔͸LuceneXXCodecΫϥεʹ· ͱΊΒΕ͍ͯΔ

    w FHLucene87Codec Lucene90Codec w ந৅ΫϥεCodecΛܧঝ w $PEFD·ΘΓͷ࣮૷͸جຊ org.apache.lucene.codecsύοέʔδͷ Լʹ·ͱΊΒΕ͍ͯΔ w όʔδϣϯ͝ͱʹύοέʔδ͕෼͔Ε͍ͯΔ w lucene90ͷ৔߹͸-VDFOFͰ৽ͨʹఆٛ͞Εͨ$PEFD w ԼҐޓ׵ͷ$PEFD͸org.apache.lucene.backward_codecs $PEFD"1*  QVCMJDDMBTT-VDFOF$PEFDFYUFOET$PEFD\ ʜ QSJWBUF fi OBM5FSN7FDUPST'PSNBUWFDUPST'PSNBUOFX-VDFOF5FSN7FDUPST'PSNBU  QSJWBUF fi OBM'JFME*OGPT'PSNBU fi FME*OGPT'PSNBUOFX-VDFOF'JFME*OGPT'PSNBU  QSJWBUF fi OBM4FHNFOU*OGP'PSNBUTFHNFOU*OGPT'PSNBUOFX-VDFOF4FHNFOU*OGP'PSNBU  QSJWBUF fi OBM-JWF%PDT'PSNBUMJWF%PDT'PSNBUOFX-VDFOF-JWF%PDT'PSNBU  QSJWBUF fi OBM$PNQPVOE'PSNBUDPNQPVOE'PSNBUOFX-VDFOF$PNQPVOE'PSNBU  QSJWBUF fi OBM1PJOUT'PSNBUQPJOUT'PSNBUOFX-VDFOF1PJOUT'PSNBU  !0WFSSJEF QVCMJD5FSN7FDUPST'PSNBUUFSN7FDUPST'PSNBU \SFUVSOWFDUPST'PSNBU^ !0WFSSJEF QVCMJD fi OBM'JFME*OGPT'PSNBU fi FME*OGPT'PSNBU \SFUVSO fi FME*OGPT'PSNBU^ !0WFSSJEF QVCMJD4FHNFOU*OGP'PSNBUTFHNFOU*OGP'PSNBU \SFUVSOTFHNFOU*OGPT'PSNBU^ ʜ ^ org.apache.lucene.backward_codecs.lucene87.Lucene87Codec ΑΓൈਮ
  15. -VDFOFΠϯσοΫεͷղੳ

  16. "OBMZ[JOH-VDFOF*OEFY#JOBSZ *OEFYJOHTFUUJOHT w ϛχϚϧͳΠϯσοΫεΛ࡞ͬͯղੳͯ͠ΈΔ w ݕূʹ࢖ͬͨίʔυIUUQTHJUIVCDPN UBLVZBBIFMMPMVDFOF w MVDFOFDPSF4/"14)05 ࣌఺

     w ΠϯσοΫεͨ͠จॻ w %PD*%"lucene in action" w %PD*%"lucene cookbook" 5FSN 1PTUJOHT-JTU BDUJPO  DPPLCPPL  JO  MVDFOF  
  17. "OBMZ[JOH-VDFOF*OEFY#JOBSZ *OEFY fi MFT

  18. #JOBSZ'PSNBU#BTJD5ZQFT )PX-VDFOFXSJUFTEBUBBTCJOBSZ • Lucene ΠϯσοΫεͷಡΈॻ͖࣌ʹ࢖ΘΕΔόΠφϦϑΥʔϚοτ • جຊతͳܕͷ௿ϨΠϠͷಡΈॻ͖͸ҎԼͷΫϥεͷϝιουʹ·ͱΊΒΕ͍ͯΔ • DataOutput: ॻ͖ࠐΈॲཧ

    • DataInput: ಡΈࠐΈॲཧ • ͜ΕΒ͸ந৅ΫϥεͰɺ࣮ࡍʹ͸ FSDirectory.FSIndexOutput / FSDirectory.FSIndexInput ͕࢖ΘΕΔ • ॻ͖ࠐΈ͸࠷ऴతʹ͸ DataOutput ͷ writeByte() writeBytes() ϝιουʹؼண͢Δ • ͢΂ͯͷσʔλ͸όΠτ/όΠτྻͱͯ͠ॻ͖ࠐ·ΕΔ • Ұ෦ͷܕ (BEInt, BELong) ʹ͍ͭͯ͸ CodecUtil ʹϝιου͕͋Δ • ࣮ͨͩ͠ࡍͷॻ͖ࠐΈ͸ DataOutput, DataInput ͷϝιουʹҕৡ
  19. #JOBSZ'PSNBU#BTJD5ZQFT *OUFHFSUZQFT • Short: DataOutput#writeShort(short i) • 2όΠτ੔਺Λ little-endian Ͱॻ͖ࠐΈ

    • Int: DataOutput#writeInt(int i) • 4όΠτ੔਺Λ little-endian Ͱॻ͖ࠐΈ • Long: DataOutput#writeLong(long i) • 8όΠτ੔਺Λ little-endian Ͱॻ͖ࠐΈ • BEInt: CodecUtil#writeBEInt(DataOutput out, int i) • 4όΠτ੔਺Λ big-endian Ͱॻ͖ࠐΈ • BELong: CodecUtil#writeBELong(DataOutput out, long i) • 8όΠτ੔਺Λ big-endian Ͱॻ͖ࠐΈ public static void writeBEInt(DataOutput out, int i) throws IOException { out.writeByte((byte) (i >> 24)); out.writeByte((byte) (i >> 16)); out.writeByte((byte) (i >> 8)); out.writeByte((byte) i); } public void writeInt(int i) throws IOException { writeByte((byte) i); writeByte((byte) (i >> 8)); writeByte((byte) (i >> 16)); writeByte((byte) (i >> 24)); }
  20. #JOBSZ'PSNBU#BTJD5ZQFT 7*OUUZQF w DataOutput#writeVInt(int i) w όΠτ੔਺ JOU ΛՄม௕ϑΥʔϚοτͰॻ͘ w

    όΠτ͝ͱʹ࠷্ҐϏοτ .4# Λϑϥάͱͯ͠࢖͏ w .4#͕ͳΒ࣍ͷόΠτΛಡΉɺͳΒͦͷόΠτͰಡΈऴΘΔ w ͸ͦͷ··όΠτͰදݱ͞ΕΔ • 0: 0000 0000 (00) • 1: 0000 0001 (01) • 127: 0111 1111 (7f) w ·Ͱ͸όΠτͰදݱ͞ΕΔ MJUUMFFOEJBO  • 128: 1000 0000 0000 0001 (80 01) • 129: 1000 0001 0000 0001 (81 01) • 16383: 1111 1111 0111 1111 (ff 7f) w Ҏ্͸όΠτͰදݱ͞ΕΔ MJUUMFFOEJBO  • 16384: 1000 0000 1000 0000 0000 0001 (80 80 01) public final void writeVInt(int i) throws IOException { while ((i & ~0x7F) != 0) { writeByte((byte) ((i & 0x7F) | 0x80)); i >>>= 7; } writeByte((byte) i); } w ݩͷόΠτ੔਺ΛϏοτͣͭॲཧ  0111 1111 (7f)ͱͷCJUXJTF"/%ΛͱΔʢԼҐϏοτ͚ͩݟΔʣ  1000 0000 (80)ͱͷCJUXJTF03ΛͱΔʢ.4#ʹΛηοτʣ  Ͱ͖ͨόΠτΛwriteByte()Ͱॻ͖ग़͠  Ϗοτ͚ͩූ߸ͳ͠ӈγϑτʢ࣍ͷϏοτ΁ʣ  ॻ͘΋ͷ͕ͳ͘ͳΔ·Ͱଓ͚Δ w !NPDP@CFUB͞ΜͷൃදεϥΠυIUUQTTQFBLFSEFDLDPNNPDPCFUB MVDFOFOVNCFSLVSPNPKJGBMTFLPEPXPEVNVIVJDJTIVCJSVEBCJBO TMJEF͕Θ͔Γ΍͍͢Ͱ͢
  21. #JOBSZ'PSNBU#BTJD5ZQFT 7-POHUZQF public final void writeVLong(long i) throws IOException {

    if (i < 0) { throw new IllegalArgumentException("cannot write negative vLong (got: " + i + ")"); } writeSignedVLong(i); } private void writeSignedVLong(long i) throws IOException { while ((i & ~0x7FL) != 0L) { writeByte((byte) ((i & 0x7FL) | 0x80L)); i >>>= 7; } writeByte((byte) i); } • DataOutput#writeVLong(long i) • long ΛҾ਺ʹऔΔ͜ͱҎ֎͸ VInt ͱಉ༷
  22. #JOBSZ'PSNBU#BTJD5ZQFT 4USJOHUZQF public void writeString(String s) throws IOException { final

    BytesRef utf8Result = new BytesRef(s); writeVInt(utf8Result.length); writeBytes(utf8Result.bytes, utf8Result.offset, utf8Result.length); } w DataOutput#writeString(String s) w ࠷ॳʹ௕͕͞ VInt Ͱॻ͔ΕΔ w ͦͷ͋ͱ UTF-8 ͰΤϯίʔυͨ͠จ ࣈྻΛ writeBytes() Ͱॻ͘
  23. #JOBSZ'PSNBU#BTJD5ZQFT .BQ4USJOH 4USJOH public void writeMapOfStrings(Map<String, String> map) throws IOException

    { writeVInt(map.size()); for (Map.Entry<String, String> entry : map.entrySet()) { writeString(entry.getKey()); writeString(entry.getValue()); } } w DataOutput#writeMapOfStrings( Map<String, String> map) w ࠷ॳʹཁૉ਺͕ writeVInt() Ͱॻ ͔ΕΔ w ͦͷ͋ͱɺΩʔͱ஋͕ཁૉ͝ͱʹ writeString() Ͱॻ͔ΕΔ 
 ʢཁૉ਺͚ͩ܁Γฦ͠ʣ
  24. #JOBSZ'PSNBU#BTJD5ZQFT 4FU4USJOH public void writeSetOfStrings(Set<String> set) throws IOException { writeVInt(set.size());

    for (String value : set) { writeString(value); } } w DataOutput#writeSetOfStrings( Set<String> set) w ࠷ॳʹཁૉ਺͕ writeVInt() Ͱॻ ͔ΕΔ w ͦͷ͋ͱɺ஋͕ཁૉ͝ͱʹ writeString() Ͱॻ͔ΕΔ 
 ʢཁૉ਺͚ͩ܁Γฦ͠ʣ
  25. #JOBSZ'PSNBU*OEFY)FBEFS *OEFY)FBEFS public static void writeIndexHeader( DataOutput out, String codec,

    int version, byte[] id, String suffix) throws IOException { if (id.length != StringHelper.ID_LENGTH) { throw new IllegalArgumentException("Invalid id: " + StringHelper.idToString(id)); } writeHeader(out, codec, version); out.writeBytes(id, 0, id.length); BytesRef suffixBytes = new BytesRef(suffix); if (suffixBytes.length != suffix.length() || suffixBytes.length >= 256) { throw new IllegalArgumentException( "suffix must be simple ASCII, less than 256 characters in length [got " + suffix + "]"); } out.writeByte((byte) suffixBytes.length); out.writeBytes(suffixBytes.bytes, suffixBytes.offset, suffixBytes.length); } public static void writeHeader(DataOutput out, String codec, int version) throws IOException { BytesRef bytes = new BytesRef(codec); if (bytes.length != codec.length() || bytes.length >= 128) { throw new IllegalArgumentException( "codec must be simple ASCII, less than 128 characters in length [got " + codec + "]"); } writeBEInt(out, CODEC_MAGIC); out.writeString(codec); writeBEInt(out, version); } w ͢΂ͯͷ Lucene ΠϯσοΫεϑΝΠϧʹ͸ɺઌ಄ʹڞ௨ͷϑΥʔϚοτͰϔομ ͕ॻ͔ΕΔ w ΠϯσοΫεϑΝΠϧͷϔομ (IndexHeader) ͸ CodecHeader ObjectID ObjectSuffix ͔ΒͳΔ w CodecUtil#writeIndexHeader() w CodecUtil#writeHeader() Ͱ CodecHeader Λॻ͘ w BEInt ͷϚδοΫφϯόʔ CODEC_MAGIC (= 0x3fd76c17) Λॻ͘ w String ͷίʔσοΫ໊ codec Λॻ͘ w BEInt ͷϑΝΠϧͷόʔδϣϯ version Λॻ͘ w DataOutput#writeBytes() Ͱ ObjectID Λॻ͘ w 1όΠτݻఆ௕Ͱ SuffixBytes ͷ௕͞Λॻ͍ͨ͋ͱɺ SuffixBytes Λॻ͘
  26. #JOBSZ'PSNBU*OEFY)FBEFS *OEFY)FBEFS IndexHeader: CodecHeader: Magic: 1071082519 # BEInt (3f d7

    6c 17): ϚδοΫφϯόʔ CodecName: "Lucene90PostingsWriterPos" # String (length: 19 = 25): ίʔσοΫ໊ Version: 0 # BEInt (00 00 00 00): ϑΝΠϧͷόʔδϣϯ ObjectID: [] # Bytes^16 ObjectSuffix: # ϑΝΠϧ໊ͷ "_0" ͷޙΖͷ "_Lucene90_0" ͷ෦෼ SuffixLength: 10 # Byte (0a) SuffixBytes: "Lucene90_0" # Bytes^SuffixLength w _0_Lucene90_0.pos ϑΝΠϧͷϔο μΛݟͯΈΔ w 0x00 ൪஍͔Β 0x3c ൪஍·Ͱ w ϔομΛ YAML Ͱදݱ͢ΔͱӈͷΑ͏ ͳײ͡
  27. #JOBSZ'PSNBU'PPUFS $PEFD'PPUFS public static void writeFooter(IndexOutput out) throws IOException {

    writeBEInt(out, FOOTER_MAGIC); writeBEInt(out, 0); writeCRC(out); } w ΠϯσοΫεϑΝΠϧͷ຤ඌʹ͸ڞ௨ͷϑΥʔϚο τͰϑολ (CodecFooter) ͕ॻ͔ΕΔ w CodecUtil#writeFooter() w BEInt ͷϚδοΫφϯόʔ FOOTER_MAGIC (= ~0x3fd76c17) Λॻ͘ w BEInt Ͱ 0 Λॻ͘ʢݻఆͷ AlgorithmIDʣ w BELong Ͱ CRC32 ͷνΣοΫαϜΛॻ͘ʢৄ ࡉ͸ java.util.zip.CRC32 ΫϥεΛࢀরʣ
  28. #JOBSZ'PSNBU'PPUFS $PEFD'PPUFS CodecFooter: Magic: 3223884776 # BEInt (c0 28 93

    e8) AlgorithmID: 0 # BEInt (00 00 00 00) Checksum: 1437576107 # BELong (00 00 00 00 55 af ab ab) w _0_Lucene90_0.pos ϑΝΠϧͷϑο λΛݟͯΈΔ w 0x42 ൪஍͔Β 0x51 ൪஍·Ͱ w YAML Ͱදݱ͢ΔͱӈͷΑ͏ʹͳΔ
  29. "OBMZ[JOH*OEFY'JMF'PSNBUT 0UIFS fi MFGPSNBUT • ֤ϑΥʔϚοτͷϑΝΠϧͷղੳ݁Ռʹ͍ͭͯ͸ɺҎԼͷϒϩάهࣄʹ ·ͱΊ͍ͯ·͢ʢॻ͖͔͚ʣ IUUQTTUPQUIFXPSMEIBUFOBCMPHDPNESBGUFOUSZ/9CU0P79&',31SF/9N.

  30. -JDFOTF w Ҿ༻ͨ͠-VDFOFͷιʔείʔυ͸"QBDIF-JDFOTFIUUQT XXXBQBDIFPSHMJDFOTFT-*$&/4&Ͱ͢ɻ