$30 off During Our Annual Pro Sale. View Details »

Lucene Index Deep Dive

Lucene Index Deep Dive

社内の Lucene 勉強会で使った資料です。Apache Lucene https://lucene.apache.org/ のインデックスにフォーマットついて発表しました。
サンプルコード: https://github.com/takuyaa/hello-lucene
ブログ記事(下書き): https://stop-the-world.hatenablog.com/draft/entry/NXbtOo1V1X261E94FKRPr5eNXmM

Takuya Asano

August 02, 2021
Tweet

More Decks by Takuya Asano

Other Decks in Programming

Transcript

  1. 5BLVZB"TBOP !UBLVZB@C

    -VDFOF*OEFY

    %FFQ%JWF

    -VDFOF3FBEJOH

    View Slide

  2. લճ·Ͱͷ͋Β͢͡

    View Slide

  3. w +BWBͰॻ͔ΕͨશจݕࡧΤϯδϯϥΠϒϥϦ
    w ೥ݱࡏɺશจݕࡧʹ͓͚ΔσϑΝΫτελϯμʔυ
    w શจݕࡧʹඞཁͳػೳ͕΄΅࣮૷͞Ε͍ͯΔ
    w &MBTUJDTFBSDI΍4PMSͰݕࡧͷίΞ෦෼ͱͯ͠࢖ΘΕ͍ͯΔ
    w &MBTUJDTFBSDI4PMS͸-VDFOFͷ3&45"1*ΛϢʔβʹఏڙ
    w ଞʹ΋-VDFOFʹ͸ͳ͍ػೳʢ෼ࢄݕࡧͳͲʣΛ࣮૷͍ͯ͠Δ
    ͕͜͜Ͱ͸ׂѪ
    8IBUJT"QBDIF-VDFOF
    IUUQTHJUIVCDPNBQBDIFMVDFOF

    View Slide

  4. 'VMMUFYU4FBSDI#BTJDTPO-VDFOF
    *OWFSUFEJOEFY5IFDPSFEBUBTUSVDUVSFPGTFBSDIFOHJOFT
    w શจݕࡧͰ͸ɺݕࡧର৅ͷίϯςϯπ͸จॻ EPDVNFOU
    ͱͯ͠
    ϞσϧԽ͞ΕΔ
    w &$ݕࡧͰ͋Ε͹঎඼
    w 8FCݕࡧͰ͋Ε͹8FCϖʔδ
    w -VDFOFͰ͸DocumentΫϥεʹରԠ
    w -VDFOF͸సஔΠϯσοΫε JOWFSUFEJOEFY
    ʹΑΓ

    จॻΛΠϯσοΫεԽ͢Δ
    w సஔΠϯσοΫεํࣜͷશจݕࡧ͸ɺେن໛ͳจॻू߹͔Βͷ
    ݕࡧʹ޲͍͍ͯΔ
    w -VDFOFͰ͸ɺจॻ͸EPDJEͰࣝผ͞ΕΔ
    5FSN 1PTUJOHT-JTU
    BDUJPO
    DPPLCPPL
    JO
    MVDFOF
    "OFYBNQMFPGJOWFSUFEJOEFYTUSVDUVSFGPSEPDVNFOUT
    l-VDFOFJO"DUJPOzBOEl-VDFOF$PPLCPPLz"
    TFUPGBMMUFSNTJTPGUFOSFGFSSFEUPBTBlUFSN
    EJDUJPOBSZzPSTJNQMZlEJDUJPOBSZz
    SFG*OGPSNBUJPO3FUSJFWBMBOE8FC4FBSDI·ͱΊ
    సஔΠϯσοΫεTUPQUIFXPSME

    IUUQTTUPQUIFXPSMEIBUFOBCMPHDPNFOUSZDTJOGPSNBUJPOSFUSJFWBM

    View Slide

  5. *OEFYJOH1SPDFTT0WFSWJFX
    $PSFDPNQPOFOUTGPSJOEFYJOH
    $POTUSVDUBEPDVNFOUPCKFDU

    จॻΦϒδΣΫτΛߏங
    "OBMZ[FUFYUDPOUFOUT QSFQSPDFTTJOH


    ςΩετղੳʢલॲཧʣ
    #VJMEB-VDFOFJOEFY

    ΠϯσοΫεߏங
    8SJUFBOJOEFYUPBTUPSBHF

    ΠϯσοΫεॻ͖ࠐΈ
    Analyzer
    Directory
    IndexWriter
    Document

    View Slide

  6. %PDVNFOU3FQSFTFOUBUJPO
    w จॻ͸DocumentΫϥεͷΦϒδΣΫτͱͯ͠දݱ͞ΕΔ
    w Document͸ෳ਺ͷField͔Βߏ੒͞ΕΔ
    w ,FZWBMVFNBQͷΑ͏ͳσʔλߏ଄
    w Field͸ϑΟʔϧυ໊ͱ಺༰ɺϑΟʔϧυͷλΠϓΛ΋ͭ
    Document doc1 = new Document();


    doc1.add(new Field("title", "Lucene in Action", TextField.TYPE_STORED));


    Document doc2 = new Document();


    doc2.add(new Field("title", "Lucene Cookbook", TextField.TYPE_STORED));
    'JFMEOBNF 'JFMEDPOUFOU 'JFMEUZQFzTUPSFEPSOPUTUPSFE
    4UPSFE
    fi
    FMETBSFTUPSFEJOBOJOEFY5IJTBMMPXTZPVUP

    SFUSJFWFUIF
    fi
    FMEDPOUFOUTBUTFBSDIUJNF
    .PEFMPCKFDUUPCFJOEFYFE

    View Slide

  7. "OBMZ[FS
    5FYUQSFQSPDFTTPST
    w AnalyzerTokenizerFilters
    w Tokenizer
    w ςΩετจࣈྻΛ5PLFOͷྻʹ෼ׂ͢Δ
    w Filter
    w 5PLFOΛҰఆͷϧʔϧͰআڈ͢ΔʢFHStopFilterʣ
    w 5PLFOͷจࣈྻΛҰఆͷϧʔϧͰஔ׵͢ΔʢFHLowerCaseFilterʣ
    w AnalyzerͷྫStandardAnalyzer
    w StandardAnalyzerStandardTokenizer + StopFilter
    LowerCaseFilter
    w 6OJDPEF5FYU4FHNFOUBUJPO ϕʔεͷ Tokenizer
    "Lucene in Action"
    "Lucene", "in", "Action"
    "Lucene", "Action"
    "lucene", "action"
    StandardTokenizer
    StopFilter
    LowerCaseFilter
    *GUIFStopFilterIBTlJOzBTBTUPQXPSE

    View Slide

  8. #BTJD*OEFYJOH"1*
    "DPEFFYBNQMFUPCVJMEB-VDFOFJOEFY
    // Create a directory for storing Lucene index


    Path indexDirPath = Files.createDirectory(Path.of("index"));


    Directory directory = FSDirectory.open(indexDirPath);


    // Set up IndexWriter


    Analyzer analyzer = new StandardAnalyzer();


    IndexWriterConfig config = new IndexWriterConfig(analyzer);


    IndexWriter indexWriter = new IndexWriter(directory, config);


    // Index a document: "Lucene in Action"


    Document doc1 = new Document();


    doc1.add(new Field("title", "Lucene in Action", TextField.TYPE_STORED));


    indexWriter.addDocument(doc1);


    // Index a document: "Lucene Cookbook"


    Document doc2 = new Document();


    doc2.add(new Field("title", "Lucene Cookbook", TextField.TYPE_STORED));


    indexWriter.addDocument(doc2);


    // Write index to the directory


    indexWriter.close();

    View Slide

  9. _0.fdm


    _0.fdt


    _0.fdx


    _0.fnm


    _0.nvd


    _0.nvm


    _0.si


    _0_Lucene84_0.doc


    _0_Lucene84_0.pos


    _0_Lucene84_0.tim


    _0_Lucene84_0.tip


    _0_Lucene84_0.tmd


    segments_1


    write.lock


    *OEFYBGUFSTUDPNNJU
    4FHNFOU
    4FHNFOUT'JMF
    -PDL'JMF
    TFHNFOUT@
    4FHNFOU
    w ΠϯσοΫε͸ෳ਺ͷηάϝϯτ TFHNFOU
    ͔ΒͳΔ
    w ͢΂ͯಉ͡σΟϨΫτϦʹอଘ͞ΕΔ
    w ηάϝϯτ͸αϒΠϯσοΫε
    w ୯ମͰ΄΅-VDFOFΠϯσοΫεͱͯ͠ػೳ͢Δ
    w ηάϝϯτ͸ෳ਺ͷϑΝΠϧ͔ΒͳΔ
    w ϑΝΠϧ໊͸_gen.extPS_gen_Lucene84_0.extͷܗࣜ
    w &H_0.fnm _0_Lucene84_0.pos ʜ
    w genηάϝϯτͷੈ୅ FH

    w extϑΥʔϚοτ͝ͱͷ֦ுࢠ FHGON QPT

    w IndexWriter͕
    fl
    VTIͨ͠ͱ͖ʹηάϝϯτ͕ͭ࡞ΒΕΔ
    w DPNNJU͞Εͨͱ͖ʹॳΊͯsegments_N͔Βࢀর͞ΕΔ
    w N ʜ
    *OEFY4FHNFOUT
    -VDFOFJOEFY
    fi
    MFT

    View Slide

  10. *OEFY'JMF'PSNBUT
    'PSNBU/BNF &YUFOTJPO 3FMBUFE$MBTT %FTDSJQUJPO
    4FHNFOU'JMF segments_N SegmentInfos ίϛοτϙΠϯτΛอ࣋
    -PDL'JMF write.lock N/A εϨουηʔϑͷͨΊͷϩοΫϑΝΠϧ
    4FHNFOU*OGP .si SegmentInfoFormat ηάϝϯτͷϝλσʔλ
    'JFME*OGPT .fnm FieldInfosFormat ϑΟʔϧυ໊ͳͲͷϑΟʔϧυ৘ใ
    fi
    FMEJOGP

    5FSN.FUBEBUB .tmd PostingsFormat λʔϜʹؔ͢Δ౷ܭ৘ใ΍'45΁ͷϙΠϯλ
    5FSN*OEFY .tip PostingsFormat λʔϜࣙॻ΁ͷϙΠϯλʢ'45όΠφϦʣ
    5FSN%JDUJPOBSZ .tim PostingsFormat λʔϜ΍ͦͷ౷ܭ৘ใɺϙεςΟϯάϦετ΁ͷϙΠϯλ
    'SFRVFODJFTBOE4LJQ%BUB .doc PostingsFormat λʔϜ͝ͱͷϙεςΟϯάϦετ΍ɺλʔϜස౓
    1PTJUJPOT .pos PostingsFormat จॻ಺ͷλʔϜͷग़ݱҐஔ
    1BZMPBETBOE0
    ff
    TFUT .pay PostingsFormat ग़ݱҐஔ͝ͱͷϝλσʔλʢจࣈΦϑηοτͳͲʣ
    'JFME%BUB .fdt StoredFieldsFormat ϑΟʔϧυσʔλ
    'JFME*OEFY .fdx StoredFieldsFormat ϑΟʔϧυσʔλ΁ͷϙΠϯλ
    'JFME.FUB .fdm StoredFieldsFormat ϑΟʔϧυσʔλͷϝλσʔλ
    /PSNT%BUB .nvd Lucene90NormsFormat จॻ΍ϑΟʔϧυͷϊϧϜͷσʔλ
    /PSNT.FUBEBUB .nvm Lucene90NormsFormat จॻ΍ϑΟʔϧυͷϊϧϜͷϝλσʔλ
    -JWF%PDVNFOUT .liv Lucene50LiveDocsFormat ࡟আ͞Ε͍ͯͳ͍ MJWF
    จॻͷ৘ใ

    View Slide

  11. -VDFOFΠϯσοΫεͷ֓ཁ

    View Slide

  12. w ηάϝϯτ͸ϑΝΠϧͱͯ͠ॻ͖ࠐΉ
    w ΫΤϦ࣌ʹ͸ϑΝΠϧ͔ΒಡΈࠐΉ
    w جຊతʹ͸ϑΝΠϧΛNNBQ͠
    ͯ࢖͏ʢ04ͷϑΝΠϧγεςϜΩ
    ϟογϡΛར༻ʣ
    w ΠϯσοΫεΛϝΠϯϝϞϦʹ͢
    ΂ͯϩʔυ͢ΔͱϝϞϦ͕͍͘Β
    ͋ͬͯ΋଍Γͳ͍
    IUUQPQFOTFBSDIMBCPUBHPBDO[QBQFS@QEG
    Białecki, Andrzej, et al. "Apache lucene 4." SIGIR 2012 workshop on open source information retrieval. 2012.
    )PX-VDFOFVTFTJOEFY
    fi
    MFT
    -VDFOF"SDIJUFDUVSF

    View Slide

  13. *OEFYGPSNBUBCTUSBDUJPO
    $PEFD"1*

    /** Encodes/decodes postings *
    /

    public abstract PostingsFormat postingsFormat()
    ;

    /** Encodes/decodes docvalues *
    /

    public abstract DocValuesFormat docValuesFormat()
    ;

    /** Encodes/decodes stored fields *
    /

    public abstract StoredFieldsFormat storedFieldsFormat()
    ;

    /** Encodes/decodes term vectors *
    /

    public abstract TermVectorsFormat termVectorsFormat()
    ;

    /** Encodes/decodes field infos file *
    /

    public abstract FieldInfosFormat fieldInfosFormat()
    ;

    /** Encodes/decodes segment info file *
    /

    public abstract SegmentInfoFormat segmentInfoFormat()
    ;

    /** Encodes/decodes document normalization values *
    /

    public abstract NormsFormat normsFormat()
    ;

    /** Encodes/decodes live docs *
    /

    public abstract LiveDocsFormat liveDocsFormat()
    ;

    /** Encodes/decodes compound files *
    /

    public abstract CompoundFormat compoundFormat()
    ;

    /** Encodes/decodes points index *
    /

    public abstract PointsFormat pointsFormat()
    ;

    /** Encodes/decodes numeric vector fields *
    /

    public abstract KnnVectorsFormat knnVectorsFormat()
    ;

    w -VDFOFͰ$PEFD"1*͕ಋೖ͞Εͨ
    w $PEFD*OEFY'JMF'PSNBUT
    w ΠϯσοΫεϑΥʔϚοτ͝ͱʹɺσʔλΛͲ͏όΠφϦʹ͢
    Δ͔ʢΤϯίʔυํ๏ʣΛنఆ
    w ΠϯσοΫεϑΝΠϧͷಡΈॻ͖Λந৅Խ
    w ϑΥʔϚοτͷ࣮૷ΛೖΕସ͑΍͘͢ͳͬͨ
    w SimpleTextCodecͳͲͷಛघͳ$PEFDΛࢦఆͰ͖Δ
    w ந৅ΫϥεCodecʢӈਤࢀরʣΛܧঝ͢ΔܗͰ࣮૷͢Δ
    w ֤ΠϯσοΫεϑΥʔϚοτ͝ͱʹ࣮૷Λࢦఆ͢Δ
    w ॻ͖ࠐΈ࣌ͷIndexWriterConfigʹ$PEFDΛࢦఆͰ͖Δ
    org.apache.lucene.codecs.Codec ΫϥεΑΓൈਮ

    View Slide

  14. *OEFYGPSNBUDPNQBUJCJMJUZ
    w ΠϯσοΫεϑΥʔϚοτͷόʔδϣϯ؅ཧ͕Մೳʹ
    w ҟͳΔόʔδϣϯͷϑΥʔϚοτΛࠞࡏͯ͠ར༻Ͱ͖Δ
    w -JWF%PDTͷϑΥʔϚοτ͸-VDFOFͷ΋ͷΛ࢖͏ͱ͔
    w ͲͷόʔδϣϯͷϑΥʔϚοτΛඥ෇͚Δ͔͸LuceneXXCodecΫϥεʹ·
    ͱΊΒΕ͍ͯΔ
    w FHLucene87Codec Lucene90Codec
    w ந৅ΫϥεCodecΛܧঝ
    w $PEFD·ΘΓͷ࣮૷͸جຊ org.apache.lucene.codecsύοέʔδͷ
    Լʹ·ͱΊΒΕ͍ͯΔ
    w όʔδϣϯ͝ͱʹύοέʔδ͕෼͔Ε͍ͯΔ
    w lucene90ͷ৔߹͸-VDFOFͰ৽ͨʹఆٛ͞Εͨ$PEFD
    w ԼҐޓ׵ͷ$PEFD͸org.apache.lucene.backward_codecs
    $PEFD"1*

    QVCMJDDMBTT-VDFOF$PEFDFYUFOET$PEFD\
    ʜ
    QSJWBUF
    fi
    OBM5FSN7FDUPST'PSNBUWFDUPST'PSNBUOFX-VDFOF5FSN7FDUPST'PSNBU

    QSJWBUF
    fi
    OBM'JFME*OGPT'PSNBU
    fi
    FME*OGPT'PSNBUOFX-VDFOF'JFME*OGPT'PSNBU

    QSJWBUF
    fi
    OBM4FHNFOU*OGP'PSNBUTFHNFOU*OGPT'PSNBUOFX-VDFOF4FHNFOU*OGP'PSNBU

    QSJWBUF
    fi
    OBM-JWF%PDT'PSNBUMJWF%PDT'PSNBUOFX-VDFOF-JWF%PDT'PSNBU

    QSJWBUF
    fi
    OBM$PNQPVOE'PSNBUDPNQPVOE'PSNBUOFX-VDFOF$PNQPVOE'PSNBU

    QSJWBUF
    fi
    OBM1PJOUT'PSNBUQPJOUT'PSNBUOFX-VDFOF1PJOUT'PSNBU

    !0WFSSJEF
    QVCMJD5FSN7FDUPST'PSNBUUFSN7FDUPST'PSNBU
    \SFUVSOWFDUPST'PSNBU^
    !0WFSSJEF
    QVCMJD
    fi
    OBM'JFME*OGPT'PSNBU
    fi
    FME*OGPT'PSNBU
    \SFUVSO
    fi
    FME*OGPT'PSNBU^
    !0WFSSJEF
    QVCMJD4FHNFOU*OGP'PSNBUTFHNFOU*OGP'PSNBU
    \SFUVSOTFHNFOU*OGPT'PSNBU^
    ʜ
    ^
    org.apache.lucene.backward_codecs.lucene87.Lucene87Codec ΑΓൈਮ

    View Slide

  15. -VDFOFΠϯσοΫεͷղੳ

    View Slide

  16. "OBMZ[JOH-VDFOF*OEFY#JOBSZ
    *OEFYJOHTFUUJOHT
    w ϛχϚϧͳΠϯσοΫεΛ࡞ͬͯղੳͯ͠ΈΔ
    w ݕূʹ࢖ͬͨίʔυIUUQTHJUIVCDPN
    UBLVZBBIFMMPMVDFOF
    w MVDFOFDPSF4/"14)05
    ࣌఺

    w ΠϯσοΫεͨ͠จॻ
    w %PD*%"lucene in action"
    w %PD*%"lucene cookbook"
    5FSN 1PTUJOHT-JTU
    BDUJPO
    DPPLCPPL
    JO
    MVDFOF

    View Slide

  17. "OBMZ[JOH-VDFOF*OEFY#JOBSZ
    *OEFY
    fi
    MFT

    View Slide

  18. #JOBSZ'PSNBU#BTJD5ZQFT
    )PX-VDFOFXSJUFTEBUBBTCJOBSZ
    • Lucene ΠϯσοΫεͷಡΈॻ͖࣌ʹ࢖ΘΕΔόΠφϦϑΥʔϚοτ


    • جຊతͳܕͷ௿ϨΠϠͷಡΈॻ͖͸ҎԼͷΫϥεͷϝιουʹ·ͱΊΒΕ͍ͯΔ


    • DataOutput: ॻ͖ࠐΈॲཧ


    • DataInput: ಡΈࠐΈॲཧ


    • ͜ΕΒ͸ந৅ΫϥεͰɺ࣮ࡍʹ͸ FSDirectory.FSIndexOutput / FSDirectory.FSIndexInput ͕࢖ΘΕΔ


    • ॻ͖ࠐΈ͸࠷ऴతʹ͸ DataOutput ͷ writeByte() writeBytes() ϝιουʹؼண͢Δ


    • ͢΂ͯͷσʔλ͸όΠτ/όΠτྻͱͯ͠ॻ͖ࠐ·ΕΔ


    • Ұ෦ͷܕ (BEInt, BELong) ʹ͍ͭͯ͸ CodecUtil ʹϝιου͕͋Δ


    • ࣮ͨͩ͠ࡍͷॻ͖ࠐΈ͸ DataOutput, DataInput ͷϝιουʹҕৡ

    View Slide

  19. #JOBSZ'PSNBU#BTJD5ZQFT
    *OUFHFSUZQFT
    • Short: DataOutput#writeShort(short i)


    • 2όΠτ੔਺Λ little-endian Ͱॻ͖ࠐΈ


    • Int: DataOutput#writeInt(int i)


    • 4όΠτ੔਺Λ little-endian Ͱॻ͖ࠐΈ


    • Long: DataOutput#writeLong(long i)


    • 8όΠτ੔਺Λ little-endian Ͱॻ͖ࠐΈ


    • BEInt: CodecUtil#writeBEInt(DataOutput out, int i)


    • 4όΠτ੔਺Λ big-endian Ͱॻ͖ࠐΈ


    • BELong: CodecUtil#writeBELong(DataOutput out, long i)


    • 8όΠτ੔਺Λ big-endian Ͱॻ͖ࠐΈ
    public static void writeBEInt(DataOutput out, int
    i) throws IOException {


    out.writeByte((byte) (i >> 24));


    out.writeByte((byte) (i >> 16));


    out.writeByte((byte) (i >> 8));


    out.writeByte((byte) i);


    }
    public void writeInt(int i) throws IOException {


    writeByte((byte) i);


    writeByte((byte) (i >> 8));


    writeByte((byte) (i >> 16));


    writeByte((byte) (i >> 24));


    }

    View Slide

  20. #JOBSZ'PSNBU#BTJD5ZQFT
    7*OUUZQF
    w DataOutput#writeVInt(int i)
    w όΠτ੔਺ JOU
    ΛՄม௕ϑΥʔϚοτͰॻ͘
    w όΠτ͝ͱʹ࠷্ҐϏοτ .4#
    Λϑϥάͱͯ͠࢖͏
    w .4#͕ͳΒ࣍ͷόΠτΛಡΉɺͳΒͦͷόΠτͰಡΈऴΘΔ
    w ͸ͦͷ··όΠτͰදݱ͞ΕΔ
    • 0: 0000 0000 (00)


    • 1: 0000 0001 (01)


    • 127: 0111 1111 (7f)


    w ·Ͱ͸όΠτͰදݱ͞ΕΔ MJUUMFFOEJBO

    • 128: 1000 0000 0000 0001 (80 01)


    • 129: 1000 0001 0000 0001 (81 01)


    • 16383: 1111 1111 0111 1111 (ff 7f)


    w Ҏ্͸όΠτͰදݱ͞ΕΔ MJUUMFFOEJBO

    • 16384: 1000 0000 1000 0000 0000 0001 (80 80 01)
    public final void writeVInt(int i) throws IOException {


    while ((i & ~0x7F) != 0) {


    writeByte((byte) ((i & 0x7F) | 0x80));


    i >>>= 7;


    }


    writeByte((byte) i);


    }
    w ݩͷόΠτ੔਺ΛϏοτͣͭॲཧ
    0111 1111 (7f)ͱͷCJUXJTF"/%ΛͱΔʢԼҐϏοτ͚ͩݟΔʣ
    1000 0000 (80)ͱͷCJUXJTF03ΛͱΔʢ.4#ʹΛηοτʣ
    Ͱ͖ͨόΠτΛwriteByte()Ͱॻ͖ग़͠
    Ϗοτ͚ͩූ߸ͳ͠ӈγϑτʢ࣍ͷϏοτ΁ʣ
    ॻ͘΋ͷ͕ͳ͘ͳΔ·Ͱଓ͚Δ
    w !NPDP@CFUB͞ΜͷൃදεϥΠυIUUQTTQFBLFSEFDLDPNNPDPCFUB
    MVDFOFOVNCFSLVSPNPKJGBMTFLPEPXPEVNVIVJDJTIVCJSVEBCJBO
    TMJEF͕Θ͔Γ΍͍͢Ͱ͢

    View Slide

  21. #JOBSZ'PSNBU#BTJD5ZQFT
    7-POHUZQF
    public final void writeVLong(long i) throws IOException {


    if (i < 0) {


    throw new IllegalArgumentException("cannot write
    negative vLong (got: " + i + ")");


    }


    writeSignedVLong(i);


    }


    private void writeSignedVLong(long i) throws IOException {


    while ((i & ~0x7FL) != 0L) {


    writeByte((byte) ((i & 0x7FL) | 0x80L));


    i >>>= 7;


    }


    writeByte((byte) i);


    }
    • DataOutput#writeVLong(long i)


    • long ΛҾ਺ʹऔΔ͜ͱҎ֎͸ VInt
    ͱಉ༷

    View Slide

  22. #JOBSZ'PSNBU#BTJD5ZQFT
    4USJOHUZQF
    public void writeString(String s) throws IOException {


    final BytesRef utf8Result = new BytesRef(s);


    writeVInt(utf8Result.length);


    writeBytes(utf8Result.bytes, utf8Result.offset,
    utf8Result.length);


    }
    w DataOutput#writeString(String
    s)


    w ࠷ॳʹ௕͕͞ VInt Ͱॻ͔ΕΔ


    w ͦͷ͋ͱ UTF-8 ͰΤϯίʔυͨ͠จ
    ࣈྻΛ writeBytes() Ͱॻ͘

    View Slide

  23. #JOBSZ'PSNBU#BTJD5ZQFT
    .BQ4USJOH 4USJOH
    public void writeMapOfStrings(Map map)
    throws IOException {


    writeVInt(map.size());


    for (Map.Entry entry : map.entrySet())
    {


    writeString(entry.getKey());


    writeString(entry.getValue());


    }


    }
    w DataOutput#writeMapOfStrings(
    Map map)


    w ࠷ॳʹཁૉ਺͕ writeVInt() Ͱॻ
    ͔ΕΔ


    w ͦͷ͋ͱɺΩʔͱ஋͕ཁૉ͝ͱʹ
    writeString() Ͱॻ͔ΕΔ

    ʢཁૉ਺͚ͩ܁Γฦ͠ʣ

    View Slide

  24. #JOBSZ'PSNBU#BTJD5ZQFT
    4FU4USJOH
    public void writeSetOfStrings(Set set) throws
    IOException {


    writeVInt(set.size());


    for (String value : set) {


    writeString(value);


    }


    }
    w DataOutput#writeSetOfStrings(
    Set set)


    w ࠷ॳʹཁૉ਺͕ writeVInt() Ͱॻ
    ͔ΕΔ


    w ͦͷ͋ͱɺ஋͕ཁૉ͝ͱʹ
    writeString() Ͱॻ͔ΕΔ

    ʢཁૉ਺͚ͩ܁Γฦ͠ʣ

    View Slide

  25. #JOBSZ'PSNBU*OEFY)FBEFS
    *OEFY)FBEFS
    public static void writeIndexHeader(


    DataOutput out, String codec, int version, byte[] id, String suffix) throws
    IOException {


    if (id.length != StringHelper.ID_LENGTH) {


    throw new IllegalArgumentException("Invalid id: " + StringHelper.idToString(id));


    }


    writeHeader(out, codec, version);


    out.writeBytes(id, 0, id.length);


    BytesRef suffixBytes = new BytesRef(suffix);


    if (suffixBytes.length != suffix.length() || suffixBytes.length >= 256) {


    throw new IllegalArgumentException(


    "suffix must be simple ASCII, less than 256 characters in length [got " +
    suffix + "]");


    }


    out.writeByte((byte) suffixBytes.length);


    out.writeBytes(suffixBytes.bytes, suffixBytes.offset, suffixBytes.length);


    }


    public static void writeHeader(DataOutput out, String codec, int version) throws
    IOException {


    BytesRef bytes = new BytesRef(codec);


    if (bytes.length != codec.length() || bytes.length >= 128) {


    throw new IllegalArgumentException(


    "codec must be simple ASCII, less than 128 characters in length [got " +
    codec + "]");


    }


    writeBEInt(out, CODEC_MAGIC);


    out.writeString(codec);


    writeBEInt(out, version);


    }
    w ͢΂ͯͷ Lucene ΠϯσοΫεϑΝΠϧʹ͸ɺઌ಄ʹڞ௨ͷϑΥʔϚοτͰϔομ
    ͕ॻ͔ΕΔ


    w ΠϯσοΫεϑΝΠϧͷϔομ (IndexHeader) ͸ CodecHeader ObjectID
    ObjectSuffix ͔ΒͳΔ


    w CodecUtil#writeIndexHeader()


    w CodecUtil#writeHeader() Ͱ CodecHeader Λॻ͘


    w BEInt ͷϚδοΫφϯόʔ CODEC_MAGIC (= 0x3fd76c17) Λॻ͘


    w String ͷίʔσοΫ໊ codec Λॻ͘


    w BEInt ͷϑΝΠϧͷόʔδϣϯ version Λॻ͘


    w DataOutput#writeBytes() Ͱ ObjectID Λॻ͘


    w 1όΠτݻఆ௕Ͱ SuffixBytes ͷ௕͞Λॻ͍ͨ͋ͱɺ SuffixBytes Λॻ͘

    View Slide

  26. #JOBSZ'PSNBU*OEFY)FBEFS
    *OEFY)FBEFS
    IndexHeader:


    CodecHeader:


    Magic: 1071082519 # BEInt (3f d7 6c 17): ϚδοΫφϯόʔ


    CodecName:


    "Lucene90PostingsWriterPos" # String (length: 19 = 25): ίʔσοΫ໊


    Version: 0 # BEInt (00 00 00 00): ϑΝΠϧͷόʔδϣϯ


    ObjectID: [] # Bytes^16


    ObjectSuffix: # ϑΝΠϧ໊ͷ "_0" ͷޙΖͷ "_Lucene90_0" ͷ෦෼


    SuffixLength: 10 # Byte (0a)


    SuffixBytes: "Lucene90_0" # Bytes^SuffixLength
    w _0_Lucene90_0.pos ϑΝΠϧͷϔο
    μΛݟͯΈΔ


    w 0x00 ൪஍͔Β 0x3c ൪஍·Ͱ


    w ϔομΛ YAML Ͱදݱ͢ΔͱӈͷΑ͏
    ͳײ͡

    View Slide

  27. #JOBSZ'PSNBU'PPUFS
    $PEFD'PPUFS
    public static void writeFooter(IndexOutput out) throws
    IOException {


    writeBEInt(out, FOOTER_MAGIC);


    writeBEInt(out, 0);


    writeCRC(out);


    }
    w ΠϯσοΫεϑΝΠϧͷ຤ඌʹ͸ڞ௨ͷϑΥʔϚο
    τͰϑολ (CodecFooter) ͕ॻ͔ΕΔ


    w CodecUtil#writeFooter()


    w BEInt ͷϚδοΫφϯόʔ FOOTER_MAGIC
    (= ~0x3fd76c17) Λॻ͘


    w BEInt Ͱ 0 Λॻ͘ʢݻఆͷ AlgorithmIDʣ


    w BELong Ͱ CRC32 ͷνΣοΫαϜΛॻ͘ʢৄ
    ࡉ͸ java.util.zip.CRC32 ΫϥεΛࢀরʣ

    View Slide

  28. #JOBSZ'PSNBU'PPUFS
    $PEFD'PPUFS
    CodecFooter:


    Magic: 3223884776 # BEInt (c0 28 93 e8)


    AlgorithmID: 0 # BEInt (00 00 00 00)


    Checksum: 1437576107 # BELong (00 00 00 00 55 af ab
    ab)
    w _0_Lucene90_0.pos ϑΝΠϧͷϑο
    λΛݟͯΈΔ


    w 0x42 ൪஍͔Β 0x51 ൪஍·Ͱ


    w YAML Ͱදݱ͢ΔͱӈͷΑ͏ʹͳΔ

    View Slide

  29. "OBMZ[JOH*OEFY'JMF'PSNBUT
    0UIFS
    fi
    MFGPSNBUT
    • ֤ϑΥʔϚοτͷϑΝΠϧͷղੳ݁Ռʹ͍ͭͯ͸ɺҎԼͷϒϩάهࣄʹ
    ·ͱΊ͍ͯ·͢ʢॻ͖͔͚ʣ
    IUUQTTUPQUIFXPSMEIBUFOBCMPHDPNESBGUFOUSZ/9CU0P79&',31SF/9N.

    View Slide

  30. -JDFOTF
    w Ҿ༻ͨ͠-VDFOFͷιʔείʔυ͸"QBDIF-JDFOTFIUUQT
    XXXBQBDIFPSHMJDFOTFT-*$&/4&Ͱ͢ɻ

    View Slide