日韩成人在线影院,精品免费,亚洲视频www

關(guān)于SegmentInfos類的具體實(shí)現(xiàn)大致已經(jīng)在文章 Lucene-2.2.0 源代碼閱讀學(xué)習(xí)(17) 中有了一個(gè)簡單的印象，可以在文章 Lucene-2.2.0 源代碼閱讀學(xué)習(xí)(17) 中的末尾部分看到一點(diǎn)非常有用的總結(jié)。

然而，到底SegmentInfos類能夠?qū)崿F(xiàn)哪些功能，讓我們能夠親眼看到它產(chǎn)生了哪些東西呢？我們可以從SegmentInfos類的一些重要的成員方法中開始追蹤一些真實(shí)存在的東西到底去向如何，比如segmentName，以及version和gen等等，他們都是有值的，那么，這些值應(yīng)該被怎樣地輸出呢，又輸出到哪里去了呢，下面仔細(xì)學(xué)習(xí)研究。

先做個(gè)引子：

在前面的文章 Lucene-2.2.0 源代碼閱讀學(xué)習(xí)(4) 中，我們做了一個(gè)小例子，對(duì)指定目錄中的一些txt文本進(jìn)行索引，然后做了一個(gè)簡單的檢索關(guān)鍵字的測試。

就從這里說起，在文章 Lucene-2.2.0 源代碼閱讀學(xué)習(xí)(4) 中，沒有學(xué)習(xí)到內(nèi)在的機(jī)制，而只是為學(xué)習(xí)分詞做了一個(gè)引導(dǎo)。現(xiàn)在，也不對(duì)如果構(gòu)建Document和Field進(jìn)行解釋，因?yàn)檫@一塊也非常地復(fù)雜，夠?qū)W習(xí)一陣子了。但是，我們要看的就是，在這個(gè)過程中都有哪些產(chǎn)物，即產(chǎn)生了哪些文件，知道了建立索引過程中產(chǎn)生了哪些文件，有助于我們?cè)赟egmentInfos類中對(duì)一些與維護(hù)索引文件相關(guān)的信息的去向進(jìn)行追蹤。

在文章 Lucene-2.2.0 源代碼閱讀學(xué)習(xí)(4) 中，運(yùn)行測試程序后，在指定的索引文件目錄(測試程序中指定為E:\Lucene\myindex)生成了很多文件(因?yàn)橹付ǖ囊⑺饕膖xt文件具有一定的量，如果只有一兩個(gè)txt文本文件，而且它們的大小都不是很大，則產(chǎn)生的索引文件數(shù)量會(huì)很少，可能只有3個(gè)或者4個(gè))，如圖所示：

圖中segments.gen和segments_f文件都不陌生了。看看他們是怎么生成的，又保存有哪些信息。

segments_N文件和segments.gen文件的生成

1、先看segment_N文件：

在將文件寫入到磁盤的目錄中之前，一般來說首先要?jiǎng)?chuàng)建一個(gè)輸出流。在SegmentInfos類中，有一個(gè)成員方法write()，在該方法中：

IndexOutput output = directory.createOutput(segmentFileName);

根據(jù)指定的索引段文件的名稱segmentFileName，創(chuàng)建一個(gè)指向索引目錄directory的輸出流output。

關(guān)于這個(gè)segmentFileName，是要先從指定的索引目錄中讀取出來的，在write()方法中第一行代碼中就獲取了這個(gè)索引段的文件名：

String segmentFileName = getNextSegmentFileName();

這里的 getNextSegmentFileName()方法是SegmentInfos類的一個(gè)成員方法，了解它有助于我們繼續(xù)追蹤：

public String getNextSegmentFileName() {
??? long nextGeneration;

??? if (generation == -1) {
????? nextGeneration = 1;
??? } else {
????? nextGeneration = generation+1;
??? }
??? return IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,"",nextGeneration);
}

該方法返回的就是我們將要處理的一個(gè)索引段文件的名稱。最后一句return返回，調(diào)用了IndexFileNames類的fileNameFromGeneration()方法，它也很重要，因?yàn)橐褂梦募Q作為參數(shù)獲取索引目錄下的維護(hù)索引的文件都要從這里獲得。

關(guān)注一下IndexFileNames類的實(shí)現(xiàn)：

package org.apache.lucene.index;

// 該類主要是對(duì)索引文件的命名進(jìn)行管理
final class IndexFileNames {

/** 索引段文件名 */
static final String SEGMENTS = "segments";

/** generation reference文件名*/
static final String SEGMENTS_GEN = "segments.gen";

/** Name of the index deletable file (only used in pre-lockless indices) */
static final String DELETABLE = "deletable";
??
/** norms file的擴(kuò)展名 */
static final String NORMS_EXTENSION = "nrm";

/** 復(fù)合文件擴(kuò)展名*/
static final String COMPOUND_FILE_EXTENSION = "cfs";

/** 刪除文件擴(kuò)展名 */
static final String DELETES_EXTENSION = "del";

/** plain norms擴(kuò)展名 */
static final String PLAIN_NORMS_EXTENSION = "f";

/** Extension of separate norms */
static final String SEPARATE_NORMS_EXTENSION = "s";

/**
?? * Lucene的全部索引文件擴(kuò)展名列表
?? */
static final String INDEX_EXTENSIONS[] = new String[] {
????? "cfs", "fnm", "fdx", "fdt", "tii", "tis", "frq", "prx", "del",
????? "tvx", "tvd", "tvf", "gen", "nrm"
};

/** 被添加到復(fù)合索引文件上的文件擴(kuò)展名 */
static final String[] INDEX_EXTENSIONS_IN_COMPOUND_FILE = new String[] {
????? "fnm", "fdx", "fdt", "tii", "tis", "frq", "prx",
????? "tvx", "tvd", "tvf", "nrm"
};

/** old-style索引文件擴(kuò)展名 */
static final String COMPOUND_EXTENSIONS[] = new String[] {
??? "fnm", "frq", "prx", "fdx", "fdt", "tii", "tis"
};

/** 詞條向量支持的文件擴(kuò)展名 */
static final String VECTOR_EXTENSIONS[] = new String[] {
??? "tvx", "tvd", "tvf"
};

/**
?? * 根據(jù)基礎(chǔ)文件名(不包括后綴，比如segments.gen文件，segments部分為基礎(chǔ)文件名)、擴(kuò)展名和generarion計(jì)算指定文件的完整文件名
?? */
static final String fileNameFromGeneration(String base, String extension, long gen) {
??? if (gen == SegmentInfo.NO) {
????? return null;
??? } else if (gen == SegmentInfo.WITHOUT_GEN) {
????? return base + extension;
??? } else {
????? return base + "_" + Long.toString(gen, Character.MAX_RADIX) + extension;
??? }
}
}

fileNameFromGeneration實(shí)現(xiàn)的功能：根據(jù)傳進(jìn)來的base(比如segments)、擴(kuò)展名、gen來生成一個(gè)新的文件名，并返回。

在SegmentInfos類中g(shù)etNextSegmentFileName() 方法調(diào)用了fileNameFromGeneration，如下所示：

return IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,"",nextGeneration);

第一個(gè)參數(shù)值為"segments"，第二個(gè)參數(shù)值為""，第三個(gè)是一個(gè)gen(它是一個(gè)Long型的數(shù)字)，如果假設(shè)這里的nextGeneration=5，調(diào)用fileNameFromGeneration()方法后，返回的是一個(gè)索引段文件名：segments_5。

這樣，就可以根據(jù)生成的segments_N文件名，創(chuàng)建一個(gè)輸出流，將需要的信息寫入到該文件中。

2、再看segments.gen文件：

仔細(xì)觀察，其實(shí)SegmentInfos類的write方法就是對(duì)segments_N文件和segments.gen文件進(jìn)行寫入操作的。

在寫入segments_N文件以后，緊接著就是處理segments.gen文件：

output = directory.createOutput(IndexFileNames.SEGMENTS_GEN);

因?yàn)樵谝粋€(gè)索引目錄下，屬于同一個(gè)索引段的索引文件就是通過一個(gè)segments.gen文件來維護(hù)的，segments.gen文件的文件名自然不需要那么麻煩地去獲取。直接使用IndexFileNames.SEGMENTS_GEN = segments.gen作為參數(shù)構(gòu)造一個(gè)輸出流，進(jìn)行輸出，寫入到索引目錄中即可。

關(guān)于segments_N文件和segments.gen文件保存的信息

同樣在SegmentInfos類的write方法中能夠看到，這兩個(gè)文件中都加入了哪些信息。

■ 關(guān)于segments_N文件

如下所示：

output.writeInt(CURRENT_FORMAT); // write FORMAT
????? output.writeLong(++version); // every write changes the index
????? output.writeInt(counter); // write counter
????? output.writeInt(size()); // write infos
????? for (int i = 0; i < size(); i++) {
??????? info(i).write(output);
????? }????????

(1) CURRENT_FORMAT

其中，CURRENT_FORMAT是SegmentInfos類的一個(gè)成員：

/* This must always point to the most recent file format. */
private static final int CURRENT_FORMAT = FORMAT_SINGLE_NORM_FILE;

上面CURRENT_FORMAT的值就是FORMAT_SINGLE_NORM_FILE的值-3：

/** This format adds a "hasSingleNormFile" flag into each segment info.
?? * See <a href=" http://issues.apache.org/jira/browse/LUCENE-756">LUCENE-756</a > for details.
?? */
public static final int FORMAT_SINGLE_NORM_FILE = -3;

(2) version

version是SegmentInfos類的一個(gè)成員，版本號(hào)通過系統(tǒng)來獲取：

/**
?? * counts how often the index has been changed by adding or deleting docs.
?? * starting with the current time in milliseconds forces to create unique version numbers.
?? */
private long version = System.currentTimeMillis();

(3) counter

用于為當(dāng)前待寫入索引目錄的索引段文件命名的，即segments_N中的N將使用counter替換。

counter也是SegmentInfos類的一個(gè)成員，初始化是為0：

public int counter = 0;

在read()方法中，使用從索引目錄中已經(jīng)存在的segment_N中讀取的出format的值，然后根據(jù)format的值來指派counter的值，如下所示：

????? int format = input.readInt();
????? if(format < 0){???? // file contains explicit format info
?????? // check that it is a format we can understand
??????? if (format < CURRENT_FORMAT)
????????? throw new CorruptIndexException("Unknown format version: " + format);
??????? version = input.readLong(); // read version
??????? counter = input.readInt(); // read counter
????? }
????? else{??? // file is in old format without explicit format info
??????? counter = format;
????? }

(4) size()

size()就是SegmentInfos的大小，SegmentInfos中含有多個(gè)SegmentInfo，注意：SegmentInfos類繼承自Vector。

(5) info(i)

info()方法的定義如下所示：

public final SegmentInfo info(int i) {
??? return (SegmentInfo) elementAt(i);
}

可見，SegmentInfos是SegmentInfo的一個(gè)容器，它只把當(dāng)前這個(gè)索引目錄中的SegmentInfo裝進(jìn)去，以便對(duì)他們管理維護(hù)。

這里，info(i).write(output);又調(diào)用了SegmentInfo類的write()方法，來向索引輸出流output中加入信息。SegmentInfo類的write()方法如下所示：

/**
?? * Save this segment's info.
?? */
void write(IndexOutput output)
??? throws IOException {
??? output.writeString(name);
??? output.writeInt(docCount);
??? output.writeLong(delGen);
??? output.writeByte((byte) (hasSingleNormFile ? 1:0));
??? if (normGen == null) {
????? output.writeInt(NO);
??? } else {
????? output.writeInt(normGen.length);
????? for(int j = 0; j < normGen.length; j++) {
??????? output.writeLong(normGen[j]);
????? }
??? }
??? output.writeByte(isCompoundFile);
}

從上可以看到，還寫入了SegmentInfo的具體信息：name、docCount、delGen、(byte)(hasSingleNormFile ? 1:0)、NO/normGen.length、normGen[j]、isCompoundFile。

■ 關(guān)于segments.gen文件

通過SegmentInfos類的write()方法可以看到：

???????? output.writeInt(FORMAT_LOCKLESS);
??????? output.writeLong(generation);
??????? output.writeLong(generation);

segments.gen文件中只是寫入了兩個(gè)字段的信息：FORMAT_LOCKLESS和generation。

因?yàn)閟egments.gen文件管理的就是segments_N文件中的N的值，與該文件相關(guān)就只有一個(gè)generation，和一個(gè)用于判斷是否是無鎖提交的信息：

/** This format adds details used for lockless commits. It differs
?? * slightly from the previous format in that file names
?? * are never re-used (write once). Instead, each file is
?? * written to the next generation. For example,
?? * segments_1, segments_2, etc. This allows us to not use
?? * a commit lock. See <a
?? * href=" http://lucene.apache.org/java/docs/fileformats.html">file
?? * formats</a> for details.
?? */
public static final int FORMAT_LOCKLESS = -2;

最后，總結(jié)一下：

現(xiàn)在知道了segments_N文件和segment.gen文件都記錄了什么內(nèi)容。

其中，segments_N文件與SegmentInfo類的關(guān)系十分密切，接下來要學(xué)習(xí)SegmentInfo類了

Lucene-2.2.0 源代碼閱讀學(xué)習(xí)(18)

更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號(hào)聯(lián)系： 360901061

您的支持是博主寫作最大的動(dòng)力，如果您喜歡我的文章，感覺我的文章對(duì)您有幫助，請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點(diǎn)擊下面給點(diǎn)支持吧，站長非常感激您！手機(jī)微信長按不能支付解決辦法：請(qǐng)將微信支付二維碼保存到相冊(cè)，切換到微信，然后點(diǎn)擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對(duì)您有幫助就好】元

2元

5元

10元

20元

自定義