The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google?
?
【西周翻譯】
?
ABSTRACT
概述
????We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
????
我們設(shè)計(jì)和實(shí)現(xiàn)了Google File System,簡稱GFS,一個(gè)可擴(kuò)展的分布式文件系統(tǒng),用于大型分布式數(shù)據(jù)相關(guān)應(yīng)用。它提供了基于普通商用硬件上的容錯(cuò)機(jī)制,同時(shí)對大量的客戶端提供高性能的響應(yīng)。
????While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment,
both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
????While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment,
both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
???
?
GFS與
此前的分布式文件系統(tǒng)具有許多相同的目標(biāo),但我們的設(shè)計(jì)是基于對我們的應(yīng)用負(fù)載和技術(shù)環(huán)境的觀察而來,包含當(dāng)前狀況,也包含今后的發(fā)展,這與一些早期的文件系統(tǒng)的假定就有了分別。這驅(qū)使著我們?nèi)ブ匦驴紤]傳統(tǒng)的選擇和探索新的設(shè)計(jì)點(diǎn)。
????The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
????The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
????
這個(gè)文件系統(tǒng)成功的滿足了我們的存儲(chǔ)需求。在Google它被廣泛的部署,我們的業(yè)務(wù)用其作為生成和處理數(shù)據(jù)的存儲(chǔ)平臺(tái),同時(shí)也被用于節(jié)省在面對大量數(shù)據(jù)時(shí)的研究和開發(fā)成本。當(dāng)前最大的集群已經(jīng)可以基于超過一千臺(tái)機(jī)器上的數(shù)千個(gè)磁盤,來存儲(chǔ)上萬TB的數(shù)據(jù),同時(shí)它也支持來自于上萬個(gè)客戶端的訪問請求。
????In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both
micro-benchmarks and real world use.
????In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both
micro-benchmarks and real world use.
????
在這篇論文中,我們展示了文件系統(tǒng)的接口擴(kuò)展,用以支持分布式應(yīng)用,并且針對我們的設(shè)計(jì)進(jìn)行的多個(gè)方面的討論,以及在真實(shí)環(huán)境中運(yùn)行的度量數(shù)據(jù)。
?
?1. INTRODUCTION 簡介
????We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier
file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.
file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.
????
我們設(shè)計(jì)實(shí)現(xiàn)了GFS來應(yīng)對來自Google快速增長的數(shù)據(jù)處理需求。GFS和此前的分布式文件系統(tǒng)具有某些相同的目標(biāo),如性能,可擴(kuò)展型,可靠性和可用性。然而,GFS的設(shè)計(jì)被Google的應(yīng)用負(fù)載情況及技術(shù)環(huán)境所驅(qū)動(dòng),具有和以往的分布式文件系統(tǒng)不同的方面。我們從設(shè)計(jì)角度重新考慮了傳統(tǒng)的選擇,針對這些不同點(diǎn)進(jìn)行了探索。
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the? components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application
bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the? components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application
bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
????
第一,組件的失效比異常更加常見。文件系統(tǒng)包含了成百上千的基于普通硬件的存儲(chǔ)機(jī)器,同時(shí)被大量的客戶端機(jī)器訪問,組件的數(shù)量和質(zhì)量決定了在某個(gè)時(shí)刻一些組件會(huì)失效而其中的一些無法從失效狀態(tài)中恢復(fù)。我們曾經(jīng)見到過由于下面的原因引發(fā)的實(shí)效:應(yīng)用缺陷,OS缺陷,人為錯(cuò)誤,磁盤/內(nèi)存/連接器/網(wǎng)絡(luò)/電源錯(cuò)誤等等,因此系統(tǒng)必須包含狀態(tài)監(jiān)視、錯(cuò)誤檢測、容錯(cuò)、自動(dòng)恢復(fù)等能力。
????Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited.
????Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited.
????
第二,傳統(tǒng)標(biāo)準(zhǔn)的文件量十分巨大,總量一般都會(huì)達(dá)到GB級別。文件通常包含許多應(yīng)用對象,諸如Web文檔等。當(dāng)我們在工作中與日益增長的包含大量對象的TB級的數(shù)據(jù)進(jìn)行交互時(shí),管理數(shù)以億計(jì)的KB大小的文件是非常困難的。所以,設(shè)計(jì)假定和參數(shù)需要重新定義,如I/O操作和塊大小等。
????Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
????Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
???
?
第三,多數(shù)的文件變化是因?yàn)樵黾有碌臄?shù)據(jù),而非重寫原有數(shù)據(jù)。在一個(gè)文件中的隨機(jī)寫操作其實(shí)并不存在。一旦完成寫入操作,文件就變成只讀,通常也是順序存儲(chǔ)。多種數(shù)據(jù)擁有這樣的特征。構(gòu)造大型存儲(chǔ)區(qū)以供數(shù)據(jù)分析程序操作;運(yùn)行應(yīng)用產(chǎn)生的連續(xù)數(shù)據(jù)流;歷史歸檔數(shù)據(jù);一臺(tái)機(jī)器產(chǎn)生的會(huì)被其他機(jī)器使用的中間數(shù)據(jù);對于巨大文件的訪問模式,“增加”變成了性能優(yōu)化的焦點(diǎn),與此同時(shí),在客戶端進(jìn)行數(shù)據(jù)塊緩存逐漸失去了原有的意義。
????Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility. For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.
???? 第四,統(tǒng)一設(shè)計(jì)應(yīng)用和文件系統(tǒng)API對提升靈活性有著好處。例如,我們將GFS的一致性模型設(shè)計(jì)的盡量輕巧,使得文件系統(tǒng)得到極大的簡化,應(yīng)用系統(tǒng)也不會(huì)背上沉重的包袱。我們還引入了一個(gè)原子Append操作,這樣多個(gè)客戶端可以同時(shí)向一個(gè)文件增加內(nèi)容,而不會(huì)出現(xiàn)同步問題。這些將會(huì)在論文的后續(xù)章節(jié)進(jìn)行討論。 ?
???? 第四,統(tǒng)一設(shè)計(jì)應(yīng)用和文件系統(tǒng)API對提升靈活性有著好處。例如,我們將GFS的一致性模型設(shè)計(jì)的盡量輕巧,使得文件系統(tǒng)得到極大的簡化,應(yīng)用系統(tǒng)也不會(huì)背上沉重的包袱。我們還引入了一個(gè)原子Append操作,這樣多個(gè)客戶端可以同時(shí)向一個(gè)文件增加內(nèi)容,而不會(huì)出現(xiàn)同步問題。這些將會(huì)在論文的后續(xù)章節(jié)進(jìn)行討論。 ?
????Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.
????
多個(gè)GFS集群被部署用于不同的用途。最大的一個(gè)擁有1000個(gè)存儲(chǔ)節(jié)點(diǎn),300TB的磁盤存儲(chǔ),被上萬個(gè)用戶持續(xù)的密集訪問。
2. DESIGN OVERVIEW 設(shè)計(jì)概覽
2.1 Assumptions 假定
????In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details.
????
在設(shè)計(jì)符合我們需求的文件系統(tǒng)的時(shí)候,我們制定了下述的假定,有挑戰(zhàn)也有機(jī)會(huì)。前面我們提到過一些關(guān)鍵的觀察,現(xiàn)在我們將其具體化。
? The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
? The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
系統(tǒng)由許多便宜常見的組件構(gòu)成,它們經(jīng)常出現(xiàn)錯(cuò)誤。必須定期進(jìn)行監(jiān)視、檢測、容錯(cuò)、以及從錯(cuò)誤狀態(tài)恢復(fù)到例行工作狀態(tài)。
? The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
? The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
系統(tǒng)存儲(chǔ)了一定數(shù)目的大型文件。我們期望是數(shù)百萬個(gè)文件,每個(gè)大概是100MB以上。GB級文件是常見情形,需要被有效的管理起來。小文件也必須支持,但是我們無需為其優(yōu)化。
? The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary
offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
? The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary
offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
系統(tǒng)的負(fù)荷來自于兩種讀操作:大型順序讀,以及小型隨機(jī)讀。在大型順序讀的情況中,單個(gè)操作通常讀取MB級別以上的數(shù)據(jù)。來自相同客戶端的連續(xù)操作通常讀取一個(gè)文件的連續(xù)區(qū)間。小型隨機(jī)讀通常讀取若干KB的數(shù)據(jù)據(jù)。關(guān)注性能的應(yīng)用往往會(huì)將小型讀操作進(jìn)行打包和排序,從而使得在文件中平穩(wěn)的讀取,而非反復(fù)前后跳轉(zhuǎn)。
? The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
? The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
系統(tǒng)的負(fù)荷也有許多大型的連續(xù)的Append寫操作。通常操作的大小與讀取相似。一旦完成寫入,文件幾乎不會(huì)再被修改。系統(tǒng)也會(huì)支持小型隨機(jī)寫入操作,但是效率不會(huì)很高。
? The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently
append to a file. Atomicity with minimal synchronization overhead is essential. The file may be
read later, or a consumer may be reading through the file simultaneously.
? The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently
append to a file. Atomicity with minimal synchronization overhead is essential. The file may be
read later, or a consumer may be reading through the file simultaneously.
對于多個(gè)客戶端并發(fā)向同一個(gè)文件進(jìn)行Append操作的情況,系統(tǒng)必須有效的實(shí)現(xiàn)良好定義的語義。我們的文件常被用作“生產(chǎn)者-消費(fèi)者隊(duì)列“或者“多路合并”。數(shù)以百計(jì)的生產(chǎn)者,每個(gè)運(yùn)行于單獨(dú)的機(jī)器,并行向同一個(gè)文件添加數(shù)據(jù)。降低同步的困擾必不可少。文件可能后續(xù)被讀取,也許一個(gè)消費(fèi)者會(huì)同時(shí)讀取。
? High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.
? High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.
持續(xù)的高帶寬比低延遲更為重要。多數(shù)目標(biāo)應(yīng)用期望以高速率對塊數(shù)據(jù)進(jìn)行處理,同時(shí)只有少量應(yīng)用對單個(gè)讀寫操作的響應(yīng)時(shí)間有嚴(yán)格的要求。
2.2 Interface 接口
????GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames. We support the usual operations to create, delete, open, close, read, and write files.
????
GFS提供了一套常見的文件系統(tǒng)接口,雖然它并沒有實(shí)現(xiàn)諸如POSIX這樣的標(biāo)準(zhǔn)API。文件在目錄中以層次化的形式進(jìn)行組織,可以通過路徑名稱進(jìn)行標(biāo)識(shí)。我們提供了諸如創(chuàng)建、刪除、打開、關(guān)閉、讀和寫文件這樣的常見操作。
????Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.
????Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.
????
GFS也擁有快照和Append記錄操作。快照以最低成本創(chuàng)建一個(gè)文件或一個(gè)目錄樹的拷貝。Append記錄允許多個(gè)客戶端同時(shí)向一個(gè)文件進(jìn)行Append操作,同時(shí)確保每個(gè)單獨(dú)客戶端Append的原子性。這一點(diǎn)對于實(shí)現(xiàn)“多路合并”和“生產(chǎn)者-消費(fèi)者隊(duì)列”非常有意義,許多客戶端可以同時(shí)進(jìn)行Append操作而不受額外的加鎖限制。我們發(fā)現(xiàn)在構(gòu)造大型分布式應(yīng)用時(shí),這種類型的文件非常有價(jià)值。快照和Append記錄將在3.4和3.5章中詳細(xì)討論。
2.3 Architecture
架構(gòu)
????A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
????
一個(gè)GFS集群由一個(gè)master和多個(gè)塊服務(wù)器(Chunkserver)組成,被多個(gè)客戶端所訪問,如圖1所示。每個(gè)機(jī)器都是廉價(jià)的Linux機(jī)器,運(yùn)行用戶態(tài)服務(wù)進(jìn)程。也可以將塊服務(wù)器和客戶端在同一臺(tái)機(jī)器上運(yùn)行,只要機(jī)器的資源允許,或者可以接受可能有問題的應(yīng)用代碼帶來的低穩(wěn)定性。
?
????Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.
????
文件被分割成固定大小的塊。每個(gè)塊都使用一個(gè)不變的全局唯一的64位塊句柄進(jìn)行標(biāo)識(shí),這個(gè)句柄在master創(chuàng)建塊時(shí)進(jìn)行分配。塊服務(wù)器在本地磁盤上像Linux文件一樣存儲(chǔ)塊,根據(jù)指定的塊句柄和字節(jié)范圍來讀寫塊數(shù)據(jù)。為了可靠性,每個(gè)塊被復(fù)制在多個(gè)塊服務(wù)器上。缺省情況下,我們保存三分復(fù)制,用戶也可以為文件名稱空間的不同地區(qū)指定不同的復(fù)制級別。
????The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
????The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
????
Master維護(hù)所有的文件系統(tǒng)元數(shù)據(jù)。它將包括名字空間,訪問控制信息,文件與塊的鏈接,以及塊的當(dāng)前位置。它還控制著系統(tǒng)層面的活動(dòng),諸如塊租借管理,孤立塊的垃圾回收,塊服務(wù)器之間的塊遷移。master會(huì)定期的與塊服務(wù)器使用心跳消息進(jìn)行通信,發(fā)送指令給塊服務(wù)器,以及收集塊服務(wù)器的狀態(tài)。
????GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
????GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
????
嵌入與應(yīng)用中的GFS客戶端代碼實(shí)現(xiàn)了文件系統(tǒng)API,與master和塊服務(wù)器進(jìn)行通信,代為應(yīng)用程序讀寫數(shù)據(jù)。客戶端與master交互以進(jìn)行元數(shù)據(jù)操作,但是所有的數(shù)據(jù)通信都將直接訪問塊服務(wù)器。我們沒有提供POSIX API,因此無需在Linux vnode層放置鉤子。
????Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
????Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
????
客戶端和塊服務(wù)器都不會(huì)緩存文件數(shù)據(jù)。客戶端進(jìn)行緩存只有極少的益處,因?yàn)槎鄶?shù)應(yīng)用操作巨大的文件,而且工作輸出的大小也超出的緩存的范圍。沒有緩存讓客戶端和整個(gè)系統(tǒng)都變得簡單,因?yàn)榭梢酝浘彺嫱絾栴}。(然后客戶端還是會(huì)緩存元數(shù)據(jù))塊服務(wù)器也無需緩存文件數(shù)據(jù),因?yàn)閴K在本地文件中存放,Linux的緩沖區(qū)機(jī)制已經(jīng)將頻繁訪問的數(shù)據(jù)放進(jìn)了內(nèi)存。
2.4 Single Master
單Master
????Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
????
單master極大的簡化了我們的設(shè)計(jì),同時(shí)也使得master可以給予全局知識(shí)進(jìn)行復(fù)雜的塊存儲(chǔ)和復(fù)制策略。但是我們必須使得master在讀寫方面的占用最小化,從而避免讓它成為瓶頸。客戶端從不直接從master讀寫數(shù)據(jù)。相反的,客戶端會(huì)詢問master該與哪個(gè)塊服務(wù)器進(jìn)行交互。而后它會(huì)將這個(gè)信息緩存一段時(shí)間,接下來的操作會(huì)直接與這個(gè)塊服務(wù)器進(jìn)行交互。
????Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunk size, the client translates the file name and byte offset specified by the application
into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key.
into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key.
????
讓我們用圖1來解釋一下一個(gè)簡單的讀操作的交互過程。首先,使用固定的塊大小,客戶端將文件名和應(yīng)用指定的偏移量轉(zhuǎn)換成文件內(nèi)部的塊索引。然后,客戶端向master發(fā)送一個(gè)請求,包含文件名和塊索引。master響應(yīng)對應(yīng)的塊句柄和復(fù)本的位置。客戶端將這些信息進(jìn)行緩存,使用文件名和塊索引作為Key。
????The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This
extra information sidesteps several future client-master interactions at practically no extra cost.
????The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This
extra information sidesteps several future client-master interactions at practically no extra cost.
???
?
客戶端向復(fù)本之一發(fā)送一個(gè)請求,通常是最近的一個(gè)。這個(gè)請求指定了塊句柄和塊內(nèi)部的一個(gè)區(qū)間。接下來對于相同塊的讀取將不會(huì)再次進(jìn)行客戶端與master的交互,直到緩存過期,或者文件被重新打開。事實(shí)上,客戶端通常在一個(gè)請求中嘗試讀取多個(gè)塊,master也會(huì)立即返回相應(yīng)的塊信息。這些額外的信息避免了后續(xù)的一些客戶端與master的交互,但又沒有引入額外的成本。
2.5 Chunk Size 塊大小
????Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.
????
塊的大小是一個(gè)關(guān)鍵的設(shè)計(jì)點(diǎn)。我們選擇了64MB,這比通常的文件系統(tǒng)塊要大出很多。每個(gè)塊的復(fù)本在一個(gè)塊服務(wù)器上被存儲(chǔ)為一個(gè)平面的Linux文件,僅在需要的時(shí)候進(jìn)行擴(kuò)展。“懶”空間分配避免了內(nèi)部碎片導(dǎo)致的空間浪費(fèi),這也許是如此大小的塊機(jī)制的最無爭議之處。
????A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
????A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
????
大型的塊有許多關(guān)鍵的好處。首先,它減少了客戶端與master交互的需求,因?yàn)閷τ谕坏膲K的讀寫,只需要向master發(fā)送一個(gè)獲取塊位置信息的初始請求。這極大的降低系統(tǒng)的負(fù)荷,因?yàn)閼?yīng)用通常對大型文件進(jìn)行順序讀寫操作。即使對于小型隨機(jī)讀操作,客戶端也可以輕松的對TB級別的工作集的塊位置存儲(chǔ)進(jìn)行緩存。第二,因?yàn)閴K足夠大,客戶端基本上是在一個(gè)給定的塊上進(jìn)行多次操作,這也可以降低網(wǎng)絡(luò)方面的困難,因?yàn)榭梢栽谝粋€(gè)時(shí)間段內(nèi)與塊服務(wù)器之間保持一個(gè)持久的TCP連接。第三,這使得可以減少在master上存儲(chǔ)的元數(shù)據(jù)大小。這樣的話,我們可以將元數(shù)據(jù)放入內(nèi)存中,從而帶來其他的將在2.6中討論的好處。
????On the other hand, a large chunks ize, even with lazy space allocation, has its? disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.
????On the other hand, a large chunks ize, even with lazy space allocation, has its? disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.
????
另一方面,雖然可以進(jìn)行“懶”空間分配,大型的塊也有它的缺點(diǎn)。一個(gè)小文件包含較少的塊,也可能只有一個(gè)。存儲(chǔ)這些塊的塊服務(wù)器可能會(huì)變成“熱點(diǎn)”,如果許多客戶端嘗試訪問相同的文件。在實(shí)踐中,熱點(diǎn)不會(huì)成為主要問題,因?yàn)槲覀兊膽?yīng)用在大多數(shù)情況下,是順序的對多個(gè)塊的文件進(jìn)行讀操作。
????However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines
at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batch queue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.
????However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines
at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batch queue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.
????
但是,當(dāng)GFS被第一次用于一個(gè)批處理隊(duì)列系統(tǒng)中試,熱點(diǎn)還是出現(xiàn)了:一個(gè)可執(zhí)行文件作為單塊文件被寫入GFS,然后在成百上千臺(tái)機(jī)器上啟動(dòng)運(yùn)行。保存這個(gè)可執(zhí)行文件的幾臺(tái)塊服務(wù)器由于大量并發(fā)請求進(jìn)入過載狀態(tài)。我們采取了一些措施來解決這個(gè)問題,提高復(fù)本數(shù)量,以及讓批處理隊(duì)列系統(tǒng)錯(cuò)開應(yīng)用的啟動(dòng)時(shí)間。一個(gè)潛在的長期解決方案是:允許客戶端在這種情況下從其他的客戶端讀取數(shù)據(jù)。
2.6 Metadata 元數(shù)據(jù)
????The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkma pping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. Using a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins the cluster.
and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins the cluster.
????
Master存儲(chǔ)三種主要的元數(shù)據(jù):文件和塊的名字空間,文件和塊的映射關(guān)系,每個(gè)塊復(fù)本的位置。所有的元數(shù)據(jù)都保存在Master的內(nèi)存中。前兩類(名字空間和映射關(guān)系)也作為操作日志被保存在Master的本地磁盤上,并且在遠(yuǎn)程機(jī)器上保存一個(gè)復(fù)本。使用日志使得我們更加簡單、可靠的更新Master的狀態(tài),不用擔(dān)心由于Master死機(jī)造成的數(shù)據(jù)不一致。Master不會(huì)持久化塊的位置信息,相反,Master啟動(dòng)時(shí)會(huì)向塊服務(wù)器查詢塊的狀態(tài),并且一個(gè)塊服務(wù)器加入集群時(shí)也會(huì)進(jìn)行相同的操作。
2.6.1 In-Memory Data Structures 內(nèi)存中的數(shù)據(jù)結(jié)構(gòu)
????Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunkm igration to balance load and disk space usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further.
????
因?yàn)樵獢?shù)據(jù)被存儲(chǔ)在內(nèi)存中,Master的操作速度非常快。并且,Master可以簡單有效的定期在后臺(tái)掃描所有的狀態(tài)。這樣的周期掃描被用于實(shí)現(xiàn)塊的垃圾回收,塊服務(wù)器失效后重新生成復(fù)本。4.3和4.4將進(jìn)行詳細(xì)討論。
????One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.
????One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.
????
對于純內(nèi)存方式的潛在憂慮在于,塊的數(shù)量、乃至于整個(gè)系統(tǒng)的容量受限于Master的內(nèi)存大小。實(shí)踐中這并不是一個(gè)嚴(yán)重的限制。對于每個(gè)64MB大小的塊,Master保存小于64字節(jié)的元數(shù)據(jù)。大多數(shù)的塊是滿的因?yàn)槎鄶?shù)文件包含多個(gè)塊,只有最后的一個(gè)是部分填充。相似的,每個(gè)文件的名字空間數(shù)據(jù)通常也僅需要64字節(jié),因?yàn)楸4娴奈募褂们熬Y壓縮過。
????If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.
????If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.
????
如果需要支持一個(gè)更大的文件系統(tǒng),在Master上添加內(nèi)存只是很小的投入。將元數(shù)據(jù)放置于內(nèi)存中帶來了簡潔性、可靠性、高性能、靈活性等諸多好處。
2.6.2 Chunk Locations 塊位置
????The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.
?????
Master并不持久化指定塊的復(fù)本位置信息。當(dāng)啟動(dòng)時(shí),Master從塊服務(wù)器上獲取這些信息。Master可以自行保持更新,因?yàn)樗刂屏怂械膲K放置操作,以及通過心跳信息監(jiān)視塊服務(wù)器的狀態(tài)。
????We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.
????
我們起初曾經(jīng)嘗試將快的位置信息在Master中進(jìn)行持久化,但是我們決定啟動(dòng)時(shí)讀取數(shù)據(jù)更加簡潔,同時(shí)可以消除Master與塊服務(wù)器之間的數(shù)據(jù)同步問題,諸如塊服務(wù)器加入或退出集群,變換名稱,失效,重啟等等。在一個(gè)擁有數(shù)百臺(tái)機(jī)器的集群中,這些事件經(jīng)常發(fā)生。
????Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.
????Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.
????
從另外一個(gè)角度去理解這個(gè)設(shè)計(jì)決策,對于某個(gè)塊是否存在于塊服務(wù)器上,這個(gè)快服務(wù)器是最具發(fā)言權(quán)的。在Master上維護(hù)這個(gè)信息的對應(yīng)視圖是沒有必要的,因?yàn)閴K服務(wù)器上的錯(cuò)誤會(huì)導(dǎo)致塊自動(dòng)消失(例如磁盤損壞失效)或者操作員重命名一個(gè)塊服務(wù)器。
2.6.3 Operation Log 操作日志
????The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.
????
操作日志包含關(guān)鍵元數(shù)據(jù)的變動(dòng)記錄。這是GFS的核心。它不僅是元數(shù)據(jù)的唯一持久記錄,也被作為定義并發(fā)操作順序的邏輯時(shí)間線。文件和塊,以及它們的版本(見4.5),在他們被創(chuàng)建后都可以被永遠(yuǎn)唯一的標(biāo)識(shí)。
????Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log
records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
????Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log
records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
????
由于操作日志的重要性,我們必須以可靠的方式保存它,而且只有元數(shù)據(jù)的變動(dòng)被持久化后,變動(dòng)才會(huì)對客戶端可見。否則,雖然塊還存在,我們卻可能丟失整個(gè)文件系統(tǒng)或者最近的客戶端操作。因此,我們將它的復(fù)本保存在多臺(tái)遠(yuǎn)程機(jī)器上,并且只有在已經(jīng)將日志輸出到本地和遠(yuǎn)程的磁盤上后,才會(huì)對客戶端的請求完成響應(yīng)。為了降低傳輸和備份對于整個(gè)系統(tǒng)的影響,在發(fā)送日志前,Master會(huì)將多個(gè)日志記錄打包在一起。
????The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability.
????The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability.
????
恢復(fù)文件系統(tǒng)的狀態(tài)時(shí),Master重放操作日志。為了最小化啟動(dòng)時(shí)間,我們必須讓日志盡量的小。每次日志超過一個(gè)指定的大小后,Master會(huì)對日志保存檢查點(diǎn),這樣系統(tǒng)可以先加載最新的檢查點(diǎn),而后只重放少數(shù)的日志就可以回退到最新狀態(tài)。檢查點(diǎn)是一個(gè)壓縮B樹的形式,可以直接被映射到內(nèi)存中,并且使用名稱空間查詢時(shí)無需額外的解析。這也將使恢復(fù)過程變得更快。
????Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. The master switches to a new log file and creates the new checkpoint in a separate thread. The new checkpoint includes all mutations before the switch. It can be created in a minute or so for a cluster with a few million files. When completed, it is written to disk both locally and remotely.
????
因?yàn)閯?chuàng)建一個(gè)檢查點(diǎn)會(huì)花費(fèi)一些時(shí)間,Master的內(nèi)部狀態(tài)被構(gòu)造為一種形式,這種形式可以使得創(chuàng)建新檢查點(diǎn)時(shí)不會(huì)對到來的變化產(chǎn)生延遲。Master會(huì)切換到一個(gè)新的日志文件,并在另一個(gè)線程中創(chuàng)建一個(gè)新的檢查點(diǎn)。新的檢查點(diǎn)包含切換前所有的變動(dòng)。一個(gè)百萬級的集群的檢查點(diǎn)可以在一分鐘內(nèi)完成創(chuàng)建。當(dāng)結(jié)束后,它將被寫入到本地和遠(yuǎn)程的磁盤中。
????Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.
????Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.
????
恢復(fù)只需要最新的完整檢查點(diǎn)和后續(xù)的日志文件。更早的檢查點(diǎn)和日志文件可以被刪除,雖然我們將會(huì)保存一些來應(yīng)對意外。創(chuàng)建檢查點(diǎn)時(shí)發(fā)生錯(cuò)誤不會(huì)影響正確性,因?yàn)榛謴?fù)代碼可以檢測和跳過不完整的檢查點(diǎn)。
2.7 Consistency Model 一致性模型
????GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. We now discuss GFS’s guarantees and what they mean to applications. We also highlight how GFS maintains these guarantees but leave the
details to other parts of the paper.
details to other parts of the paper.
????
GFS擁有一個(gè)輕量的一致性模型,可以完美的支持高度分布的應(yīng)用,但保持了簡單和容易實(shí)現(xiàn)的優(yōu)點(diǎn)。我們現(xiàn)在討論GFS對于一致性的保證,以及對于應(yīng)用程序意味著什么。我們強(qiáng)調(diào)GFS如果管理這些保證,但是實(shí)現(xiàn)細(xì)節(jié)將在論文的后續(xù)部分進(jìn)行討論。
????The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. Table 1 summarizes the result. A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interference
from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined regions. The applications do not need to further distinguish between different kinds of undefined regions.
from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined regions. The applications do not need to further distinguish between different kinds of undefined regions.
????
數(shù)據(jù)變動(dòng)后文件范圍的狀態(tài)取決于變動(dòng)的類型,是否成功,是否是并發(fā)變動(dòng)。表格1匯總了結(jié)果。如果所有的客戶端不管從哪個(gè)復(fù)本讀取,都一直能看見相同的數(shù)據(jù),則這個(gè)文件范圍是一致的。在一個(gè)文件數(shù)據(jù)變動(dòng)后,如果它是一致的,并且客戶端可知變動(dòng)的地方,則這個(gè)文件范圍是已定義的。如果一個(gè)變動(dòng)成功,則被影響的文件范圍是已定義的(隱含的一致性):所有的客戶端一直都可見寫入的變動(dòng)。同步的成功變動(dòng)使得范圍是一致的但是未定義:所有的客戶端看見相同的數(shù)據(jù),但是它也許不會(huì)表現(xiàn)出發(fā)生的變動(dòng)。通常情況下,它包含多個(gè)變動(dòng)的混合片段。一個(gè)失敗的變動(dòng)使得范圍變成不一致(因此也是未定義):不同的客戶端在不同時(shí)間可能看見不同的數(shù)據(jù)。我們將會(huì)描述我們的程序如何能夠分辨已定義和未定義的范圍。應(yīng)用不用去區(qū)分未定義的范圍的種類。
????Data mutations may be writes or record appends. A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). (In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) The offset is returned to the client and marks the beginning of a defined region that contains the record. In addition, GFS may insert padding or record duplicates in between. They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.
????
數(shù)據(jù)變動(dòng)可以是寫或記錄追加。寫會(huì)將數(shù)據(jù)寫在應(yīng)用指定的文件偏移位置。記錄追加將把數(shù)據(jù)原子性的追加到文件中,但是GFS可以選擇偏移位置(3.3)。(相比而言,通常的追加僅指在文件的末尾)偏移量將會(huì)返回到客戶端,并標(biāo)識(shí)出包含記錄的已定義范圍的開始處。此外,GFS會(huì)在中間插入填充字符或者冗余記錄。它們占據(jù)被認(rèn)為是不一致的范圍,通常比用戶數(shù)據(jù)的量少的多。
????After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas (Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity.
????After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas (Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity.
????
在一系列成功變動(dòng)后,變動(dòng)的文件范圍保證是已定義的,并且包含了最后變動(dòng)所寫入的數(shù)據(jù)。GFS通過下面的方法做到這一點(diǎn):(a)將塊的變動(dòng)在所有的復(fù)本上按相同的順序進(jìn)行記錄(3.1),(b)使用塊版本號來檢測是否因?yàn)閴K服務(wù)器死機(jī)造成錯(cuò)過了某些變動(dòng),從而復(fù)本變成失效(4.5)。失效的復(fù)本將不再會(huì)涉及后續(xù)的變動(dòng),Master向客戶端響應(yīng)塊的位置時(shí)也不會(huì)返回此復(fù)本的信息。它們將盡早被垃圾回收。
????Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next
open of the file, which purges from the cache all chunk information for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.
open of the file, which purges from the cache all chunk information for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.
????
因?yàn)榭蛻舳司彺媪藟K的位置,它們可能會(huì)在信息刷新前從一個(gè)失效的復(fù)本讀取。時(shí)間窗口由緩存超時(shí)時(shí)間以及文件再次打開的時(shí)間而決定,文件打開后會(huì)清除緩存中所有塊的信息。而且,因?yàn)槲覀兊拇蠖鄶?shù)文件是僅追加的,一個(gè)失效的復(fù)本通常返回塊末尾之前的數(shù)據(jù),而不是無效的數(shù)據(jù)。當(dāng)重新聯(lián)系Master時(shí),它可以立即得到當(dāng)前的塊位置。
????Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk
is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it becomes unavailable, not corrupted: applications receive clear errors rather than corrupt data.
????Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk
is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it becomes unavailable, not corrupted: applications receive clear errors rather than corrupt data.
????
成功變動(dòng)過后很久,部件錯(cuò)誤可以會(huì)損壞或銷毀數(shù)據(jù)。GFS使用Master和塊服務(wù)器之間的握手和數(shù)據(jù)校驗(yàn),可以識(shí)別失效的塊服務(wù)器(5.2),一旦出現(xiàn)問題,數(shù)據(jù)可以盡快的從有效的復(fù)本中恢復(fù)出來(4.3)。只有當(dāng)一個(gè)塊的所有復(fù)本在GFS應(yīng)對之前全部丟失,這個(gè)塊才會(huì)不可逆的丟失,通常GFS的反應(yīng)時(shí)間在幾分鐘之內(nèi),即使在此種情況下,塊變成不可用,但并沒有損壞:應(yīng)用可以接收到明確的錯(cuò)誤,而不是損壞的數(shù)據(jù)。
2.7.2 Implications for Applications 應(yīng)用的影響
????GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
????
GFS應(yīng)用使用一些已經(jīng)在其他用途也需要的技巧,就可以適應(yīng)這樣的簡化一致性模型了:依賴追加甚于覆寫,檢查點(diǎn),寫入自驗(yàn)證,自標(biāo)識(shí)的記錄等。
????Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined
state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more efficient and more resilient to application failures than random writes. Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.
????Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined
state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more efficient and more resilient to application failures than random writes. Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.
????
實(shí)際上,我們所有的應(yīng)用程序使用追加進(jìn)行文件變動(dòng)多過覆寫。一個(gè)典型用法,寫入者從頭到尾生成文件。它寫完所有的數(shù)據(jù)后,將文件重命名為一個(gè)永久的名稱,或者周期性的為寫入成功多少而建立檢查點(diǎn)。檢查點(diǎn)也包含應(yīng)用性的校驗(yàn)和。讀取者只驗(yàn)證和處理在最新檢查點(diǎn)中的文件范圍,也就是已定義的狀態(tài)。不管發(fā)生一致性和同步問題,這個(gè)方法工作的很好。追加比隨機(jī)寫有效的多,并且對應(yīng)用失效更有彈性。檢查點(diǎn)允許寫入者漸進(jìn)的重新開始,并避免讀取者從應(yīng)用的角度認(rèn)為文件數(shù)據(jù)已經(jīng)成功處理,然而實(shí)際上是不完整的。
????In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics preserves each writer’s output. Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. These
functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.
functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.
????
在另一個(gè)常見的用法中,多個(gè)寫入者并發(fā)的向一個(gè)文件進(jìn)行追加,進(jìn)行結(jié)果的合并或者作為生產(chǎn)者-消費(fèi)者隊(duì)列。記錄追加的“最少一次追加”的語義保證了每個(gè)寫入者的輸出。讀取者按照下面的方法來應(yīng)對偶爾的填充數(shù)據(jù)和冗余信息。寫入者準(zhǔn)備的每條記錄都包含諸如校驗(yàn)和這樣的額外信息,因此記錄的有效性可以被判斷。讀取者可以使用校驗(yàn)和來識(shí)別和消除額外的填充數(shù)據(jù)和記錄片段。如果它不能容忍偶然的冗余(例如,如果他們出發(fā)非冪操作),它可以使用記錄的唯一標(biāo)識(shí)來過濾掉它們,這些標(biāo)識(shí)符通常用于名稱對應(yīng)的應(yīng)用,例如Web文檔。這些記錄I/O的功能(除去移除冗余)都封裝在庫的代碼中在應(yīng)用中共享,并且可以用于google實(shí)現(xiàn)的其它文件接口。記錄的相同序列,加上少有的冗余,總是被分發(fā)給記錄的讀取者。
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061
微信掃一掃加我為好友
QQ號聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長非常感激您!手機(jī)微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元

