AlphaSort: a cache-Sensitive Parallel External Sort




старонка1/5
Дата канвертавання26.04.2016
Памер123.39 Kb.
  1   2   3   4   5
AlphaSort: A Cache-Sensitive Parallel External Sort
Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, Dave Lomet

Abstract A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads. Using commodity processors, memory, and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds. This beats the best published record on a 32-CPU 32-disk Hypercube by 8:1. On another benchmark, AlphaSort sorted more than a gigabyte in a minute.

AlphaSort is a cache-sensitive memory-intensive sort algorithm. We argue that modern architectures require algorithm designers to re-examine their use of the memory hierarchy. AlphaSort uses clustered data structures to get good cache locality. It uses file striping to get high disk bandwidth. It uses QuickSort to generate runs and uses replacement-selection to merge the runs. It uses shared memory multiprocessors to break the sort into subsort chores.

Because startup times are becoming a significant part of the total time, we propose two new benchmarks:

(1) MinuteSort: how much can you sort in a minute, and

(2) PennySort: how much can you sort for a penny.

An abridged version of this paper appeared in the ACM SIGMOD'94 Proceedings.


This paper appeared in VLDB Journal 4(4): 603-627 (1995)
Copyright 1995 by the VLDB endowment. Copying without fee is permitted provided that the copies are not made or distributed for direct commercial advantage and credit for the source is given. Abstracting with credit is permitted. For other copying of articles, write to the chairperson of the Publication Board. To copy otherwise or republish, requires a fee and/or specific permission. http://www.vldb.org/

This work was sponsored by Digital Equipment Corporation.

Authors' Addresses:


Chris Nyberg, Ordinal Technology Corp., 20 Crestview Dr., Orinda, CA 94563 chris@ordinal.com

Tom Barclay, Microsoft Corp., One Microsoft Way, Redmond, WA 98052 tbarclay@microsoft.com

Zarka Cvetanovic, Digital, 60 Codman Hill Rd, Boxborough, MA 01717 zarka@danger.enet.dec.com

Jim Gray, 310 Filbert St., San Francisco, CA 94133 @crl.com

David Lomet, Microsoft Corp., One Microsoft Way, Redmond, WA 98052 lomet@microsoft.com

1. Introduction

In 1985, an informal group of 25 database experts from a dozen companies and universities defined three basic benchmarks to measure the transaction processing performance of computer systems.



DebitCredit: a market basket of database reads and writes, terminal IO, and transaction commits to measure on-line transaction processing performance (OLTP). This benchmark evolved to become the TPC-A transactions-per-second and dollars-per-transaction-per-second metrics [12].

Scan: copy a thousand 100-byte records from disk-to-disk with transaction protection. This simple mini-batch transaction measures the ability of a file system or database system to pump data through a user application.

Sort: a disk-to-disk sort of one million, 100-byte records. This has become the standard test of batch and utility performance in the database community [3, 4, 6, 7, 9, 11, 13 18, 21, 22]. Sort tests the processor's, IO subsystem's, and operating system's ability to move data.

DebitCredit is a simple interactive transaction. Scan is a mini-batch transaction. Sort is an IO-intensive batch transaction. Together they cover a broad spectrum of basic commercial operations.


2. The sort benchmark and prior work on sort

The Datamation article [1] defined the sort benchmark as:

• Input is a disk-resident file of a million 100-byte records.

• Records have 10-byte key fields and can't be compressed.

• The input record keys are in random order.

• The output file must be a permutation of the input file sorted in key ascending order.


The performance metric is the elapsed time of the following seven steps:

(1) launch the sort program.

(2) open the input file and create the output file.

(3) read the input file.

(4) sort the records in key-ascending order.

(5) write the output file.

(6) close the files.

(7) terminate the program.


The implementation may use all the "mean tricks" typical of operating systems utilities. It can access the files via low-level interfaces, it can use undocumented interfaces, and it can use as many disks, processors and as much memory as it likes. Sort's price-performance metric normalizes variations in software and hardware configuration. The basic idea is to compute the 5-year cost of the hardware and software, and then prorate that cost for the elapsed time of the sort [1, 12]. A one minute sort on a machine with a 5-year cost of a million dollars would cost 38 cents (0.38$).
In 1985, as reported by Tsukerman, typical systems needed 15 minutes to perform this sort benchmark [1, 6, 21]. As a super-computer response to Tsukerman's efforts, Peter Weinberger of ATT wrote a program to read a disk file into memory, sort it using replacement-selection as records arrived, and then write the sorted data to a file [22]. This code postulated 8-byte keys, a natural size for the Cray, and made some other simplifications. The disks transferred at 8 MB/s, so you might guess that it took 12.5 seconds to read and 12.5 seconds to write for a grand total of 25 seconds. However there was about 1 second worth of overhead in setup, file creation, and file access. The result, 26 seconds, stood as the unofficial sort speed record for seven years. It is much faster than the subsequently reported Hypercube and hardware sorters.

Table 1: Published sort performance on the Datamation 100 MB benchmark in chronological order. Extrapolations marked by (*). Prices are estimated.

System

Seconds

$/sort(*)

Cost M$*

CPUs

Disks

Reference

Tandem

3600

4.61

2

2

2

[1, 21]

Beck

6000

1.92

.1

4

4

[7]

Tsukerman + Tandem

980

1.25

.2

3

6

[20]

Weinberger + Cray

26

1.25

7.5

1

1

[22]

Kitsuregawa

320*

0.41

.2

1+

1

[15]

Baugsto

180

0.23

.2

16

16

[4]

Graefe + Sequent

83

0.27

.5

8

4

[11]

Baugsto

40

0.26

1

100

100

[4]

DeWitt + Intel iPSC/2

58

0.37

1.0

32

32

[9]

DEC Alpha AXP 7000

9.1

0.022

.4

1

16

1993

DEC Alpha AXP 4000

8.2

0.011

.2

2

14

1993

DEC Alpha AXP 7000

7

0.014

.5

3

28

1993
Since 1986, most sorting effort has focused on multiprocessor sorting, either using shared memory or using partitioned-data designs. DeWitt, Naughton, and Schneider's efforts on an Intel Hypercube was the fastest reported time: 58.3 seconds using 32 processors, 32 disks and 224 MB of memory [9]. Baugsto, Greispland and Kamberbeek mentioned a 40-second sort on a 100-processor 100-disk system [4]. These parallel systems stripe the input and output data across all the disks (30 in the Hypercube case). They read the disks in parallel, performing a preliminary sort of the data at each source, and partition it into equal-sized parts. Each reader-sorter sends the partitions to their respective target partitions. Each target partition processor merges the many input streams into a sorted run that is stored on the local disk. The resulting output file is striped across the 30 disks. The Hypercube sort was two times slower than Weinberger's Cray sort, but it had better price-performance, since the machine is about seven times cheaper.
Table 1 and Graph 2 show that prior to AlphaSort, sophisticated hardware-software combinations were slower than a brute-force one-pass memory intensive sort. Until now, a Cray Y-MP super-computer with a gigabyte of memory, a fast disk, and fast processors was the clear winner. But, the Cray approach was expensive.


Graph 2: The performance and price-performance trends of sorting displayed in chronological order. Until now, the Cray sort was fastest but the parallel sorts had the best price-performance.
Weinberger's Cray-based sort used a fast processor, a fast-parallel-transfer disk, and lots of fast memory. AlphaSort's approach is similar, but it uses commodity products to achieve better price/performance. It uses fast microprocessors, commodity memory, and commodity disks. It uses file striping to exploit parallel disks, and it breaks the sorting task into subtasks to utilize multiprocessors. Using these techniques, AlphaSort beats the Cray Y-MP in two dimensions: it is about 4x faster and about 100x less expensive.

  1   2   3   4   5


База данных защищена авторским правом ©shkola.of.by 2016
звярнуцца да адміністрацыі

    Галоўная старонка