Copyright | (c) 2010 Jasper Van der Jeugt (c) 2010 - 2011 Simon Meier |
---|---|
License | BSD3-style (see LICENSE) |
Maintainer | Simon Meier <[email protected]> |
Portability | GHC |
Safe Haskell | Trustworthy |
Language | Haskell98 |
Builder
s are used to efficiently construct sequences of bytes from smaller parts. Typically, such a construction is part of the implementation of an encoding, i.e., a function for converting Haskell values to sequences of bytes. Examples of encodings are the generation of the sequence of bytes representing a HTML document to be sent in a HTTP response by a web application or the serialization of a Haskell value using a fixed binary format.
For an efficient implementation of an encoding, it is important that (a) little time is spent on converting the Haskell values to the resulting sequence of bytes and (b) that the representation of the resulting sequence is such that it can be consumed efficiently. Builder
s support (a) by providing an O(1) concatentation operation and efficient implementations of basic encodings for Char
s, Int
s, and other standard Haskell values. They support (b) by providing their result as a lazy ByteString
, which is internally just a linked list of pointers to chunks of consecutive raw memory. Lazy ByteString
s can be efficiently consumed by functions that write them to a file or send them over a network socket. Note that each chunk boundary incurs expensive extra work (e.g., a system call) that must be amortized over the work spent on consuming the chunk body. Builder
s therefore take special care to ensure that the average chunk size is large enough. The precise meaning of large enough is application dependent. The current implementation is tuned for an average chunk size between 4kb and 32kb, which should suit most applications.
As a simple example of an encoding implementation, we show how to efficiently convert the following representation of mixed-data tables to an UTF-8 encoded Comma-Separated-Values (CSV) table.
data Cell = StringC String | IntC Int deriving( Eq, Ord, Show ) type Row = [Cell] type Table = [Row]
We use the following imports and abbreviate mappend
to simplify reading.
import qualified Data.ByteString.Lazy as L import Data.ByteString.Builder import Data.Monoid import Data.Foldable (foldMap) import Data.List (intersperse) infixr 4 <> (<>) :: Monoid m => m -> m -> m (<>) = mappend
CSV is a character-based representation of tables. For maximal modularity, we could first render Table
s as String
s and then encode this String
using some Unicode character encoding. However, this sacrifices performance due to the intermediate String
representation being built and thrown away right afterwards. We get rid of this intermediate String
representation by fixing the character encoding to UTF-8 and using Builder
s to convert Table
s directly to UTF-8 encoded CSV tables represented as lazy ByteString
s.
encodeUtf8CSV :: Table -> L.ByteString encodeUtf8CSV = toLazyByteString . renderTable renderTable :: Table -> Builder renderTable rs = mconcat [renderRow r <> charUtf8 '\n' | r <- rs] renderRow :: Row -> Builder renderRow [] = mempty renderRow (c:cs) = renderCell c <> mconcat [ charUtf8 ',' <> renderCell c' | c' <- cs ] renderCell :: Cell -> Builder renderCell (StringC cs) = renderString cs renderCell (IntC i) = intDec i renderString :: String -> Builder renderString cs = charUtf8 '"' <> foldMap escape cs <> charUtf8 '"' where escape '\\' = charUtf8 '\\' <> charUtf8 '\\' escape '\"' = charUtf8 '\\' <> charUtf8 '\"' escape c = charUtf8 c
Note that the ASCII encoding is a subset of the UTF-8 encoding, which is why we can use the optimized function intDec
to encode an Int
as a decimal number with UTF-8 encoded digits. Using intDec
is more efficient than stringUtf8 . show
, as it avoids constructing an intermediate String
. Avoiding this intermediate data structure significantly improves performance because encoding Cell
s is the core operation for rendering CSV-tables. See Data.ByteString.Builder.Prim for further information on how to improve the performance of renderString
.
We demonstrate our UTF-8 CSV encoding function on the following table.
strings :: [String] strings = ["hello", "\"1\"", "λ-wörld"] table :: Table table = [map StringC strings, map IntC [-3..3]]
The expression encodeUtf8CSV table
results in the following lazy ByteString
.
Chunk "\"hello\",\"\\\"1\\\"\",\"\206\187-w\195\182rld\"\n-3,-2,-1,0,1,2,3\n" Empty
We can clearly see that we are converting to a binary format. The 'λ' and 'ö' characters, which have a Unicode codepoint above 127, are expanded to their corresponding UTF-8 multi-byte representation.
We use the criterion
library (http://hackage.haskell.org/package/criterion) to benchmark the efficiency of our encoding function on the following table.
import Criterion.Main -- add this import to the ones above maxiTable :: Table maxiTable = take 1000 $ cycle table main :: IO () main = defaultMain [ bench "encodeUtf8CSV maxiTable (original)" $ whnf (L.length . encodeUtf8CSV) maxiTable ]
On a Core2 Duo 2.20GHz on a 32-bit Linux, the above code takes 1ms to generate the 22'500 bytes long lazy ByteString
. Looking again at the definitions above, we see that we took care to avoid intermediate data structures, as otherwise we would sacrifice performance. For example, the following (arguably simpler) definition of renderRow
is about 20% slower.
renderRow :: Row -> Builder renderRow = mconcat . intersperse (charUtf8 ',') . map renderCell
Similarly, using O(n) concatentations like ++
or the equivalent concat
operations on strict and lazy ByteString
s should be avoided. The following definition of renderString
is also about 20% slower.
renderString :: String -> Builder renderString cs = charUtf8 $ "\"" ++ concatMap escape cs ++ "\"" where escape '\\' = "\\" escape '\"' = "\\\"" escape c = return c
Apart from removing intermediate data-structures, encodings can be optimized further by fine-tuning their execution parameters using the functions in Data.ByteString.Builder.Extra and their "inner loops" using the functions in Data.ByteString.Builder.Prim.
Builder
s denote sequences of bytes. They are Monoid
s where mempty
is the zero-length sequence and mappend
is concatenation, which runs in O(1).
Internally, Builder
s are buffer-filling functions. They are executed by a driver that provides them with an actual buffer to fill. Once called with a buffer, a Builder
fills it and returns a signal to the driver telling it that it is either done, has filled the current buffer, or wants to directly insert a reference to a chunk of memory. In the last two cases, the Builder
also returns a continutation Builder
that the driver can call to fill the next buffer. Here, we provide the two drivers that satisfy almost all use cases. See Data.ByteString.Builder.Extra, for information about fine-tuning them.
toLazyByteString :: Builder -> ByteString Source
Execute a Builder
and return the generated chunks as a lazy ByteString
. The work is performed lazy, i.e., only when a chunk of the lazy ByteString
is forced.
hPutBuilder :: Handle -> Builder -> IO () Source
Output a Builder
to a Handle
. The Builder
is executed directly on the buffer of the Handle
. If the buffer is too small (or not present), then it is replaced with a large enough buffer.
It is recommended that the Handle
is set to binary and BlockBuffering
mode. See hSetBinaryMode
and hSetBuffering
.
This function is more efficient than hPut . toLazyByteString
because in many cases no buffer allocation has to be done. Moreover, the results of several executions of short Builder
s are concatenated in the Handle
s buffer, therefore avoiding unnecessary buffer flushes.
byteString :: ByteString -> Builder Source
Create a Builder
denoting the same sequence of bytes as a strict ByteString
. The Builder
inserts large ByteString
s directly, but copies small ones to ensure that the generated chunks are large on average.
lazyByteString :: ByteString -> Builder Source
Create a Builder
denoting the same sequence of bytes as a lazy ByteString
. The Builder
inserts large chunks of the lazy ByteString
directly, but copies small ones to ensure that the generated chunks are large on average.
shortByteString :: ShortByteString -> Builder Source
Construct a Builder
that copies the ShortByteString
.
int8 :: Int8 -> Builder Source
Encode a single signed byte as-is.
word8 :: Word8 -> Builder Source
Encode a single unsigned byte as-is.
int16BE :: Int16 -> Builder Source
Encode an Int16
in big endian format.
int32BE :: Int32 -> Builder Source
Encode an Int32
in big endian format.
int64BE :: Int64 -> Builder Source
Encode an Int64
in big endian format.
word16BE :: Word16 -> Builder Source
Encode a Word16
in big endian format.
word32BE :: Word32 -> Builder Source
Encode a Word32
in big endian format.
word64BE :: Word64 -> Builder Source
Encode a Word64
in big endian format.
floatBE :: Float -> Builder Source
Encode a Float
in big endian format.
doubleBE :: Double -> Builder Source
Encode a Double
in big endian format.
int16LE :: Int16 -> Builder Source
Encode an Int16
in little endian format.
int32LE :: Int32 -> Builder Source
Encode an Int32
in little endian format.
int64LE :: Int64 -> Builder Source
Encode an Int64
in little endian format.
word16LE :: Word16 -> Builder Source
Encode a Word16
in little endian format.
word32LE :: Word32 -> Builder Source
Encode a Word32
in little endian format.
word64LE :: Word64 -> Builder Source
Encode a Word64
in little endian format.
floatLE :: Float -> Builder Source
Encode a Float
in little endian format.
doubleLE :: Double -> Builder Source
Encode a Double
in little endian format.
Conversion from Char
and String
into Builder
s in various encodings.
The ASCII encoding is a 7-bit encoding. The Char7 encoding implemented here works by truncating the Unicode codepoint to 7-bits, prefixing it with a leading 0, and encoding the resulting 8-bits as a single byte. For the codepoints 0-127 this corresponds the ASCII encoding.
char7 :: Char -> Builder Source
Char7 encode a Char
.
string7 :: String -> Builder Source
Char7 encode a String
.
The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1. The Char8 encoding implemented here works by truncating the Unicode codepoint to 8-bits and encoding them as a single byte. For the codepoints 0-255 this corresponds to the ISO/IEC 8859-1 encoding.
char8 :: Char -> Builder Source
Char8 encode a Char
.
string8 :: String -> Builder Source
Char8 encode a String
.
The UTF-8 encoding can encode all Unicode codepoints. We recommend using it always for encoding Char
s and String
s unless an application really requires another encoding.
charUtf8 :: Char -> Builder Source
UTF-8 encode a Char
.
stringUtf8 :: String -> Builder Source
UTF-8 encode a String
.
Formatting of numbers as ASCII text.
Note that you can also use these functions for the ISO/IEC 8859-1 and UTF-8 encodings, as the ASCII encoding is equivalent on the codepoints 0-127.
Decimal encoding of numbers using ASCII encoded characters.
int8Dec :: Int8 -> Builder Source
Decimal encoding of an Int8
using the ASCII digits.
e.g.
toLazyByteString (int8Dec 42) = "42" toLazyByteString (int8Dec (-1)) = "-1"
int16Dec :: Int16 -> Builder Source
Decimal encoding of an Int16
using the ASCII digits.
int32Dec :: Int32 -> Builder Source
Decimal encoding of an Int32
using the ASCII digits.
int64Dec :: Int64 -> Builder Source
Decimal encoding of an Int64
using the ASCII digits.
intDec :: Int -> Builder Source
Decimal encoding of an Int
using the ASCII digits.
integerDec :: Integer -> Builder Source
Decimal encoding of an Integer
using the ASCII digits.
word8Dec :: Word8 -> Builder Source
Decimal encoding of a Word8
using the ASCII digits.
word16Dec :: Word16 -> Builder Source
Decimal encoding of a Word16
using the ASCII digits.
word32Dec :: Word32 -> Builder Source
Decimal encoding of a Word32
using the ASCII digits.
word64Dec :: Word64 -> Builder Source
Decimal encoding of a Word64
using the ASCII digits.
wordDec :: Word -> Builder Source
Decimal encoding of a Word
using the ASCII digits.
floatDec :: Float -> Builder Source
Currently slow. Decimal encoding of an IEEE Float
.
doubleDec :: Double -> Builder Source
Currently slow. Decimal encoding of an IEEE Double
.
Encoding positive integers as hexadecimal numbers using lower-case ASCII characters. The shortest possible representation is used. For example,
>>> toLazyByteString (word16Hex 0x0a10) Chunk "a10" Empty
Note that there is no support for using upper-case characters. Please contact the maintainer, if your application cannot work without hexadecimal encodings that use upper-case characters.
word8Hex :: Word8 -> Builder Source
Shortest hexadecimal encoding of a Word8
using lower-case characters.
word16Hex :: Word16 -> Builder Source
Shortest hexadecimal encoding of a Word16
using lower-case characters.
word32Hex :: Word32 -> Builder Source
Shortest hexadecimal encoding of a Word32
using lower-case characters.
word64Hex :: Word64 -> Builder Source
Shortest hexadecimal encoding of a Word64
using lower-case characters.
wordHex :: Word -> Builder Source
Shortest hexadecimal encoding of a Word
using lower-case characters.
int8HexFixed :: Int8 -> Builder Source
Encode a Int8
using 2 nibbles (hexadecimal digits).
int16HexFixed :: Int16 -> Builder Source
Encode a Int16
using 4 nibbles.
int32HexFixed :: Int32 -> Builder Source
Encode a Int32
using 8 nibbles.
int64HexFixed :: Int64 -> Builder Source
Encode a Int64
using 16 nibbles.
word8HexFixed :: Word8 -> Builder Source
Encode a Word8
using 2 nibbles (hexadecimal digits).
word16HexFixed :: Word16 -> Builder Source
Encode a Word16
using 4 nibbles.
word32HexFixed :: Word32 -> Builder Source
Encode a Word32
using 8 nibbles.
word64HexFixed :: Word64 -> Builder Source
Encode a Word64
using 16 nibbles.
floatHexFixed :: Float -> Builder Source
Encode an IEEE Float
using 8 nibbles.
doubleHexFixed :: Double -> Builder Source
Encode an IEEE Double
using 16 nibbles.
byteStringHex :: ByteString -> Builder Source
Encode each byte of a ByteString
using its fixed-width hex encoding.
lazyByteStringHex :: ByteString -> Builder Source
Encode each byte of a lazy ByteString
using its fixed-width hex encoding.
© The University of Glasgow and others
Licensed under a BSD-style license (see top of the page).
https://downloads.haskell.org/~ghc/8.8.3/docs/html/libraries/bytestring-0.10.10.0/Data-ByteString-Builder.html