正确性:它应当返回被测试文件的正确的字符数、单词数和行数。
速度(真实世界的时间):与wc的执行时间相比是快是慢?
最大常驻内存量:最多使用多少内存?内存使用量是常量还是线性的,或者是其他?
1stupid :: FilePath -> IO (Int, Int, Int)
2stupid fp = do
3 contents <- readFile fp
4 return (length s, length (words s), length (lines s))
很不错&#xff0c;这段代码能正常运行&#xff0c;并且能获得与wc相同的结果——如果你愿意等的话。而我在测试大文件时开始不耐烦&#xff08;它需要几分钟的时间&#xff09;&#xff0c;但在小文件&#xff08;90MB&#xff09;上的测试结果如下&#xff1a;
1import Data.List2import Data.Char34simpleFold :: FilePath -> IO (Int, Int, Int)5simpleFold fp &#61; do6 countFile <$> readFile fp78countFile :: String -> (Int, Int, Int)9countFile s &#61;
10 let (cs, ws, ls, _) &#61; foldl&#39; go (0, 0, 0, False) s
11 in (cs, ws, ls)
12 where
13 go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
14 go (cs, ws, ls, wasSpace) c &#61;
15 let addLine | c &#61;&#61; &#39;\n&#39; &#61; 1
16 | otherwise &#61; 0
17 addWord | wasSpace &#61; 0
18 | isSpace c &#61; 1
19 | otherwise &#61; 0
20 in (cs &#43; 1, ws &#43; addWord, ls &#43; addLine, isSpace c)
结果这个版本遇到了更严重的问题&#xff01;
程序运行花了几分钟&#xff0c;内存占用迅速超过了3GB&#xff01;为什么会这样呢&#xff1f;我们使用的是严格版本的foldl&#39;&#xff08;后面的撇号 &#39; 表示它是严格的&#xff09;&#xff0c;但它只在“Weak Head Normal Form”&#xff08;WHNF&#xff09;中是严格的&#xff0c;也就是说&#xff0c;它在元组累加器中是严格的&#xff0c;但在实际的值中不是严格的&#xff01;
这很讨厌&#xff0c;因为这意味着我们构造了一大堆巨大的加法操作&#xff0c;但只有在整个文件遍历结束后才进行求值&#xff01;有时候&#xff0c;懒惰求值就会像这样偷偷地给我们挖坑。如果不注意&#xff0c;这种内存泄漏很容易就会搞垮你的Web服务器。
1{-# LANGUAGE BangPatterns #-}23...4 go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)5 go (!cs, !ws, !ls, !wasSpace) c &#61;6 let addLine | c &#61;&#61; &#39;\n&#39; &#61; 17 | otherwise &#61; 08 addWord | wasSpace &#61; 09 | isSpace c &#61; 1
10 | otherwise &#61; 0
11 in (cs &#43; 1, ws &#43; addWord, ls &#43; addLine, isSpace c)
这一点小改动带来了近乎疯狂的性能提升。新的性能数据如下&#xff1a;
90MB测试文件
1import Data.Char2import qualified Data.ByteString.Lazy.Char8 as BS34simpleFold :: FilePath -> IO (Int, Int, Int)5simpleFold fp &#61; do6 simpleFoldCountFile <$> BS.readFile fp78simpleFoldCountFile :: BS.ByteString -> (Int, Int, Int)9simpleFoldCountFile s &#61;
10 let (cs, ws, ls, _) &#61; BS.foldl&#39; go (0, 0, 0, False) s
11 in (cs, ws, ls)
12 where
13 go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
14 go (!cs, !ws, !ls, !wasSpace) c &#61;
15 let addLine | c &#61;&#61; &#39;\n&#39; &#61; 1
16 | otherwise &#61; 0
17 addWord | wasSpace &#61; 0
18 | isSpace c &#61; 1
19 | otherwise &#61; 0
20 in (cs &#43; 1, ws &#43; addWord, ls &#43; addLine, isSpace c)
这一点小改动将运行时间缩短到了将近一半&#xff01;
90MB测试文件
1data CharType &#61; IsSpace | NotSpace
2 deriving Show
3
4data Flux &#61;
5 Flux !CharType
6 {-# UNPACK #-} !Int
7 !CharType
8 | Unknown
9 deriving Show
这些类型只有在统计单词数时才需要。
1instance Semigroup Flux where
2 Unknown <> x &#61; x
3 x <> Unknown &#61; x
4 Flux l n NotSpace <> Flux NotSpace n&#39; r &#61; Flux l (n &#43; n&#39; - 1) r
5 Flux l n _ <> Flux _ n&#39; r &#61; Flux l (n &#43; n&#39;) r
6
7instance Monoid Flux where
8 mempty &#61; Unknown
这里的Unknown构造函数表示Monoidal幺元&#xff0c;实际上我们可以不用它&#xff0c;而是用Maybe将Semigroupo提升为Monoid&#xff0c;但Maybe会给半群添加操作带来不必要的懒惰性&#xff01;所以为了简单起见&#xff0c;我只是将其定义为类型的一部分。
1flux :: Char -> Flux
2flux c | isSpace c &#61; Flux IsSpace 0 IsSpace
3 | otherwise &#61; Flux NotSpace 1 NotSpace
这很简单&#xff0c;非空格字符统计为“单词”&#xff0c;所谓单词就是以非空格开始并结束&#xff0c;所谓空白&#xff0c;就是一个长度为零的单词&#xff0c;两侧被空格字符包围。
1>>> foldMap flux "testing one two three"
2Flux NotSpace 4 NotSpace
3
4>>> foldMap flux "testing on" <> foldMap flux "e two three"
5Flux NotSpace 4 NotSpace
6
7>>> foldMap flux "testing one " <> foldMap flux " two three"
8Flux NotSpace 4 NotSpace
似乎能正常工作&#xff01;
1data Counts &#61;2 Counts { charCount :: {-# UNPACK #-} !Int34 , wordCount :: !Flux5 , lineCount :: {-# UNPACK #-} !Int6 }7 deriving (Show)89instance Semigroup Counts where
10 (Counts a b c) <> (Counts a&#39; b&#39; c&#39;) &#61; Counts (a &#43; a&#39;) (b <> b&#39;) (c &#43; c&#39;)
11
12instance Monoid Counts where
13 mempty &#61; Counts 0 mempty 0
没问题&#xff01;类似地&#xff0c;我们需要将单个字符变成Counts对象&#xff1a;
1countChar :: Char -> Counts
2countChar c &#61;
3 Counts { charCount &#61; 1
4 , wordCount &#61; flux c
5 , lineCount &#61; if (c &#61;&#61; &#39;\n&#39;) then 1 else 0
6 }
尝试一下&#xff1a;
1>>> foldMap countChar "one two\nthree"
2Counts {charCount &#61; 13, wordCount &#61; Flux NotSpace 3 NotSpace, lineCount &#61; 1}
看起来不错&#xff01;你可以用喜欢的内容来证实这个幺半群是正确的。
1module MonoidBSFold where23import Data.Char4import qualified Data.ByteString.Lazy.Char8 as BS56monoidBSFold :: FilePath -> IO Counts7monoidBSFold paths &#61; monoidFoldFile <$> BS.readFile fp89monoidFoldFile :: BS.ByteString -> Counts
10monoidFoldFile &#61; BS.foldl&#39; (\a b -> a <> countChar b) mempty
我们将一部分复杂的内容移动到了Counts类型中&#xff0c;这样能大幅简化实现。一般来说这样做很好&#xff0c;因为测试单一数据类型比测试每个使用fold的地方要容易得多。
1monoidBSFold :: FilePath -> IO Counts
2monoidBSFold paths &#61; monoidBSFoldFile <$> BS.readFile fp
3{-# INLINE monoidBSFold #-}
4
5monoidBSFoldFile :: BS.ByteString -> Counts
6monoidBSFoldFile &#61; BS.foldl&#39; (\a b -> a <> countChar b) mempty
7{-# INLINE monoidBSFoldFile #-}
8
我还给countChar和flux函数添加了INLINE。我们来看看有没有效果&#xff1a;
90MB测试文件
543MB测试文件
1import Types2import Control.Monad3import Data.Traversable4import Data.Bits5import GHC.Conc (numCapabilities)6import Control.Concurrent.Async7import Data.Foldable8import System.IO9import System.Posix.Files
10import qualified Data.ByteString.Lazy.Char8 as BL
11import Data.ByteString.Internal (c2w)
12import GHC.IO.Handle
13
14multiCoreCount :: FilePath -> IO Counts
15multiCoreCount fp &#61; do
16 putStrLn ("Using available cores: " <> show numCapabilities)
17 size <- fromIntegral . fileSize <$> getFileStatus fp
18 let chunkSize &#61; fromIntegral (size &#96;div&#96; numCapabilities)
19 fold <$!> (forConcurrently [0..numCapabilities-1] $ \n -> do
20 -- Take all remaining bytes on the last capability due to integer division anomolies
21 let limiter &#61; if n &#61;&#61; numCapabilities - 1
22 then id
23 else BL.take (fromIntegral chunkSize)
24 let offset &#61; fromIntegral (n * chunkSize)
25 fileHandle <- openBinaryFile fp ReadMode
26 hSeek fileHandle AbsoluteSeek offset
27 countBytes . limiter <$!> BL.hGetContents fileHandle)
28{-# INLINE handleSplitUTF #-}
29
30countBytes :: BL.ByteString -> Counts
31countBytes &#61; BL.foldl&#39; (\a b -> a <> countChar b) mempty
32{-# INLINE countBytes #-}
33
这里涉及了很多东西&#xff0c;我尽量详细地解释一下。
543MB测试文件
输入可以是ASCII或UTF-8编码。当然还有其他流行的编码方式&#xff0c;但根据我有限的经验&#xff0c;绝大部分现代文本文件都采用两者之一。实际上&#xff0c;有许多网站都在致力于让UTF-8成为唯一的编码格式。
我们仅把ASCII中的空格和换行当做空格和换行处理&#xff1b;MONGOLIAN VOWEL SEPARATOR等字符就不考虑了。
1import Data.Bits2import Data.ByteString.Internal (c2w)3countByte :: Char -> Counts4countByte c &#61;5 Counts {6 -- Only count bytes at the START of a codepoint, not continuation bytes7 charCount &#61; if (bitAt 7 && not (bitAt 6)) then 0 else 18 , wordCount &#61; flux c9 , lineCount &#61; if (c &#61;&#61; &#39;\n&#39;) then 1 else 0
10 }
11 where
12 bitAt &#61; testBit (c2w c)
13{-# INLINE countByte #-}
这样就好了&#xff01;现在我们可以处理UTF-8和ASCII了&#xff0c;我们甚至都不需要知道处理的是什么编码&#xff0c;就能永远给出正确的结果。
543MB文件
1module Streaming where23import Types4import Data.Traversable5import GHC.Conc (numCapabilities)6import System.IO (openFile, IOMode(..))7import qualified Streamly as S8import qualified Streamly.Data.String as S9import qualified Streamly.Prelude as S
10import qualified Streamly.Internal.Memory.Array as A
11import qualified Streamly.Internal.FileSystem.Handle as FH
12
13streamingBytestream :: FilePath -> IO Counts
14streamingBytestream fp &#61; do
15 src <- openFile fp ReadMode
16 S.foldl&#39; mappend mempty
17 $ S.aheadly
18 $ S.maxThreads numCapabilities
19 $ S.mapM countBytes
20 $ FH.toStreamArraysOf 1024000 src
21 where
22 countBytes &#61;
23 S.foldl&#39; (\acc c -> acc <> countByte c) mempty
24 . S.decodeChar8
25 . A.toStream
26
27{-# INLINE streamingBytestream #-}
注意&#xff1a;这里用的streamly版本7.10是直接从Github代码库中获得的&#xff0c;很可能它很快就会被发不到hackage上。这段代码还使用了几个内部模块&#xff0c;我希望看到&#xff0c;像这段代码中的用例能够证明&#xff0c;这些模块应该暴露出来。
1FH.toStreamArraysOf 1024000 src
这一段从文件描述符中读取字节块放到Byte数组的流中。使用Byte数组可以比使用Lazy ByteString等更快&#xff01;每1MB文件内容我们会使用一个单独的数组&#xff0c;这一点你可以根据情况调整。
1S.mapM countBytes
这里使用mapM在数组上运行countBytes函数&#xff1b;countBytes本身会根据数组创建流&#xff0c;然后使用我们的幺半群字节计数器来运行流fold&#xff1a;
1countBytes &#61;
2 S.foldl&#39; (\acc c -> acc <> countByte c) mempty
3 . S.decodeChar8
4 . A.toStream
接下来&#xff0c;我们告诉streamly在数组上并行运行map&#xff0c;从而实现让每个线程处理一个1MB的文件块。这里将线程的数量限制在了核心数量。一旦读入所有数据&#xff0c;就可以立即进行处理&#xff0c;我们的统计代码没有任何阻塞的理由&#xff0c;所以增加更多的线程只会给调度器带来额外的负担而已。
1S.maxThreads numCapabilities
1S.aheadly
此时我们已经统计了1MB的输入块&#xff0c;但我们依然需要累加所有输入块。这一点可以在另一个流fold中通过mappend实现&#xff1a;
1S.foldl&#39; mappend mempty
就这些&#xff01;来看看效果吧&#xff01;
这四个Python项目&#xff0c;让你很快读懂Python&#xff01;
https://edu.csdn.net/topic/python115?utm_source&#61;csdn_bw
热 文 推 荐
☞
点击阅读原文&#xff0c;参与中国开发者现状调查问卷&#xff01;