A word of caution: This post features code from my first and so far only parser I’ve ever written. I was able to achieve what I wanted by reading attoparsec’s documentation, which is by no means written in the manner of a tutorial. I let the types and my understanding of the various typeclasses it uses guide me. I make no claims that this is The Best Way® to do things. The parser I made is performant enough for my needs but I have done no formal benchmarks nor have I put time into optimizing it further.
Attoparsec is a Haskell library for creating parser combinators. It is inspired by the older Parsec library and is designed with performance and efficiency in mind. Brian O’Sullivan is the creator. Brian writes some very high-quality libraries, all of which are a joy to use.
I decided to write this because I found very few tutorial-like resources for writing parsers with Attoparsec. The documentation is good once you know a bit of what you’re doing, but there wasn’t anything to get started. I dug into the source code of another one of Brian O’Sullivan’s projects, Aeson
I needed a parser for a project I’m working on, HollaBack. This service receives emails with a specified date as the mailbox, parses them and bounces the email back to you at the desired time. I got the idea from FollowUp.cc. Since their format seemed as good as any, I decided to make the parser for their date format.
The date format can either be a relative time or a specific date/time. A relative date/time looks like (quantity)(time keywword). Time keywords are mi, h, d, w, mo, and y. Some examples are:
Specific date/times look like:
These should be pretty self explanatory.
data DayOfWeek = Monday |
Tuesday |
Wednesday |
Thursday |
Friday |
Saturday |
Sunday deriving (Show, Eq)
data Date = Date Month Int deriving (Show, Eq)
data DateTimeSpec = RelativeDateTime TimeUnit |
SpecificDateTime Date TimeOfDay |
SpecificWeekdayTime DayOfWeek TimeOfDay |
SpecificWeekday DayOfWeek |
SpecificTime TimeOfDay deriving (Show, Eq)
data TimeUnit = TimeUnit Integer TimeKeyword deriving (Show, Eq)
data TimeKeyword = Minutes |
Hours |
Days |
Weeks |
Months |
Years deriving (Show, Eq)
import Control.Applicative ((<*>),
(*>),
(<$>),
(<|>),
pure)
import qualified Data.Attoparsec.Text as A
import qualified Data.Attoparsec.Combinator as AC
import Data.Attoparsec.Text (Parser)
import Data.Text (Text)
Parsers can be written almost entirely in terms of functions from Control.Applicative. Try LYAH for a refresher on Applicative.
The first step is to define signatures for combinators for all the major data types:
timeUnit :: Parser TimeUnit
timeKeyword :: Parser TimeKeyword
day :: Parser DayOfWeek
time :: Parser TimeOfDay
date :: Parser Date
Parsing discrete values is the simplest. Lets start with day
day :: Parser DayOfWeek
day = monday <|>
tuesday <|>
-- ...
where monday = A.stringCI "monday" *> pure Monday
We want to match the string “monday” so we use stringCI, which does a full string, case insenstive match. stringCI has a type Text -> Parser Text. We use *> which discards the result of the first action (Parser Text), and returns the second. pure lifts the value Monday into the functor.
Luckily for us, Parser has an instance for Alternaative which is a “monoid on applicative functors”. <|> is an associative binary operation. in the case of a paaarser, if the first parse fails, the next parser is used. So when we chain together these parsers with <|>, it will try them sequentially until one is successful.
Adding multiple aliases for each day is easy:
stringChoices :: [Text] -> Parser Text
stringChoices = AC.choice . map A.stringCI
-- ...
where monday = stringChoices ["monday", "mon"] *> pure Monday
For compound types like TimeUnit, the best way is to use Applicative’s sequential application function, <*>. We’ll write a parser for each component of TimeUnit, one for the quantity and the other for the TimeKeyword.
timeKeyword :: Parser TimeKeyword
timeKeyword = minutes <|>
hours <|>
-- ..
where minutes = A.stringCI "mi" *> pure Minutes
hours = A.stringCI "h" *> pure Hours
-- ..
We can think of the TimeUnit constructor as a function that consumes arguments (I know this isn’t exactly how it works in Haskell, but I’ll describe it thusly for brevity). Thus, the type is: TimeUnit :: Integer -> TimeKeyword -> TimeUnit. Our parser will then look like this:
timeUnit :: Parser TimeUnit
timeUnit = TimeUnit <$> integer
<*> timeKeyword
where integer = toInteger <$> A.decimal
Decimal is capable of parsing floating point numbers as well, but because we use fromIntegral, the parser will fail if it is given a floating point quantity, which is exactly what we want.
I’m really quite impressed by the ease of use for parsing simple grammars with Attoparsec. The Applicative/Alternative instances make the parsers read like BNF grammar notation and make parsing complex types a simple matter of creating smaller parsers and then composing them.
RememberTheMilk offers a bookmarklet to quickly add a task based on the page of you are currently on. For whatever reason, they do not fill in the URL for the task, only using the title of the page as the title of the task. Digging into the bookmarklet code, this was really simple to fix.
Drag the box below into your bookmark bar to install the bookarklet.
Depagination is a word I probably made up. When working on Web.GooglePlus I noticed that a lot of resources exposed in the Google+ API in their response return a “page token” to the next page, like so:
{
"kind": "plus#peopleFeed",
"selfLink": string,
"title": string,
"nextPageToken": string,
"items": [
people Resource
]
}
Pagination is nice for an end user but it is typically not very useful to the user of a library. If I want to perform a search for people on Google+ with Web.GooglePlus, my application’s code shouldn’t have to carry the burden of traversing pages of data. It should consume as many results as it needs until the results run out or some applicaiton logic decides it has had enough. That’s where depagination with Enumerators comes in.
The trick to depagination where the next page token is included alongside the current page is that you must maintain state to know when to stop enumerating and to have the token to get the next page.
Data.Enumerator.List provides unfoldM which has the type:
unfoldM :: Monad m => (s -> m (Maybe (a, s))) -> s -> Enumerator a m b
unfoldM takes a function which, given a state, either produces a result and the modified state or returns Nothing, thus terminating the Enumerator’s stream.
For my specific case, unfoldM wasn’t quite what I needed. It would yield a list of results at a time as a single Chunk per page. Instead, it would make more sense for each chunk to be a list of results, say [Person] or [Activity] in terms of Google+. For this, I needed to make a slight modification to unfoldM:
-- Exactly the same as unfoldM but takes the result of the stateful function
-- and uses it as the chunks, rather than a Chunks with a singleton list
unfoldListM :: Monad m => (s -> m (Maybe ([a], s)))
-> s
-> Enumerator a m b
unfoldListM f = checkContinue1 $ \loop s k -> do
fs <- lift (f s)
case fs of
Nothing -> continue k
Just (as, s') -> k (Chunks as) >>== loop s'
With that out of the way, I was able to create a generic depaginator for any paginated resource in the Google+ API. First, the relevant types:
type PageToken = Text
data DepaginationState = FirstPage |
MorePages PageToken |
NoMorePages
simpleDepaginator :: Monad m => (DepaginationState -> m (Maybe ([a], DepaginationState)))
-> Enumerator a m b
simpleDepaginator depaginate = unfoldListM depaginate FirstPage
simpleDepaginator sets up the depagination by initializing the state to being on the first page. The first page is a special case where there is no previous page token. It takes the step as an argument which fetches the next page and mutates the state. Here’s what the generic depagination step looks like:
paginatedState :: (a, Maybe PageToken)
-> (a, DepaginationState)
paginatedState (results, token) = (results, maybe NoMorePages MorePages token)
simpleDepaginationStep :: FromJSON a => Integer
-> Ascii
-> Query
-> DepaginationState
-> GooglePlusM (Maybe ([a], DepaginationState))
simpleDepaginationStep perPage pth params FirstPage = (return . fmap paginatedState) =<< simpleGetFirstPage perPage pth params
simpleDepaginationStep perPage pth params (MorePages tok) = (return . fmap paginatedState) =<< simpleGetPage perPage (Just tok) pth params
simpleDepaginationStep _ _ _ NoMorePages = return Nothing
Simple as can be. All 3 cases of the state are handled. In the first two, there is more data to get and thus an HTTP GET is performed. If we are on page 2 or higher, it includes the token. If there are no more pages, the depagination step returns Nothing, which terminates the stream of data.
paginatedState simply looks at the presence or absence of the current request’s page token to determine if the enumerator should continue requesting pages or not.
The process of coming up with this abstraction results in a nice way to yield results of the API calls before having gotten all the pages. This is the inherant benefit of designing for the Enumerator interface. The application using the library does not necessarily have to hold all results in memory, nor wait for them all to be fetched to deal with them, nor concern itself with the intricacies of paginating the requests.
Also, using enumerators makes defining non-enumerator interfaces dead simple. You just consume the enumerator:
import qualified Data.Enumerator.List as EL
enumActivities :: PersonID -- ^ Feed owner ID
-> ActivityCollection -- ^ Indicates what type of feed to retrieve
-> Maybe Integer -- ^ Page size. Should be between 1 and 100. Defualt 20
-> Enumerator Activity GooglePlusM b
enumActivities pid coll perPage = simpleDepaginator depaginate
where depaginate = -- ...
getActivities :: PersonID -- ^ Feed owner ID
-> ActivityCollection -- ^ Indicates what type of feed to retrieve
-> GooglePlusM [Activity]
getActivities pid coll = run_ $ enumActivities pid coll (Just 100) $ EL.consume
getActivities will then consume pages of 100 activities at a time (the maximum) and will not yield a result until it hits the end. If the user wants to do something quick and dirty, just slurping the entire resource without having to worry about enumerators is an attractive option.
The main fault I can see with using unfoldM is that it can be lossy and can send some false signals if you use it on an API which may change its data format at any time. For instance, I’ve coded my datatypes against the Google+ specs and I’m fairly confident I’ve gotten all the fields right. This includes handling fields which may possibly be absent such as email addresses, URLs, etc. However, because unfoldM is terminated with a Nothing, a fatal parse error to the consumer is indistinguishable from hitting the last page, it terminates the stream and says nothing more. If the resource you are consuming is more error prone than most, it may be a good idea to roll your own unfoldM which uses Either to distinguish between normal termination and termination due to error.
Overall I really like this pattern and will probably use it on any future web API projects I do that require consuming a paginated resource.
I should preface this post by saying that may only be particular to Rails 2.3.X. Unfortunately, all of the major projects at work use Rails 2 so I haven’t had much work with large datasets using the latest version of ActiveRecord
We use find_in_batches and find_each a lot at work for oneoff scripts and large scale data transforms. A good example may be modifying several thousand products in the database. If you were to simply execute a Product.all.each could very easily bring your production server to its knees and make you feel quite foolish. Instead, you would do something like:
Product.some_scope.find_each do |product|
# do stuff with each product
end
You can read the documentation for the details, but the gist is that Rails will handle the pagination on the backend, fetching 1000 records at a time.
find_in_batches is the generic case for this and will yield in each batch rather than each product in the batch.
These methods allow you to work with a large number of records without having to load all of them into memory at once. Production servers do not enjoy swapping.
Some time ago I had learned a lesson that I recently forgot. ActiveRecord in 2.3.X implements a query cache which is pretty much a Hash keyed on the SQL. This behavior is ideal for handling web requests. It is reasonable behavior in the context of a single web request to produce the same data when given the same query. If you load some data in the beginning of handling a request, for most applications it is perfectly sane (and desirable) to just cache that result for the duration of the request in case the same query is made.
However, most use cases for find_in_batches seem to be for maintenance and administation scripts that do not coincide with a user request. It was a bit of a surpise to me to find out that the query cache is still used in find_in_batches and find_each. Because of the paginating that these methods do, you will most likely not execute the same query twice unless you do the same find_each or find_in_batches in the same request/script. Also, because the need for these methods is predicated on trying to prevent large datasets from exhausting memory, if any methods should skip the ActiveRecord query cache it would be these two.
ActiveRecord::Base has an instance method called uncached which takes a block. ANy code executed in this block will do so without writing to or reading from the query cache. You could use it like:
Product.uncached do
Product.find_each do |product|
# do stuff with each product
end
end
I’m not sure how I feel about Rails omitting this optimization. On one hand, I have a hard time imaginging a case where you‘d use either one of these two methods where the query cache would do anything but eat up memory for no reason. On the other hand, the Rails guys could easily argue that it is not Rails’ business how you use find_in_batches and therefor it should not try to optimize the code for you. That is a certainly defensible stance. A mention of being wary of the query cache when using this method would probably satisfy everyone.
Last week I received a much-anticipated replacement for my old development notebook (Toshiba Satellite Pro L300) for a new Lenovo T420s. It has been a good 4 years or so since I’ve upgraded. My reasons for ditching the Toshiba were:
After much deliberation, I decided to go with the T420s. Some factors influencing my decision: 1. Good reputation for strong build quality and quality design (internal roll cage, excellent keyboard, etc). 2. Reports of great linux support. 3. A price tag competitive with all the shiny, crapware-encumbered, consumer-grade notebooks from the other manufacturers. 4. Very compelling hardware specs. Having a quad core i5 on a development notebook is quite an advantage. 5. The T420s model is surprisingly lightweight and thin.
Please excuse the terrible photos that I took on my cell phone.

The box acquired quite a few labels coming from China.

The 64-watt AC Adapter and Battery

The T420s is quite sleek
One of the dilemmas I came across when researching this laptop is that it uses the “new” (read nonstandard) 7mm height, 2.5" hard drive dimensions. Most SSD hard drives these days seem to be 9.5mm, with the exception of the older Intel X–25Ms, which are hard to come by and overly expensive. Crucial’s C300 was confirmed to be moddable to 7mm in height by removing the top panel and removing the 2.5mm plastic spacer. The problem with the C300 is that it is the previous generation of hard drives. I could not find confirmation that the newer Crucial M4 could be modded in a similar manner.
Operating on innuendo and pictures of the product indicating that there was indeed some sort of spacer, I purchased a 64GB Crucial M4 Model CT064M4SSD2. To perfomr this mod, I also needed some electrical tape (I promise, this isn’t as ghetto-rigged as it sounds), and 4 M2x3mm part number 10124 screws from LaptopScrews.com. I needed these shorter screws so that the panel can be re-affixed to the rest of the drive in the absence of the spacer which necessitated the long screws.

M4 with the top panel removed showing the spacer

Without the spacer. Note how the PCB sticks out

Inside of the panel taped. Scuffs were already there, I think

M4 after being modded and closed back up
Unlike the Crucial C300, there was no warranty sticker covering part of the spacer. This means that as long as you’re careful about your screws and remove any incriminating evidence, you should be able to RMA this drive if something goes wrong.
Regarding the electrical tape: on 128GB and higher versions of the drive, there appear to be some flash memory chips on the side of the PCB facing the panel. For these models, you should be able to apply smaller pieces of electrical tape on the chips rather than the panel, since they will stick out from the board. Be sure you’re using electrical tape as it is designed not to conduct electricity. I used the tape as a precaution since the board naturally sticks up on one side and the top panel is so thin, there is very little clearance from the PCB. The last thing I wanted was to short out the hard drive by touching it with the back panel. The PCB is very smooth with no pins sticking out on the panel’s side, so I didn’t have to tape the entire thing down.
Note that when you close the drive back up, the panel might be very slightly distended near the middle of the drive. This didn’t seem to cause any problems. It may be due to the additional room required for 2/3 layers of electrical tape. It still fit perfectly into the laptop so I wasn’t too concerned about it.
I don’t really have anything special to say here. Everything worked as normal and all of the Fn keys seem to work well under Arch. The laptop stays quite cool. I am also quite happy to have discovered that Lenovo made the air intake and exhaust on the sides of the chassis instead of putting the intake on the bottom, which seems to be the single most boneheaded and pervasive mistake common to most laptops.

Installing Arch Linux

Having a good old fashioned laptop party with my girlfriend afterwards