MichaelXavier

blog $ takeWhile safeForWork thoughts_

ProjectsGitHubResumeEmailAll PostsRSS

  • Writing a Small Parser with Attoparsec

    January 20, 2012

    A word of caution: This post features code from my first and so far only parser I’ve ever written. I was able to achieve what I wanted by reading attoparsec’s documentation, which is by no means written in the manner of a tutorial. I let the types and my understanding of the various typeclasses it uses guide me. I make no claims that this is The Best Way® to do things. The parser I made is performant enough for my needs but I have done no formal benchmarks nor have I put time into optimizing it further.

    Attoparsec

    Attoparsec is a Haskell library for creating parser combinators. It is inspired by the older Parsec library and is designed with performance and efficiency in mind. Brian O’Sullivan is the creator. Brian writes some very high-quality libraries, all of which are a joy to use.

    I decided to write this because I found very few tutorial-like resources for writing parsers with Attoparsec. The documentation is good once you know a bit of what you’re doing, but there wasn’t anything to get started. I dug into the source code of another one of Brian O’Sullivan’s projects, Aeson

    The Problem

    I needed a parser for a project I’m working on, HollaBack. This service receives emails with a specified date as the mailbox, parses them and bounces the email back to you at the desired time. I got the idea from FollowUp.cc. Since their format seemed as good as any, I decided to make the parser for their date format.

    The Format

    The date format can either be a relative time or a specific date/time. A relative date/time looks like (quantity)(time keywword). Time keywords are mi, h, d, w, mo, and y. Some examples are:

    • 2d
    • 3mo
    • 45mi

    Specific date/times look like:

    • jan5
    • march14–2pm
    • friday–5am
    • sunday
    • 8am

    The Types

    These should be pretty self explanatory.

    data DayOfWeek = Monday    |
    Tuesday |
    Wednesday |
    Thursday |
    Friday |
    Saturday |
    Sunday deriving (Show, Eq)

    data Date = Date Month Int deriving (Show, Eq)

    data DateTimeSpec = RelativeDateTime TimeUnit |
    SpecificDateTime Date TimeOfDay |
    SpecificWeekdayTime DayOfWeek TimeOfDay |
    SpecificWeekday DayOfWeek |
    SpecificTime TimeOfDay deriving (Show, Eq)

    data TimeUnit = TimeUnit Integer TimeKeyword deriving (Show, Eq)

    data TimeKeyword = Minutes |
    Hours |
    Days |
    Weeks |
    Months |
    Years deriving (Show, Eq)

    The Imports

    import Control.Applicative ((<*>),
    (*>),
    (<$>),
    (<|>),
    pure)
    import qualified Data.Attoparsec.Text as A
    import qualified Data.Attoparsec.Combinator as AC
    import Data.Attoparsec.Text (Parser)
    import Data.Text (Text)

    Writing Parsers

    Parsers can be written almost entirely in terms of functions from Control.Applicative. Try LYAH for a refresher on Applicative.

    The first step is to define signatures for combinators for all the major data types:

    timeUnit :: Parser TimeUnit
    timeKeyword :: Parser TimeKeyword
    day :: Parser DayOfWeek
    time :: Parser TimeOfDay
    date :: Parser Date

    Parsing Discrete Values

    Parsing discrete values is the simplest. Lets start with day

      day :: Parser DayOfWeek
    day = monday <|>
    tuesday <|>
    -- ...
    where monday = A.stringCI "monday" *> pure Monday

    We want to match the string “monday” so we use stringCI, which does a full string, case insenstive match. stringCI has a type Text -> Parser Text. We use *> which discards the result of the first action (Parser Text), and returns the second. pure lifts the value Monday into the functor.

    Luckily for us, Parser has an instance for Alternaative which is a “monoid on applicative functors”. <|> is an associative binary operation. in the case of a paaarser, if the first parse fails, the next parser is used. So when we chain together these parsers with <|>, it will try them sequentially until one is successful.

    Adding multiple aliases for each day is easy:

        stringChoices :: [Text] -> Parser Text
    stringChoices = AC.choice . map A.stringCI

    -- ...
    where monday = stringChoices ["monday", "mon"] *> pure Monday

    Combining Parsers

    For compound types like TimeUnit, the best way is to use Applicative’s sequential application function, <*>. We’ll write a parser for each component of TimeUnit, one for the quantity and the other for the TimeKeyword.

        timeKeyword :: Parser TimeKeyword
    timeKeyword = minutes <|>
    hours <|>
    -- ..
    where minutes = A.stringCI "mi" *> pure Minutes
    hours = A.stringCI "h" *> pure Hours
    -- ..

    We can think of the TimeUnit constructor as a function that consumes arguments (I know this isn’t exactly how it works in Haskell, but I’ll describe it thusly for brevity). Thus, the type is: TimeUnit :: Integer -> TimeKeyword -> TimeUnit. Our parser will then look like this:

        timeUnit :: Parser TimeUnit
    timeUnit = TimeUnit <$> integer
    <*> timeKeyword
    where integer = toInteger <$> A.decimal

    Decimal is capable of parsing floating point numbers as well, but because we use fromIntegral, the parser will fail if it is given a floating point quantity, which is exactly what we want.

    That Was Much Easier Than Expected

    I’m really quite impressed by the ease of use for parsing simple grammars with Attoparsec. The Applicative/Alternative instances make the parsers read like BNF grammar notation and make parsing complex types a simple matter of creating smaller parsers and then composing them.

  • RememberTheMilk URL Task Bookmarklet

    December 17, 2011

    RememberTheMilk offers a bookmarklet to quickly add a task based on the page of you are currently on. For whatever reason, they do not fill in the URL for the task, only using the title of the page as the title of the task. Digging into the bookmarklet code, this was really simple to fix.

    Drag the box below into your bookmark bar to install the bookarklet.

    Add to RTM!

  • Depaginating Web Resources with Enumerators

    October 16, 2011

    Depagination Defined

    Depagination is a word I probably made up. When working on Web.GooglePlus I noticed that a lot of resources exposed in the Google+ API in their response return a “page token” to the next page, like so:

    {
    "kind": "plus#peopleFeed",
    "selfLink": string,
    "title": string,
    "nextPageToken": string,
    "items": [
    people Resource
    ]
    }

    Pagination is nice for an end user but it is typically not very useful to the user of a library. If I want to perform a search for people on Google+ with Web.GooglePlus, my application’s code shouldn’t have to carry the burden of traversing pages of data. It should consume as many results as it needs until the results run out or some applicaiton logic decides it has had enough. That’s where depagination with Enumerators comes in.

    Enter unfoldM

    The trick to depagination where the next page token is included alongside the current page is that you must maintain state to know when to stop enumerating and to have the token to get the next page.

    Data.Enumerator.List provides unfoldM which has the type:

    unfoldM :: Monad m => (s -> m (Maybe (a, s))) -> s -> Enumerator a m b

    unfoldM takes a function which, given a state, either produces a result and the modified state or returns Nothing, thus terminating the Enumerator’s stream.

    For my specific case, unfoldM wasn’t quite what I needed. It would yield a list of results at a time as a single Chunk per page. Instead, it would make more sense for each chunk to be a list of results, say [Person] or [Activity] in terms of Google+. For this, I needed to make a slight modification to unfoldM:

    -- Exactly the same as unfoldM but takes the result of the stateful function
    -- and uses it as the chunks, rather than a Chunks with a singleton list
    unfoldListM :: Monad m => (s -> m (Maybe ([a], s)))
    -> s
    -> Enumerator a m b
    unfoldListM f = checkContinue1 $ \loop s k -> do
    fs <- lift (f s)
    case fs of
    Nothing -> continue k
    Just (as, s') -> k (Chunks as) >>== loop s'

    With that out of the way, I was able to create a generic depaginator for any paginated resource in the Google+ API. First, the relevant types:

    type PageToken             = Text
    data DepaginationState = FirstPage |
    MorePages PageToken |
    NoMorePages

    simpleDepaginator :: Monad m => (DepaginationState -> m (Maybe ([a], DepaginationState)))
    -> Enumerator a m b
    simpleDepaginator depaginate = unfoldListM depaginate FirstPage

    simpleDepaginator sets up the depagination by initializing the state to being on the first page. The first page is a special case where there is no previous page token. It takes the step as an argument which fetches the next page and mutates the state. Here’s what the generic depagination step looks like:

    paginatedState :: (a, Maybe PageToken)
    -> (a, DepaginationState)
    paginatedState (results, token) = (results, maybe NoMorePages MorePages token)

    simpleDepaginationStep :: FromJSON a => Integer
    -> Ascii
    -> Query
    -> DepaginationState
    -> GooglePlusM (Maybe ([a], DepaginationState))
    simpleDepaginationStep perPage pth params FirstPage = (return . fmap paginatedState) =<< simpleGetFirstPage perPage pth params
    simpleDepaginationStep perPage pth params (MorePages tok) = (return . fmap paginatedState) =<< simpleGetPage perPage (Just tok) pth params
    simpleDepaginationStep _ _ _ NoMorePages = return Nothing

    Simple as can be. All 3 cases of the state are handled. In the first two, there is more data to get and thus an HTTP GET is performed. If we are on page 2 or higher, it includes the token. If there are no more pages, the depagination step returns Nothing, which terminates the stream of data.

    paginatedState simply looks at the presence or absence of the current request’s page token to determine if the enumerator should continue requesting pages or not.

    Cost/Benefit of this approach

    The process of coming up with this abstraction results in a nice way to yield results of the API calls before having gotten all the pages. This is the inherant benefit of designing for the Enumerator interface. The application using the library does not necessarily have to hold all results in memory, nor wait for them all to be fetched to deal with them, nor concern itself with the intricacies of paginating the requests.

    Also, using enumerators makes defining non-enumerator interfaces dead simple. You just consume the enumerator:

    import qualified Data.Enumerator.List as EL

    enumActivities :: PersonID -- ^ Feed owner ID
    -> ActivityCollection -- ^ Indicates what type of feed to retrieve
    -> Maybe Integer -- ^ Page size. Should be between 1 and 100. Defualt 20
    -> Enumerator Activity GooglePlusM b
    enumActivities pid coll perPage = simpleDepaginator depaginate
    where depaginate = -- ...


    getActivities :: PersonID -- ^ Feed owner ID
    -> ActivityCollection -- ^ Indicates what type of feed to retrieve
    -> GooglePlusM [Activity]
    getActivities pid coll = run_ $ enumActivities pid coll (Just 100) $ EL.consume

    getActivities will then consume pages of 100 activities at a time (the maximum) and will not yield a result until it hits the end. If the user wants to do something quick and dirty, just slurping the entire resource without having to worry about enumerators is an attractive option.

    The main fault I can see with using unfoldM is that it can be lossy and can send some false signals if you use it on an API which may change its data format at any time. For instance, I’ve coded my datatypes against the Google+ specs and I’m fairly confident I’ve gotten all the fields right. This includes handling fields which may possibly be absent such as email addresses, URLs, etc. However, because unfoldM is terminated with a Nothing, a fatal parse error to the consumer is indistinguishable from hitting the last page, it terminates the stream and says nothing more. If the resource you are consuming is more error prone than most, it may be a good idea to roll your own unfoldM which uses Either to distinguish between normal termination and termination due to error.

    Overall I really like this pattern and will probably use it on any future web API projects I do that require consuming a paginated resource.

  • find_in_batches And the Query Cache

    August 25, 2011

    I should preface this post by saying that may only be particular to Rails 2.3.X. Unfortunately, all of the major projects at work use Rails 2 so I haven’t had much work with large datasets using the latest version of ActiveRecord

    Use Case for find_in_batches and find_each

    We use find_in_batches and find_each a lot at work for oneoff scripts and large scale data transforms. A good example may be modifying several thousand products in the database. If you were to simply execute a Product.all.each could very easily bring your production server to its knees and make you feel quite foolish. Instead, you would do something like:

      Product.some_scope.find_each do |product|
    # do stuff with each product
    end

    You can read the documentation for the details, but the gist is that Rails will handle the pagination on the backend, fetching 1000 records at a time.

    find_in_batches is the generic case for this and will yield in each batch rather than each product in the batch.

    These methods allow you to work with a large number of records without having to load all of them into memory at once. Production servers do not enjoy swapping.

    Unreasonable Default?

    Some time ago I had learned a lesson that I recently forgot. ActiveRecord in 2.3.X implements a query cache which is pretty much a Hash keyed on the SQL. This behavior is ideal for handling web requests. It is reasonable behavior in the context of a single web request to produce the same data when given the same query. If you load some data in the beginning of handling a request, for most applications it is perfectly sane (and desirable) to just cache that result for the duration of the request in case the same query is made.

    However, most use cases for find_in_batches seem to be for maintenance and administation scripts that do not coincide with a user request. It was a bit of a surpise to me to find out that the query cache is still used in find_in_batches and find_each. Because of the paginating that these methods do, you will most likely not execute the same query twice unless you do the same find_each or find_in_batches in the same request/script. Also, because the need for these methods is predicated on trying to prevent large datasets from exhausting memory, if any methods should skip the ActiveRecord query cache it would be these two.

    Solution

    ActiveRecord::Base has an instance method called uncached which takes a block. ANy code executed in this block will do so without writing to or reading from the query cache. You could use it like:

      Product.uncached do
    Product.find_each do |product|
    # do stuff with each product
    end
    end

    I’m not sure how I feel about Rails omitting this optimization. On one hand, I have a hard time imaginging a case where you‘d use either one of these two methods where the query cache would do anything but eat up memory for no reason. On the other hand, the Rails guys could easily argue that it is not Rails’ business how you use find_in_batches and therefor it should not try to optimize the code for you. That is a certainly defensible stance. A mention of being wary of the query cache when using this method would probably satisfy everyone.

  • Lenovo T420s with 7mm Crucial M4 Mod

    August 25, 2011

    Last week I received a much-anticipated replacement for my old development notebook (Toshiba Satellite Pro L300) for a new Lenovo T420s. It has been a good 4 years or so since I’ve upgraded. My reasons for ditching the Toshiba were:

    1. Maxed out the memory at 2GB. Operating with > 10 Chrome tabs (as I often do) and working on Rails apps or compiling Haskell code pushed me into a thrashing state constantly.
    2. Battery life had diminished to a pathetic 10–15 minutes.
    3. Abysmal hardware support under Linux: screen brightness controls didn’t work, fans kicked in quite late and stayed on seemingly forever. In fact, most Fn keys were just non functional. Volume control also didn’t work.
    4. While performing fairly well for several years, this wasn’t even a very good laptop when I bought it.
    5. Aside from the worn out battery, it weighed almost 6 pounds which discouraged me from taking it places.

    After much deliberation, I decided to go with the T420s. Some factors influencing my decision: 1. Good reputation for strong build quality and quality design (internal roll cage, excellent keyboard, etc). 2. Reports of great linux support. 3. A price tag competitive with all the shiny, crapware-encumbered, consumer-grade notebooks from the other manufacturers. 4. Very compelling hardware specs. Having a quad core i5 on a development notebook is quite an advantage. 5. The T420s model is surprisingly lightweight and thin.

    Unboxing

    Please excuse the terrible photos that I took on my cell phone.

    The box acquired quite a few labels coming from China.

    The box acquired quite a few labels coming from China.

    The 64-watt AC Adapter and Battery

    The 64-watt AC Adapter and Battery

    The T420s is quite sleek

    The T420s is quite sleek

    7mm Crucial M4 Mod

    One of the dilemmas I came across when researching this laptop is that it uses the “new” (read nonstandard) 7mm height, 2.5" hard drive dimensions. Most SSD hard drives these days seem to be 9.5mm, with the exception of the older Intel X–25Ms, which are hard to come by and overly expensive. Crucial’s C300 was confirmed to be moddable to 7mm in height by removing the top panel and removing the 2.5mm plastic spacer. The problem with the C300 is that it is the previous generation of hard drives. I could not find confirmation that the newer Crucial M4 could be modded in a similar manner.

    Operating on innuendo and pictures of the product indicating that there was indeed some sort of spacer, I purchased a 64GB Crucial M4 Model CT064M4SSD2. To perfomr this mod, I also needed some electrical tape (I promise, this isn’t as ghetto-rigged as it sounds), and 4 M2x3mm part number 10124 screws from LaptopScrews.com. I needed these shorter screws so that the panel can be re-affixed to the rest of the drive in the absence of the spacer which necessitated the long screws.

    Steps to Mod the M4

    1. Remove the top panel carefully by unscrewing the 4 long screws. Don’t lose them in case you have to RMA the drive.
    2. Remove the spacer and set aside.
    3. Place tape on the inside of the top panel on all the sides except the one with the SATA/power connectors. I put down about 3 layers on each of the sides.
    4. Carefully line up the top panel to the screw holes on the drive and use your new screws to screw it in.
    5. Mount the drive in the hard drive caddy, and then mount that in the laptop.
    M4 with the top panel removed showing the spacer

    M4 with the top panel removed showing the spacer

    Without the spacer. Note how the PCB sticks out

    Without the spacer. Note how the PCB sticks out

    Inside of the panel taped. Scuffs were already there, I think

    Inside of the panel taped. Scuffs were already there, I think

    M4 after being modded and closed back up

    M4 after being modded and closed back up

    Notes About the Mod

    Unlike the Crucial C300, there was no warranty sticker covering part of the spacer. This means that as long as you’re careful about your screws and remove any incriminating evidence, you should be able to RMA this drive if something goes wrong.

    Regarding the electrical tape: on 128GB and higher versions of the drive, there appear to be some flash memory chips on the side of the PCB facing the panel. For these models, you should be able to apply smaller pieces of electrical tape on the chips rather than the panel, since they will stick out from the board. Be sure you’re using electrical tape as it is designed not to conduct electricity. I used the tape as a precaution since the board naturally sticks up on one side and the top panel is so thin, there is very little clearance from the PCB. The last thing I wanted was to short out the hard drive by touching it with the back panel. The PCB is very smooth with no pins sticking out on the panel’s side, so I didn’t have to tape the entire thing down.

    Note that when you close the drive back up, the panel might be very slightly distended near the middle of the drive. This didn’t seem to cause any problems. It may be due to the additional room required for 2/3 layers of electrical tape. It still fit perfectly into the laptop so I wasn’t too concerned about it.

    Installing Arch Linux

    I don’t really have anything special to say here. Everything worked as normal and all of the Fn keys seem to work well under Arch. The laptop stays quite cool. I am also quite happy to have discovered that Lenovo made the air intake and exhaust on the sides of the chassis instead of putting the intake on the bottom, which seems to be the single most boneheaded and pervasive mistake common to most laptops.

    Installing Arch Linux

    Installing Arch Linux

    Having a good old fashioned laptop party with my girlfriend afterwards

    Having a good old fashioned laptop party with my girlfriend afterwards

Copyright © 2010-2011 Michael Xavier
Site generated with hakyll.