Tracing Lexer memory leaks


One of the problems I’ve been struggling with for a while now is the presence of pesky memory leaks in Visual Haskell. Hs2lib has one convention, It doesn’t free any memory, and so you’re responsible for freeing all memory.

As far as I knew, I was freeing any and all pointers that I had. It should not be leaking, but yet it was. So I decided to get to the root of this problem. I wrote a simple application that uses the Lexer classes of Visual Haskell and would emulate a user scrolling by feeding it lines of a Haskell file one at a time.

Using the Debug Diagnostics Tools I was able to track the application and make a full memory dump every few seconds in order to track the progression of the leak. The results were rather surprising:

WARNING – DebugDiag was not able to locate debug symbols for HsLexer.dll, so the reported function name(s) may not be accurate.

HsLexer.dll is responsible for 11.95 MBytes worth of outstanding allocations. The following are the top 2 memory consuming functions:

HsLexer!HsEnd+1343ac4: 8.20 MBytes worth of outstanding allocations.

HsLexer!HsEnd+150ae89: 2.00 MBytes worth of outstanding allocations.

This was detected in LexerLeakTest.exe__PID__4660__Date__07_04_2011__Time_03_20_30PM__18__Leak Dump – Private Bytes.dmp

So according to this tool, my little program was leaking quite extensively and not surprisingly, it was all coming from my inside my Haskell program. Unfortunately, GHC/GCC can not produce the proper symbols (.pdb files) for any of the Microsoft debugging tools to understand. So while we know conclusively that the program is leaking, we don’t know where.

Hs2lib-Debug

This is where the new version of Hs2lib comes into play. The idea is to track any and all allocations and de-allocations made during the run of the program, in essence a simple profiler.

For this reason we now have Debug versions of the of modules. Through the magic of CPPHS, Cabal and custom preprocessors we get the usual “release” modules and “debug” modules which write the allocations to a file.

I’ll skip over the implementation of that, but the idea is to override all allocation functions with a custom one. The structure which is being written out to disk (the current version uses the rather slow show/read instances generated by GHC. I will be replacing these in the future) is:

<br />
data MemAlloc = MemAlloc { memFun   :: Caller<br />
                         , memStack :: Stack<br />
                         , memStart :: Address<br />
                         , memStop  :: Maybe Address<br />
                         , memSize  :: Maybe MemSize<br />
                         , memTime  :: String<br />
                         }<br />
              | MemFree { memStack :: Stack<br />
                        , memStart :: Address<br />
                        , memSize  :: Maybe MemSize<br />
                        , memTime  :: String<br />
                        }<br />
              deriving (Show, Read)<br />

The allocations get written out to a file called “Memory.dump”.

A piece of one such file is:

<br />
MemAlloc { memFun = Record<br />
         , memStack = Then "WinDll\\Lib\\/NativeMapping_Base.cpphs:242(peekCWString)"<br />
                     (Then "HsLexer.hs:85(fromNative)"<br />
                     (Then "HsLexer.hs:83(lexSourceStringWithExtA)"<br />
                      Empty))<br />
         , memStart = 43309024<br />
         , memStop = Just 43309026<br />
         , memSize = Just 2<br />
         , memTime = "1312779866"<br />
         }<br />
MemFree { memStack = Then "WinDll\\Lib\\/NativeMapping_Base.cpphs:246(freeCWString)"<br />
                    (Then "HsLexer.hs:85(fromNative)"<br />
                    (Then "HsLexer.hs:83(lexSourceStringWithExtA)"<br />
                     Empty))<br />
        , memStart = 43309024<br />
        , memSize = Just 2<br />
        , memTime = "1312779866"<br />
        }<br />

From this you can see that not only does it record allocation, but I’ve also implemented a simple artificial stack, that shows us where the allocation is/originated. The current implementation is rather simplistic. I will look into expanding this later, but for now it suits my needs.

For those wondering, this tracker is not enabled by default. To enable it just pass the “- -debug” flag to the call to Hs2lib. Compiling with this flag instructs the library to use the Debug version of the libraries, and changes the code generators so that they also add extra information to the generated code.

Since de-allocations also have to be tracked, using this flag also exposes a free function which should be used to free pointers. If the CallingConvention  used is stdcall then a function “freeS” is exported, if ccall then “freeC” is exported. The reason for this is because these functions are statically defined. They aren’t being generated by the tool but are instead part of the library part of the tool.

Performing analysis

Once we have a .dump file, the next step is to analyze this information. This is where the new tool Hs2lib-debug comes in. This tool replays the allocations in a pseudo heap. If all goes well, at the end the heap should be empty. If it has any entries it means we’ have a leak.

Invoking it is quite easy, just pass it as an argument the folder which contains the dump file:

Hs2lib-debug.exe -v .\MemDumps

and that’s all.

Running this on the dump file created from the lexer program returned:

*** Program starting up…

*** Reading file ‘.\MemDumps\Memory.dump’…

*** Found 31890 record(s).

*** Analyzing [********************] 100.00%

*** Found 1135 outstanding allocation(s).

1135 unfreed references found originating from HsLexer.hs:85(fromNative)\;\;\WinDll\Lib\/NativeMapping_Base.cpphs:242(peekCWString)\;

*** Cleaning up….

Done.

The messy output for the stack is a as follows: Entries in the path are separated by a ;, instead of the usual \ character.

The function the profiler pointed out is:

<br />
lexSourceStringWithExtA :: StablePtr (IORef [ExtensionFlag]) -&gt; CWString -&gt; IO ((StatelessParseResultPtr (Located Token)))<br />
lexSourceStringWithExtA a1 a2 =<br />
  do let st = newStack (__FILE__ ++ ":" ++ (show __LINE__) ++ "(lexSourceStringWithExtA)")<br />
     a1p &lt;- fromNative (pushStack st (__FILE__ ++ ":" ++ (show __LINE__) ++ "(fromNative)")) a1<br />
     a2p &lt;- fromNative (pushStack st (__FILE__ ++ ":" ++ (show __LINE__) ++ "(fromNative)")) a2<br />
     res &lt;- lexSourceStringWithExt a1p a2p<br />
     toNative st res<br />

And the exact line is

<br />
a2p &lt;- fromNative (pushStack st (__FILE__ ++ ":" ++ (show __LINE__) ++ "(fromNative)")) a2<br />

This is interesting for a couple of reasons., The profiler is saying that the pointer associated with the CWString (which is defined as a Ptr CWchar) is never freed. But why not.

The answer lies in the C# Marshaller and types being used. We are currently Marshalling C# strings using the String datatype. Strings in C# are immutable, so once the marshaller creates a wchar_t* from the string, it never worries about it again. They are strictly an in parameter.

There are two ways to solve this:

  • C# does have mutable strings, using the StringBuffer type. Using StringBuffer has the benefit that it already is implemented as a char pointer. The marshaller simply passes the pointer to the Haskell function and upon. After the function returns and the GC determines that the StringBuffer is no longer in use, it should free the memory (at least in theory).
  • Make the library just free any CWString it dereferences.

For now I’ve chosen the second approach, for no other reason other than it requires the least amount of change in existing code. In the future I’ll adopt the approach that Hs2lib will free any arguments being passed to it. I don’t know if this is the convention usually used. If someone has something against this approach I would love to hear it.

Update : There’s somewhat of a big gotcha that I’ve recently discovered. We have to remember that the String type in .NET is immutable. So when the marshaller sends it to out Haskell code, the CWString we get there is a copy of the original. We have to free this. When GC is performed in C# it won’t affect the the CWString, which is a copy.

The problem however is that when we free it in the Haskell code we can’t use freeCWString. The pointer was not allocated with C (msvcrt.dll)’s alloc. There are three ways (that I know of) to solve this.

  • use char* in your C# code instead of String when calling a Haskell function. You then have the pointer to free when you call returns, or initialize the pointer using fixed.
  • import CoTaskMemFree in Haskell and free the pointer in Haskell
  • use StringBuilder instead of String. I’m not entirely sure about this one, but the idea is that since StringBuilder is implemented as a native pointer, the Marshaller just passes this pointer to your Haskell code (which can also update it btw). When GC is performed after the call returns, the StringBuilder should be freed.

I’ve once again updated the library to not free any pointers. This prevents a nasty heap corruption. To not get memory leaks in c# you are free to choose between solution 1 & 3 presented here.

Results

With this new code in place, I once again run the profiled lexer to generate a dump. This time when I analyze the result however I get

*** Program starting up…

*** Reading file ‘.\MemDumps\Memory.dump’…

*** Found 33027 record(s).

*** Analyzing [********************] 100.00%

*** Found 0 outstanding allocation(s).

Congratulations, No memory leak(s) detected.

*** Cleaning up….

And so the memory leaks are fixed. It’s worth noting that the analyzers can be quite slow. It uses a flat LinkedList like structure and lists to do the analysis. I will in future versions be replacing these with a Tree like structure and arrays respectively.

MSBuild Part 1


I took some time this week from working on my Thesis to work on MSBuild again, I know it’s slow going, but unfortunately this project is not my highest priority task atm (I thank you all for your patience). The result is the following

The new compile and project pipeline for a .cabal file now looks as follows:

[.cabal file] <-> [IDE] <-> [Cabal2MSBuild] <-> [MSBuild]

The execution looks like

  1. You double click on a .cabal file which launches visual studio
  2. The IDE then invokes Cabal2MSBuild on the .cabal file to get a .hsproj file
  3. This .hsproj file is use strictly internally. It’s only purpose is to instruct the IDE which files to include, references, Author etc. Any changes done to these fields will be also directly written to the .cabal file via Cabal.
  4. When you want to build a project, a MSBuild task is run, this task even though defined in the .hsproj files will use no information from this file. It calls cabal-install on the original .cabal file.
  5. Errors and warnings are captured and translated into something the IDE can understand.

This is a much simpler and stabler approach than the first one which was to deeply and directly integrate .cabal files inside the IDE. This requires a immense amount of work, And I never could get it to work 100%.

Having finished my MSBuild book (which was really the only *complete and coherent* source of information on MSBuild) I was able to create the Cabal2MSBuild tool. This tool will create a .hsproj file from a .cabal file which contains all information that was present in the .cabal file but also a full listing of which files to include in the project. References etc. the last line of this file is an important line

<Import Project="Haskell.Cabal.targets"/>

This line imports a set of predefined Cabal specific tasks for MSBuild. These tasks as they’re called wrap calls to the cabal-install tool. The exposed functionalities are

Build, Run, Configure, Deploy, Clean and Update. (As a side note, this Target file can be used by any msbuild compatible build system. So this means using Team Foundation Server as a source control server and having continuous builds going on should be possible).

These Tasks are all exposed by a custom Task file “Cabal.MSBuild.Tasks.dll”

(The end of this post will contain the full project and Task files)

To show that this all works, here’s a example output

Phyx>cabal2msbuild Cabal2MSBuild.cabal Test.proj
Done.

Phyx>msbuild /target:Configure /property:CabalProjectFile="Cabal2MSBuild.cabal"
Test.proj
Microsoft (R) Build Engine Version 4.0.30319.1
[Microsoft .NET Framework, Version 4.0.30319.1]
Copyright (C) Microsoft Corporation 2007. All rights reserved.

Build started 10/12/2010 12:32:50.
Project "C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj" on node 1 (Configure target(s)).
Configure:
  VSH2010 Configuring project file...
  VSH2010 Using Cabal 'C:\Users\Phyx\AppData\Roaming\cabal\bin\Cabal.exe'
  Running Cabal...
TSKCONFIGURE : warning : The package list for 'hackage.haskell.org' is 29 days old. [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  Run 'cabal update' to get the latest list of available packages.
TSKCONFIGURE : warning : Cabal2MSBuild.cabal: Unknown fields: other-modules (line 17) [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  Fields allowed in this section:
  name, version, cabal-version, build-type, license, license-file,
  copyright, maintainer, build-depends, stability, homepage,
  package-url, bug-reports, synopsis, description, category, author,
  tested-with, data-files, data-dir, extra-source-files,
  extra-tmp-files
TSKCONFIGURE : warning : Cabal2MSBuild.cabal: Unknown fields: other-modules (line 17) [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  Fields allowed in this section:
  name, version, cabal-version, build-type, license, license-file,
  copyright, maintainer, build-depends, stability, homepage,
  package-url, bug-reports, synopsis, description, category, author,
  tested-with, data-files, data-dir, extra-source-files,
  extra-tmp-files
TSKCONFIGURE : warning : Cabal2MSBuild.cabal: Unknown fields: other-modules (line 17) [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  Fields allowed in this section:
  name, version, cabal-version, build-type, license, license-file,
  copyright, maintainer, build-depends, stability, homepage,
  package-url, bug-reports, synopsis, description, category, author,
  tested-with, data-files, data-dir, extra-source-files,
  extra-tmp-files
TSKCONFIGURE : warning : This package indirectly depends on multiple versions of the same [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  package. This is highly likely to cause a compile failure.
  package ghc-6.12.1 requires Cabal-1.8.0.2
  package bin-package-db-0.0.0.0 requires Cabal-1.8.0.2
  package Cabal2MSBuild-0.2.2 requires Cabal-1.9.2
  VSH2010 Configuring done.
Done Building Project "C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj"
 (Configure target(s)).


Build succeeded.

"C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj" (Configure target) (1) ->
(Configure target) ->
  TSKCONFIGURE : warning : The package list for 'hackage.haskell.org' is 29 days old. [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  TSKCONFIGURE : warning : Cabal2MSBuild.cabal: Unknown fields: other-modules (line 17) [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  TSKCONFIGURE : warning : Cabal2MSBuild.cabal: Unknown fields: other-modules (line 17) [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  TSKCONFIGURE : warning : Cabal2MSBuild.cabal: Unknown fields: other-modules (line 17) [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]
  TSKCONFIGURE : warning : This package indirectly depends on multiple versions of the same [C:\Users\Phyx\Documents\Haskell\Cabal2MSBuild\Test.proj]

    5 Warning(s)
    0 Error(s)

Time Elapsed 00:00:28.65

This exposes a few problems. While it works, the feedback is inaccurate. The problem is cabal-install has no set format for warnings and errors. They’re all freeform so MSBuild will only recognize the first line and this is due to the “warning: “ prefix of the lines. Ideally we want the entire warning since that’s what we’re going to report back to the user.

This same problem exists for GHC as well, however GHC is a bit more structured, mostly errors are pretty printed and have some kind of structure to them, which you can parse by looking at the indentation of text.

As I’m trying to modify as little as possible (It’ll be easier to maintain in the future and takes the burden of maintenance of me) I’m writing a set of parser to augment MSBuild’s build in parsers to support messages generated by cabal-install and GHC.

The current .hsproj file is a valid project file, and though I haven’t integrated it into the IDE yet, when I do, It should just work, since there is full support for MSBuild project files already there (again, less to maintain, and no reflection hacks this time).

Eventually the Cabal2MSBuild tool will be available on hackage and the build tasks on codeplex since hackage doesn’t allow binary dependencies.

And now, as promised the full files, note however that both of these files are work in progress.

Appendix A: Target file

<?xml version="1.0" encoding="UTF-8"?>

<!–

***********************************************************************************************

Haskell.Cabal.targets

WARNING:  DO NOT MODIFY this file unless you are knowledgeable about MSBuild and have

          created a backup copy.  Incorrect changes to this file will make it

          impossible to load or build your projects from the command-line or the IDE.

This file defines the steps in the standard build process for Haskell Cabal projects.

Copyright (C) Tamar Christina. All rights reserved.

***********************************************************************************************

–>

<Project DefaultTargets="Build" InitialTargets="Configure" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

 
    <UsingTask AssemblyFile="Cabal.MSBuild.Tasks.dll" TaskName="TskRun"       />

    <UsingTask AssemblyFile="Cabal.MSBuild.Tasks.dll" TaskName="TskBuild"     />

    <UsingTask AssemblyFile="Cabal.MSBuild.Tasks.dll" TaskName="TskConfigure" />

    <UsingTask AssemblyFile="Cabal.MSBuild.Tasks.dll" TaskName="TskDeploy"    />

    <UsingTask AssemblyFile="Cabal.MSBuild.Tasks.dll" TaskName="TskClean"     />

    <UsingTask AssemblyFile="Cabal.MSBuild.Tasks.dll" TaskName="TskUpdate"    />

    <Target Name="Clean">

        <TskClean CabalFile="$(CabalProjectFile)" />

    </Target>

   
    <Target Name="Run">

        <TskRun CabalFile="$(CabalProjectFile)" />

    </Target>

   
    <Target Name="Deploy">

        <TskDeploy CabalFile="$(CabalProjectFile)" Target="$(CabalDeployedFile)" />

    </Target>

   
    <Target Name="Configure">

        <TskConfigure CabalFile="$(CabalProjectFile)" User="$(CabalUserConfigure)" />

    </Target>

   
    <Target Name="Build">

        <TskBuild CabalFile="$(CabalProjectFile)" />

    </Target>   
   
    <Target Name="Update">

        <TskUpdate/>

    </Target>

   
    <Target Name="Make">

        <TskConfigure CabalFile="$(CabalProjectFile)" User="$(CabalUserConfigure)" />

        <TskBuild CabalFile="$(CabalProjectFile)" />       
    </Target>

    <Target Name="MakeRun">

        <TskConfigure CabalFile="$(CabalProjectFile)" User="$(CabalUserConfigure)" />

        <TskBuild CabalFile="$(CabalProjectFile)" />

        <TskRun CabalFile="$(CabalProjectFile)" />

    </Target>

</Project>

Appendix B: hsproj file

<?xml version="1.0" encoding="UTF-8"?>

<Project DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

<PropertyGroup>

<Configuration Condition=" ‘$(Configuration)’ == ” ">Build</Configuration>

<SchemaVersion>2.0</SchemaVersion>

<ProjectGuid>{99999999-9999-9999-9999-999999999999}</ProjectGuid>

<RootNamespace>Hs2lib</RootNamespace>

<AssemblyName>Hs2lib</AssemblyName>

<EnableUnmanagedDebugging>false</EnableUnmanagedDebugging>

<License>BSD3</License>

<LicenseFile></LicenseFile>

<Maintainer>Tamar Christina &lt;…@zhox.com&gt;</Maintainer>

<Author>Tamar Christina &lt;…@zhox.com&gt;</Author>

<Stability>experimental</Stability>

<Homepage>http://www.zhox.com/projects/haskell/hs2lib</Homepage>

<PkgUrl></PkgUrl>

<BugReports></BugReports>

<Synopsis>A Library and Preprocessor that makes it easier to create shared libs (note: only tested on windows) from Haskell programs.</Synopsis>

<Description>The supplied PreProcessor can be run over any existing source and would generate FFI code for every function marked to be exported by the special notation documented inside the package. It then proceeds to compile this generated code into a windows DLL.

The Library contains some helper code that’s commonly needed to convert between types, and contains the code for the typeclasses the PreProcessor uses in the generated code to keep things clean.

It will always generated the required C types for use when calling the dll, but it will also generate the C# unsafe code if requested.

Read http://www.zhox.com/projects/haskell/hs2lib.pdf

Current Restrictions:

– Does not automatically resolve missing datatype declarations using hackage. Future releases will search library code for the types you need to resolve this but currently you’ll get a missing instance error.

– You cannot export functions which have the same name (even if they’re in different modules because 1 big hsc file is generated at the moment, no conflict resolutions)

– You cannot export datatypes with the same name, same restriction as above.

– Does not support automatic instance generation for infix constructors yet

</Description>

<Category>Development</Category>

<DataDir></DataDir>

</PropertyGroup>

<PropertyGroup Condition=" ‘$(Configuration)’ == ‘GHCi’ ">

<DebugSymbols>true</DebugSymbols>

<OutputPath>bin\Debug</OutputPath>

<OutputType>HI</OutputType>

</PropertyGroup>

<PropertyGroup Condition=" ‘$(Configuration)’ == ‘Build’ ">

<DebugSymbols>false</DebugSymbols>

<OutputPath>bin\Build</OutputPath>

<OutputType>EXE</OutputType>

</PropertyGroup>

<ItemGroup>

<Content Include="$(DataDir)\Templates\main.template-unix.c"/>

<Content Include="$(DataDir)\Templates\main.template-win.c"/>

<Content Include="$(DataDir)\Includes\Tuples.h"/>

<Content Include="$(DataDir)\Includes\Instances.h"/>

</ItemGroup>

<ItemGroup>

<Compile Include="$(DataDir)\WinDll\*.hs"/>

<Compile Include="$(DataDir)\*.hs"/>

<Compile Include="$(DataDir)\Includes\*.h"/>

</ItemGroup>

<ItemGroup>

<Reference Include="base"/>

<Reference Include="syb"/>

</ItemGroup>

<ItemGroup>

<!–

<None Include="*.Cabal">

    <Type>Library</Type>

    <Exposed>True</Exposed>

</None>

–>

<Compile Include="WinDll\Lib\Converter.*" />

<Compile Include="WinDll\Lib\NativeMapping.*" />

<Compile Include="WinDll\Lib\Tuples.*" />

<Compile Include="WinDll\Structs\Types.*" />

<Compile Include="WinDll\Lib\InstancesTypes.*" />

</ItemGroup>

<!–

<ItemGroup Type="Executable" ExeName="" Name="Hs2lib" ModulePath="Hs2lib.hs"/>

–>

<Import Project="Haskell.Cabal.targets"/>

</Project>

File vs In-Memory buffer type checking


I’ve recently been looking into how to make the type checking part faster. The type checker part consists of two parts just like most of the code in Visual Haskell 2010, A Haskell side and a C# side.

This post is about optimizations on the C# side.

the setup currently is as follows:

Every 500ms after the user has finished typing, a call is made to typecheck() which does a few  things,

  • Finds the current module information (cached)
  • Generates a temporary file which has the same content of the In-Memory buffer
  • Makes a call to the Haskell world
  • Interprets the results

The part I thought I could optimize is that I didn’t want to have to write the buffer out to a file only to have GHC read it back in. Under normal circumstances the files will probably be in disk cache and not even written out to begin with, but we would still have to make a couple of kernel calls (about 6).

It’s not like the new approach wouldn’t have drawbacks, for one thing you’d have the serialization drawback, you suddenly have to serialize large strings from and to the Haskell world, but above that the GHC Api doesn’t even support type checking of In-Memory buffers.

So the first part is to edit the GHC Api and add this functionality.

Modifying GHC

I won’t explain this in detail, but I’ll just outline the “gist” of it.

Whenever you want to tell GHC want to inspect, It does so by inspecting all the “Targets” you’ve specified. An example of this would be

target <- guessTarget file Nothing
addTarget target

which means, we first ask it to guess what the string is (a module, or filename) and it’ll construct the appropriate Target for us. In order to support In-Memory buffers I added a new TargetId

| TargetVirtualFile StringBuffer FilePath (Maybe Phase)
  — ^ This entry loads the data from the provided buffer but acts like
  –   it was loaded from a file. The path is then used to inform GHC where
  –   to look for things like Interface files.
  –   This Target is useful when you want to prevent a client to write it’s
  –   in-memory buffer out to a file just to have GHC read it back in.

This takes the string buffer, and the original filename, the filename is only used to discover properties about the buffer (the most important of which is what type of input it is, hspp, hsc, hs or lhs)  and along with this a new function was create the facilitate the create of a virtual target.

— | Create a new virtual Target that will be used instead of reading the module from disk.
–   However object generation will be set to False. Since the File on disk would could
–   and is most likely not the same as the content passed to the function
virtualTarget :: GhcMonad m => StringBuffer -> FilePath -> m Target
virtualTarget content file
   = return (Target (TargetVirtualFile content file Nothing) False Nothing)

so for it’s all pretty straight forward, the rest I won’t explain since they have no bearing on how to use this new code, but what I’ve done was basically trace the workings of the load2 function, and where appropriate added some new code to instead of reading a file into a buffer, to just use the buffer passed on to virtualTarget.

The problem is, in runPhase a system process is called to deLit a Literate Haskell file which is a bit of a problem since it means that I would have to have the file on disk anyway. I decided not to worry about that for now but instead to focus on implementing the change for normal Haskell files first so I can run some benchmarking code.

Testing

Now for the actual testing

I created a new DLL file which contains the compiled code from two Haskell functions

getModInfo :: Bool -> String -> String -> String -> IO (ApiResults ModuleInfo)

and

getModStringInfo :: Bool -> String -> String -> String -> String -> IO (ApiResults ModuleInfo)

They both do the same amount of work, only the latter passes along as a target a VirtualTarget instead of a normal File target. But first it’s time to see if the changes actually did something.

This first test just asks for all the information it can get from a file named “VsxParser.hs”

Phyx@PHYX-PC /c/Users/Phyx/Documents/VisualHaskell2010/Parser
$ ghcii.sh VsxParser
GHCi, version 6.13.20100521: http://www.haskell.org/ghc/
Ok, modules loaded: VsxParser.
Prelude VsxParser> getModInfo True "VsxParser.hs" "VsxParser" ""
Success: ModuleInfo {functions = (70,[FunType {_Tspan = VsxParser.hs:256:1-8, _Tname = "showPpr’", _type = Just "Bool -> a -> String", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:252:1-7, _Tname = "mkSized", _type = Just "[a] -> (Int,[a])", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:244:1-17, _Tname = "configureDynFlags", _type = Just "DynFlags -> DynFlags", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:237:1-19, _Tname = "createLocatedErrors", _type = Just "ErrMsg -> [ErrorMessage]", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:230:1-13, _Tname = "processErrors", _type = Just "SourceError -> IO (ApiResults a)", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:69:1-4, _Tname = "main", _type =
(content goes on and on and on)

the content will go on for many more pages so I won’t post that

now, We call getModStringInfo asking for information about the same file, but this time we pass along a much smaller and completely different buffer than the one of the file on disk. namely a module with just 1 function “foo = 5”

Prelude VsxParser> getModStringInfo True "module VsxParser where\nfoo = 5\n" "VsxParser.hs" "VsxParser" ""
Success: ModuleInfo {functions = (1,[FunType {_Tspan = VsxParser.hs:2:1-3, _Tname = "foo", _type = Just "Integer", _inline = (0,[])}]), imports = (1,[Import {_Ispan = Implicit import declaration, _Iname = "Prelude", _pkg = Nothing, _source= False, _qual = False, _as = Nothing, _hiding = (0,[])}]), types = (0,[]), warnings = (0,[])}

As can be seen it used the content of the buffer and not the file to perform the analysis. So far so good. The new virtualTarget seems to be working just fine. Now comes the part you’ve all been waiting for, the numbers!

Benchmarking

First the setup, The two methods above as already mentioned were compiled to a static DLL. The tests are written in C# and the full code for it is as follows (you can skip this if it doesn’t interest you). Also my laptop harddrive is just 5400rpm so please keep this in mind 🙂

 static void Main(string[] args)
        {
            Console.WriteLine("Running Benchmarks....");

            int length = 500;
            List<Double> stringB = new List<double>();
            List<Double> fileB = new List<double>();
            string content = File.ReadAllText("VsxParser.hs");

            Console.Write("Press [enter] to start In-Memory buffer test");
            Console.ReadLine();

            Console.WriteLine("Running String tests...");
            unsafe
            {
                for (int i = 0; i < length; i++)
                {
                    DateTime start = DateTime.Now;
                    WinDll.Parser.getModInfoByString(true, content, "VsxParser.hs", "VsxParser", "");
                    stringB.Add(DateTime.Now.Subtract(start).TotalMilliseconds);
                }
            }

            Console.Write("Press [enter] to start File based test");
            Console.ReadLine();

            Console.WriteLine("Running File tests...");
            unsafe
            {
                for (int i = 0; i < length; i++)
                {
                    DateTime start = DateTime.Now;
                    string path = Path.ChangeExtension(System.IO.Path.GetTempFileName(), Path.GetExtension("VsxParser.hs"));
                    File.WriteAllText(path, content);
                    WinDll.Parser.getModuleInfoByPath(true, "VsxParser.hs", "VsxParser", "");
                    File.Delete(path);
                    fileB.Add(DateTime.Now.Subtract(start).TotalMilliseconds);
                }
            }
            Console.WriteLine("Writing results...");

            StreamWriter fs1 = File.CreateText("stringB.csv");
            for (int i = 0; i < length; i++)
            {
                fs1.Write(stringB[i].ToString());
                if (i < length - 1)
                    fs1.Write(", ");
            }
            fs1.Write(fs1.NewLine);
            fs1.Close();

            fs1 = File.CreateText("fileB.csv");
            for (int i = 0; i < length; i++)
            {
                fs1.Write(fileB[i].ToString());
                if (i < length - 1)
                    fs1.Write(", ");
            }
            fs1.Write(fs1.NewLine);
            fs1.Close();

            Console.WriteLine("Done.");
        } 

 

Aside from this, I’ve also used the Windows performance monitor to monitor haddrive activity. Both methods are ran 500times and all the times are measured in the milliseconds range.

Results

first off, the intuition was right, Windows observed a 100% disk cache hit for the duration of the test. So the files were always in cache. So the expected difference shouldn’t be that big (or should it?)

  String buffer File Buffer
Overview Results_Normal_s1 Results_Normal_s2
Performance clustering Results_Normal_b1 Results_Normal_b2
Geometric mean 636.594594942071087ms 666.617770612978724ms
Variance 1436.0ms 1336.56ms
Mean Deviation 26.6618ms 27.1434ms

The two charts reveal that they both performed in about the same ranges on average with no real noticeable difference. You have to remember that these numbers are in milliseconds (ms). The difference in Geometric means is almost negligible 30.02317567090764ms

Results_Normal_merge

Viewing both charts together we see a significant overlap which means on average the one doesn’t perform all that better from the other on normal disk usage.

But what happens when we have abnormal disk usage? What if the drive was so busy that the disk cache starts missing. Would the File based approach still perform “good enough” ?

To simulate high disk usage I used a Microsoft tool called SQLIO, it’ official use is to find disk performance capacity, but it does so by stressing the drive, so it works fine for us. (get it at http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=9a8b005b-84e4-4f24-8d65-cb53442d9e19)

This is ran from a batch file in two modes read and write for about 36sec per mode. which should cover the length of 1 test. The batch file used is

sqlio -kW -s10 -frandom -o8 -b8 -LS -Fparam.txt
sqlio -kW -s36 -frandom -o8 -b64 -LS -Fparam.txt
sqlio -kW -s36 -frandom -o8 -b128 -LS -Fparam.txt
sqlio -kW -s36 -frandom -o8 -b256 -LS -Fparam.txt

sqlio -kW -s36 -fsequential -o8 -b8 -LS -Fparam.txt
sqlio -kW -s36 -fsequential -o8 -b64 -LS -Fparam.txt
sqlio -kW -s36 -fsequential -o8 -b128 -LS -Fparam.txt
sqlio -kW -s36 -fsequential -o8 -b256 -LS -Fparam.txt

sqlio -kR -s36 -frandom -o8 -b8 -LS -Fparam.txt
sqlio -kR -s36 -frandom -o8 -b64 -LS -Fparam.txt
sqlio -kR -s36 -frandom -o8 -b128 -LS -Fparam.txt
sqlio -kR -s36 -frandom -o8 -b256 -LS -Fparam.txt

sqlio -kR -s36 -fsequential -o8 -b8 -LS -Fparam.txt
sqlio -kR -s36 -fsequential -o8 -b64 -LS -Fparam.txt
sqlio -kR -s36 -fsequential -o8 -b128 -LS -Fparam.txt
sqlio -kR -s36 -fsequential -o8 -b256 -LS -Fparam.txt

Running SQLIO and the benchmarks again the first thing that catches my eyes is that

disk cache performance has dropped from 100% to

95.4854

Looking at the dataset I see a lot of completely missed caches. Also remember that the SQLIO is also reading data,

so the disk cache activity also represents it’s reads and not just those of the benchmark as before. So let’s analyze the numbers like before

  String buffer File Buffer
Overview Results_Normal_s1x Results_Normal_s2x
Performance clustering Results_Normal_b1x Results_Normal_b2x
Geometric mean 883.339857882210200ms 904.505721026752581ms
Variance 2.26107*10^6ms 3.81543*10^6ms
Mean Deviation 449.206ms 611.77ms

These results are quite surprising since the difference in mean now is just 21.16586314454238ms. So under heavy load they start to converge to the same performance level.

Results_Normal_mergex

So while both have some significant outliers,  both perform on average in the same category. It’s sad and hard to admit, but Unless I made a mistake when implementing the VirtualFile (which very well might be true) having a String vs File buffer doesn’t really make a big difference mostly due to the OS’s management of I/O and the disk caching going on.

In fact under heavy loads Windows started to use more and more “Fast Reads” and “Fast Writes” These require much less kernel and user mode calls than a full I/O call, so even that advantage was negated somewhat under heavy load, while the overhead of marshalling the string hasn’t changed, which might explain the smaller difference in Geometric mean in the second benchmark.

So the conclusion of the story is is, for now, there’s nothing to gain from the calls on the C# side, but maybe there is in the processing of the result but that’s saved for another time :). Up next, optimizing the Haskell Code to gather more complete module information and do it faster!.

Context sensitive lexing


This is something I haven’t seen in other Haskell IDEs before but which to me would be useful:

Context sensitive lexing, as in the lexer wil treat certain tokens differently based on information defined globally, e.g LANGUAGE Pragmas.

But first a quick recap of how lexing is done in visual haskell 2010:

  • The IDE will ask me to color text one line at a time
  • Everytime I want to color a line I make a call to HsLexer.dll which is a binding to the GHC Api, which calls the GHC lexer directly.
  • Multiline comments are handles in a local state and are never passed to the lexer because since I’m lexing one line at a time, I won’t be able to find the boundaries of the comment blocks like that, so instead I just keep track of the comment tokens {- and –} and identify blocks using a local algorithm that mimics the matching done by GHC.
    Using that I was always able to color GHC Pragmas a different color than normal comments, the reason for this is that they have special meaning, so I’m depicting them as such.

The original code for lexing on the Haskell side was

— @@ Export
— | perform lexical analysis on the input string.
lexSourceString :: String -> IO (StatelessParseResult [Located Token])
lexSourceString source = 
do
   buffer <- stringToStringBuffer source
   let srcLoc  = mkSrcLoc (mkFastString "internal:string") 1 1
   let dynFlag = defaultDynFlags
   let result  = lexTokenStream buffer srcLoc dynFlag
   return $ convert result

pretty straight forward, I won’t really be explaining what everything does here, but what’s important is that we need to somehow add the LANGUAGE pragma entries into the dynFlag value above.

To that end, I created a new function

— @@ Export
— | perform lexical analysis on the input string and taking in a list of extensions to use in a newline seperated format
lexSourceStringWithExt :: String -> String -> IO (StatelessParseResult [Located Token])
lexSourceStringWithExt source exts = 
do
   buffer <- stringToStringBuffer source
   let srcLoc  = mkSrcLoc (mkFastString "internal:string") 1 1
   let dynFlag = defaultDynFlags
   let flagx   = flags dynFlag
   let result  = lexTokenStream buffer srcLoc (dynFlag { flags = flagx ++ configureFlags (lines exts) })
   return $ convert result

which gets the list of Pragmas to enable in a newline \n delimited format. The reason for this is that WinDll currently does not support Lists marshalling properly. It’ll be there in the final version at which point I would have rewritten these parts as well. But until then this would suffice.

the function seen above

configureFlags :: [String] -> [DynFlag]

is used to convert from the list of strings to a list of recognized DynFlag that effect lexing.

Now on to the C# side, Information I already had was the location of the multi comment sections, so all I needed to do was, on any change filter out those sections which I already know to be a Pragma (I know this because I color them differently remember)

But since the code that tracks sections is generic I did not want to hardcode this, so instead I created the following event and abstract methods

public delegate void UpdateDirtySections(object sender, Entry[] sections);
public event UpdateDirtySections DirtyChange;

/// <summary>
/// Raise the dirty section events by filtering the list with dirty spans to reflect
/// only those spans that are not the DEFAULT span
/// </summary>
protected abstract void notifyDirty();

/// <summary>
/// A redirect code for raising the internal event
/// </summary>
/// <param name="list"></param>
internal void raiseNotifyDirty(Entry[] list)
{
    if (DirtyChange != null)
        DirtyChange(this, list);
}

and the specific implementation of  notifyDirty for the CommentTracker is

protected override void notifyDirty()
{
    Entry[] sections = (Entry[])list.Where(x => x.isClosed && !(x.tag is CommentTag)).ToArray();
    base.raiseNotifyDirty(sections);
}

Meaning we only want those entries that are Not the normal CommentTag and that are closed, i.e. having both the start and end values filled in. (the comment tracking algorithm tracks also unclosed comment blocks, It needs to in order to do proper matching as comments get broken or introduced)

The only thing left now is to make subscribe to this event from the Tagger that produces syntax highlighting and react to it. My specific implementation does two things, It keeps track of the current collection of pragmas and the previous collection.

then it makes a call to checkNewHLE to see whether we have introduces or removed a valid syntax pragma. If this is the case, it asks for the entire file to be re-colored.

This call to checkNewHLE is important, since when the user is modifying an already existing pragma tag,

for instance adding TypeFamilies  into the pragmas {-# LANGUAGE TemplateHaskell #-} we get notified for every keypress the user makes, but untill the whole keyword TypeFamilies has been types there’s no point in re-coloring the whole file.

The result of this can be seen below and I find it very cool to be frank 😀

What it looks like with no pragmas

image

now look at what happens when we enable TemplateHaskell  and TypeFamilies

image

notice how with the extensions enabled “family” and “[|” , “|]” now behave like different keywords, this should be usefull to notify the programmer when he’s using certain features. For instance, with TypeFamilies enabled line 6 would no longer be valid because “family” is now a keyword.

Finding the current buffer’s filename


I was recently faced with the problem that in order for me to be able to send a file off to GHC for type checking and parsing (not in that order) I would need to know the full filename.

But the problem is, the only thing I have if a ITextBuffer object. Luckily, almost every object in Visual Studio 2010 has a “Properties” well, property.

So after looking around I found out that this collection contains the ITextDocument object i so desperately need. But ran into one problem. This is a dictionary so logically I would need the key of that object.

The irritation here was that the Key for this object seems to be an type, but How would I create a ITextDocument type? just using ITextDocument as a type isn’t correct, and because I just have the interface, I can’t call GetType() on it. Now I was stuck, having no idea how to construct the key.

Fortunately I realize that I would only need to look this up once, when my Tagger is initialized. So I decided to just do a linear lookup in the dictionary and select the first matching type.

It’s arguably not the way it should be done, But should be fine for my purposes, the code ended up looking like

/// <summary>
/// Finds the first value with the specified type inside the property bag.
/// This is used because I don’t know how to get the Visual Studio instantiated
/// types out of the bag. So I’m doing runtime matching. It would only be done once
/// per buffer so shouldn’t be too bad.
/// </summary>
/// <typeparam name="T">Type of the result</typeparam>
/// <param name="buffer">buffer to look in</param>
/// <returns>Object of the requested type</returns>
/// <exception cref="InvalidOperationException">Gets thrown if the type is not found inside the property dictionary</exception>
public static T getPropertyFromBuffer<T>(ITextBuffer buffer)
{
    foreach (var item in buffer.Properties.PropertyList)
    {
        if (item.Value is T)
            return (T)item.Value;
    }
    throw new InvalidOperationException("The specified type could not be found inside the property bag");
}

So at runtime it uses the generic type T to do lookups, a simple use of this would be

this.document = Utils.EditorUtils.getPropertyFromBuffer<ITextDocument>(this.buffer);

and that’s how I lookup my ITextDocument object 🙂

Working around ghc’s lexer’s layout rule


While implementing coloring for Haskell files I noticed that lines with more closing braces (either ‘)’ or ‘}’) were not being colored.

After doing some digging around I found out the following:

?parseLine("{")->tag

cStatelessParseResultSOk

?parseLine("}")->tag

cStatelessParseResultSFailed

and

?parseLine("{-")->tag

cStatelessParseResultSOk

?parseLine("-}")->tag

cStatelessParseResultSFailed

 

So apparently they were throwing lexical errors, but why?

After contacting Mr. Simon Marlow I was told that this is the handling of GHC’s layout rule. to quote

“You’re probably encountering the lexer’s handling of the Haskell “layout” rule.  When the lexer sees a ‘}’ token, it pops the current layout stack, and if the layout stack is empty then this is a lexical error.”

This left me with 3 choices

  1. Use a custom lexer much like the original visual haskell did
  2. Replace all {, },( and ) with 1 whitespace character so that they won’t be colored, but the rest of the input will, but the positions would be preserved.
  3. left pad the input with enough opening braces to have the lexer succeed in parsing then adjust the ranges.

Option 1 was the least maintainable, since I would have to keep updating the lexer everytime the one in ghc changes. So I didn’t want to do this.

Option 2 was a possibility, one which I tried out before, But I noticed that having the braces colored really did help.

Option 3 was then chosen by process of elimination. It turned out to not be that much work at all.

private int prepareLine(ref string str)
{
    int round= 0, brace = 0;

    for (int i = 0; i < str.Length; i++)
    {
        switch (str[i])
        {
            case ‘}’:
                if(i==0 || !(str[i-1]==’-‘))
                    brace++;
                break;
            case ‘)’:
                round++;
                break;
            default:
                break;
        }
    }

    if(round > 0)
        str = str.PadLeft(str.Length + round, ‘(‘);

    if (brace > 0)
        str = str.PadLeft(str.Length + brace, ‘{‘);

    return round + brace;
}

 

is the full implementation. Now I know what you’re thinking, By doing this I’ll create more opening than closing braces. So a balanced line like (Int) becomes unbalanced ((Int). However this is not a problem, Since for my coloring braces carry no semantics. I don’t care what they mean (as in, when interpreted) all I care about is what they are (as in the token type).

With that in place, the only other code needed is to skip the first n number of tokens returned from the lexer, where n is the result of calling the prepareLine function.

And that’s all, Now we have perfect line coloring everywhere 🙂

image

Changing default settings


The visual studio editor has a bunch of build in settings you can turn on and off per editor.

Since I’m writing a language service for Haskell, I would like to enable replacing tabs with spaces, set the amount of spaces to 4 and turn on line numbering

After a bit of searching I found a blog post from Noah Ric a developer at Microsoft about disabling zooming in a document window.

http://blogs.msdn.com/noahric/archive/2010/03/18/disabling-mouse-wheel-zoom-through-ieditoroptions.aspx

So using this as a basis I created the following:

[Export(typeof(IWpfTextViewCreationListener))]
[ContentType("haskell")]
[TextViewRole(PredefinedTextViewRoles.Zoomable)]
internal class ViewCreationListener : IWpfTextViewCreationListener
{
    public void TextViewCreated(IWpfTextView textView)
    {
        textView.Options.SetOptionValue(DefaultWpfViewOptions.EnableHighlightCurrentLineId, false);
        textView.Options.SetOptionValue(DefaultOptions.TabSizeOptionId, 4);
        textView.Options.SetOptionValue(DefaultOptions.ConvertTabsToSpacesOptionId, true);
        textView.Options.SetOptionValue(DefaultTextViewHostOptions.LineNumberMarginId, true);
    }
}

 

(sorry no syntax highlighting as I can’t find a theme on here that won’t break it)

Notice the “PredefinedTextViewRoles.Zoomable” , I wanted to use PredefinedTextViewRoles.Document but when doing this the IDE would randomly throw Exceptions saying that the Editor wasn’t fully created yet. Which is odd since Visual Studio is doing all the initializations.

The set of values you can change this way are listed here: http://msdn.microsoft.com/en-us/library/microsoft.visualstudio.text.editor.ieditoroptions_members(v=VS.100).aspx

For more things you change take a look at http://msdn.microsoft.com/en-us/library/ee818135.aspx