Collapsible regions.–Video Update #2


So here’s the second video update, it shows some of the current progress up till now, while it might not look significantly different but there’s a big difference under the hood. A lot has been rewritten and optimized in anticipations of new features coming soon, like intellisense and ghost typing (coming in the next video, due in a few days). Click the link below to see the video

Collapsible Regions

That’s it for today, and now… back to my thesis project.

Advertisements

Another screen capture


Just a quick snapshot of what I’ve been working on, Still doing work on typchecking and code discovery which is a bit slow atm, and having to finish this is taking longer than I expected.

errors

But then it’s off to cabal support after this.

File vs In-Memory buffer type checking


I’ve recently been looking into how to make the type checking part faster. The type checker part consists of two parts just like most of the code in Visual Haskell 2010, A Haskell side and a C# side.

This post is about optimizations on the C# side.

the setup currently is as follows:

Every 500ms after the user has finished typing, a call is made to typecheck() which does a few  things,

  • Finds the current module information (cached)
  • Generates a temporary file which has the same content of the In-Memory buffer
  • Makes a call to the Haskell world
  • Interprets the results

The part I thought I could optimize is that I didn’t want to have to write the buffer out to a file only to have GHC read it back in. Under normal circumstances the files will probably be in disk cache and not even written out to begin with, but we would still have to make a couple of kernel calls (about 6).

It’s not like the new approach wouldn’t have drawbacks, for one thing you’d have the serialization drawback, you suddenly have to serialize large strings from and to the Haskell world, but above that the GHC Api doesn’t even support type checking of In-Memory buffers.

So the first part is to edit the GHC Api and add this functionality.

Modifying GHC

I won’t explain this in detail, but I’ll just outline the “gist” of it.

Whenever you want to tell GHC want to inspect, It does so by inspecting all the “Targets” you’ve specified. An example of this would be

target <- guessTarget file Nothing
addTarget target

which means, we first ask it to guess what the string is (a module, or filename) and it’ll construct the appropriate Target for us. In order to support In-Memory buffers I added a new TargetId

| TargetVirtualFile StringBuffer FilePath (Maybe Phase)
  — ^ This entry loads the data from the provided buffer but acts like
  –   it was loaded from a file. The path is then used to inform GHC where
  –   to look for things like Interface files.
  –   This Target is useful when you want to prevent a client to write it’s
  –   in-memory buffer out to a file just to have GHC read it back in.

This takes the string buffer, and the original filename, the filename is only used to discover properties about the buffer (the most important of which is what type of input it is, hspp, hsc, hs or lhs)  and along with this a new function was create the facilitate the create of a virtual target.

— | Create a new virtual Target that will be used instead of reading the module from disk.
–   However object generation will be set to False. Since the File on disk would could
–   and is most likely not the same as the content passed to the function
virtualTarget :: GhcMonad m => StringBuffer -> FilePath -> m Target
virtualTarget content file
   = return (Target (TargetVirtualFile content file Nothing) False Nothing)

so for it’s all pretty straight forward, the rest I won’t explain since they have no bearing on how to use this new code, but what I’ve done was basically trace the workings of the load2 function, and where appropriate added some new code to instead of reading a file into a buffer, to just use the buffer passed on to virtualTarget.

The problem is, in runPhase a system process is called to deLit a Literate Haskell file which is a bit of a problem since it means that I would have to have the file on disk anyway. I decided not to worry about that for now but instead to focus on implementing the change for normal Haskell files first so I can run some benchmarking code.

Testing

Now for the actual testing

I created a new DLL file which contains the compiled code from two Haskell functions

getModInfo :: Bool -> String -> String -> String -> IO (ApiResults ModuleInfo)

and

getModStringInfo :: Bool -> String -> String -> String -> String -> IO (ApiResults ModuleInfo)

They both do the same amount of work, only the latter passes along as a target a VirtualTarget instead of a normal File target. But first it’s time to see if the changes actually did something.

This first test just asks for all the information it can get from a file named “VsxParser.hs”

Phyx@PHYX-PC /c/Users/Phyx/Documents/VisualHaskell2010/Parser
$ ghcii.sh VsxParser
GHCi, version 6.13.20100521: http://www.haskell.org/ghc/
Ok, modules loaded: VsxParser.
Prelude VsxParser> getModInfo True "VsxParser.hs" "VsxParser" ""
Success: ModuleInfo {functions = (70,[FunType {_Tspan = VsxParser.hs:256:1-8, _Tname = "showPpr’", _type = Just "Bool -> a -> String", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:252:1-7, _Tname = "mkSized", _type = Just "[a] -> (Int,[a])", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:244:1-17, _Tname = "configureDynFlags", _type = Just "DynFlags -> DynFlags", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:237:1-19, _Tname = "createLocatedErrors", _type = Just "ErrMsg -> [ErrorMessage]", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:230:1-13, _Tname = "processErrors", _type = Just "SourceError -> IO (ApiResults a)", _inline = (0,[])},FunType {_Tspan = VsxParser.hs:69:1-4, _Tname = "main", _type =
(content goes on and on and on)

the content will go on for many more pages so I won’t post that

now, We call getModStringInfo asking for information about the same file, but this time we pass along a much smaller and completely different buffer than the one of the file on disk. namely a module with just 1 function “foo = 5”

Prelude VsxParser> getModStringInfo True "module VsxParser where\nfoo = 5\n" "VsxParser.hs" "VsxParser" ""
Success: ModuleInfo {functions = (1,[FunType {_Tspan = VsxParser.hs:2:1-3, _Tname = "foo", _type = Just "Integer", _inline = (0,[])}]), imports = (1,[Import {_Ispan = Implicit import declaration, _Iname = "Prelude", _pkg = Nothing, _source= False, _qual = False, _as = Nothing, _hiding = (0,[])}]), types = (0,[]), warnings = (0,[])}

As can be seen it used the content of the buffer and not the file to perform the analysis. So far so good. The new virtualTarget seems to be working just fine. Now comes the part you’ve all been waiting for, the numbers!

Benchmarking

First the setup, The two methods above as already mentioned were compiled to a static DLL. The tests are written in C# and the full code for it is as follows (you can skip this if it doesn’t interest you). Also my laptop harddrive is just 5400rpm so please keep this in mind 🙂

 static void Main(string[] args)
        {
            Console.WriteLine("Running Benchmarks....");

            int length = 500;
            List<Double> stringB = new List<double>();
            List<Double> fileB = new List<double>();
            string content = File.ReadAllText("VsxParser.hs");

            Console.Write("Press [enter] to start In-Memory buffer test");
            Console.ReadLine();

            Console.WriteLine("Running String tests...");
            unsafe
            {
                for (int i = 0; i < length; i++)
                {
                    DateTime start = DateTime.Now;
                    WinDll.Parser.getModInfoByString(true, content, "VsxParser.hs", "VsxParser", "");
                    stringB.Add(DateTime.Now.Subtract(start).TotalMilliseconds);
                }
            }

            Console.Write("Press [enter] to start File based test");
            Console.ReadLine();

            Console.WriteLine("Running File tests...");
            unsafe
            {
                for (int i = 0; i < length; i++)
                {
                    DateTime start = DateTime.Now;
                    string path = Path.ChangeExtension(System.IO.Path.GetTempFileName(), Path.GetExtension("VsxParser.hs"));
                    File.WriteAllText(path, content);
                    WinDll.Parser.getModuleInfoByPath(true, "VsxParser.hs", "VsxParser", "");
                    File.Delete(path);
                    fileB.Add(DateTime.Now.Subtract(start).TotalMilliseconds);
                }
            }
            Console.WriteLine("Writing results...");

            StreamWriter fs1 = File.CreateText("stringB.csv");
            for (int i = 0; i < length; i++)
            {
                fs1.Write(stringB[i].ToString());
                if (i < length - 1)
                    fs1.Write(", ");
            }
            fs1.Write(fs1.NewLine);
            fs1.Close();

            fs1 = File.CreateText("fileB.csv");
            for (int i = 0; i < length; i++)
            {
                fs1.Write(fileB[i].ToString());
                if (i < length - 1)
                    fs1.Write(", ");
            }
            fs1.Write(fs1.NewLine);
            fs1.Close();

            Console.WriteLine("Done.");
        } 

 

Aside from this, I’ve also used the Windows performance monitor to monitor haddrive activity. Both methods are ran 500times and all the times are measured in the milliseconds range.

Results

first off, the intuition was right, Windows observed a 100% disk cache hit for the duration of the test. So the files were always in cache. So the expected difference shouldn’t be that big (or should it?)

  String buffer File Buffer
Overview Results_Normal_s1 Results_Normal_s2
Performance clustering Results_Normal_b1 Results_Normal_b2
Geometric mean 636.594594942071087ms 666.617770612978724ms
Variance 1436.0ms 1336.56ms
Mean Deviation 26.6618ms 27.1434ms

The two charts reveal that they both performed in about the same ranges on average with no real noticeable difference. You have to remember that these numbers are in milliseconds (ms). The difference in Geometric means is almost negligible 30.02317567090764ms

Results_Normal_merge

Viewing both charts together we see a significant overlap which means on average the one doesn’t perform all that better from the other on normal disk usage.

But what happens when we have abnormal disk usage? What if the drive was so busy that the disk cache starts missing. Would the File based approach still perform “good enough” ?

To simulate high disk usage I used a Microsoft tool called SQLIO, it’ official use is to find disk performance capacity, but it does so by stressing the drive, so it works fine for us. (get it at http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=9a8b005b-84e4-4f24-8d65-cb53442d9e19)

This is ran from a batch file in two modes read and write for about 36sec per mode. which should cover the length of 1 test. The batch file used is

sqlio -kW -s10 -frandom -o8 -b8 -LS -Fparam.txt
sqlio -kW -s36 -frandom -o8 -b64 -LS -Fparam.txt
sqlio -kW -s36 -frandom -o8 -b128 -LS -Fparam.txt
sqlio -kW -s36 -frandom -o8 -b256 -LS -Fparam.txt

sqlio -kW -s36 -fsequential -o8 -b8 -LS -Fparam.txt
sqlio -kW -s36 -fsequential -o8 -b64 -LS -Fparam.txt
sqlio -kW -s36 -fsequential -o8 -b128 -LS -Fparam.txt
sqlio -kW -s36 -fsequential -o8 -b256 -LS -Fparam.txt

sqlio -kR -s36 -frandom -o8 -b8 -LS -Fparam.txt
sqlio -kR -s36 -frandom -o8 -b64 -LS -Fparam.txt
sqlio -kR -s36 -frandom -o8 -b128 -LS -Fparam.txt
sqlio -kR -s36 -frandom -o8 -b256 -LS -Fparam.txt

sqlio -kR -s36 -fsequential -o8 -b8 -LS -Fparam.txt
sqlio -kR -s36 -fsequential -o8 -b64 -LS -Fparam.txt
sqlio -kR -s36 -fsequential -o8 -b128 -LS -Fparam.txt
sqlio -kR -s36 -fsequential -o8 -b256 -LS -Fparam.txt

Running SQLIO and the benchmarks again the first thing that catches my eyes is that

disk cache performance has dropped from 100% to

95.4854

Looking at the dataset I see a lot of completely missed caches. Also remember that the SQLIO is also reading data,

so the disk cache activity also represents it’s reads and not just those of the benchmark as before. So let’s analyze the numbers like before

  String buffer File Buffer
Overview Results_Normal_s1x Results_Normal_s2x
Performance clustering Results_Normal_b1x Results_Normal_b2x
Geometric mean 883.339857882210200ms 904.505721026752581ms
Variance 2.26107*10^6ms 3.81543*10^6ms
Mean Deviation 449.206ms 611.77ms

These results are quite surprising since the difference in mean now is just 21.16586314454238ms. So under heavy load they start to converge to the same performance level.

Results_Normal_mergex

So while both have some significant outliers,  both perform on average in the same category. It’s sad and hard to admit, but Unless I made a mistake when implementing the VirtualFile (which very well might be true) having a String vs File buffer doesn’t really make a big difference mostly due to the OS’s management of I/O and the disk caching going on.

In fact under heavy loads Windows started to use more and more “Fast Reads” and “Fast Writes” These require much less kernel and user mode calls than a full I/O call, so even that advantage was negated somewhat under heavy load, while the overhead of marshalling the string hasn’t changed, which might explain the smaller difference in Geometric mean in the second benchmark.

So the conclusion of the story is is, for now, there’s nothing to gain from the calls on the C# side, but maybe there is in the processing of the result but that’s saved for another time :). Up next, optimizing the Haskell Code to gather more complete module information and do it faster!.

Partial typechecking


Ok  so I just finished partial type checking, It now reports back type errors and information about a module whenever it can.

I also uploaded a video to show the most common way to navigate a file with that information. But do note a few things

  • The “look” is not final. e.g. it’s missing theming on the dropdown among others, and icons to show the type of item. These will be added next but wanted to get this out.
  • The delay at the initial check is artificially imposed. So I can debug. It usually has the information before the file loads.
  • And I have a cold, So forgive my stuffy voice 🙂

http://www.screencast.com/t/NmM1NTBkMDQ

I think I’ll make my self imposed end of July deadline. Like I mentioned before, the first version aims to match features with the original visual haskell.

Context sensitive lexing


This is something I haven’t seen in other Haskell IDEs before but which to me would be useful:

Context sensitive lexing, as in the lexer wil treat certain tokens differently based on information defined globally, e.g LANGUAGE Pragmas.

But first a quick recap of how lexing is done in visual haskell 2010:

  • The IDE will ask me to color text one line at a time
  • Everytime I want to color a line I make a call to HsLexer.dll which is a binding to the GHC Api, which calls the GHC lexer directly.
  • Multiline comments are handles in a local state and are never passed to the lexer because since I’m lexing one line at a time, I won’t be able to find the boundaries of the comment blocks like that, so instead I just keep track of the comment tokens {- and –} and identify blocks using a local algorithm that mimics the matching done by GHC.
    Using that I was always able to color GHC Pragmas a different color than normal comments, the reason for this is that they have special meaning, so I’m depicting them as such.

The original code for lexing on the Haskell side was

— @@ Export
— | perform lexical analysis on the input string.
lexSourceString :: String -> IO (StatelessParseResult [Located Token])
lexSourceString source = 
do
   buffer <- stringToStringBuffer source
   let srcLoc  = mkSrcLoc (mkFastString "internal:string") 1 1
   let dynFlag = defaultDynFlags
   let result  = lexTokenStream buffer srcLoc dynFlag
   return $ convert result

pretty straight forward, I won’t really be explaining what everything does here, but what’s important is that we need to somehow add the LANGUAGE pragma entries into the dynFlag value above.

To that end, I created a new function

— @@ Export
— | perform lexical analysis on the input string and taking in a list of extensions to use in a newline seperated format
lexSourceStringWithExt :: String -> String -> IO (StatelessParseResult [Located Token])
lexSourceStringWithExt source exts = 
do
   buffer <- stringToStringBuffer source
   let srcLoc  = mkSrcLoc (mkFastString "internal:string") 1 1
   let dynFlag = defaultDynFlags
   let flagx   = flags dynFlag
   let result  = lexTokenStream buffer srcLoc (dynFlag { flags = flagx ++ configureFlags (lines exts) })
   return $ convert result

which gets the list of Pragmas to enable in a newline \n delimited format. The reason for this is that WinDll currently does not support Lists marshalling properly. It’ll be there in the final version at which point I would have rewritten these parts as well. But until then this would suffice.

the function seen above

configureFlags :: [String] -> [DynFlag]

is used to convert from the list of strings to a list of recognized DynFlag that effect lexing.

Now on to the C# side, Information I already had was the location of the multi comment sections, so all I needed to do was, on any change filter out those sections which I already know to be a Pragma (I know this because I color them differently remember)

But since the code that tracks sections is generic I did not want to hardcode this, so instead I created the following event and abstract methods

public delegate void UpdateDirtySections(object sender, Entry[] sections);
public event UpdateDirtySections DirtyChange;

/// <summary>
/// Raise the dirty section events by filtering the list with dirty spans to reflect
/// only those spans that are not the DEFAULT span
/// </summary>
protected abstract void notifyDirty();

/// <summary>
/// A redirect code for raising the internal event
/// </summary>
/// <param name="list"></param>
internal void raiseNotifyDirty(Entry[] list)
{
    if (DirtyChange != null)
        DirtyChange(this, list);
}

and the specific implementation of  notifyDirty for the CommentTracker is

protected override void notifyDirty()
{
    Entry[] sections = (Entry[])list.Where(x => x.isClosed && !(x.tag is CommentTag)).ToArray();
    base.raiseNotifyDirty(sections);
}

Meaning we only want those entries that are Not the normal CommentTag and that are closed, i.e. having both the start and end values filled in. (the comment tracking algorithm tracks also unclosed comment blocks, It needs to in order to do proper matching as comments get broken or introduced)

The only thing left now is to make subscribe to this event from the Tagger that produces syntax highlighting and react to it. My specific implementation does two things, It keeps track of the current collection of pragmas and the previous collection.

then it makes a call to checkNewHLE to see whether we have introduces or removed a valid syntax pragma. If this is the case, it asks for the entire file to be re-colored.

This call to checkNewHLE is important, since when the user is modifying an already existing pragma tag,

for instance adding TypeFamilies  into the pragmas {-# LANGUAGE TemplateHaskell #-} we get notified for every keypress the user makes, but untill the whole keyword TypeFamilies has been types there’s no point in re-coloring the whole file.

The result of this can be seen below and I find it very cool to be frank 😀

What it looks like with no pragmas

image

now look at what happens when we enable TemplateHaskell  and TypeFamilies

image

notice how with the extensions enabled “family” and “[|” , “|]” now behave like different keywords, this should be usefull to notify the programmer when he’s using certain features. For instance, with TypeFamilies enabled line 6 would no longer be valid because “family” is now a keyword.

Finding the current buffer’s filename


I was recently faced with the problem that in order for me to be able to send a file off to GHC for type checking and parsing (not in that order) I would need to know the full filename.

But the problem is, the only thing I have if a ITextBuffer object. Luckily, almost every object in Visual Studio 2010 has a “Properties” well, property.

So after looking around I found out that this collection contains the ITextDocument object i so desperately need. But ran into one problem. This is a dictionary so logically I would need the key of that object.

The irritation here was that the Key for this object seems to be an type, but How would I create a ITextDocument type? just using ITextDocument as a type isn’t correct, and because I just have the interface, I can’t call GetType() on it. Now I was stuck, having no idea how to construct the key.

Fortunately I realize that I would only need to look this up once, when my Tagger is initialized. So I decided to just do a linear lookup in the dictionary and select the first matching type.

It’s arguably not the way it should be done, But should be fine for my purposes, the code ended up looking like

/// <summary>
/// Finds the first value with the specified type inside the property bag.
/// This is used because I don’t know how to get the Visual Studio instantiated
/// types out of the bag. So I’m doing runtime matching. It would only be done once
/// per buffer so shouldn’t be too bad.
/// </summary>
/// <typeparam name="T">Type of the result</typeparam>
/// <param name="buffer">buffer to look in</param>
/// <returns>Object of the requested type</returns>
/// <exception cref="InvalidOperationException">Gets thrown if the type is not found inside the property dictionary</exception>
public static T getPropertyFromBuffer<T>(ITextBuffer buffer)
{
    foreach (var item in buffer.Properties.PropertyList)
    {
        if (item.Value is T)
            return (T)item.Value;
    }
    throw new InvalidOperationException("The specified type could not be found inside the property bag");
}

So at runtime it uses the generic type T to do lookups, a simple use of this would be

this.document = Utils.EditorUtils.getPropertyFromBuffer<ITextDocument>(this.buffer);

and that’s how I lookup my ITextDocument object 🙂

Working around ghc’s lexer’s layout rule


While implementing coloring for Haskell files I noticed that lines with more closing braces (either ‘)’ or ‘}’) were not being colored.

After doing some digging around I found out the following:

?parseLine("{")->tag

cStatelessParseResultSOk

?parseLine("}")->tag

cStatelessParseResultSFailed

and

?parseLine("{-")->tag

cStatelessParseResultSOk

?parseLine("-}")->tag

cStatelessParseResultSFailed

 

So apparently they were throwing lexical errors, but why?

After contacting Mr. Simon Marlow I was told that this is the handling of GHC’s layout rule. to quote

“You’re probably encountering the lexer’s handling of the Haskell “layout” rule.  When the lexer sees a ‘}’ token, it pops the current layout stack, and if the layout stack is empty then this is a lexical error.”

This left me with 3 choices

  1. Use a custom lexer much like the original visual haskell did
  2. Replace all {, },( and ) with 1 whitespace character so that they won’t be colored, but the rest of the input will, but the positions would be preserved.
  3. left pad the input with enough opening braces to have the lexer succeed in parsing then adjust the ranges.

Option 1 was the least maintainable, since I would have to keep updating the lexer everytime the one in ghc changes. So I didn’t want to do this.

Option 2 was a possibility, one which I tried out before, But I noticed that having the braces colored really did help.

Option 3 was then chosen by process of elimination. It turned out to not be that much work at all.

private int prepareLine(ref string str)
{
    int round= 0, brace = 0;

    for (int i = 0; i < str.Length; i++)
    {
        switch (str[i])
        {
            case ‘}’:
                if(i==0 || !(str[i-1]==’-‘))
                    brace++;
                break;
            case ‘)’:
                round++;
                break;
            default:
                break;
        }
    }

    if(round > 0)
        str = str.PadLeft(str.Length + round, ‘(‘);

    if (brace > 0)
        str = str.PadLeft(str.Length + brace, ‘{‘);

    return round + brace;
}

 

is the full implementation. Now I know what you’re thinking, By doing this I’ll create more opening than closing braces. So a balanced line like (Int) becomes unbalanced ((Int). However this is not a problem, Since for my coloring braces carry no semantics. I don’t care what they mean (as in, when interpreted) all I care about is what they are (as in the token type).

With that in place, the only other code needed is to skip the first n number of tokens returned from the lexer, where n is the result of calling the prepareLine function.

And that’s all, Now we have perfect line coloring everywhere 🙂

image