Tuesday, March 03, 2009

Darren on Flex: Typed Literals, Paving The Way For Truly Embedded DSLs

What are they and what do they have to do with embedded DSLs?

After undertaking several Fit(Library) framework development projects and listening to various speakers on DSLs at SPA2008 (DSLs clearly being the hype for 2008, or at least at that conference), I started thinking about what DSLs exist and what support we have for them and what support we have for new ones. We are all familiar with the concept of primitive types and String literals? Having a Java background, things don't get much simpler than that. Those who are now feeling the urge to say something like, "Language X does not have primitive types, everything is an object..." please, pipe down, it doesn't really matter. The important thing to note is that there are types of variable you can define a value for without using the new operator (switch the keyword, operator, construct to suit your language of choice).

String Literal Abuse...

Strings are used to express eeeevverything... well, almost everything: SQL statements, XML fragments/documents and Regular Expressions. Even regular language expressions are made executable by the infamous eval('<string-embedded-language-expressions>') method, where supported. There maybe more types expressable by strings but we will focus on these for now, but if you know of any other interesting cases, I'd be interested to learn of them.

... Invites String Literal Pain!

The pain suffered as result of abusing string literals often goes unnoticed and can be derived, in part, from the fact that they are commonly used to embed full-blown languages, as in the highlighted cases for SQL and XML. While coding in this manner i.e. string-embedding, a benefit of formal languages becomes less obvious; these languages have their own set of rules i.e. their formal grammar, that apply as a matter of well-formedness. In fact we completely loose the benefit (real, potential or otherwise) of the well-formedness of these formal languages. In other words, the earliest possible point in time that one can determine for sure that some embedded SQL is correct, or in fact wrong, is when your code hits reality i.e. at run-time. For some poor sods (or fools, depending on your p.o.v.), that could be when an livid customer is ranting and raving about some dodgy transaction that takes too long or doesn't yield the... or when some XML document that is spat out chokes some down stream system, which always has teh effect of upsetting a lot of people. Let's consider escaping: How annoying does it get when one arrives (worse, has to write) at a ridiculously long regular expression string, let's say in Java. There are so many meta-characters that are frequently used to control the semantics of a RegExp that are escaped with a '\' (backslash) and thus must be doubly escaped in Java. In particular managing the namespace overlap for Java (symbols and characters) strings and embedded RegExp (escape-sequences), none more troublesome than '\n', which means insert new-line and match new-line, respectively, where the former takes precedence thus breaking the latter. The comprehensibility of escaped backslashes can be argued i.e. you know you are writing Java so you know what '\\\\' means, but what I can almost guarantee is that as the number of escaped-backslashes increases the more likely you are to go boss-eyed, recount them n times and ultimatly write the wrong RegExp before getting it right.

Ruby's Enhanced String Notation

Just as an aside, while we are discussing the use of strings, Ruby is quite extraordinary w.r.t. to how one instantiates strings. I have read that Ruby in someways is designed to mimic the Linux shell e.g. in its creation of strings: ' (single-quote) strings are different to " (double-quote) strings. Like in the Linux shell, using the double-quote notation expands referenced varibles; one can inject values from locally (lexically) scoped variables in the literal string value using Ruby's # and {} notation when embebbed in double-quoted strings. For example:
a = 2
puts "the variable a = #{a}"
This yields The variable a = 2.
# ...
puts 'the variable a = #{a}'
And this yields The variable a = \#{a}. There will be plenty programmers out there that are familiar with C's printf()-family of procedures, where it has features in many post-C languages. String.printf() was introduced in Java 5, leveraging the introduction of varargs. I was once on a Python project too and grew extremely fond of the overloaded % String operator; coming from Java, I was quite blown away by that back then.

Using Strings Responsibly Might Mean Not Using Them At All

So it seems that some language authors have felt the pain and reacted appropriatly by introducing real types for these things that might otherwise be embedded in strings. Here are a few cases that I know about; if I've missed any, which I am sure I have, please let me know.

RegExp - JavaScript, ActionScript, Ruby, Python

The common syntax for these languages, with the exception of Python, is to use the '/' (forward-slash) as the RegExp delimiter, in exactly the same way single-quotes or double-quotes are used to delimit strings. Foe example:
/regular expression/
JavaScript and ActionScript behave almost the same (in my experience, correct me if I am wrong) which suggests to me this maybe a feature ECMAScript; I am not clued up on ECMAScript directly, so I can only speculate. By 'behave' I mean in reference to the flags you can specify to alter the way the RegExp processes the target string. For example:
/regular expression/gimsx;

XML Notation - ActionScript

ActionScript overloads the < and > operators to be used as XML delimiters(obvious choice, no?).
var tagname:String = "item"; 
var attributename:String = "id"; 
var attributevalue:String = "5"; 
var content:String = "Chicken"; 
var x:XML = <{tagname} {attributename}={attributevalue}>{content}</{tagname}>; 
trace(x.toXMLString())
    // Output: Chicken
This was borrowed from the Flex 3 Developer Guide. As you can see, you can reference variables inline to construct any part of the XML fragment.

Binary & Bit Notation - Erlang

Erlang has the tidiest notation for bits and bytes I've ever seen... excuse me if this a blatant show of ignorance:
Binary = <<98,105,110,97,114,121>>.
Each comma-separated value represents a byte and so must fall in the inclusive range 0-255. This example shows a 6-byte binary.
Binary = <<"binary">>.
This happens to be valid Erlang as the patterns match. This second example shows a 6-byte binary represented as a string (ironically) as all byte values represent printable ASCII character codes. The first example is matched as the values specified are the corresponding ASCII character codes for binary. I believe Erlang will prefer to display the string form when possible.
B = 98.
I = 105.
N = 110.
A = 97.
R = 114.
Y = 121.
Binary = <<B,I,N,A,R,Y>>.
Erlang also allows you to reference variables/constants/terms (whatever they are called in Erlang) inline as well. The bit notation is only a minor variation on this theme.
MangledBinary = <<B:4,I:4,N:4,A:4,R:4,Y:4>>.
The :4 says to take the 4 least significant bits, in this case chopping each byte in half, which incidentally still yields a printable binary, but now only 3-bytes long (<<")รก)">>)

So The Part Relevant to DSLs

LINQ-to-SQL - C#

C#'s LINQ-to-SQL, coupled with LINQ Expression (comprehension) syntax, is the closest thing I've seen to an embedded DSL that is strongly typed. That is, you can express a literal and assign it inline to a specialized type, relevant to its domain of use.
enum Food { Fruit, Veg }

var fruitAndVeg = {
    new { Type=Fruit, Name="Apple" },
    new { Type=Fruit, Name="Banana" },
    new { Type=Fruit, Name="Cherry" },
    new { Type=Veg, Name="Artichoke" },
    new { Type=Veg, Name="Asparagus" },
    new { Type=Veg, Name="Broccoli" },
    new { Type=Veg, Name="Carrot" }
};

from food in fruitAndVeg
where food.Type == Veg
where food.Name.StartsWith ("A")
select food.Name
This yields Artichoke and Asparagus. While this is not SQL-92 (nor does it actually link to SQL [read: database, although it can]), you can clearly see the resemblance and can therefore immediately identify the comprehension syntax as a query language.

Make the Tools Do The Work!

For those that appreciate type-safety (i.e. those that use [and like] strongly-type languages), compile-time support for these typed literals would come as standard. Typed literals would also permit enhanced IDE support: literal validation; syntax highlighting; code completion (if appropriate). These are not ideas of my own, rather a very brief account of my experience with Visual Studio and LINQ with its comprehension syntax. The problem is the implementation for this is quite involved and limited to querying IEnumerable type things. AFAIK, there is no general mechanism for providing such a feature, a feature which essentially allows you to extend the syntax of you embedding language e.g. C#. Many text editors provide a mechanism to specify information about your language, so that it can usefully provide syntax highlighting. However what we are getting at with typed literals goes beyond this and really needs to extend the compiler and/or the IDE; I envisage some sort of plugin, that either or both compiler and IDE can understand, which has some sort of Type -> Expression syntax mapping. But once we've done all this, how far away would we be from the effort to do all this using Bison, Flex, JavaCC, YACC or whatever other parser/compile-generator? I'm willing to bet it would be a similar effort, but what we gain by embedding/inlining, aside form the features and support that the IDE would typically provide, is:
  • The opportunity for value injection, as per Ruby strings and ActionScript XML fragments.
  • The opportunity to simplify tasks that are typically more painful or less compact in the main language e.g. C# and LINQ.
  • The oppurtunity to do all the above where you need it, in the code i.e. not in some external file, thus incurring no maintenance penalty.
Well, that's about all I have to say/muse on the subject. With closures and dynamic constructs becoming more popular, I wait to see what other developments materialise next in the language space. Cheers.

4 comments:

Jonas Bandi said...

Interesting post, thanks!

Two things coming to mind while reading:
Groovy also offers the embedding of Regexp and a very DSLish approach to process XML.
http://groovy.codehaus.org/Regular+Expressions
http://groovy.codehaus.org/Reading+XML+using+Groovy%27s+XmlSlurper

As far as I understand .NET, your C# example actually refers to plain LINQ.
LING is a language feature. LINQ-to-SQL is a LINQ Provider, a pluggable Library that adapts LINQ features for an SQL-based data store.
(By the way LINQ-to-SQL is merged with LINQ-to-Entities and will not be further developed by MS)

Darren Bishop said...

Hey fella. Yeah I know about LINQ's Provider design structure. You are correct about the non-SQL-ness of the LINQ example, but I do say as much in the text following the example.

Thanks for the Groovy links, I've never looked at Groovy before.

- The RegExp stuff is pretty cool; I wish some if this stuff was just sorted by bog-standard Java - as far as I'm concerned there is no good reason why not... espeically given that it's just a case of existing operators (reserved characters) being overloaded.

- The XML stuff is also quite nice - although it's not as simple as E4X, which is the closest thing to support for inline XPath. I do like the multi-line string notation; I think Python has this. Again not sure whay Java doesn't.

Back to LINQ: will LING-to-Entities be maintained by MS?

Cheers.

Jonas Bandi said...
This comment has been removed by the author.
Jonas Bandi said...

Concerning MS's strategy for LINQ-to-SQL, there was quite a buz in the community.

Here is a good summary:
Is LINQ to SQL Truly Dead?