lua-patterns/readme.md
2017-04-18 19:21:48 +02:00

5.8 KiB

Lua string patterns in Rust

Lua string patterns are a powerful yet lightweight alternative to full regular expressions. They are not regexps, since there is no alternation (the | operator), but this is not usually a problem. In fact, full regexps become too powerful and power can be dangerous or just plain confusing. This is why OpenBSD's httpd has Lua patterns. The decision to use % as the escape rather than the traditional \ is refreshing. In the Rust context, lua-patterns is a very lightweight dependency, if you don't need the full power of the regex crate.

This library reuses the original source from Lua 5.2 - only 400 lines of battle-tested C. I originally did this for a similar project to bring these patterns to C++.

More information can be found on the Lua wiki.

I've organized the Rust interface much as the original Lua library, 'match', 'gmatch' and 'gsub', but made these methods of a LuaPattern struct. This is for two main reasons:

  • although string patterns are not compiled, they can be validated upfront
  • after a match, the struct contains the results
extern crate lua_patterns;
use lua_patterns::LuaPattern;

let mut m = LuaPattern::new("one");
let text = "hello one two";
assert!(m.matches(text));
let r = m.range();
assert_eq!(r.start, 6);
assert_eq!(r.end, 9);

This not in itself impressive, since it can be done with the string find method, but once we start using patterns it gets more exciting, especially with captures:

let mut m = LuaPattern::new("(%a+) one");
let text = " hello one two";
assert!(m.matches(text));
assert_eq!(m.capture(0),1..10); // "hello one"
assert_eq!(m.capture(1),1..6); // "hello"

Lua patterns (like regexps) are not anchored by default, so this finds the first match and works from there. The 0 capture always exists (the full match) and here the 1 capture just picks up the first word.

There is an obvious limitation: "%a" refers specifically to a single byte representing a letter according to the C locale. Lua people will often look for 'sequence of non-spaces' ("%S+"), etc - that is, identify maybe-UTF-8 sequences using surronding punctionation or spaces.

If you want your captures as strings, then there are several options. Grab them as a vector (it will be empty if the match fails.)

let v = m.captures(text);
assert_eq!(v, &["hello one","hello"]);

This will create a vector - you can avoid excessive allocations with capture_into:

let mut v = Vec::new();
if m.capture_into(text,&mut v) {
    assert_eq!(v, &["hello one","hello"]);
}

Imagine that this is happening in a loop - the vector is only allocated the first time it is filled, and thereafter there are no allocations. It's a convenient method if you are checking text against several patterns, and is actually more ergonomic than using Lua's string.match. (Personally I prefer to use those marvelous things called "if statements" rather than elaborate regular expressions.)

The gmatch method creates an interator over all matches.

let mut m = lp::LuaPattern::new("%S+");
let split: Vec<_> = m.gmatch("dog  cat leopard wolf  ").collect();
assert_eq!(split,&["dog","cat","leopard","wolf"]);

A single match is returned; if the pattern has no captures, you get the full match, otherwise you get the first match. So "(%S+)" would give you the same result.

Text substitution is an old favourite of mine, so here's gsub:

let mut m = lp::LuaPattern::new("%$(%S+)");
let res = m.gsub("hello $dolly you're so $fine",
    |cc| cc.get(1).to_uppercase()
);
assert_eq!(res,"hello DOLLY you're so FINE");

The closure is passed a Closures object and the captures are accessed using the get method; it returns a String.

In Lua, string.gsub has three forms:

  • using a closure, like here
  • using a replacement string referencing closures, like "%1-%2"
  • using a table - i.e. a map

The first is more general, and the other cases can be implemented in a straightforward way using it (although I am thinking of implementing the second case as a convenient shortcut.) For maps, you usually want to handle the 'not found' case in some special way:

let mut map = HashMap::new();
// updating old lines for the 21st Century
map.insert("dolly", "baby");
map.insert("fine", "cool");
map.insert("good-looking", "pretty");

let mut m = LuaPattern::new("%$%((.-)%)");
let res = m.gsub("hello $(dolly) you're so $(fine) and $(good-looking)",
    |cc| map.get(cc.get(1)).unwrap_or(&"?").to_string()
);
assert_eq!(res,"hello baby you're so cool and pretty");

(The ".-" pattern means 'match as little as possible' - often called 'lazy' matching.)

For the replacement case, this is equivalent to a replace string "%1:'%2'":

let mut m = lp::LuaPattern::new("(%S+)%s*=%s*([^;]+);");
let res = m.gsub("alpha=bonzo; beta=felix;",
    |cc| format!("{}:'{}',", cc.get(1), cc.get(2))
);
assert_eq!(res, "alpha:'bonzo', beta:'felix',");

Having a byte-oriented pattern matcher can be useful. For instance, this is basically the old strings utility - we read all of a 'binary' file into a vector of bytes, and then use gmatch_bytes to iterate over all &[u8] matches corresponding to two or more adjacent ASCII letters:

let mut words = LuaPattern::new("%a%a+");
for w in words.gmatch_bytes(&buf) {
    println!("{}",std::str::from_utf8(w).unwrap());
}

The pattern itself may be arbitrary bytes - Lua 'string' matching does not care about embedded nul bytes:

let patt = &[0xDE,0x00,b'+',0xBE];
let bytes = &[0xFF,0xEE,0x0,0xDE,0x0,0x0,0xBE,0x0,0x0];

let mut m = LuaPattern::from_bytes(patt);
assert!(m.matches_bytes(bytes));
assert_eq!(&bytes[m.capture(0)], &[0xDE,0x00,0x00,0xBE]);