## Lua string patterns in Rust [Lua string patterns](https://www.lua.org/pil/20.2.html) are a powerful yet lightweight alternative to full regular expressions. They are not regexps, since there is no alternation (the `|` operator), but this is not usually a problem. In fact, full regexps become _too powerful_ and power can be dangerous or just plain confusing. This is why OpenBSD's httpd has [Lua patterns](http://man.openbsd.org/patterns.7). The decision to use `%` as the escape rather than the traditional `\` is refreshing. In the Rust context, `lua-patterns` is a very lightweight dependency, if you don't need the full power of the `regex` crate. This library reuses the original source from Lua 5.2 - only 400 lines of battle-tested C. I originally did this for a similar project to bring [these patterns to C++](https::/github.com/stevedonovan/rx-cpp). More information can be found on [the Lua wiki](http://lua-users.org/wiki/PatternsTutorial). I've organized the Rust interface much as the original Lua library, 'match', 'gmatch' and 'gsub', but made these methods of a `LuaPattern` struct. This is for two main reasons: - although string patterns are not compiled, they can be validated upfront - after a match, the struct contains the results ```rust extern crate lua_patterns; use lua_patterns::LuaPattern; let mut m = LuaPattern::new("one"); let text = "hello one two"; assert!(m.matches(text)); let r = m.range(); assert_eq!(r.start, 6); assert_eq!(r.end, 9); ``` This not in itself impressive, since it can be done with the string `find` method, but once we start using patterns it gets more exciting, especially with _captures_: ```rust let mut m = LuaPattern::new("(%a+) one"); let text = " hello one two"; assert!(m.matches(text)); assert_eq!(m.capture(0),1..10); // "hello one" assert_eq!(m.capture(1),1..6); // "hello" ``` Lua patterns (like regexps) are not anchored by default, so this finds the first match and works from there. The 0 capture always exists (the full match) and here the 1 capture just picks up the first word. > There is an obvious limitation: "%a" refers specifically to a single byte > representing a letter according to the C locale. Lua people will often > look for 'sequence of non-spaces' ("%S+"), etc - that is, identify maybe-UTF-8 > sequences using surronding punctionation or spaces. If you want your captures as strings, then there are several options. Grab them as a vector (it will be empty if the match fails.) ```rust let v = m.captures(text); assert_eq!(v, &["hello one","hello"]); ``` This will create a vector - you can avoid excessive allocations with `capture_into`: ```rust let mut v = Vec::new(); if m.capture_into(text,&mut v) { assert_eq!(v, &["hello one","hello"]); } ``` Imagine that this is happening in a loop - the vector is only allocated the first time it is filled, and thereafter there are no allocations. It's a convenient method if you are checking text against several patterns, and is actually more ergonomic than using Lua's `string.match`. (Personally I prefer to use those marvelous things called "if statements" rather than elaborate regular expressions.) The `gmatch` method creates an interator over all matches. ```rust let mut m = lp::LuaPattern::new("%S+"); let split: Vec<_> = m.gmatch("dog cat leopard wolf ").collect(); assert_eq!(split,&["dog","cat","leopard","wolf"]); ``` A single match is returned; if the pattern has no captures, you get the full match, otherwise you get the first match. So "(%S+)" would give you the same result. Text substitution is an old favourite of mine, so here's `gsub`: ```rust let mut m = lp::LuaPattern::new("%$(%S+)"); let res = m.gsub("hello $dolly you're so $fine", |cc| cc.get(1).to_uppercase() ); assert_eq!(res,"hello DOLLY you're so FINE"); ``` The closure is passed a `Closures` object and the captures are accessed using the `get` method; it returns a `String`. In Lua, `string.gsub` has three forms: - using a closure, like here - using a replacement string referencing closures, like "%1-%2" - using a table - i.e. a map The first is more general, and the other cases can be implemented in a straightforward way using it (although I am thinking of implementing the second case as a convenient shortcut.) For maps, you usually want to handle the 'not found' case in some special way: ```rust let mut map = HashMap::new(); // updating old lines for the 21st Century map.insert("dolly", "baby"); map.insert("fine", "cool"); map.insert("good-looking", "pretty"); let mut m = LuaPattern::new("%$%((.-)%)"); let res = m.gsub("hello $(dolly) you're so $(fine) and $(good-looking)", |cc| map.get(cc.get(1)).unwrap_or(&"?").to_string() ); assert_eq!(res,"hello baby you're so cool and pretty"); ``` (The ".-" pattern means 'match as little as possible' - often called 'lazy' matching.) For the replacement case, this is equivalent to a replace string "%1:'%2'": ```rust let mut m = lp::LuaPattern::new("(%S+)%s*=%s*([^;]+);"); let res = m.gsub("alpha=bonzo; beta=felix;", |cc| format!("{}:'{}',", cc.get(1), cc.get(2)) ); assert_eq!(res, "alpha:'bonzo', beta:'felix',"); ``` Having a byte-oriented pattern matcher can be useful. For instance, this is basically the old `strings` utility - we read all of a 'binary' file into a vector of bytes, and then use `gmatch_bytes` to iterate over all `&[u8]` matches corresponding to two or more adjacent ASCII letters: ```rust let mut words = LuaPattern::new("%a%a+"); for w in words.gmatch_bytes(&buf) { println!("{}",std::str::from_utf8(w).unwrap()); } ``` The pattern itself may be arbitrary bytes - Lua 'string' matching does not care about embedded nul bytes: ```rust let patt = &[0xDE,0x00,b'+',0xBE]; let bytes = &[0xFF,0xEE,0x0,0xDE,0x0,0x0,0xBE,0x0,0x0]; let mut m = LuaPattern::from_bytes(patt); assert!(m.matches_bytes(bytes)); assert_eq!(&bytes[m.capture(0)], &[0xDE,0x00,0x00,0xBE]); ```