lua-patterns/readme.md

213 lines
7.7 KiB
Markdown
Raw Permalink Normal View History

2017-04-18 08:43:26 -05:00
## Lua string patterns in Rust
[Lua string patterns](https://www.lua.org/pil/20.2.html) are a powerful
yet lightweight alternative to full regular expressions. They are not
regexps, since there is no alternation (the `|` operator), but this
is not usually a problem. In fact, full regexps become _too powerful_ and
power can be dangerous or just plain confusing.
This is why OpenBSD's httpd has [Lua patterns](http://man.openbsd.org/patterns.7).
The decision to use `%` as the escape rather than the traditional `\` is refreshing.
In the Rust context, `lua-patterns` is a very lightweight dependency, if you
don't need the full power of the `regex` crate.
This library reuses the original source from Lua 5.2 - only
400 lines of battle-tested C. I originally did this for a similar project to bring
[these patterns to C++](https::/github.com/stevedonovan/rx-cpp).
More information can be found on [the Lua wiki](http://lua-users.org/wiki/PatternsTutorial).
2017-04-19 09:18:16 -05:00
The cool thing is that Lua is a 300KB download, if you want to test patterns out
without going through Rust.
2017-04-18 08:43:26 -05:00
I've organized the Rust interface much as the original Lua library, 'match',
'gmatch' and 'gsub', but made these methods of a `LuaPattern` struct. This is
for two main reasons:
- although string patterns are not compiled, they can be validated upfront
- after a match, the struct contains the results
```rust
extern crate lua_patterns;
use lua_patterns::LuaPattern;
let mut m = LuaPattern::new("one");
let text = "hello one two";
assert!(m.matches(text));
let r = m.range();
assert_eq!(r.start, 6);
assert_eq!(r.end, 9);
```
This not in itself impressive, since it can be done with the string `find`
method. (`new` will panic if you feed it a bad pattern, so use `new_try` if
you want more control.)
Once we start using patterns it gets more exciting, especially
2017-04-18 08:43:26 -05:00
with _captures_:
```rust
let mut m = LuaPattern::new("(%a+) one");
let text = " hello one two";
assert!(m.matches(text));
assert_eq!(m.capture(0),1..10); // "hello one"
assert_eq!(m.capture(1),1..6); // "hello"
```
Lua patterns (like regexps) are not anchored by default, so this finds
the first match and works from there. The 0 capture always exists
(the full match) and here the 1 capture just picks up the first word.
2017-04-18 12:21:48 -05:00
> There is an obvious limitation: "%a" refers specifically to a single byte
> representing a letter according to the C locale. Lua people will often
> look for 'sequence of non-spaces' ("%S+"), etc - that is, identify maybe-UTF-8
2017-04-20 09:18:50 -05:00
> sequences using surronding punctuation or spaces.
2017-04-18 12:21:48 -05:00
2017-04-20 09:18:50 -05:00
If you want your captures as strings, then there are several options. If there's
just one, then `match_maybe` is useful:
2017-04-18 12:21:48 -05:00
```rust
2017-04-20 09:18:50 -05:00
let mut m = LuaPattern::new("OK%s+(%d+)");
let res = m.match_maybe("and that's OK 400 to you");
assert_eq!(res, Some("400"));
```
You can grab them as a vector (it will be empty if the match fails.)
```rust
let mut m = LuaPattern::new("(%a+) one");
let text = " hello one two";
2017-04-18 12:21:48 -05:00
let v = m.captures(text);
assert_eq!(v, &["hello one","hello"]);
```
2017-04-20 09:18:50 -05:00
This will create a vector. You can avoid excessive allocations with `capture_into`:
2017-04-18 12:21:48 -05:00
```rust
let mut v = Vec::new();
if m.capture_into(text,&mut v) {
assert_eq!(v, &["hello one","hello"]);
}
```
Imagine that this is happening in a loop - the vector is only allocated the first
time it is filled, and thereafter there are no allocations. It's a convenient
method if you are checking text against several patterns, and is actually
more ergonomic than using Lua's `string.match`. (Personally I prefer
to use those marvelous things called "if statements" rather than elaborate
regular expressions.)
The `gmatch` method creates an interator over all matched strings.
2017-04-18 12:21:48 -05:00
```rust
let mut m = lp::LuaPattern::new("%S+");
let split: Vec<_> = m.gmatch("dog cat leopard wolf ").collect();
assert_eq!(split,&["dog","cat","leopard","wolf"]);
```
A single match is returned; if the pattern has no captures, you get the full match,
otherwise you get the first match. So "(%S+)" would give you the same result.
A more general version is `gmatch_captures` which creates a _streaming_ iterator
over captures. You have to be a little careful with this one; in particular, you
will get nonsense if you try to `collect` on the return captures: don't try to
keep these values.
It is fine to collect from an expression involving the `get` method however!
```rust
let mut m = lua_patterns::LuaPattern::new("(%S)%S+");
let split: Vec<_> = m.gmatch_captures("dog cat leopard wolf")
.map(|cc| cc.get(1)).collect();
assert_eq!(split,&["d","c","l","w"]);
```
2017-04-19 09:18:16 -05:00
Text substitution is an old favourite of mine, so here's `gsub_with`:
2017-04-18 12:21:48 -05:00
```rust
let mut m = lp::LuaPattern::new("%$(%S+)");
2017-04-19 09:18:16 -05:00
let res = m.gsub_with("hello $dolly you're so $fine",
2017-04-18 12:21:48 -05:00
|cc| cc.get(1).to_uppercase()
);
assert_eq!(res,"hello DOLLY you're so FINE");
```
The closure is passed a `Closures` object and the captures are accessed
using the `get` method; it returns a `String`.
2017-04-19 09:18:16 -05:00
The second form of `gsub` is convenient when you have a replacement
string, which may contain closure references. (To add a literal "%" escape
it like so "%%")
2017-04-18 12:21:48 -05:00
2017-04-19 09:18:16 -05:00
```rust
let mut m = LuaPattern::new("%s+");
let res = m.gsub("hello dolly you're so fine","");
assert_eq!(res, "hellodollyyou'resofine");
2017-04-18 12:21:48 -05:00
2017-04-19 09:18:16 -05:00
let mut m = LuaPattern::new("(%S+)%s*=%s*(%S+);%s*");
let res = m.gsub("a=2; b=3; c = 4;", "'%2':%1 ");
assert_eq!(res, "'2':a '3':b '4':c ");
2017-04-20 09:18:50 -05:00
```
2017-04-19 09:18:16 -05:00
The third form of `string.gsub` in Lua does lookup with a table - that is, a map.
2017-04-20 09:18:50 -05:00
But for maps you really want to handle the 'not found' case in some special way:
2017-04-18 12:21:48 -05:00
```rust
let mut map = HashMap::new();
// updating old lines for the 21st Century
map.insert("dolly", "baby");
map.insert("fine", "cool");
map.insert("good-looking", "pretty");
let mut m = LuaPattern::new("%$%((.-)%)");
2017-04-19 09:18:16 -05:00
let res = m.gsub_with("hello $(dolly) you're so $(fine) and $(good-looking)",
2017-04-18 12:21:48 -05:00
|cc| map.get(cc.get(1)).unwrap_or(&"?").to_string()
);
assert_eq!(res,"hello baby you're so cool and pretty");
```
(The ".-" pattern means 'match as little as possible' - often called 'lazy'
matching.)
2017-04-19 09:18:16 -05:00
This is equivalent to a replace string "%1:'%2'":
2017-04-18 12:21:48 -05:00
```rust
let mut m = lp::LuaPattern::new("(%S+)%s*=%s*([^;]+);");
2017-04-19 09:18:16 -05:00
let res = m.gsub_with("alpha=bonzo; beta=felix;",
2017-04-18 12:21:48 -05:00
|cc| format!("{}:'{}',", cc.get(1), cc.get(2))
);
assert_eq!(res, "alpha:'bonzo', beta:'felix',");
```
Having a byte-oriented pattern matcher can be useful. For instance, this
is basically the old `strings` utility - we read all of a 'binary' file into
a vector of bytes, and then use `gmatch_bytes` to iterate over all `&[u8]`
matches corresponding to two or more adjacent ASCII letters:
```rust
let mut words = LuaPattern::new("%a%a+");
for w in words.gmatch_bytes(&buf) {
println!("{}",std::str::from_utf8(w).unwrap());
}
```
The pattern itself may be arbitrary bytes - Lua 'string' matching does
not care about embedded nul bytes:
```rust
let patt = &[0xDE,0x00,b'+',0xBE];
let bytes = &[0xFF,0xEE,0x0,0xDE,0x0,0x0,0xBE,0x0,0x0];
let mut m = LuaPattern::from_bytes(patt);
assert!(m.matches_bytes(bytes));
assert_eq!(&bytes[m.capture(0)], &[0xDE,0x00,0x00,0xBE]);
```
2017-04-19 09:18:16 -05:00
The problem here is that it's not obvious when our 'arbitrary' bytes
include one of the special matching characters like `$` (which is 0x24)
and so on. Hence there is `LuaPatternBuilder`:
```rust
2017-04-20 09:18:50 -05:00
let bytes = &[0xFF,0xEE,0x0,0xDE,0x24,0x24,0xBE,0x0,0x0];
2017-04-19 09:18:16 -05:00
let patt = LuaPatternBuilder::new()
.bytes_as_hex("DE24") // less tedious than a byte slice
.text("+") // unescaped
.bytes(&[0xBE]) // byte slice
.build();
2017-04-18 12:21:48 -05:00
2017-04-19 09:18:16 -05:00
let mut m = LuaPattern::from_bytes(&patt);
// picks up "DE2424BE"
```
> Static verification: this version attempts to verify string patterns. If you
> want errors, use `new_try` and `from_bytes_try`, otherwise the constructors panic.
> If a match panics after successful verification, it is a __BUG__ - please
> report the offending pattern.
2017-04-18 08:43:26 -05:00