Regular expressions that work “everywhere”

johndcook.com

90 points by ColinWright 3 days ago


rtpg - 12 hours ago

Emacs in particular I suffer so much from basically guessing what needs to be escaped or not. I know `rx` exists[0] as an alternative but it's not really fun to use.

Even beyond the regex syntax itself, you often also start running into encoding problems when trying to actually use them. Typing the regex in a shell? Make sure to esacpe stuff properly. Regex in Python? Make sure it's a raw string. Etc etc etc

It's a modern miracle we're at least within rhyming distance of how to write regexes in most tools.

[0]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx...

JdeBP - 16 hours ago

The author is circling around, but not quite reaching, a statement that POSIX Basic Regular Expressions work everywhere, with the caveat that that not everyone has caught up with version 8 of the Single Unix Specification, which has slightly changed BREs.

agnishom - 16 hours ago

A while ago, we wrote a paper about finding regexes which match the same way in both the greedy semantics and the leftmost maximal semantics.

https://par.nsf.gov/servlets/purl/10534654

MathMonkeyMan - 17 hours ago

I've always been a stickler for being specific about which regex language your thing accepts, and whether it is to match any substring, or a prefix, or a suffix, or the whole thing, or a line, or a substring of a line, or whatever.

Here are some of the [more popular][1] ones, and then there are PCRE and Python.

It took me a while to learn that some of the older ones you see in e.g. grep are [specified by POSIX][2].

[1]: https://cppreference.com/cpp/regex#Regular_expression_gramma...

[2]: https://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd...

mrbluecoat - 5 hours ago

> the following features work everywhere. YMMV.

Amusing pair of statements.

tonyg - 10 hours ago

That's one of the reasons RFC 9485, "I-Regexp: An Interoperable Regular Expression Format", is important.

https://datatracker.ietf.org/doc/html/rfc9485

dekdrop - 3 hours ago

I want to share Russ Cox's webpage on regexp https://swtch.com/~rsc/regexp/

I find it a good reading.

chasil - 2 hours ago

Microsoft FINDSTR.EXE supports a subset of these regular expressions.

It does not support the + repetition operator.

ok_dad - 14 hours ago

Go stdlib regexp package does not support back references, as it uses the RE2 engine. You can use them in replace but not matching.

codetiger - 13 hours ago

I built my Rust library for JSONLogic and use bindings for other languages after similar frustrations with Rule engines, template engines and IFTTT engines. https://github.com/GoPlasmatic/datalogic-rs

myroon5 - 14 hours ago

JSON schema's docs also have a recommended regular expression subset:

https://json-schema.org/understanding-json-schema/reference/...

quotemstr - 16 hours ago

It drives me nuts when a developer documents something or other as being a "regex" but doesn't mention which dialect of regulation expression he's talking about. This habit is particularly common in the Rust, JavaScript, and Python communities, which seem to forget that their language's regular expression language isn't universal.

pmarreck - 13 hours ago

I've become a fan of whatever PCRE2 understands

- a day ago
[deleted]
gilrain - 4 hours ago

We must find a way to return to SNOBOL/PITBOL. It’s so elegant and effective in Ada (where it’s in the standard library).

https://en.wikipedia.org/wiki/SNOBOL

> In the 1980s and 1990s, its use faded as newer languages such as AWK and Perl made string manipulation by means of regular expressions fashionable. SNOBOL4 patterns include a way to express BNF grammars, which are equivalent to context-free grammars and more powerful than regular expressions. The "regular expressions" in current versions of AWK and Perl are in fact extensions of regular expressions in the traditional sense, but regular expressions, unlike SNOBOL4 patterns, are not recursive, which gives a distinct computational advantage to SNOBOL4 patterns.

K0IN - 6 hours ago

So my favorite regex (.*?) works? Puh.

galaxyLogic - 12 hours ago

2 RegExp problems:

1. You can not compose a bigger regexp out of smaller ones

2. A regexp can not "call" other regexps

LoganDark - 16 hours ago

> the special characters . * ^ $

These already do not work in many tools which require those special characters to be escaped to have any meaning. An easy example is GNU grep, sed, etc. which use BRE ("Basic Regular Expressions") by default. The article mentions GNU coreutils but does not explain that `-E` is required to fix that behavior.

jonstewart - 15 hours ago

Then there’s not just the issue of whether the engine supports a particular syntactical feature but the issue of matching semantics. Perl/PCRE’s semantics are far different from POSIX’s and some implementations different semantics altogether (and quite reasonably).

semanticc - an hour ago

> So for my definition of “everywhere,” with the caveats mentioned above, the following features work everywhere. YMMV.

  .
  ^, $
  […], [^…]
  \*
  \w, \W, \s, \S
  \1 - \9 backreferences
  \b \B
  ? + 
  | alternation
  {n,m} for counting matches
  (...) capturing
Except that these don't work in macOS/BSD sed (even with -E flag):

- \w, \W, \s, \S - need to use POSIX classes instead: [[:alnum:]], [^[:alnum:]], [[:space:]], [^[:space:]]

- \b - need to use use [[:<:]] (word start) and [[:>:]] (word end) instead

- \B - (not a word start/end) no alternatives

monkamonme - 14 hours ago

[flagged]

ngruhn - 16 hours ago

[dead]

Resonix - 3 days ago

why I built this