'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · edit-2 3 months ago

'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

RoadieRich@midwest.social · 3 months ago

The things is, if you try to parse html with a regex, you will always be able to construct valid html that will break it, no matter how complex your regex becomes.

ryannathans · 3 months ago

Maybe you shouldn’t use regex to parse html?

TehPers@beehaw.org · edit-2 3 months ago

This advice mostly applies to people who are less experienced and less familiar with just how complex HTML can be. As for other languages - if you’re doing regex on markdown, you’ll probably be fine (but you should verify if you’re writing something for the general case that must not fail). But in HTML’s case:

You have nested languages (CSS and JS)
You have tag-specific rules (img and link end in />, but div must end in a separate closing tag)
Browsers use error correction to try to make sense of invalid HTML, like inserting missing tags. Many websites rely on this behavior.

If you’re trying to use Regex to parse a specific website’s HTML, you’ll be able to get what you want eventually, but as a general HTML parser, there will always be some website that breaks your assumptions.

DaleGribble88@programming.dev · 3 months ago

HTML parsers scare me. I already knew it was a big job, but this blog post sealed the deal that HTML, err… the web’s interpretation of HTML(?), is one heck of a mess.
https://jakearchibald.com/2023/against-self-closing-tags-in-html/

mcmodknower@programming.dev · 3 months ago

How many combinations and levels of italics, bold and ~~strikethrough~~, combined with escaped chars like * can your program handle?

How many combinations and levels of *italics*, **bold** and ~~strikethrough~~, combined with escaped chars like \* can your program handle?

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · 3 months ago

So I’ll answer your question and ask a question back from anyone who can help me.

RE the nesting, I was under the impression that they can’t be combined when I made it. Then I read CommonMark’s specs and it seems like it’s possible. It would be miserable to do this with a syntax-directed translation. I used ASDL to write up a tree, and added some features to asdl(1) so they would be handled more properly. I am not sure whether I should use a parser generator for this, but the nesting can be handled by Lex’s start conditions — if I fail to do so, I may use a PEG generator.

Now my question.

I think nesting and recursion are a good case for using a push-down parser here — I will still try and find a solution before I use an LR parser.

I avoid using Yacc because I honestly have no clue how to use it with a language like Markdown.

So my thinking is, I would just use a starting condition stack with Lex (I use Flex). It’s pretty simple. Let’s use a linked list so there are no limits.


struct stack { int state; stuct stack *next,  }

struct stack *top, *bottom;

void push ...

int pop ...

(I usually use typedefs though).

So now we have a psuedo-pushdown parser. What are these called?

I am still a beginner at this but one thing that worries me is, how would I handle the tree with this method?

With Yacc or PEG parser generators, it is easy to assign a value to a reduction/closure. But with this method, Flex won’t allow me. Unless I use too many variables.

I think I may use peg(1). I can even do the same stack thingy with PEG.

Any help is welcome.

Phrodo_00@lemmy.world · 3 months ago

HTML is not even a tree (XHTML is. XML is a type 2 grammar). SGML languages like HTML are more similar to Tree-adjoining grammars.

For example <b>This<i>is perfectly</b>valid</i> html.

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · 3 months ago

btw this is the ASDL I wrote:

%{

#include "mukette.h"


static Arena *ast_scratch = NULL;


#define ALLOC(size) arena_alloc(ast_scratch, size)


%}

md_linkage = Hyperlink(md_compound? name, string url)
	  | Image(string? alt, string path)
	  ;


md_inline = Italic(md_compound italic)
	 | Bold(md_compound bold)
	 | BoldItalic(md_compound bold_italic)
	 | Strikethrough(md_compound strike_through)
	 | InlineCode(string inline_code)
	 | Linkage(md_linkage linkage)
	 | RegularText(string regular_text)
	 ;


md_header_level = H1 | H2 | H3 | H4 | H5 | H6 ;


md_line = Header(md_compound text, md_header_level level)
	| Indented(md_compound text, usize num_indents)
	| LinkReference(identifier name, string url, string title)
	| HorizontalLine
	;


md_compound = (md_inline* compound) ;


md_table_row = (md_compound cells, size num_cell) ;


md_table = (md_table_row* rows, size num_rows) ;


md_ol_item = (int bullet_num, md_list_item item) ;


md_ul_item = (char bullet_char, md_list_item item) ;


md_list_item = TextItem(string text)
	    | OrderedItem(md_ol_item ordered_item)
	    | UnorderedItem(md_ul_item unordered_item)
	    | NestedList(md_list nested_list)
	    ;



md_list = (md_list_item* items) ;



md_block = Pargraph(md_compound* paragraph)
	| BlockQuote(md_compound* block_quote)2
	| CodeListing(identifier? label, string code)
	| Table(md_table table)
	| List(md_list list)
	| Line(md_line line)
	;


markdown = (md_block* blocks) ;


%%


static inline void init_tree_scratch(void) { ast_scratch = arena_init(AST_ARENA_SIZE); }
static inline void free_tree_scratch(void) { arena_free(ast_scratch); }

I had an easier time parsing ASDL with Yacc. I still can’t tell whether a grammar is LR, LL or RE, but I can tell that Markdown is not CFG.

I just updated ASDL: https://github.com/Chubek/ZephyrASDL

Apologies if I am too late on the documetnation. I am still trying to improve it by using it myself. I also wish to add an Ocaml target.

mcmodknower@programming.dev · 2 months ago

md_inline and md_compound use each other, and not only at the end xor the beginning of the rule, making this a non-type 3 grammar.

Sorry for the late response, i wanted to do a better response but don’t have the time for that currently.

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · 2 months ago

Thanks. I actually have a parse-related question which I will post somewhere soon (as in 2-3 minute).

Corbin@programming.dev · 3 months ago

[HTML and Markdown] are not grammatically Type 2 (Chomsky-wise, Context-Free); rather, they are Type 3 (Chomsky-wise, Regular).

This is at least half-wrong, in that HTML is clearly not regular. The proof is simple: HTML structurally embeds a language of balanced parentheses (a Dyck language), and such languages are context-free and not regular. I don’t know Markdown well and there are several flavors; it’s quite possible that some flavors are regular. However, traditional Markdown embeds HTML, so if HTML is not regular than neither is Markdown.

I once did a syntax-directed translation of Markdown to HTML in AWK!

Sure. The original Markdown implementation was in Perl and operated similarly. However, note that this doesn’t imply that either language is regular, only that a translation is possible assuming the input is valid Markdown. Punting on recognition means that invalid parse trees have undefined translations, presumably at least sometimes generating invalid HTML.

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · 3 months ago

I remember reading something about recursive languages in Aho’s Automata Book. I will have to check it again. So Regular Languages can’t be recursive? Is that what ‘recursive’ language even means? I have to read the book again.

Thanks for your help man.

Corbin@programming.dev · 3 months ago

The definition of recursive language can be read on Wikipedia too. Quoting from that page:

All regular, context-free and context-sensitive languages are recursive.

But not all recursive languages are regular. As for “recursive”, it’s a historical term that might not be sufficiently descriptive; today, I’d usually call such languages decidable, to emphasize that they are languages which we can write (always-halting) computer programs to detect.

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · 3 months ago

Thanks! You know last night I fed a model several books and asked it a lot of stuff. This did not come up.

What i do is I glance at the book, then ask the model to explain it to me. I call the model ‘Urkel the Wiz Kid’.

I fed both CS books and CE books. Like Sipsers, A Quantative Approach, Automata of Aho, Harris and Harris, etc.

fubo@lemmy.world · 3 months ago

I’m confused. To me, “building a tree” and “parsing” are the same thing. If you end up with a tree representation of the structure of your document, the thing you did to get there is parsing.

RonSijm@programming.dev · 3 months ago

I suppose “parsing” is a more generic thing. If you “build a tree” you did some form of “parsing” - yes

However when you do “parsing” like “find all links in an html doc” - you’re parsing, but not necessarily building a tree.

Static_Rocket@lemmy.world · edit-2 3 months ago

I mean, you shouldn’t really parse any language (markup, script or otherwise) with regex. The point is there are other tools for the job that get you closer to what the actual interpreter is expecting to see. It’s really easy to botch a regex and accidentally create new syntax matches.

Regex is fine for noncritical parsing. You’ll often see it used for text editor / IDE language syntax highlighting, but you should also have noticed by now how often that tends to break down with more extreme combinations of syntax.

I do agree that lexers should still be preferred for manual manipulation of existing langues, but do whatever you want. It’s not like any of us can stop you.

solrize@lemmy.world · 3 months ago

It wouldn’t occur to me to use a parser generator to make an html parser, or (say) a Lisp reader. But I wouldn’t use regexes either. For large HTML docs I generally use a SAX parser like expat, then maintain a stack in the application of tag nesting at the current point. For smaller docs I use a DOM parser, same idea but it builds a tree for you. In each case the bottom level is basically a hand coded state machine, scanning the HTML input and emitting tag events or text strings.

Markdown is possibly simpler, but I haven’t had to with process it so far.

some_guy@lemmy.sdf.org · 3 months ago

I’ve been automating tasks with HTML pages for years and years. I built two scripts for harvesting media and progressing through galleries at the start of the pan when we first had all that downtime. I kept refining them over time and they worked very well. If you know enough of a scripting language, you can absolutely use regex to take pages apart and do this. In the end, I was grabbing titles for different files and renaming them when I saved them to disk. It’s just trial and error to get what ultimately works.

I suspect that you’re doing things that are likely a bit more advanced than what I just described. Either way, cheers! I like this theme.

FizzyOrange@programming.dev · 3 months ago

Not really. You should avoid regexes for almost everything really. They’re too error prone. The only situations where they really shine are where it doesn’t matter too much if they’re wrong. For example interactive editing/filtering/searching. Another place I use them is for triaging error messages.

Using them in a parser for anything more than maybe matching literals in the lexer is insane. (Assuming you care about your code actually working.)