FilterPlugin

Substitute and extract information from content by using regular expressions

Description

This plugin allows to substitute and extract information from content by using regular expressions. There are three different types of new functions:
  1. FORMATLIST: maniplulate a list of items; it is highly configurable to define what constitutes a list and how to extract items from it
  2. SUBST, STARTSUBST/STOPSUBST: substiture a pattern in a chunk of text
  3. EXTRACT, STARTEXTRACT/STOPEXTRACT: extract a pattern from a text
While the START-STOP versions of SUBST and EXTRACT work on inline text, the normal versions process a source topic before including it into the current one.

Syntax Rules

DECODE

Syntax: %DECODE{"..." type="..."}%

reverses encoding via VarENCODE.

  • type="...": type of decoding, can be url (default), entity, html, =quote or none

SUBST

Syntax: %SUBST{topic="..." ...}%

insert a topic by processing its content.

  • topic="...": name of the topic text to be processed
  • rev="...": revision of the topic to be processed (defaults to latest version)
  • text="...": text to be processed (has got higher precedence than 'topic')
  • pattern="...": pattern to be extracted or substituted
  • format="...": format expression or pattern substitute
  • header="...": header string prepended to output
  • footer="...": footer string appended to output
  • limit="<n>" maximum number of occurences to extract or substitute counted from the start of the text (defaults to 100000 aka all hits)
  • skip="<n>" skip the first n occurences
  • exclude="...": skip occurences that match this regular expression
  • include="...": skip occurences that don't match this regular expression
  • sort="on,off,alpha,num" order of the formatted items (default "off")
  • expand="on,off": toggle expansion of markup before filtering (defaults to on)

STARTSUBST, STOPSUBST

Syntax:
%STARTSUBST{...}% 
... 
%STOPSUBST%

substitute text given inline. see SUBST.

EXTRACT

Syntax: %EXTRACT{topic="..."  ...}%

extract text from a topic. see SUBST.

STARTEXTRACT, STOPEXTRACT

Syntax:
%STARTEXTRACT{...}% 
... 
%STOPEXTRACT%

extract content given inline. see SUBST.

FORMATLIST

Syntax: %FORMATLIST{"<list>" ...}%

formats a list of items. The <list> argument is separated into items by using a split expression; each item is matched agains a pattern and then formatted using a format string while being separated by a separator string; the result is prepended with a header and appended with a footer in case the list is not empty.
  • <list>: the list
  • tokenize="...": regex to tokenize the list before spliting it up, tokens are inserted back again after the split stage has been passed
  • split="...": the split expression (default ",")
  • replace="key1=value1,key2=value2, ...": this allows to preprocess each list item by replacing the given keys with their value
  • pattern="...": pattern applied to each item (default "\s(.*)\s")
  • format="...": the format string for each item (default "$1")
  • header="...": header string
  • footer="...": footer string
  • separator="...": string to be inserted between list items
  • lastseparator="...": string separating the last item from the rest of the list
  • null="...": the format string to render the empty list
  • hideempty="on,off": when set to "on" then empty list items will not be added to the result (empty in the sense of ”); set this to "off" to still add them (default "on")
  • limit="...": max number of items to be taken out of the list (default "-1")
  • skip="...": number of list items to skip, not adding them to the result
  • sort="on,off,alpha,num" order of the formatted items (default "off")
  • reverse="on,off": reverse the sortion of the list
  • rotate="...": rotate the list right by a positive number, or rotate left by a negative number of items
  • unique="on,off": remove dupplicates from the list
  • exclude="...": remove list items that match this regular expression
  • include="...": remove list items that don't match this regular expression
  • casesensitive="on,off": boolean switch to enable or disable case sensitive matching of exclude and include filters (default "on");
  • selection="...": regular expression that a list item must match to be "selected"; if this matches the $marker is inserted
  • marker="...": string to be inserted when the selection regex matches; this will be inserted at the position $marker as indicated in format .
  • map="key1=value1,key2=value2, ...": this establishes a key-value hash available via the $map() variable. (see also the replace parameter for means to preprocess list items automatically.)

The pattern string groups matching substrings in the list item to which you can refer to by using $1, $2, … in the format string. Any format string (format, header, footer) may contain format tokens

* $percnt$ * $nop * $dollar and * $n.

Furthermore the variables:

* $index: expands to the index within the (filtered) list * $pos: expands to the position within the unfiltered list (include and exclude not applied) * $hits: expands to the total number of matched list elements * $count: expands to the total number of elements in the list * $marker: is set if the selection regular expression matches the current item * $map(key): returns the value for "key" as specified in the map argument

MAKEINDEX

Syntax: %MAKEINDEX{"<list>" ...}%

formats a list into a multi-column index like in MediaWiki's category topcis. MAKEINDEX insert capitals as headlines to groups of sorted items. It will try to balance all columns equally, and keep track of breaks to prevent "schusterkinder", that is avoid isolated headlines at the bottom of a column.

parameters:
  • <list>: the list of items
  • split="...": the split expression to separate the <list> into items (default ",")
  • pattern="...": pattern applied to each item (default "(.*)")
  • cols="...": maximum number of cols to split the list into, defaults to automatic, that is the number of columns is specified by colwidth and colgap; in general it is better to specify colwidth and colgaph rather than hard-coding the number of columns; this will let the viewport of the browser/device decide on the number of columns dynamically based on the available space
  • colwidth="...": maximum width of a column, defaults to 18em
  • colgap="...": size of gap betweel columns, defaults to 2em
  • format="...": format of each list item (default "$item")
  • group="...": format string to prepend to index groups, defaults to <h3 $anchor>$group</h3>
  • sort="on,off,alpha,num,nocase": sort the list (default "on")
  • unique="on/off": removed duplicates (default "off")
  • exclude="...": pattern to check against items in the list to be excluded
  • include="...": pattern to check against items in the list to be included
  • casesensitive="on,off": boolean switch to enable or disable case sensitive matching of exclude and include filters (default "on");
  • reverse="on/off": reverse the list (default "off")
  • header="...": format string to prepend to the result
  • footer="..." format string to be appended to the result
  • transliterate="on/off/<mapping>" influences the way sorting and grouping is handled:
  • hideempty="on,off": boolean flag to disable any output in case the list is empty either a boolean switch to enable/disable decoding unicodes into their neares latin character (using CPAN:Text::Unidecode), or a custom mapping list "<source1>=<target1>, <source2>=<target2>, ..." to map a source string to a given target string (default "on")

Like in FORMATLIST the format parameter can make use of $1, $2, … variables to match the groupings defined in the pattern argument (like in pattern="(.*);(.*);(.*)") . The first matched grouping $1 will be used as the $item to sort the list and is optionally being transliterated.

In addition header and footer might contain the $anchors variable which will expand to a navigation to jump to the groups within the index.

Examples

EXTRACT Example 1: convert table into text

One of the uses of this plugin is to extract data from tables, which is useful for creating "database-like" wiki applications where data is stored in foswiki tables. While it is certainly possible to do that without this plugin the plugin makes these requests easier to create and maintain. Note, however, that best practice is to store database-like information using DataForms, so that you don't need to parse the format of the data to extract its records repeatedly.

The table:
Pos Description Hours
1 onsite troubleshooting 3
2 normalizing data to new format 10
3 testing server performance 5

You type:

%EXTRACT{topic="%TOPIC%" expand="off" 
  pattern="^\|\s\s(.*?)\s*\|\s*(.*?)\s*\|\s*(.*?)\s*\|" 
  format="   * it took $3 hours $2$n"
  skip="1"
}%

Expected result (simulated):

  • it took 3 hours onsite troubleshooting
  • it took 10 hours normalizing data to new format
  • it took 5 hours testing server performance

Actual result (this site):

  • it took 3 hours onsite troubleshooting
  • it took 10 hours normalizing data to new format
  • it took 5 hours testing server performance
  • it took hours added rotate parameter to %FORMATLIST
  • it took hours added casesensitive parameter to MAKEINDEX and FORMATLIST; new macro DECODE; fixed a couple of perl gotchas initializing variables
  • it took hours added rev param to %SUBST and %EXTRACT
  • it took hours improved sorting of lists, i.e. with numeric values
  • it took hours rewrite MAKEINDEX from using tables to css3 multicolumn
  • it took hours don't fallback to unidecode if an explicit mapping is given; don't use Foswiki's internal anchor creator as it does not support unicode
  • it took hours fixing deprecated unescaped left brace in regexes
  • it took hours transliterate/normalize unicode strings before sorting them in MAKETEXT
  • it took hours fixed paging through lists in FORMATLIST
  • it took hours modernized plugin by using a proper OO-core; fixed processing of tokenize properly; added replace parameter for FORMATLIST; fixed the plugin calling Foswiki::Func::expandCommonVariables() itself unnecessarily
  • it took hours fixed SUBST macro topic param processing embedded META
  • it took hours fixed parsing zero values in lists (by Grzegorz Marszalek)
  • it took hours fixed wrapper for non-official api call to getAnchorName on foswiki-1.1
  • it took hours ease tokenize; forward compatibility for newer foswikis
  • it took hours added include counterpart to already existing exclude params; fixed SUBST not to forget about the non-matching tail of a char sequence
  • it took hours added $anchors to MAKEINDEX (by Dirk Zimoch); added nocase option to FORMATLIST (by Dirk Zimoch); fixed null/empty string match in FORMATLIST
  • it took hours sorting a list before, not after, formatting it in FORMATLIST
  • it took hours added MAKEINDEX, added lazy compilation
  • it took hours using registerTagHandler() as far as possible; enhanced parameters to EXCTRACT and SUBST
  • it took hours fixed SUBST, added skip parameter to FORMATLIST
  • it took hours fixed limit parameter in FORMATLIST
  • it took hours added use strict; and fixed revealed errors
  • it took hours fixed SUBST not to cut off the rest of the text

EXTRACT Example 2: convert text into table

Use CSS tags to format text comments as a tabular data (e.g., to allow sorting).

The comments:
This is the first comment.
-- Michael Daum on 22 Aug 2005

This is the second comment.
-- Michael Daum on 22 Aug 2005

You type:

%EXTRACT{
   topic="%TOPIC%" expand="off"
   pattern=".div class=\"text\">.*?[\r\n]+(.*?)[\r\n]+(?:.*?[\r\n]+)+?-- (.*?) on (.*?)[\r\n]+"
   format="| $3 | $2 | $1 ... |$n" header="|*Date*|*Author*|*Headline*|$n"
}%

Expected result (simulated):

Date Author Headline
22 Aug 2005 Michael Daum This is the first comment. …
22 Aug 2005 Michael Daum This is the second comment. …

Actual result (this site):

Date Author Headline
22 Aug 2005 Michael Daum This is the first comment. …
22 Aug 2005 Michael Daum This is the second comment. …

MAKEINDEX example 1: creating an index from a chunk of text

compare with Philosophy articles needing attention

A B C D E F H I K L M N O P R S T U V W m p t

MAKEINDEX example 2: creating an index for a search result

A

AdminGroup
09 Mar 2018 - 21:10 - AdminUser
AdminUser
27 Feb 2018 - 19:59 - UnknownUser
AdminUserLeftBar
27 Feb 2018 - 19:59 - UnknownUser
AnnikaWith
03 Mar 2017 - 16:47 - RegistrationAgent

F

FamilieBereichGroup
13 May 2017 - 18:04 - LarsWith

G

GroupTemplate
27 Feb 2018 - 20:00 - UnknownUser
GroupViewTemplate
27 Feb 2018 - 19:59 - UnknownUser

L

LarsWith
13 Mar 2017 - 21:17 - LarsWith
LarsWithLeftBar
13 Mar 2018 - 09:33 - LarsWith

M

MalinaWith
05 Mar 2017 - 12:50 - RegistrationAgent
MarenUndLarsWith
28 Oct 2015 - 20:00 - RegistrationAgent
MarenWith
05 Jun 2016 - 08:12 - RegistrationAgent
MortenWith
07 Oct 2016 - 15:07 - RegistrationAgent

N

NobodyGroup
27 Feb 2018 - 20:00 - UnknownUser

P

ProjectContributor
27 Feb 2018 - 19:59 - UnknownUser

R

RegistrationAgent
27 Feb 2018 - 20:00 - UnknownUser

S

SitePreferences
09 Mar 2018 - 21:15 - AdminUser

T

TammoWith
13 May 2017 - 17:43 - RegistrationAgent

U

UnknownUser
27 Feb 2018 - 19:59 - UnknownUser
UserHomepageHeader
27 Feb 2018 - 20:00 - UnknownUser
UserList
27 Feb 2018 - 20:00 - UnknownUser
UserListByDateJoined
27 Feb 2018 - 20:00 - UnknownUser
UserListByLocation
27 Feb 2018 - 20:00 - UnknownUser
UserListHeader
27 Feb 2018 - 20:00 - UnknownUser
UserRegistration
09 Mar 2017 - 17:17 - LarsWith

W

WebAtom
27 Feb 2018 - 19:59 - UnknownUser
WebChanges
27 Feb 2018 - 19:59 - UnknownUser
WebCreateNewTopic
27 Feb 2018 - 19:59 - UnknownUser
Main
27 Feb 2018 - 19:59 - UnknownUser

Installation Instructions

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. "Extensions Operation and Maintenance" Tab → "Install, Update or Remove extensions" Tab. Click the "Search for Extensions" button. Enter part of the extension name or description and press search. Select the desired extension(s) and click install. If an extension is already installed, it will not show up in the search results.

You can also install from the shell by running the extension installer as the web server user: (Be sure to run as the webserver user, not as root!)
cd /path/to/foswiki
perl tools/extension_installer <NameOfExtension> install

If you have any problems, or if the extension isn't available in configure, then you can still install manually from the command-line. See https://foswiki.org/Support/ManuallyInstallingExtensions for more help.

Dependencies

NameVersionDescription
Text::Unidecode>=1.27Optional

Change History

19 Jan 2024: added rotate parameter to %FORMATLIST
23 Jun 2023: added $pos to %FORMATLIST
29 Apr 2022: added casesensitive parameter to MAKEINDEX and FORMATLIST; new macro DECODE; fixed a couple of perl gotchas initializing variables
19 Oct 2020: added hideempty parameter to MAKEINDEX; fixed expand in SUBST aind EXTRACT
25 Oct 2018: added rev param to %SUBST and %EXTRACT
08 Oct 2018: added colwidth and colgap to %MAKEINDEX; fixed numerical sorting of lists
01 Jun 2018: improved sorting of lists, i.e. with numeric values
05 Mar 2018: css fixes for MAKEINDEX
30 Aug 2017: rewrite MAKEINDEX from using tables to css3 multicolumn
05 Sep 2016: added $hits to FORMATLIST to distinguish it from $count and $index
29 Apr 2016: don't fallback to unidecode if an explicit mapping is given; don't use Foswiki's internal anchor creator as it does not support unicode
20 Apr 2016: added transliterate parameter, including custom mappings; upgraded Text::Unidecode fallback shipped with this plugin
31 Aug 2015: fixing deprecated unescaped left brace in regexes
17 Jul 2015: fixed compatibility with Foswiki-2.x
10 Apr 2014: transliterate/normalize unicode strings before sorting them in MAKETEXT
19 Jun 2012: added lastseparator (by Foswiki:Main/OliverKrueger); fixed paging when using together with include and exclude parameters
15 May 2012: fixed paging through lists in FORMATLIST
05 May 2012: fixed lists not being processed properly before iterating over them in FORMATLIST and MAKEINDEX
19 Apr 2012: modernized plugin by using a proper OO-core; fixed processing of tokenize properly; added replace parameter for FORMATLIST; fixed the plugin calling Foswiki::Func::expandCommonVariables() itself unnecessarily
10 Jan 2012: fixed filtering zero; fixed counting list items without formating them; added hideempty parameter to enable/disable rendering empty list items
29 Sep 2011: fixed SUBST macro topic param processing embedded META
25 Aug 2011: fixed perl rookie error initializing defaults
14 Jul 2011: fixed parsing zero values in lists (by Grzegorz Marszalek)
06 Apr 2011: fixed SUBST to removing everything after the last match
23 Jul 2010: fixed wrapper for non-official api call to getAnchorName on foswiki-1.1
07 Jun 2010: fixed expanding standard escapes ($n, $percent, …); improved examples in docu
12 Feb 2010: ease tokenize; forward compatibility for newer foswikis
17 Nov 2009: added tokenize pattern for FORMATLIST; fixed potential deep recursion in SUBST/EXTRACT
14 Sep 2009: added include counterpart to already existing exclude params; fixed SUBST not to forget about the non-matching tail of a char sequence
17 Apr 2009: converted to foswiki, added numerical sorting to MAKETEXT
08 Oct 2008: added $anchors to MAKEINDEX (by Dirk Zimoch); added nocase option to FORMATLIST (by Dirk Zimoch); fixed null/empty string match in FORMATLIST
20 Aug 2008: added selection and marker to FORMATLIST, similar in use as VarWEBLIST
03 Jul 2008: sorting a list before, not after, formatting it in FORMATLIST
08 May 2008: added 'text' parameter to SUBST and EXTRACT; fixed SUBST as it was pretty useless before
07 Dec 2007: added MAKEINDEX, added lazy compilation
14 Sep 2007: added sorting for EXTRACT and SUBST
02 May 2007: using registerTagHandler() as far as possible; enhanced parameters to EXCTRACT and SUBST
05 Feb 2007: fixed escapes in format strings; added better default value for max number of hits to prevent deep recursions on bad regexpressions
22 Jan 2007: fixed SUBST, added skip parameter to FORMATLIST
18 Dec 2006: using registerTagHandler for FORMATLIST
13 Oct 2006: fixed limit parameter in FORMATLIST
31 Aug 2006: added NO_PREFS_IN_TOPIC
15 Aug 2006: added use strict; and fixed revealed errors
14 Feb 2006: moved in FORMATLIST from the Foswiki:Extensions/NatSkinPlugin; added escape variables to format strings
06 Dec 2005: fixed SUBST not to cut off the rest of the text
09 Nov 2005: fixed deep recursion using expand="on"
22 Aug 2005: Initial version; added expand toggle

PackageForm edit

Author Michael Daum
Version 7.10
Release 21 Jan 2024
Description Substitute and extract information from content by using regular expressions
Repository https://github.com/foswiki/FilterPlugin
Copyright © 2005-2024 Michael Daum
License GPL (GNU General Public License)
Home Foswiki:Extensions/FilterPlugin
Support Foswiki:Support/FilterPlugin
Topic revision: r1 - 21 Jan 2024, UnknownUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this site is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback