text analysis: split document in seperated lines

For a text analysis I want to split a large document in seperated lines. Then I want to calculated for example the average number of words. A example of the document is the following:

text = {”
1.1 English Language Proficiency Policy
Since all activities (lectures, seminars, laboratories) at Humber \
are conducted in English, it is essential that all students possess \
strong English:

*writing skills;
*comprehension skills;
*speaking skills.

This allows students to cope with the rigours of the academic \
curriculum and to successfully complete any workplace components of \
the program including co-operative education. Therefore, if your \
first language is not English, or if your previous education has been \
conducted in another language, you will be required to demonstrate \
proficiency in English by undertaking and submitting the results of \
one of the following at the level relative to the program(s) to which \
you apply.

Please note that scores for the standardized English language tests:

may vary by program:
are only valid for a 24-month period from the date of testing.

1.1.2 For any program level

a minimum of three years of full-time study at the secondary school \
level in an English language school system in a country where English \
is considered the primary language (i.e. the primary language of \
instruction and evaluation is English), with acceptable grades in all \
English courses.

a minimum of one full year of successful study in an accredited \
university degree program, an accredited college degree program or \
graduate-level studies at either the university or college level, or
two full years of successful study in an accredited college diploma \
program in a country and in a postsecondary institution where English \
is the primary language of instruction.
“}

All lines which starts with a digit and finished with a end-of-line character withoud a “.”character must be deleted.
Not al lines has a end-of-line (“\n”) charachter.
The end of a line can be recognized by a word boundary followed by a ‘.’ and then a whitespace character. Or the end-of-line is given by ‘;’ or ‘:’.
Characters like “(,),*” must be deleted

The desired output is:

{“since all activities lectures seminars laboratories at Humber are
conducted in English”, “it is essential that all students possess
strong English, “writing skills”,”comprehension skills”,”speaking
skills”,”This allows students to cope with the rigours of the academic
curriculum and to successfully complete any workplace components of
the program including co-operative education”, “Therefore if your
first language is not English or if your previous education has been
conducted in another language you will be required to demonstrate
proficiency in English by undertaking and submitting the results of
one of the following at the level relative to the programs to which
you apply”, “Please note that scores for the standardized English
language tests,may vary by program”, “are only valid for a 24-month
period from the date of testing”, “a minimum of three years of
full-time study at the secondary school level in an English language
school system in a country where English is considered the primary
language i.e. the primary language of instruction and evaluation is
English”, with acceptable grades in all English courses”, “a minimum
of one full year of successful study in an accredited university
degree program an accredited college degree program or graduate-level
studies at either the university or college level or two full years of
successful study in an accredited college diploma program in a country
and in a postsecondary institution where English is the primary
language of instruction”}

I tried many options with RegularExpression but none of them gave the desired output. Has anyone a suggestion.

=================

  

 

These kind of things are trivially achieved with any of the standard text processing tools (i.e. perl, awk, etc).
– Dr. belisarius
Oct 8 ’14 at 16:33

  

 

@belisarius true, however such tools, however useful, are not readily available on PCs. Personally, I think this is justification enough to have access to a Linux box. Could the question be expanded to “since I’ve sworn my soul to microsoft, can I do this in M instead of having to search for and install gawk?”
– bobthechemist
Oct 8 ’14 at 16:37

  

 

As @belisarius says there are other tools for this but if you prefer to do it using Mathematica I suggest avoiding regular expressions and using the standard pattern elements instead. Please see (6998) for a very brief overview. (Presumably if you were already skilled with regular expressions you would be using a different tool and not having this problem.)
– Mr.Wizard♦
Oct 8 ’14 at 16:37

  

 

@bobthechemist I’ve used gawk-like tools for really big projects, and they are a nightmare too. But for quick and dirty tasks they are heaven.
– Dr. belisarius
Oct 8 ’14 at 16:58

=================

2 Answers
2

=================

Here is a rough attempt at implementing what you describe, primarily using StringSplit.

Fold[
Flatten @ StringSplit[##] &,
StringReplace[text[[1]], “,” | “*” :> “”],
{
StartOfLine ~~ Whitespace … ~~ DigitCharacter ~~ Except[“\n”] .. ~~ “\n”,
WordBoundary ~~ (“.” | “;” | “:”) ~~ Whitespace,
“\n”
}
] // StringTrim

{“Since all activities (lectures seminars laboratories) at Humber are conducted in \
English it is essential that all students possess strong English”, “writing skills”, \
“comprehension skills”, “speaking skills”, “This allows students to cope with the rigours \
of the academic curriculum and to successfully complete any workplace components of the \
program including co-operative education”, “Therefore if your first language is not \
English or if your previous education has been conducted in another language you will be \
required to demonstrate proficiency in English by undertaking and submitting the results \
of one of the following at the level relative to the program(s) to which you apply”, \
“Please note that scores for the standardized English language tests”, “may vary by \
program”, “are only valid for a 24-month period from the date of testing”, “a minimum of \
three years of full-time study at the secondary school level in an English language \
school system in a country where English is considered the primary language (i.e”, “the \
primary language of instruction and evaluation is English) with acceptable grades in all \
English courses”, “a minimum of one full year of successful study in an accredited \
university degree program an accredited college degree program or graduate-level studies \
at either the university or college level or”, “two full years of successful study in an \
accredited college diploma program in a country and in a postsecondary institution where \
English is the primary language of instruction”}

Ugly, I must admit, but perhaps it gives you some idea of where to start.

If you run into any specific problems please let me know.

  

 

Thanks, It works fine!!
– Michiel van Mens
Oct 9 ’14 at 7:55

  

 

@Mr.Wizard, thank you! You`re fantastic as always!!!
– Rod
Apr 25 at 11:08

Try this:

StringSplit[
StringReplace[First@text,
Shortest[(x : DigitCharacter) ~~ __ ~~ “\n”] | “)” | “(” | “*” ->
“”], “.” | “;” | “,” | “:”]

which gives the following somewhat imperfect output:

{”
Since all activities lectures”, ” seminars”, ” laboratories at \ Humber are conducted in English”, ” it is essential that all students
\ possess strong English”, ”
writing skills”, ” comprehension skills”, ” speaking skills”, ”
This allows students to cope with the rigours of the academic \ curriculum and to successfully complete any workplace components of \
the program including co-operative education”, ” Therefore”, ” if \
your first language is not English”, ” or if your previous education \
has been conducted in another language”, ” you will be required to \
demonstrate proficiency in English by undertaking and submitting the \
results of one of the following at the level relative to the programs
\ to which you apply”, ”
Please note that scores for the standardized English language \ tests”, ”
may vary by program”, ” are only valid for a
a minimum of three years of full-time study at the secondary \ school level in an English language school system in a country where \
English is considered the primary language i”, “e”, ” the primary \
language of instruction and evaluation is English”, ” with acceptable
\ grades in all English courses”, ”
a minimum of one full year of successful study in an accredited \ university degree program”, ” an accredited college degree program or
\ graduate-level studies at either the university or college level”, ”
or two full years of successful study in an accredited college \
diploma program in a country and in a postsecondary institution where
\ English is the primary language of instruction”, ” “}

I’m not totally clear on what you want, as your requirements seem to contradict your example output, so hopefully you can modify this to do what you need.