Extract content from wiki pages - page numbers of changed pages
Last updated at 9:48 am UTC on 30 June 2018
This page is about getting the content of the 'changes' page of this wiki, actually the full changes and then querying the result to get at the page numbers of pages which changed since a particular date.
Steps
1. Get page content of http://wiki.squeak.org/squeak/completeChanges
2. Get the collection of dates on which changes occurred.
3. Get all lines before a certain date
4. Extract the page numbers
Summary (all together as a script)
1. Get page content
Evaluate the following code in a Workspace
| url pageSource contentAfterH2Header |
url := 'http://wiki.squeak.org/squeak/completeChanges' asUrl.
pageSource := url retrieveContents contents.
contentAfterH2Header := (pageSource splitBy: '</h2>') second.
contentAfterH2Header inspect
The result of this operation is that you get an Inspector window on the result which is a ByteString.

SequenceableCollection has a #beginsWith: method implemented. Thus the ByteString being a subclass of String which in turn is a subclass of SequenceableCollection inherits this method.
2. Find out about dates on which changes occurred.
The inspector window of the previous section might be used to get a list of all dates where changes occurred.
Paste
(self lines select: [:aLine | aLine beginsWith: '<h3>']) inspect
into the code box of the Inspector object and execute it. The result is an array of all change dates.
Use the array it to find out about a date one wants to have changes up to that date.
Poke into the array with e.g.
self at: 250
to get at the data
<h3>31 March 2017</h3>
2. Get the collection of dates on which changes occurred.
Then go back to the first array with which has all lines and do
|copy |
copy := true.
(self lines select: [:aLine | (aLine beginsWith: '<h3>31 March 2017</h3>') ifTrue: [copy := false].
copy ])
inspect
This gives all the lines before 31 March 2017 in a new Inspector object.
From this inspector window we want to go for
<li><a href="/squeak/
to get at the page numbers of the changed wiki pages.
So we need to put the following code snippet into the evaluation pane of the Inspector window
(self select: [:aLine | aLine beginsWith: '<li><a href="/squeak/']) inspect
This gives us another Inspector window, this time only with the lines which actually reference a changed page. Entries with change dates later than 31 March 2017 are in this collection. The object shown is an array with each element referencing a line.

4. Extract the page numbers

self collect: [:aLine | (aLine splitBy: '"') second]
Summary (all together as a script)
| url pageSource contentAfterH2Header linesWithDates aDateString copy
linesBeforeTheDate pageReferences pagesDict |
url := 'http://wiki.squeak.org/squeak/completeChanges' asUrl.
pageSource := url retrieveContents contents.
contentAfterH2Header := (pageSource splitBy: '</h2>') second.
linesWithDates := contentAfterH2Header lines select: [:aLine | aLine beginsWith: '<h3>'].
aDateString := UIManager default chooseFrom: linesWithDates.
copy := true.
linesBeforeTheDate := contentAfterH2Header lines select:
[:aLine | (aLine beginsWith: '<h3>31 March 2017</h3>') ifTrue: [copy := false].
copy ].
pageReferences := linesBeforeTheDate select: [:aLine | aLine beginsWith: '<li><a href=' ].
pageReferences := linesBeforeTheDate collect: [:aLine | (aLine splitBy: '"') second].
pageReferences inspect.