Minggu, 05 Februari 2012

“Web Pages, Text Types. Linguistics Features: Some Issues” by Marina Santini

Summary Essay on “Web Pages, Text Types. Linguistics Features: Some Issues” by Marina Santini

Author Marina Santini in her journal, “Web Pages, Text Types, Linguistics Features: Some Issues” emphasized that web page is considered as a new type of document. She suggests that web page possesses more complexities than paper documents. One document on a web page contains several texts with different communicative function. For instance, one page can be divided into some parts and those parts were organized by links. Various tablets of words scattered around the main documents, such as navigation button, menu, ads, and search bar are the link that connecting one document to the others. Unlike paper documents, it is possible for web page to lose its specific linguistics and textual characteristic because of its visual structure. Thus, the author tries to investigate text typology on web page based on linguistics features, more specifically text types. In her journal, Santini chooses two well-established studies by Biber (Multidimensional analysis: 2004) and Werlich (1976) to learn whether the text types suggested in those studies are suitable and applicable to web pages. Biber’s Multidimensional Analysis relies on inductive statistical approach based on factor analysis and cluster analysis and it focuses only on linguistic features (lexical, morphological, and syntactic classes). The analyses resulting in four dimensions: personal involved narration, persuasive-argumentative discourse, advice, and abstract-technical discourse. However, Werlich analyzed five text types: narration, description, exposition, argumentation, and instruction. The author also adds two broad text types in her study: Nominal vs Verbal. NLP tools were used to converting web pages from HTML version into ASCII format.

Furthermore in her study, Santini finds out that over 50% threshold of the web page refers to the nominal text type. This means less linguistics features on web pages and the probability of its suitableness with text types is decreased. She proposes six issues related to this unsuitableness. The six issues are: Elements of text coded as images, headings, lists, proper nouns, tabular text, and mixed text. The first issues happens when some text elements of web page coded as image embedded in HTML page are lost when converted into ASCII version (text without pictures). Santini agrees that the solution for this problem is hardly to find.  The second issue occurs when the tools did not detect a heading because it were wrote inside a sentence rather than as an independent unit. Adding HTML tags for headings <h#> can solve the issue. As for lists, the issues lies on stylometric measurement such as the average length of a sentence. It is because the nature of lists which is always semantically incomplete and does not end with punctuation. Solution for this issue is by adding <li> mark as artificial sentence boundaries. The fourth issue comes to proper nouns which can be found in almost every web page. Usually it contains a list of names or personal details. Unfortunately, the NLP tools is useless in this case. The tabular structure is quite difficult to be analyzed by linguistics standpoints. Last issue goes to mixed texts. It talks about the strings of text surrounding the main body of a web page that semantically separated from its main body and provides only additional information to the reader. At least contains of three text types: a comment (the main article), an informational list (the headlines on the right side), and an index (the items on the left). According to these issues, the author concludes that the issues stated above do not have an easy solution. Textuality of web page also do not make it any easier for automatic extraction application such as NLP to interpret it. The author acknowledges that it is possible for the same thing to happen to other similar tools when used for similar study. Thus, further discussions and investigations are needed.

wynne ert , february 2012

