Sources

I can apply R, Python, Julia and Linux Shell so as to do the programming skillfully in processing texts, extracting data, analyzing data and building statistical models. More importantly, this has allowed me to become a real data scientist. I can make use of existing computational models to design new algorithms for implementing practical purposes in research.

SOFTWARE, DATABASES & CORPORA DEVELOPED

Software & Tools

○ the automatic converter of discourse dependency from discourse corpora;

○ the analyzer of discourse distance;

○ the analyzer of discourse complexity;

○ the annotation tool of PDTB-RST combination;

○ the toolkit of visualizing discourse network;

○ the analyzer of diachronic frequency for lemmas and n-grams;

○ the text pre-processing toolkit (Linux shell/python versions);

○ the analyzer of relative entropy for historical changes in lemmas and n-grams;

Databases

○ the datbase of historical frequencies for English compounds (9828 words);

○ the datbase of historical frequencies for discourse connectives (words, two/three-word phrases);

○ the database of lexical psychological properties in L2;

○ the database of historical concreteness and imageability in English;

○ the database of sentimental properties of onomatopoeia in multiple languages (28 languages;

○ the database of native Chinese reader’s perception on sentence boundaries;

○ the database of eye-movements on coherent discourse and incoherent discourse in four languages;

Corpora

○ the corpus of English hyphenated compounds;

○ the balance corpus of discourse dependency in multiple languages (11 languages);

○ the corpus of Chinese “run-on” sentences: annotated in syntactic, semantic and discourse levels;

○ the corpus of event graphs for Chinese discourse;

○ the corpus of Chinese topic chains: syntactic, textual and anaphora annotations;

○ the corpus of Chinese “fine translation” of literary works: morphological and syntactic annotations.

The other databases, programming scripts and preprints are available at: