Sources

I can apply R, Python, Julia and Linux Shell so as to do the programming skillfully in processing texts, extracting data, analyzing data and building statistical models. More importantly, this has allowed me to become a real data scientist. I can make use of existing computational models to design new algorithms for implementing practical purposes in research.

SOFTWARE, DATABASES & CORPORA DEVELOPED


Software & Tools

the automatic converter of discourse dependency from discourse corpora;

the analyzer of discourse distance;

the analyzer of discourse complexity;

the annotation tool of PDTB-RST combination;

the toolkit of visualizing discourse network;

the analyzer of diachronic frequency for lemmas and n-grams;

the text pre-processing toolkit (Linux shell/python versions);

the analyzer of relative entropy for historical changes in lemmas and n-grams;


Databases

the datbase of historical frequencies for English compounds (9828 words);

the datbase of historical frequencies for discourse connectives (words, two/three-word phrases);

the database of lexical psychological properties in L2;

the database of historical concreteness and imageability in English;

the database of sentimental properties of onomatopoeia in multiple languages (28 languages;

the database of native Chinese reader’s perception on sentence boundaries;

the database of eye-movements on coherent discourse and incoherent discourse in four languages;


Corpora

the corpus of English hyphenated compounds;

the balance corpus of discourse dependency in multiple languages (11 languages);

the corpus of Chinese “run-on” sentences: annotated in syntactic, semantic and discourse levels;

the corpus of event graphs for Chinese discourse;

the corpus of Chinese topic chains: syntactic, textual and anaphora annotations;

the corpus of Chinese “fine translation” of literary works: morphological and syntactic annotations.


The codes used in the published papers are available at: https://github.com/fivehills/

The other sources including datasets, programming codes are available at OSF. You can go to:



The other databases, programming scripts and preprints are available at: