Sign in
Log inSign up

nlphose: Commandline Tools For NLP On Tweets (and other data sources)

Ashish Patil's photo
Ashish Patil
·Jul 5, 2021·

2 min read

I started work on my project a few months ago. The project attempts to create command line utilities that can be piped together. A streaming source of text, such as twint or logs, can be combined with a complex pipeline that can perform various tasks on the strings.

The basic shell concept of using the pipe command to process and feed output of one task to another is the aim of the project. The command line script that makes up the project expects single line JSON to be input which contains text to be processed and the output from earlier processing. Every command line script adds attributes to the data. The whole system is easy to modify due to the simple Python programs that use various NLP libraries.

Here are some examples of what can be done.

You can get works of art in positive messages.

twint -s netflix | ./twint2json.py | ./senti.py | ./entity.py | jq 'if (.afinn_score) > 5 then .entities|.[]| select(.label == "WORK_OF_ART") | .entity    else empty  end'

People are mentioned in postive posts about the premierleague.

twint -s premierleague | ./twint2json.py | ./senti.py | ./entity.py | jq ' if (.afinn_score) > 5 then . as $parent | .entities|.[]| select((.label == "PERSON") and .entity != "Netflix") | [$parent.text,.entity]     else empty  end'

A tool called pv can be used to monitor the processing speed. You can see the number of incoming and outgoing messages as they are shown below.

For more details please visit the project’s github page here: https://github.com/code2k13/nlphose