The most alluring part in machine learning and Artificial inteligence is, that it gives us possibility to start rethinking about everything, by using them - you can see everything from completely different point of view and I think, sometimes this fact of 'rethinking' is the key for new inventions.

I had such kind of moment few days ago, it was really interesting and I decided to write about this experiment.

I created my twitter account approximately 7 years ago, but I hadn't used it very often and after few months, I connected it with one of my websites and installed 'auto-retweet' bot on it. After this, lots of things had changed, I didn't have time for my website, I removed this bot and as I am not really fond of social networks - my twitter account had been paused for a few years.

Several months ago, I decided to start using Twitter. I wanted to share my publications, my public repos (or maybe not only mine) - with the other people, so this is how I decided to rejoin Twitter, but of course I had the followers with absolutely different interests based on my account's past activity. So, to be honest -it affected on my motivation really badly.

I decided to find such kind of users - for whom I would be interesting too.

At first, When you start activity on Twitter and you don't have lots of posts ( of course I deleted older posts :D ) and followers - the chance is big, that the persons you follow - won't follow you back and you need to add really interesting posts for a long time to represent yourself as best as possible.

We have mentioned 'adding followers', so before I start writing about the main parts of this story, I want to share Twitter's rules and limitations, because penalties on Twitter for unethical actions are very serious.

Here are some important links:

https://help.twitter.com/en/rules-and-policies/twitter-following-rules

https://help.twitter.com/en/rules-and-policies/twitter-automation

So, our goal is to make classification model based on Twitter's users' data (we will generate it later), to predict users' actions in the future - will this user follow us or not, when we send him/her following request. So, to simplify - it should help us to find people - who will follow us in future.

At first, I needed Python package, because communicating straight with Twitter api - isn't a good idea and it needs much more time, so as we know Python is 'batteries included' and I found a great package very easily, which has all the functionality, that I needed for this project and what's more important - it has great documentation as well.

Here is the link:

https://python-twitter.readthedocs.io/en/latest/index.html

For using this package, first we need to install it using pip.

$ pip install python-twitter

And after this, we need api credentials from Twitter, you can get it from here very easily:

https://developer.twitter.com/

If it seems little bit confusing, you can see detailed instruction with screenshots: %[python-twitter.readthedocs.io/en/latest/ge…]

Okay, now we can connect successfully with Twitter api:

https://gist.github.com/ent1c3d/5bde7eccad0ec948e22831ed789a0fc6

Now it's time to start data mining. We need to collect enough data, to train our model and here are two options how to get it : first, which is harder and which I did because I didn't have any choice - we should try to follow interesting people and save whether they follow us or not and the second option -for those, who already has enough amount of friends and followers - you can use this data to train your model.

In my case, the second option wouldn't work and I had to choose the harder one. So, for the first scenario I needed to find popular people on Twitter, who were mostly posting about programming/machine learning, because their followers would be also very good for me, so I started collecting data about those kind of users.

First, I choose 'a Twitter rockstar' user, who always posts about programming or ml and next get the list of his/her followers:

https://gist.github.com/ent1c3d/cb54e0753073eb9c5fb71d6bb68652ec

After that, I try to get detailed information about each of them. The package, I'm using, returns me this kind of data for each user :

Here is really interesting information, which I can use for training my model, but I don't want to follow each of them, I want to follow only those people, who has the same interests as me and who has some activities(see line 28).

https://gist.github.com/ent1c3d/c8423be555642e7440558c11cca4e38c

In spite of the fact, that both Twitter api and our package give us possibility to follow users very easily:

https://gist.github.com/ent1c3d/712be9822f2baab244dfd35ddde4d4a2

Still, you have to be careful, so that your actions won't look like a spam or not to break Twitter rules.

So, as you can see, I am using simple 'if condition' to filter people, who are not active or who don't have enough amount of friends. Because I want to save maximally filtered data.

I decided to save information about users in relational database and I am using Sqlite as database, because it's absolutely enough for us and I didn't want to waste time for such kind of things. Also, in application side I use only sqlite3 package to connect database, without any orm, I wanted it to be as simple as possible.

https://gist.github.com/ent1c3d/4bfd852ab78dc0f805f430b2bca90fe8

For the second scenario, we are using almost the same code, but instead of getting data from other users' followers- we are getting data about our followers and followings and save all of this information in our table.

First, we get info about our followers and followings :

https://gist.github.com/ent1c3d/2a5505c687e95c47acbb2670556c7b53

And finally, we save everything into our table, as it was in previous example.

So, I already have the information, that I needed.

I want to start training my model and for this, I need to choose features for it. We have already seen how does the result look like, which we have from Twitter api. I decided to use only few of them for training my model : created_at , favourites_count, friends_count, friends_count, statuses_count.

Although, I was not satisfied by training process and I think it's very easy to guess why: It's hard to guess if a user still actively uses Twitter account only based on this data, maybe he/she has lots of posts, but all of them are posted 1 year ago, so Idecided to remove column created_at and statuses_count from features' list and instead of it, add information about the last posting date. You can get information about it like this :

https://gist.github.com/ent1c3d/47f398e3b502de045eb25f99a2545a81

I decided to convert this string into integer format and save how many days have gone after last tweet from this user. I wrote a small function for it:

https://gist.github.com/ent1c3d/f85db195ba8e396e3c132b4a7cbfdb16

And I can already use this function in my main loop :

https://gist.github.com/ent1c3d/101a2b25678bb007cb40d35862bc88be

It's not an easy job to collect data for your model, it needs minimum few days and sometimes more than weeks, so during those days - a lot of things had been changing with users, whose data have we already saved in our db, so i wrote codes for making my data always up-to-date(for examle: friends list, followers list, etc…) and I can run these codes before I start creating my model.

finally, my data looks like this :

I don't know which ide or editor do you use, I almost always choose Intellij products, in this case Iam using Pycharm. I am using same ide also for connecting with sqlitedb and it gives us possibility to convert our data into csv format very easily.

I check that I want header as a first row.

Train, Evaluate, Predict

And now the most interesting part. For our last step, we can use Google Colab, which I think is one of the most useful Google products. First, I am importing necessary packages.

https://gist.github.com/ent1c3d/b445e28b970813a374d146dc94b88a7e

Next, we need to upload our csv file and it's really easy when you use Colab, you just need 2 lines of code and next, authentication code.

https://gist.github.com/ent1c3d/f2344efae1a35992bbab9704675327c6

After we upload our file, we can open it using Panda and see how it looks like.

Maybe you already know, that in Twitter there exist private accounts, which means, that you cannot see users' posts, before they give permission for that. So, in such situations you can not see user's last post. In my opinion, feature days_after_last_post is really important and I decided not to use users for training, who don't give information about this data. We can add this filter in sql like this :

select ud.* from users_data ud where ud.days_after_last_post is not null;

Or you can filter it right here, in Colab.

https://gist.github.com/ent1c3d/f5f14174bf9f484a0d5377ff1d0bee77

We can also see how many followers do we have using panda:

Now we can create variables for our features and for label:

https://gist.github.com/ent1c3d/d3cf499ea2d23b442e0ec8df23ef57d9

When we already have our features and the target, we will use Scikit-learn to split the data for convinience. Scikit-learn has a train_test_split function for splitting a Panda's DataFrame into a training and testing set.

https://gist.github.com/ent1c3d/0520cdcbe11150f29d2f53b895045c9b

We are giving the target and features as a parameter to train_test_split function and it returns randomly shuffled data. We have 30% of data for a test set and 70% for training set.

Now, I will call function numeric_column for each feature, because luckily all of my features are numeric. Next, I will create list for the feature columns,

https://gist.github.com/ent1c3d/91de186172b6e753b5ccfd2591e1ae26

And now we can create list for feature columns.

https://gist.github.com/ent1c3d/99ed5580ce40dd8e4e294621baa49d5c

Input Function, testing and evaluation

We need to have input function, which must return a tuple, containing two elements : features and list of labels and we use this input function for our model, because when we train it - we have to pass the features and the labels to the model.

But, of course, it's not mandatory to write it by our hand, there is already function for it with the name 'pandas_input_fn' and we will use it :

https://gist.github.com/ent1c3d/5861ed4386ba778d3b200de2d74bac38

And now it's time to instantiate Model

https://gist.github.com/ent1c3d/f052e61e2def026e58021fd9314a2cb1

We are ready to train our model :

https://gist.github.com/ent1c3d/9dcece432457b4cabe0fb49417347e35

We trained it and now, we are able to evaluate our model. For evaluating, we pass eval_fn as a parameter to evaluat function :

https://gist.github.com/ent1c3d/799ca85f73194f89ef87d74c2db0e7d3

You can see on the screen, that in my case, accuracy was 80%+, which, I think, for this kind of data is normal - I only had info about 1400 users and only had 4 features. Peoples' decisions sometimes don't depend on these features, so our accuracy is satisfying for me right now. Now I can use this model to make prediction on real data.

Prediction

https://gist.github.com/ent1c3d/b72aaea46f1c82b8d52ce8eb7213596a

Finally, we can get the list of those users, who will answer back on our friend request depending on our model's prediction.

https://gist.github.com/ent1c3d/e3076c0486cbc98979ccb491992c3404

Of course, you can train your model better to get much more accuracy, you can use different features or change whatever you want, because this experiment is just for fun and for spending more time with Tensorflow and other libraries.

Thank you for taking the time to read this story :)

Experimenting with twitter data using Tensorflow

Train, Evaluate, Predict

Input Function, testing and evaluation

Prediction

Product

Explore

Company

Blogs

Support