TSP algorithms: 2-opt, 3-opt in python

2-opt and 3-opt algorithms are used to get approximative solution of the travelling salesman problem. In the previous post I explained what the TSP problem is and I also included the implementation of Christofides algorithm.

# 2-opt algorithm

2-opt algorithm is one of the most basic and widely used heuristic for obtaining approximative solution of TSP problem. 2-opt starts with random initial tour and it improves the tour incrementally by exchanging 2 edges in the tour with two other edges.

2-opt move

In each step, 2-opt algorithm deletes two edges (u_1, u_2) and (v_1, v_2), where u_1, v_1, u_2, v_2 are distinct , thus creating 2 subtours and reconnects the tour with edges (u_1, v_1) and (u_2, v_2) in case that the replacement reduces the length of the tour (Note: beware, that 2 edges cannot be incident to the same node). The final solution cost of the single 2-opt move can be expressed as \Delta_{ij} = c(u_1, u_2) + c(v_1, v_2) - (c(u_1, v_1) +. c(u_2, v_2))

Algorithm ends in the local optimum, in which no further improving step is found. One two-opt move has time complexity of \mathcal{O}(n^2). In one two opt move in the worst case we need to check for one broke edge e (n-2) other edges that could be broken to improve the tour - thus the \mathcal{O}(n). We need to check it for all the edges, therefore we get the equation n * \mathcal{O}(n) = \mathcal{O}(n^2).

3-opt algorithm

3-opt algorithm works in the similar fashion. Instead of breaking the tour by removing 2 edges, it removes three edges, therefore breaking the tour into 3 separate sub-tours that can be reconnected in 8 possible ways (including the original tour) as shown on the following figure.

Let's assume that we got a tour that we split up to 3 different subtours. A = (4,5), B=(6,1), C = (2,3). There are 8 possible ways how to reconnect the original subtour ABC.

 a)\ \ ABC,  b) \ \ A'BC,  c) \ \ ABC',  d) \ \ A'BC',  e) \ \ A'B'C,  f) \ \ AB'C,  g) \ \ AB'C',  h) \ \ A'B'C' (Note: Symbol ' defines the reversed segment).

 

3opt

As you can see, there exists only 4 3-opt moves, namely d) \ A'BC',  e) \ A'B'C, g) \ AB'C' and h) \ A'B'C',  a) is the original tour and the rest are 2-opt moves (one of three broken edges was reinserted back).

We will choose the best case from all 8 moves, particulary through one function:

def get_solution_cost_change(graph, route, case, i, j, k):
    A, B, C, D, E, F = route[i - 1], route[i], route[j - 1], route[j], route[k - 1], route[k % len(route)]

   if case == OptCase.opt_case_1:
        return 0
   elif case == OptCase.opt_case_2:
        # second case is the case A'BC
        return graph[A, B] + graph[E, F] - (graph[B, F] + graph[A, E])
    elif case == OptCase.opt_case_3:
        # ABC'
        return graph[C, D] + graph[E, F] - (graph[D, F] + graph[C, E])
    elif case == OptCase.opt_case_4:
        # A'BC'
        return graph[A, B] + graph[C, D] + graph[E, F] - (graph[A, D] + graph[B, F] + graph[E, C])
    elif case == OptCase.opt_case_5:
        # A'B'C
        return graph[A, B] + graph[C, D] + graph[E, F] - (graph[C, F] + graph[B, D] + graph[E, A])
    elif case == OptCase.opt_case_6:
        # AB'C
        return graph[B, A] + graph[D, C] - (graph[C, A] + graph[B, D])
    elif case == OptCase.opt_case_7:
        # AB'C'
        return graph[A, B] + graph[C, D] + graph[E, F] - (graph[B, E] + graph[D, F] + graph[C, A])
    elif case == OptCase.opt_case_8:
        # A'B'C'
        return graph[A, B] + graph[C, D] + graph[E, F] - (graph[A, D] + graph[C, F] + graph[B, E])

Then, after we get which case of reconnecting the tour is the best, we will need to reverse the segments accordingly:

if (i - 1) < (k % len(route)):
    first_segment = route[k% len(route):] + route[:i]
else:
    first_segment = route[k % len(route):i]
second_segment = route[i:j]
third_segment = route[j:k]

if case == OptCase.opt_case_1:
    # first case is the current solution ABC
    pass
elif case == OptCase.opt_case_2:
    # A'BC
    solution = list(reversed(first_segment)) + second_segment + third_segment
elif case == OptCase.opt_case_3:
    # ABC'
    solution = first_segment + second_segment + list(reversed(third_segment))
elif case == OptCase.opt_case_4:
    # A'BC'
    solution = list(reversed(first_segment)) + second_segment + list(reversed(third_segment))
elif case == OptCase.opt_case_5:
    # A'B'C
    solution = list(reversed(first_segment)) + list(reversed(second_segment)) + third_segment
elif case == OptCase.opt_case_6:
    # AB'C
    solution = first_segment + list(reversed(second_segment)) + third_segment
elif case == OptCase.opt_case_7:
    # AB'C'
    solution = first_segment + list(reversed(second_segment)) + list(reversed(third_segment))
elif case == OptCase.opt_case_8:
    # A'B'C
    solution = list(reversed(first_segment)) + list(reversed(second_segment)) + list(reversed(third_segment))
return solution

After that we reiterate the process until no improvements can be made.

Whole code can be found in Python on my github page. If you see any improvements can be made, feel free to post it into the comments or make PR directly!

 

 

PS: It is not in the scope of this post to describe the 3-opt fully. It is meant to provide the implementation and at least a little bit of description how the code is working. See [2] for more information on this topic.

Source:

[1] https://www.sciencedirect.com/science/article/abs/pii/S0377221799002842

[2] https://tsp-basics.blogspot.com/

Christofides algorithm in Python

Travelling salesman problem (a.k.a chinese postman) is a famous matematical problem firstly mentioned back in 1832.

Basically, it asks the following question: Given a list of cities and distance between them what is the shortest path to visit them all and return back to the original city?

Mathematical definition was laid by W.R. Hamilton and by the British mathematician Thomas Kirkman.

Christofides algorithm

Christofides algorithm is an approximative algorithm for finding solution for travelling salesman problem. Beware, this algorithm works only on instances, where the distances form a metric space. Therefore, it works only if the instance obeys the trinagle inequality and the distance matrix are symmetric.

Christofides algorithm has the best approximation ratio that has been proven so far. The solution provided by Christofides guarantees to be withing 3/2 of the optimum solution. Proof can be read on various other sources.

The first step of the Christofides algorithm is to find minimal spanning tree.

#1 Minimal spanning tree

Given an undirected connected graph G = (V, E) a spanning tree S of graph G is a subgraph of G that connects all vertices together without cycles with minimum possible total edge weight.

The most known two simple algorithms that can find MST are Kruskal and Prim. I implemented the latter, but nevertheless if you want to contribute to this open source project, feel free to make a pull request for Kruskal. Prim can be found here on github.

#2 Get odd degree vertices

From my point of view, this step is pretty clear. Find odd degree vertices and save them into a list.

#3 From odd degree vertices, make minimal-weight perfect matching

Third step of the Christofides algorithm is to find minimal-weight perfect matching. Matching M in a graph G is a set of edges E without common vertices. Perfect matching is a matching which matches all vertices of the graph. That also means that every vertex of graph is incident to exactly one edge of the matching M.

Only few of algorithms were published that can find minimal-weight perfect matching. One of them is original Edmond's Blossom algorithm [1]. There were multiple improvements done in that algorithm like Blossom IV by Cook and Rohe and Blossom V (as far as I know, this is the latest one) by Kolmogorov [2].

I tried to read the papers regarding those algorithms, but after long study I decided that I can use algorithm that is already in the networkx library called max_weight_matching. How? Simply negating the signs in the adjacency matrix. Note that this procedure works only if we are searching for maximum cardinality matching! Otherwise, the edges with negative weights would be simply ignored by the algorithm. Another approach would be to substract each weight from very large number.

#4 Add edges from 3/ into minimal spanning tree

The fourth step is to unite matching M and minimal spanning tree MST to form an Eulerian multigraph. Multigraph means that there may exists repetetive edges. NetworkX provides a nice data structure called Multigraph.

#5 Create euler tour

Again, there is no need to implement algorithm for finding euler tour from the scratch. Fortunately, networkx provide a method for that.

#6 Remove repeated vertices

The last step in the algorithm is to remove repeated vertices that were included in the euler tour. If path is a list then vertices may be removed this way:

path = list(dict.fromkeys(path).keys())
path.append(starting_node)

The whole code can be read on my github.

.

Sources:

[1]:  https://stanford.edu/~rezab/classes/cme323/S16/projects_reports/shoemaker_vare.pdf

[2]: https://pub.ist.ac.at/~vnk/papers/blossom5.pdf

Django with Docker and Gitlab Runner example

In the second post of this series (check the first one!) we will talk about technologies that will be used in our day-to-day development process. A friend of mine wanted to participate on this project and therefore I decided to set-up my old laptop to be a full-time development server that will be running the newest version of my application.

 

I would like to make a CI/CD pipeline that will look something like this:

  1. Run pip freeze >> requirements.txt only if you added/removed/changed versions of any libraries in your virtual environment,
  2. Push new commit into gitlab with some changes,
  3. Gitlab runner should automatically run tests against the changes,
  4. If the tests pass successfully, gitlab runner should build the docker image,
  5. Gitlab runner should deploy the new version (called staging deployment environment).

 

#1 Export all python libraries with their versions defined to the requirements.txt file

Let's say we added/removed/modified the libraries we use in the current project. For example, if I update Django library to the newer version in my virtual environment and use any function from that new version, my colleagues that don't have the updated version of the library will be unable to run the project. We can solve this issue with the of help pip freeze >> requirements.txt command.

This may be a reason why the development server won't work after commit. We need to avoid this!

 

Command pip freeze does one simple thing. It lists all installed libraries in the current virtual environment. The >> symbol in Linux means that the output of the previous command will be redirected into a file. In our case the file is called requirements.txt. Then we can push the updated file into git.

 

It is really easy to install all libraries from requirements.txt file. Just use the following command: pip install -r requirements.txt . We will use this command in dockerfile as you will see in a second.

 

#2 Create docker file

Docker is more or less a new container technology. This technology is beyond scope of this post, so if you don't understand what docker is and you would like to know something more just click here.

 

# Official Python image
FROM python:3.6

# create root directory for project, set the working directory and move all files
RUN mkdir /cryptoapp
WORKDIR /cryptoapp
ADD . /cryptoapp

# Web server will listen to this port
EXPOSE 8000

# Install all libraries we saved to requirements.txt file
RUN pip install -r requirements.txt

The FROM instruction tells docker which image should be used as base image. In this particular case, we specified python 3.6. Although don't forget to specify the version you've been using on your development machine.

The RUN instruction will execute any commands in a new layer on top of the current image and commit the results. That means every RUN command will create a new docker image that will be used in subsequent commands.

The ADD command copies files from local folder (in this case whole folder where Dockerfile is located) and copies it to filesystem of the image.

EXPOSE command informs the docker that the container will listen to port 8000.


Now let's move to create docker-compose.yml file.  My pick for database is postgresql so the first service will be db with image: postgres. The second service is django. The most important part is anything behind "command". python manage.py runserver 0.0.0.0:8000 starts python webserver.

 

version: "2"
services:
  db:
    image: postgres
  django:
    build: .
    restart: always
    container_name: cryptoapp
    command: python manage.py runserver 0.0.0.0:8000
    ports:
      - "8000:8000"
    depends_on:
      - db
#3 Set up gitlab runner

The last step to achieve our goal is to set up gitlab runner. Gitlab runner has its configuration stored in a file called "gitlab-ci.yml". Let's create the basic config (as you can see, docker-compose command is used, so you need to include the python package docker-compose in your requirements.txt):

stages:
  - test
  - build
  - deploy

test:
  stage: test
  script: echo "Testing the app"

build:
  stage: build
  script:
  - echo "Building the app"
  - docker-compose build

deploy_staiging:
  stage: deploy
  only:
  - master
  script:
  - echo "Deploying the app"
  - docker-compose up -d

As you can see have 3 separate stages.

The first one is just a testing phase, currently just dummy (the script just prints out the "Testing the app" string. The second one is build phase, where we build docker image.

The third phase is the phase where magic happens, that means we run the docker container up and our server should be ready with the database turn on as well.

 

#4 Summary

To sum this up completely, we have working CI/CD pipeline for our project. After every commit to gitlab, this pipeline will start. All steps that gitlab runner makes are recoreded in .gitlab-ci.yml file.

01: Crypto dev info: Idea & Choosing technologies

#1 – A Short Introductory to Cryptocurrency

Bitcoin in the past few years has skyrocketed, reaching nearly $20,000 in value in late 2017. Today, it remains at a steady $7,000 but the value of digital assets has significantly gone up over the past few years. Cryptocurrency was invented by a person or team we only know as Satoshi Nakamoto. The official Wikipedia on Satoshi Nakamoto lists 13 different people as being the official creator for Bitcoin.

 

Not long into the creation of Bitcoin, Satoshi had vanished from the project and was never to be seen or heard from again. Since the creation of Bitcoin, cryptocurrency has surged well above anyone’s original expectations and since then, a lot of new digital assets have been created as well. We’ll discuss the top three digital currencies at this point in time.

 

#2 – The Top Three Most Famous Cryptocurrencies

 

#2.1 – Ethereum

Designed by Vitalik Buterin in 2013, Ethereum has risen quickly among the ranks of digital coins. In 2016, a failed DAO project resulted in Ethereum being separated into two different blockchains. There is now Ethereum Classic and Ethereum. Ethereum is still maintained by Vitalik Buterin but Ethereum Classic doesn’t seem to be going anywhere anytime soon with stagnant development.

 

 #2.2 – XRP

cryptocurrencyXRP was released in 2012 but today, the coin is officially owned by Ripple. Ripple today focuses more on payment processing services and financial institutions. If you want more information on Ripple, you can go to the following blog: https://xrphodor.wordpress.com/or you can go to the official Ripple website.

 

#2.3 – Bitcoin Cash

In 2017, a separate blockchain was created when a small team of developers wanted to increase the blocksize of Bitcoin and they wanted to change the initial code. As a result, a hard fork occurred and this split the coin into two different entities. Now, there’s the original Bitcoin and Bitcoin Cash.

 

#3 – An Overview of The Current Market Evaluation and Progress

For the evaluation of the market and cryptocurrency’s total assets, we can use Market Cap. Market Cap is simply a formula of the total number of coins and the price of each coin. Each individual coin has its own value and we can use Market Cap to evaluate this.

 

For example: If Bitcoin is currently valued at $7,376.41 (changes daily) and there’s 17,249,375 coins in circulation, then the total market cap of BTC is valued at $127,238,394,412.

 

As of 4/9/2018, the total market cap was at $240,129,298,509 which is a significant change from January of 2018 which had a total market cap of $800,000,000,000. The price change can be attributed to the fundamentals of FUD, which are Fear, Uncertainty and Doubt. This is still a huge price jump from 2013 though, because the total market cap back then was only $1,000,000,000.

 

#4 – Market Manipulation and ICO Frauds

However, with the success and rise of cryptocurrency, there’s bound to be some projects that are questionable and some that are downright scams. There’s been a lot of promising projects that turned out to either be not worth their own weight in salt or just outright scams. (Bitconnect comes to mind).

 

My application doesn’t have a name yet but my main mission is to provide development information for all crypto coins. Think of it as a general Wiki for everything cryptocurrency based with information such as how many commits are made, the number of contributors, the number of releases and more. For contributors, that doesn’t just mean the inventor, that means how many software engineers are on the project and how many people are actively working on it at the time being.

 

The main benefit of this project will be so that investors can look up information on a coin, learn a little about their history and decide whether or not this is a good investment for them. With any investment opportunity, you should always due your diligent research but this project will serve as a nice starting point.

 

If you have any new ideas or features you’d like to see in this project, please let me know and I will discuss these ideas with my team to see if we can bring them to life.

 

#5 – The Technology Involved

For the next few posts on this blog, I want to discuss the technology and fundamentals that’ll be going into this project. This will be to provide useful information to developers and people like minded who may be interested in the project. The backend will be programmed in Python with the Django framework. The UI is nice and clean, which was made possible thanks to the use of proper HTML formatting, clean CSS, clean JS scripting and the Bootstrap framework.

First programming job: useful list of things to know

As you probably don't know, I have been working as a software engineer for a short period of time.

My first job was related to manual testing of one bigger product and this job laid good foundation for my future. In this job I learned where the bugs are mostly found, whether they're easy to fix, their priority and why they exist in the first place.

Moving on, I was employed as a software engineer. But back in the day I was wondering, what do I have to know to get a job? There are so many languages, frameworks and principles that I didn't know at that time! (and I still don't).

Here is the list of items that is useful to know before you start your first programming job. You don't have to worry, you will probably land a job anyway, but this is recommendation from my point of view.

#0 Work discipline

Programming is a hard job. You need to deliver product regularly and finish the tasks assigned to you. How much time do you have for each task is up to the management of the company you're working for.

On the other side, nobody will stand behind your shoulder to check on you whether you're currently on Facebook or you're really hardworking.  That means you can slack for some time now and then, but the time table should be followed very strictly.

To be a programmer, you should be also a team player. Don't be the lazy dog that won't lift a finger without order. You're a new guy so you should apply for as many tasks as you can and that gives you the opportunity to ask questions and learn very much from your colleagues (big thanks to my former / actual colleagues, they're great).

#1 Programming language used to develop the product you're working on

In the first place you should pick up for yourself type of development you want to do. Don't worry, the transition between the various types is not that hard, except you need to learn new stuff.

I don't want to start language and paradigm wars therefore I will just point out most used languages by SODS at the time it was taken (2018) divided into simple categories.

Front-end stack: HTML, CSS, Javascript

Back-end stack: Java, C#, Python

Databases: SQL

Mobile development: Android (Java), Swift

Data science: R, Python

In my opinion, every IT specialist is useful to know their way about one scripting language, that helps them to fix problems / analyze logs, analyze defects and write short programs etc. I can recommend Python, because of one simple cause. There are over 135119 packages available for use via Python package index (PyPI).

#2 Version Control System

Version control systems are a category of software that helps development teams all over the world to manage the changes done to source code over time. It records every change and developers can revert their mistakes.

According to Stack Overflow Developer Survey [SODS] the winner is obvious. GIT is popular open-source VCS founded by Linus Torvalds.

  1. Git 87.2%
  2. Subversion 16.1%
  3. Other (mercurial, Team foundation version control)

Read at least few chapters from GIT entry level book to cover basics. Dealing with source code of whole project is risky and you don't want to break something up in first few days (because everybody did :)). You should be comfortable with using basic merge, pull / push requests, understand commit / add and fetch.

#3 OOP / Design principles

Nowadays, most languages are OOP oriented. If that's the case and you've chosen OOP language, you should know what is abstraction, encapsulation, composition, inheritance, delegation and polymorphism.

Let's move on to design patterns:

From my point of view, if you're starting your first job you don't have to explicitly know how to use design patterns, but you should be able to recognize some of them in the code. On every job interview I was asked about couple of them (my personal favorite is Observer, and I was asked just about that one, luck I guess).

#4 Code style

The programming style is a set of guidelines when writing a source code for your application. General idea is that common programming style is helping developers with code readability and understandability.

Imagine reading code:

public 
static void 
main(String[] args) 



    {a = 123;
       b = 231;

                }

instead of:

public static void main(String[] args) {
    a = 123;
    b = 231;
}

Every language has it's own conventions, like Google Java Style or PEP8 for Python.

#5 Basic database knowledge

Almost every application is working with some kind of database. Most used database language is SQL. The basic language syntax and commands will be sufficient, you can look up everything on the fly.

You should understand what is query, insert, select, delete, foreign/primary keys, etc. Just to know your way around if you need to fix something with the DB.

#6 Basic algorithms and data structures

Every university that I know of teach some basic course of algorithms and data structures. You don't need to solve Traveling Salesma or Shortest path problem, because you will hardly need it. If you need it, you will look it up later. Although you should be able to understand when to use which structure for your data.

Python dictionaries or tuples, R dataframes or Java maps are an important part of the programming world. The knowledge about them are very efficient and will grant you multiple ways of solving your problems. Then it is up to you to choose the right ones!

#7 Terminal commands knowledge

Everybody loves open-source. Everybody loves things that are free. Linux OS is strong in field of servers. Depending on the project you may need to ssh into remote machine and do various operations. Commands like ssh, vim, tar, scp, sudo, grep, find and others should be familiar to you.

Those commands are available to you on mac machines and windows (through terminal emulators like cmder) as well. The logs generated by applications are pretty big and standard text editors have troubles loading them (I tried sublime and atom). Vim editor is blazingly fast and is right tool for this job.

 

This list is just my view from 'recent' newcomer into programming jobs world. Don't take it too seriously. Also this article shouldn't explain all the items, just to list them that they exists.

Sources:

  1. https://insights.stackoverflow.com/survey/2018/

Too many open files

Okay, so your boss calls you at 8. pm to tell you, that the application has stopped. You log in to your machine and you don't know what has happened. Then, while checking the logs you find this culprit:

xxxx xxx Too many open files.

#1 Understanding the problem

If you are not familiar with linux, you may be confused. Everything in linux is considered as a file, even your keyboard. To access those files, file descriptors are used.
File descriptor is a number that uniquely identifies opened file within an operating system. In a unix-like system, file descriptor refers to any file type named in a file system. For example as:

  1. Regular file
  2. Unix domain sockets
  3. Named pipes
  4. Anonymous pipes
  5. Network sockets

#2 Possible causes

  • Processes running on your OS are leaking resources, which means that the process opened to many streams that were not closed at all
  • Your process is leaking resources
  • Too many unclosed connections were opened on your server
  • The limit of number of opened FD is too low

Get process ID of your application

ps aux | grep yourAppName

To check limits set to your process:

cat /proc/processID/limits

When you encounter this problem, you should immediately look for output of the lsof command:
man lsof

lists all open files beionging to all active processes, in absence of any options.

In the absence of any options, lsof lists all open files belonging to
all active processes.

If any list request option is specified, other list requests must be
specifically requested - e.g., if -U is specified for the listing of
UNIX socket files, NFS files won't be listed unless -N is also speci-
fied; or if a user list is specified with the -u option, UNIX domain
socket files, belonging to users not in the list, won't be listed
unless the -U option is also specified

From lsof command you can safely tell which application is causing your problems. You can look for the ID of that process:

To check how many FD were opened by your process, you can use following command:

lsof -a -p {processID} | wc -l

Then if you need to identify every opened file descriptor by your application:

lsof -a -p {processID}

To check user limits of FD:

ulimit -n

This returns the maximum number of opened file descriptors.

More information on limits is stored in the following file: /etc/security/limits.conf

As you can read in the first two commented lines:

# This file sets the resource limits for the users logged in via PAM.
# It does not affect resource limits of the system services.

More about limits conf can be found here.

#3 Fixing

Basically, there are only 2 causes of this problem, but the second one is hard to find when dealing with big projects.

#3.1 FD limit is too low

If your application is running on a web server, there may be more connections opened to your server than is allowed by your linux distro. You can increase the hard and soft limits following way:

User limit: 

To change user limits edit conf. file:

vi /etc/security/limits.conf

System wide limit: 

sysctl -w fs.file-max=100000
echo "fs.file-max = 100000" >> /etc/sysctl.conf

and then relog the user.

#3.2 App is leaking resources (not closing streams)

If this is the problem, you need to close streams opened by your app. If you encountered this problem in the application written  java, your resources may be leaking. You can use a new (it's not new in the time of writing) feature of try (try with resources). It automatically closes all streams that may cause problems in the future.

PS:

  1. Check out more information about hard limit and soft limit settings in Linux. Good answers are found here and here

If you have any questions or any tips what to look for when this problem happens, feel free to leave a comment down below!

Sources:

https://ss64.com/bash/ulimit.html

https://linux.die.net/man/8/lsof

https://www.cyberciti.biz/faq/linux-find-all-file-descriptors-used-by-a-process/

https://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/

Feature Selection process

#1 Motivation for use of feature selection methods

Feature selection process
Feature selection process

Nowadays, we live in world where smart-devices produces huge amount of data. In real-world applications we encounter a big issue known as Curse of dimensionality. That means, we can find easily datasets with high number of features without adequate number of observation points. That can (and probably will) cause over-fitting of the model.

To effectively implement data mining methods, feature selection methods are often used in preprocessing stage.

#2 What is feature selection?

Feature selection, also known as variable selection is the process of selecting subset of relevant features for use in the following model. They are used to speed up learning process and to improve model interpretability.

Note: Irrelevant features decrease prediction accuracy and simultaneously increase learning time (due to high dimensionality).

#3 Feature selection method approaches

FS methods are commonly divided into 3 main approaches:

  1. Filter methods have the smallest computational complexity. They select the subset of features regardless of model (classifier). The importance of variables are calculated by various statistical tests for their outcome with the output variable. State-of-art methods: Relief, mRMR. Filter methods can be then divided into 2 categories, the first one is univariate FS and the second one is multivariate FS methods. Univariate methods completely isolates features from each other and on the other hand multivariate techniques take into consideration also feature sependency.
  2. Wrapper methods tries to select a subset of features and train a model using them. After the model is trained, some conclusions are made and based on them we either remove/add some of the features from/to that subset. Wrapper methods are really computationally expensive, because the problem is basically doing exhaustive search over all subsets. The final subset is chosen as a subset with highest achieved score.
  3. Embedded methods are methods, that are trying to combine the best parts from Filter and Wrapper methods. It uses feature selection with classification simultaneously.

Notes:

 

 

Python ML classifiers: AdaBoost

Nowadays, there are a lot of classifiers, that performs very similarly and they more or less the same system resources. How do you decide which of these classifiers you should use? Is it AdaBoost, Naive-Bayes or other?

Well, the most obvious answer is that you have to try all of them and then pick the best one! (It varies with your dataset though).

In my next post I will describe Naive-Bayes classifier and I will provide a working example using sci-kit learn framework in python.


 

#1 AdaBoost: Short description

Adaboost is a shorter term for Adaptive Boosting. Boosting is one of the most powerful learning ideas, originally designed for classification problems.

Consider a two-class problem, with the output variable coded as Y = \{0,1\}. Given a vector of predictor variables X, a classifier C(X) returns prediction of one of the two values \{0,1\}. The error is defined as follows:

Err = \frac{1}{N} \sum_{i=1}^{n}I(y_i \neq C(x_i))

The boosting is an iterative procedure that combines the results of many weak classifiers (C_m(x), m = 1,2 \dots , M) on modified data to produce a powerful classifier. Starting with the unweighted training sample, AdaBoost builds a first weak classifier C_1(x), for example Decision Stump that produces class labels.

weak classifier is defined as a classifier whose error rate is slightly better than random pick. If at least one misclassification was produced by one of the classifiers,  the weight of that observation point is increased. Subsequently, the second classifier is build using the new weights. The predictions from all of them are then combined by a weighted majority voting technique, thus producing final prediction:

C(x) = sign(\sum_{i=1}^{M}{a_m C_m(x)})

, where a_1, a_2, \dots a_M are numbers returned by boosting algorithm to increase influence to the better classifiers in the sequence. Each boosting step consists of applying weights w_1, w_2, ..., w_N to each of the training data points (x_i, y_i), i = 1, 2, 3, ..., N).

 


#2 AdaBoost implementation in sci-kit learn

There exist a lot of different algorithms for AdaBoost but I have chosen AdaBoost SAMME that supports multi-class classification and it is already implemented in scikit-learn.

AdaBoostClassifier class actually doesn't require any parameters to run (it has only optional ones) and they are:

  • base_estimator: the default estimator is DecisionTreeClassifier (I called it Decision Stump in the previous section)
  • n_estimators: Mease and Wyner (2009) says, that 1000 estimators should be enough. I believe that sometimes you do not need that much of estimators. Personally I used 500 estimators in my latest simulations.
  • learning_rate: I haven't tried tuning this parameter, so the default is one. There is a minor trade-off between the number of estimators and learning rate. This parameters lower the significance of the contribution of each classifier.
  • algorithm: SAMME.R 
  • random_state:

The actual code is pretty simple:

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
X, X_test, y, y_test = train_test_split(X_full, y_full, train_size=0.8)
scaler = StandardScaler().fit(X_train, y_train)
clf = AdaBoostClassifier(n_estimators=100)
clf.fit(scaler.transform(X_train, y_train), y_train)
score = clf.score(scaler.transform(X_test, y_test)
print 'RESULTS AdaBoost: ' str(score)

As you can see, you don't need to tune parameters in comparison with, for example SVM (thus not needing to use GridSearchCV).

 

 

RNetLogo: 1. Linking R with NetLogo

Agent based modeling is a simulation approach, that studies social systems from the complex adaptive systems prospective. Usage of ABM is limited to research, although I have already seen some private companies using it for profit.

NetLogo simulation
NetLogo simulation

The most popular simulation platform is NetLogo, which uses functional programming language called Logo. Unfortunately, NetLogo itself doesn't support parallel computing and that's shame, because ABM is resource intensive. Actually, there is one workaround that I know of.

Prerequsities:

  • R: Statistical programming language that can be obtained on https://www.r-project.org/. I recommend downloading IDE called Rstudio
  • NetLogo
  • R library "parallel": provides parallel computing in R
  • R library "RNetLogo": provides interface to use NetLogo from R, reference

Note: One more reason why I use RNetLogo library is the appearance of NetLogo plots and statistical output (it is not suitable for publication)

library("RNetLogo") 
nl.path <- "C:/Program Files/NetLogo 5.3.1/app/" # the path to NetLogo.jar
NLStart(nl.path, gui=FALSE) # true - if you want to run NetLogo with GUI
model.path <- "C:/Program Files/NetLogo 5.3.1/app/models/model.nlogo" # path to your .nlogo model
# set up parameters and initialize model
NLCommand("set parameter1", toString(1.00))
NLCommand("set parameter2", toString(2.00))
NLCommand("setup")
# and run model for 10 000 ticks.
nOfTicks <- 10000
NLCommand(nOfTicks, "go")

This was basic example where I didn't use parallel runs. In the next example I'll provide working code for running NetLogo in parallel.

library(parallel)
processors <- detectCores() # quantity of cores you want to use
cl <- makeCluster(processors)

nl.path <- "C:/Program Files/NetLogo 5.3.1/app/" # the path to NetLogo.jar
model.path <- "C:/Program Files/NetLogo 5.3.1/app/models/model.nlogo" # path to your .nlogo model

# function that starts NetLogo and loads model
preprocessing <- function(dummy, gui, nl.path, model.path) {
  library("RNetLogo")
  NLStart(nl.path, gui=gui)
  NLLoadModel(model.path)
}

simulationFunction <- function(id) {
  NLCommand("set param1", toString(1))
  NLCommand("setup")
  NLDoCommand(10000,"go")
}

# function that quits NetLogo after simulation
postprocessing <- function(x){
  NLQuit()
}

#start simulation
invisible(parLapply(cl, 1:processors, prepro, gui=gui, nl.path=nl.path, model.path=model.path))

result.par <- parSapply(cl, id, simulationFunction)

# stop simulation
invisible(parLapply(cl, 1:processors, postpro) 

stopCluster(cl)

Note: I hope that comments are informative enough and you won't have any problems with running your model. If you have any questions, please write it below in the comments, I will answer as soon possible.