Monday, February 15, 2016

How to store passwords

Hash function

Let's do some modeling - we will start with storing password problem in our Identity Management System.

First of all I want to start with a statement: we don't want to store plaintext user passwords. We want to know that provided password is valid. Basic idea behind this concept is to create a function to transform user password to less vulnerable and irreversible hash.

At this point I think we are all aware that we should do this that way and you probably know about existence of hash functions. Choosing a proper algorithm to provide such hash is just not so simple.
We can start with some basic model and we will define some rules to our hash algorithm later on.
PasswordHashProvider - basic interface to provide one or more hash algorithms should looks like this:

public interface PasswordHashProvider {
    String encode(String password, UUID salt);

    default boolean verify(String password, UUID salt, String hash) {
        return encode(password, salt).equals(hash);
    }
}

With this simple interface we can model our requirements and implement few tests.
First of all we want our function to provide some properties like:
  • it is quick to compute the hash value for any given message
  • it is infeasible to generate a message from its hash
  • it is infeasible to modify a message without changing the hash
  • it is infeasible to find two different messages with the same hash.
To validate first property: "it is quick to compute the hash value for any given message" we can generate multiple passwords and fail if our computations last a little bit too long. Lets write test for this:

@Timeout(value = 100, unit = TimeUnit.MILLISECONDS)
def "should generate hash during 100 milliseconds or faster"() {
 given:
  String simplePassword = "PASSWORD"
 when:
  String result = passwordHashProvider.encode(simplePassword, UUID.randomUUID())
 then:
  !result.isEmpty()
}


Second property of hash function - "infeasible to generate a message from its hash" is quite interesting from mathematical perspective. First of all this kind of function should be irreversible (I mean there should be no function f^-1 that f(f^-1(x)) = x ). This property is very easy to break- I mean every reversible function should be injective. To break injective easily we can assign same result for multiple values. This is the case for our password hash function - you can login with different password you defined but it should be infeasible to find the second one.

To express third property: "infeasible to modify a message without changing the hash" I wrote simple test:

def "should generate different hash for same passwords and different salt"() {
 given:
  List hashes = []
  Integer countSum = 0
  String password = "PASSWORD"
 when:
  (1..ITERATIONS).each {
   String hash = passwordHashProvider.encode(password, UUID.randomUUID())
   hashes.add(hash)
   countSum += hashes.count(hash)
  }
 then:
  countSum == ITERATIONS
}
Ok - I will explain one trick used here to validate this property. We want to check if change of salt will change the password hash. Salt is just some known value that is accessible during encryption of passwords to improve their quality and complicate hacker life because password rainbow tables (tables with common hash/password combinations) does not work for passwords with salt properly added during encoding.

Second trick of this test is question: how to verify that our hashes are unique? Simple idea behind this is to sum count of our hash among other hashes and after that check if our sum is equal to number of iterations. In case we provide only one hash for all values we will get sum of arithmetic progression. In our case for lets say 100 password hashes we will get sum of 5050. Therefore this sum should be from range 100 - 5050.

I can't find a way to test property "it is infeasible to find two different messages with the same hash". We have to assume that this property should be reported by security researchers if we use one of well known algorithms.


Well known algorithms:

MD5

This algorithm fails the last property: "it is infeasible to find two different messages with the same hash". You can find details searching for phrase: "MD5 is not collision resistant".

Bcrypt

Default password hash algorithm for BSD and other systems including some Linux distributions.

Scrypt

What a silly name! What should I put in google to find java implementation? Java scrypt? :)
This algorithm characterizes with some good properties like being  quite heavy memory intensive (which is good, because it saves us from farms of nvidia cores ready to break our computation intensive algorithms - as a humanity we can't scale memory without huge costs (in early 2016) ).

PBKDF2

After wikipedia: "PBKDF2 (Password-Based Key Derivation Function 2) is a key derivation function that is part of RSA LaboratoriesPublic-Key Cryptography Standards (PKCS) series, specifically PKCS #5 v2.0, also published as Internet Engineering Task Force's RFC 2898."

Argon2


To sum up

Lots of algorithms to hash passwords exists but  which one is best or just good enough? Of course Argon2 is quite interesting. We can define how hard it can be to cpu, memory and we can also define degree of parallelism. For now I can't find JVM implementation of this algorithm - only bindings to C lib.

PBKDF2 seems reasonable - it has java native support (javax.crypto package). It has quite interesting properties and is widely implemented. It has some troubles like being cpu rather than memory intensive (which in cloud environments is also crucial for us). Lets start with PBKDF2 and move to Argon2 someday (the day I will find the JVM implementation with Apache2 compatible licence).

Friday, February 12, 2016

Spring sleuth

Correlation Id

When you start with microservices the natural question comes to mind: how to debug all this stuff. Unfortunately some of problems appear only in production environments. How to trace them effectively? Solution is not always so simple, but correlation id should help you with debugging such things. I will show you logs and everything should become clear right now:

2016-02-12 19:48:41.116 [trace=6d9ed076-bb14-42e9-b537-3ae1e85441b0,span=6d9ed076-bb14-42e9-b537-3ae1e85441b0] [o-auto-1-exec-3] o.s.cloud.sleuth.util.ExceptionUtils    : Trace client warn: thread http-nio-auto-1-exec-3 tried to start a new Span with parent MilliSpan(begin=1455302921115, end=0, name=null, traceId=6d9ed076-bb14-42e9-b537-3ae1e85441b0, parents=[], spanId=6d9ed076-bb14-42e9-b537-3ae1e85441b0, remote=true, exportable=true, annotations={}, processId=null, timelineAnnotations=[]), but there is already a currentSpan MilliSpan(begin=1455302921054, end=0, name=http/api/user/januszkowalski, traceId=6d9ed076-bb14-42e9-b537-3ae1e85441b0, parents=[], spanId=6d9ed076-bb14-42e9-b537-3ae1e85441b0, remote=false, exportable=false, annotations={}, processId=null, timelineAnnotations=[])
2016-02-12 19:48:41.119 [trace=6d9ed076-bb14-42e9-b537-3ae1e85441b0,span=17a6a6c2-dba3-4ac1-a23b-2deb6371c08b] [o-auto-1-exec-3] o.s.cloud.sleuth.log.Slf4jSpanListener  : Starting span: MilliSpan(begin=1455302921119, end=0, name=http/api/user/januszkowalski, traceId=6d9ed076-bb14-42e9-b537-3ae1e85441b0, parents=[6d9ed076-bb14-42e9-b537-3ae1e85441b0], spanId=17a6a6c2-dba3-4ac1-a23b-2deb6371c08b, remote=false, exportable=true, annotations={}, processId=null, timelineAnnotations=[])


First idea is to add unique id to every request and trace this id between another microservices interactions. This trick gives us possibility to trace our logs with this uid. Quite simple, but how to make this without rewriting this feaature from scratch?

In Approdorix project I started with Spring sleuth. It was really simple, just one line in application.properties:

logging.pattern.console=%d{yyyy-MM-dd HH:mm:ss.SSS} [trace=%X{X-Trace-Id:-},span=%X{X-Span-Id:-}] [%15.15t] %-40.40logger{39}: %m%n

Observable

Why you should consider Couchbase as a KV or document database?

Tomek Nurkiewicz gave me access to his fresh book: "Reactive Programming with RxJava". I think that this is my chance to learn "something new" by designing fully reactive application. 

Evolution - from callbacks to RX

As you probably know node.js and javascript in general is single threaded. You can ask question - how it is possible that most nowadays Single Page Applications do a lot of things under the hood? They can send few request to backend and redraw user interface in responsive timing. Of course there is a possibility to run multiple IO operations at the same time in background but our app should not actively wait and check IO operations statuses.
Key to the "success" or rather a way how they do this in javascript is callbacks. After our operation ends (with success or error) our app gets informed instantly. 
Have you heard about callback hell? It all starts in javascript. For many years a lot of fresh js adepts create a lot of async code that tends to move to the right - I mean literally moves to the right because every callback moves their code 4 spaces right.
Now you can't enter modern JS without hearing about solutions to avoid callback hell (I mean promises and $q).

I was quite familiar with $q and promises in javascript when I found out CompletableFuture in java8. I like those new mechanism in java8 especially mixed with lambdas.
If you don't know much about RxJava you can imagine that RxJava is just CompletableFuture on steroids.

I found in Tomeks book a chapter named: "Couchbase client API" but there is no content behind this marker in current version. I can imagine that this chapter will be extremely interesting and especially if Tomek will compare RxJava driver with other db drivers - those blocking ones.

What is so special about relation of Couchbase with RxJava? Couchbase is first database I know to provide native RxJava driver. What does it mean? You can query your database asynchronously and be informed just after our data reach client side document by document until we get complete result or we end with any error. Lets look how we can write this in code:

    @RequestMapping(value = "/{login}", method = RequestMethod.GET)
    public DeferredResult<User> findByLogin(@PathVariable String login) {
        DeferredResult<User> deferred = new DeferredResult<>();
        userService
                .get(login)
                .subscribe(deferred::setResult,
                        e -> {
                            logger.error("Can't find user by login", e);
                            throw new UserDoesNotExist();
                        });
        return deferred;
    }

We have few mechanisms in this small piece of code:

DeferredResult - this is Spring way to defer returning of result (from Spring MVC 3.2). This mechanism benefit from servlet 3 and its ability to async request processing. This is quite important because it is very easy to transform RxJava into blocking and we should do such trick to avoid blocking any thread.

Observable<User> - this is RxJava stuff. UserService.get(login) method returns Observable<User>. We can subscribe on events from this observable and react to emitted events. If we get healthy user we can return our user as a result with deferred::setResult. If we get some nasty error we can put message in log and throw runtime error. With all those tricks during the time our database process our query no threads on our application server waits for result. Just after they will come our app will process them with new threads.

I wonder if I can develop entire API in this manner - I mean reactive way. I will get back to this post if I encounter any problems with this approach. I think that Tomeks book will help me to stay on course and avoid me from making mistakes that transforms RxJava into blocking one.

Approdorix

The idea behind the Approdorix.com

My blog looks dead for some time, but a lot of other stuff was done under the hood (this is called the life). Finally I found the brilliant idea to write entire cycle of blogposts about creating open source project. My idea is to build some general microservices to fulfill most general needs of future applications.

Is this possible? 

I must admit that I am afraid about the limited usability of a general models, but let's consider some bounded contexts where we can apply general domain models without bringing much risk to the project.

For example in Identity Management domain there is a lot of stuff you can do without any changes in almost every application. Users can register, login, logout, edit some basic data, use other authorization providers like: facebook, gmail or twitter. Things gets even more complicated if you consider microservices architecture. It sounds like a lot of stuff that has to be done even before you actually provide any real business value. I think that Identity Management done right can be shared between the future projects and therefore IDM is good bounded context to start with.
You can find github repository here:
https://github.com/piotrpietrzak/approdorix-users

A friend of mine told me about some general model of accounting pattern described in Martin's Fowler paper. After reading this I think that it can bring some value to implement general accounting microservice to provide financial consequences of business events. I wonder if it is possible to implement "general accounting" module that can with very small adjustments work in any domain. 

Start small and grow only if we have to

Of course I do it for fun and therefore I do not want to just do it, but I want to learn something new, discover the way of doing things right and write about it.

My plan is to run those microservices with all dependencies on AWS with Kubernetes. I think this assumption is quite useful for our open source project end users because they can start small and grow even if they don't want to spend a lot of $ on infrastructure. I think I will start with couchbase as a document database and kafka as message broker. Why? I will explain in following blog posts.