Category Archives: Computer Science

try catch

ref –

When a single, unconditional catch clause is used, the catch block is entered when any exception is thrown. For example, when the exception occurs in the following code, control transfers to the catch clause.

The catch block specifies an identifier (e in the example above) that holds the value specified by the throw statement. The catch block is unique in that JavaScript creates this identifier when the catch block is entered and it adds it to the current scope; the identifier lasts only for the duration of the catch block; after the catch block finishes executing, the identifier is no longer available.


Note that the finally clause executes regardless of whether an exception is thrown. Also, if an exception is thrown, the statements in the finally clause execute even if no catch clause handles the exception.

You can use the finally clause to make your script fail gracefully when an exception occurs; for example, to do general cleanup, you may need to release a resource that your script has tied up.

General situations

try…catch blocks are generally encourages to be used less, and this doesn’t depend on the language you use.

The main reason for this is the cost of catch blocks. Also another reason is that, when you wrap many statements with a single try…catch block, in catch block you can’t be sure what was exactly the main problem.

It’s better to use techniques like validation or if…else blocks to reduce the probability of an exception (error) to happen. For example, when you want to work with a number which is taken from user, instead of using try…catch, you can use:

deeply nested objects where a null can pop up at any level

to perform this “get” with 100% safety one would have to write a bit of verbose code:

Now imagine the variable names are realistic and longer and you quickly get a very ugly code if statement.

I tend to use try/catch/finally to simplify this when I don’t care about any “else” scenarios and just want the object or not:

Hash Table (Linear Probing)

ref –

Linear Probing is an Open Addressing scheme for hash tables.

The “open” in “open addressing” tells us the index (aka. address) at which an object will be stored in the hash table is not completely determined by its hash code. Instead, the index may vary depending on what’s already in the hash table.

The “closed” in “closed hashing” refers to the fact that we never leave the hash table; every object is stored directly at an index in the hash table’s internal array. Note that this is only possible by using some sort of open addressing strategy. This explains why “closed hashing” and “open addressing” are synonyms.

Best case – O(1).
For insertion, it hashes to an empty element. We insert. done.
For search, it hashes to its initial location. We do a check. Keys match. done.
Deletion – Searches for the element with the key. We find it. We remove it. The next element is empty so we’re done.

Worst case – all are O(n), because the worst possible case scenario is that we have to iterate the whole array after the initial collision.


Insertion for linear probe deals with 2 steps:

1) hash to a location. If the location is empty, done.
2) If the location is occupied, we probe for an empty space.

Probe involves going step by step down the cells and trying to find an empty cell. Once found, we place the object there.
Note that we do not circle around back to the beginning of the array. If no empty cells are to be found, we return -1. Other alternatives can be to grow the array and try inserting again.

Retrieving the object

Searching for a cell with a certain key involves starting at the hashed index:

1) If the keys match, then return the cell. If no match, keep going forward (probe) and looking for a match on our key.
2) If an empty cell is found, the key is not in the table. If the key did exist, it would be placed in the empty cell first, and not have gone any further in the later cells. This is because during insertion, when a value is hashed and there is a collision, it would naturally probe until it finds an empty spot to put its object. Thus, when searching, hitting an empty spot means its the end of the line for a particular item that was hashed to a previous location.

Deleting an Object

Deleting an item is the same as searching. When we have successfully found the cell it delete, we:

1) remove the cell so the array element references null
2) Here’s the important part. Once the element was is removed in 1), we need to find a replacement down the line. The reason why we need to find a replacement is shown below.

Say we want to insert key “job”. It gets hashed to index 6, then we move it to the next empty space, which is index 9.

The next operation is that we want to remove key “last name”. So last name gets hashed to index 6. We go down the array until we find “last name” at index 8. We remove it.

Now, say we want to search for key “job”. It would get hashed to index 6. Then it would keep going and then it would hit index 8 on an empty cell. It would stop and say key “job” does not exist. However, clearly, job exists down the line at index 9.

Solution on what to do for the empty space after deletion

When deleting, we remove the cell, then:

1) then go down the list to find a replacement for this empty cell
2) The replacement must have a key equal or smaller than the recently deleted item.

This is because all items are inserted on or after its initial hashed index. Thus, we can replace the empty cell with hashed index that is smaller because the search is coming from the left.

But why? take a look at an example:

Sa we want to delete key “last”, which was initially hashed to index 6, and probed to index 7. Now, from the previous example, we know we can’t just let it go like this.

We must go down the line and find a replacement. The notion is that we must find a replacement that has key <= our key. But what if we dont' do this? What will happen? Say we simply get the next available replacement, which in our case, is key "name" (initially indexed to 8). So as you can see, say we bring key "name" over to index 7. Then, we make a search for key "name". Its initial hash index would be 8. And it wouldn't find anything there. If data does get placed there, it wouldn't be key "name" that we're looking for. Let's see what would happen if we iterated the array for a cell that has a key <= to our key. Picking an element with hash <= removed element's hash

So after removing key “last”, we can’t just get any element to replace it. We must find an element with a hash <= our hash. Since we've removed "last" with key of 6, we then iterate the array to find that element. We go down the list, name has hash 8, nope. job has hash 8. nope. age has has 6. Perfect! so we move the element with key "age" and hash 6 to index 7. Don't forget that we must do this continuously until we hit an empty slot or the end of the array. Now, when you try to search for "age", it will hash you to index 6, which you collide with key first. Then you probe. At index 7, we have found the element with key "age". done!


Functions that satisfy Use Case scenarios should be public.
Any other functions that are used by the public functions are to be private.

Thus, insertion, retrieving index, deleting, and printing should all be public.

Thus, I the public functions are:


All the other functions are private:


One issue to note is that the private functions’s this belong to global (or “undefined” if using strict). You need to connect its ‘this’ context to our object. Some standard ways are to use call/apply. But you can also define a private variable _self, where it references the this context.

Once declared, you can freely use it any private functions.
In our loadFactor example, we use scope to access private _self, which references the correct this context.

shellsort (js)

Shellsort is the same as insertion sort. Except that it uses a pass array to decide how many number of steps to jump.

In Insertion Sort, we always use 1 step. When we backtrack, we go back 1 step. But in Shellsort, we have an array, where we decide how many jumps to backtrack. This is called a “gap”. Of course in the very end, we’ll definitely have 1 step at the end.

In our example, we decide to use 5, 3, 1. What this means is that we will sort this via Insertion via passes where we jump 5 steps, then 3 steps, then 1 step. So for step 5, we start at index 5 of the array. Do the Insertion sort, but instead of using 1 step when going backwards, we use 5 steps.

first pass, 5 steps

We start off at index 5. Will process the insertion sort, and when we’re done, we go down the array 1 by 1.

The insertion sort itself will use 5 steps when sorting via

Everything else is processed the same way as Insertion Sort. We get the current valueToInsert. It needs to satisfy a few points:

1) is the index larger or equal to the gap index?
2) is the previous gapped element larger than the valueToInsert?

If so, we assign the previous gapped element to the current one.
If not, we just put the valueToInsert back to the current element.

Given the Insertion Code with the gap element like so:

The gap we use is 5. So we need to do an Insertion pass where the gap is 5.

We start index 5. The value at that index is 11. 11 is the value we need to check and find a proper place to insert.

The Insertion sort happens when we have j point to i.

1) j>= gap, 5 >= 5 ? yes
2) previous gapped element > valueToInsert, 6 > 11 ? NO

dataStore[5] = 11, we move down the array.


The next index is 6. dataStore[6] is 8. The previous gapped element is dataStore[6-5] is 0

1) j >= gap, 6 >= 5? yes
2) previous gapped element > valueToInsert, 0 > 8, NO

dataStore[6] = 8, we move down the array


The next index is 7. dataStore[7] is 0. Its previous gapped element dataStore[7-5] is 2.

1) j >= gap, 7 >= 5? yes
2) previous gapped element > valueToInsert, 2 > 0 yes

When both requirements is true, we exchange the element values. Basically storing the previous value into the current value. This is the because the previous value is larger than the current value.

We continue on the insertion sort by looking back at -gap elements. Namely, j-= gap. Hence our j is now at index 2, dataStore[2] = 2.
We check the requirements again:

1) j >= gap, 2 >= 5 NO

So we jump out of the loop and assign the valueToInsert at dataStore[2]. This perfectly switches the numbers at dataStore[2], and dataStore[7].


We move onto the next index at 8. dataStore[8] is 5.
Previous gapped element is dataStore[8-5] = 9

1) j >= gap, 8 >= 5, yes
2) previous gapped element > valueToInsert, 9 > 5 yes

Hence we need to copy the previous element to the current one. We then backtrack gap steps into index 3

1) j >= gap, 3 >= 5, NO
therefore dataStore[3] = 5.

We keep going down the array.


We process the last element at index 9, dataStore[9] is 4.

dataStore[9] is 4
dataStore[9-5] is 3

1) j>= gap, 9 >= 5
2) previous gapped element > valueToInsert, 3 > 4, NO.

So we jump out of the loop
dataStore[9] = 4

next gap pass, 3 steps

We start from beginning to end, and at index gap. We do the insertion sort with gap steps, however, we process each element down the array one by one.

index 3

Hence, our next pass involves gap 3. We start off at index 3.
We see that dataStore[3] is 9. valueToInsert = 9.
Previous gap element is dataStore[0] is 6.

is index 3 >= 3 yes
6 > 9 (valeToinsert) NO, move on to next index


index 4

We now on index 4.
dataStore[4] = 3, valueToInsert = 3
dataStore[4-3] = 0

4 >= 3 yes
0 > 3 (valeToInsert) NO, move on to next index

index 5

We now on index 5.
dataStore[5] is 11, valueToInsert = 11
dataStore[5-3] is 2

5 >= 3 yes
2 > 11 (valeToInsert), NO, move on to next index


index 6

We now on index 6.
dataStore[6] is 8 valueToInsert = 8
dataStore[6-3] is 9.

6 >= 3 yes
9 > 8(valeToInsert), we then copy over the value from the previous element to the current

Thus, dataStore[6] = 9.

We go back 1 gap step so now j is at index 3.
dataStore[3] is 9
dataStore[3-3] is 6

3 >= 3 yes
6 > 8 (valeToInsert), NO, we skip out. have current index j, dataStore[3] = 8 (valueToInsert)

move forward down the array


index 7

We are now at index 7.
dataStore[7] is 0. valueToInsert is 0.
dataStore[7-3] is 3

7>=3 yes
3 > 0 yes

Let’s copy previous element to current so now
dataStore[7] is 3.

j moves a “gap” step back to index 4

dataStore[4] is 3
dataStore[1] is 0

4 >= 3, YES
0 > 0 NO

dataStore[4] is 0.

Move forward


index 8

dataStore[8] is 5 (valueToInsert)
dataStore[8-3] is 11

8 >= 3 yes
11 > 5 yes

So we need to copy the previous value over to current.

dataStore[8] = 11

Go back 1 “gap” step to index 5

dataStore[5] is 11
dataStore[5-3] is 2

5 >= 3 yes
2 >= 5 (valueToInsert) NO

dataStore[5] = 5

move on…


Index 9

Finally, we are index 9.

dataStore[9] is 4 (valueToInsert)
dataStore[9-3] is 8

9 >= 3 yes
8 > 4 yes

so we copy the previous gap element to current
dataStore[9] = 8

we go previous gap element to index 6

dataStore[6] is 8
dataStore[6-3] is 9

6 >= 3, YES
9 >= 4 (valueToInsert), YES

copy the previous gap element to current
dataStore[6] = 9

go back “gap” elements

dataStore[3] is 9
dataStore[0] is 6

3 >= 3, YES
6 > 4 (valueToInsert) YES

copy the previous gap element to current
dataStore[3] = 6

go back “gap” elements

index is now 0
0 >= 3 NO

so dataStore[0] = 4 (valueToInsert)

Done for gap 3 pass


Full Source

Circular List in JS (prototype pattern, constructor pattern)

Prototype Pattern

– Using prototypes to create classes, all instances of that class will the same functions.

– Can’t have private utility functions. All properties and functionalities can be seen by caller.

Constructor Pattern with scoped private variables

By using scoped variables, we can have privacy, and can create get/set functionalities for public access.

However, not very practical as I will need to create get/set functions for all the private variables.
Then, using them in my functions can be awkward as assigning references must be dealt with using set functions.
And being assigned, we’ll have to call the get functions.

Insertion Sort

Insertion sort is a simple sorting algorithm that works the way we sort playing cards in our hands

Given, 8 5 3 12 2

For the first pass, we always start at index 1. So we see 5 and store it in variable valueToInsert.

There are 2 conditions we must satisfy:

1) 1 > 0, yes. index > 0
2) 8 > 5, yes. previous element is larger than valueToInsert

Both satisfies so, we copy the previous element (8) to the current location where 5 is.

We move 1 step back and hit index 0.

We evaluate the 2 situations again:

1) 0 > 0, no. We stop.

If we reach index 0, that means we’ve reached the end, and we our stored 5 there.

Essentially what we’re doing is taking the current number and going backwards to see where to insert it. We do this by comparing the previous element and seeing if its bigger than our valueToInsert. If its larger, push it down and we keep going. If its smaller, we stop and we insert our number. If we reach the end at index 0, we insert our number.


Now we have 5 8 3 12 2.

We work our next pass which is index 2 with value 3.

We store 3 and start searching for the insertion point. We evaluate the 2 points:

1) is index > 0? 2 > 0, yes.
2) We look at the previous element 8. is 8 > 3? yes, so we move 8 to where 3 is.

Move backwards 1 step.

1) is index > 0? 1 > 0, yes.
2) Then we look at previous element 5, is 5 > 3? yes. so we move 5 to where 8 is.

1) is index > 0? 0 > 0, no. We’ve reached the end. Hence we put 3 at index 0.

This leaves us with 3 5 8 12 2.


Then we work on value 12.

1) is index > 0? inner 3 > 0, yes.
2) Previous element 8 > 12? no. We stop, and move on.

Finally, our last value is 2. Apply the same as we did above and we’ll see that 2 will be inserted at the way in the beginning because it is smaller than the other numbers.


full source

Time Complexity

The best case input is an array that is already sorted. In this case insertion sort has a linear running time (i.e., O(n)). During each iteration, the first remaining element of the input is only compared with the right-most element of the sorted subsection of the array.

The simplest worst case input is an array sorted in reverse order. The set of all worst case inputs consists of all arrays where each element is the smallest or second-smallest of the elements before it. In these cases every iteration of the inner loop will scan and shift the entire sorted subsection of the array before inserting the next element. This gives insertion sort a quadratic running time (i.e., O(n2)).

The average case is also quadratic, which makes insertion sort impractical for sorting large arrays.

However, insertion sort is one of the fastest algorithms for sorting very small arrays, even faster than quicksort; indeed, good quicksort implementations use insertion sort for arrays smaller than a certain threshold, also when arising as subproblems; the exact threshold must be determined experimentally and depends on the machine, but is commonly around ten.

Selection sort

Worst-case performance О(n2) comparisons, О(n) swaps
Best-case performance О(n2) comparisons, О(n) swaps

The idea is that we latch onto the first element and deem it being the minimum.
We then go through the rest of the array, and look for anything that is smaller. If so, we swap. This is 1 pass.

At this point, after the 1st pass, we deem first element as minimum. Then we move on
to the next element and call it the minimum. We do another pass and look in the leftover array
to see if there’s anything smaller than our minimum. If so, swap.

We keep doing this, pulling minimum numbers from the array and pushing it towards the front, until
all the elements are sorted from smallest to largest.


So there’s 2 loops: outer and inner.

Outer simply goes through the array.

The inner goes through the remaining elements AFTER the outer via a loop and compares it to the outer.

if the inner element is smaller than the outer element, then we swap. Essentially what is happening is that we are bringing in the smallest element to the front.


Given an array, we make passes on it where:

– we select the starting value to be the minimum.
– we then make (length of array – 1) passes over it. (through all of our passes, the last element will be automatically sorted, so no need to do anything on it)
– For every pass, we need to do the very important action of pulling the minimum from the array and swapping it with the first value.


…continuing on in the same fashion…

Eventually all the smallest number will be swapped to the front, resulting in the list being sorted.

full source

Stack and Heap

The stack is the memory set aside as scratch space for a thread of execution.

When a function is called, a block is reserved on the top of the stack for local variables, parameters, and some bookkeeping data.

When that function returns (finishes execution), the block becomes unused (all the variables gets popped) and can be used the next time a function is called.

The stack is always reserved in a LIFO (last in first out) order; the most recently reserved block is always the next block to be freed. This makes it really simple to keep track of the stack; freeing a block from the stack is nothing more than adjusting one pointer.

The heap is memory set aside for dynamic allocation. Unlike the stack, there’s no enforced pattern to the allocation and deallocation of blocks from the heap; you can allocate a block at any time and free it at any time. This makes it much more complex to keep track of which parts of the heap are allocated or free at any given time; there are many custom heap allocators available to tune heap performance for different usage patterns.

Each thread gets a stack, while there’s typically only one heap for the application (although it isn’t uncommon to have multiple heaps for different types of allocation).

The stack is attached to a thread, so when the thread exits the stack is reclaimed. The heap is typically allocated at application startup by the runtime, and is reclaimed when the application (technically process) exits.

The size of the stack is set when a thread is created. The size of the heap is set on application startup, but can grow as space is needed (the allocator requests more memory from the operating system).

The stack is faster because the access pattern makes it trivial to allocate and deallocate memory from it (a pointer/integer is simply incremented or decremented), while the heap has much more complex bookkeeping involved in an allocation or deallocation. Also, each byte in the stack tends to be reused very frequently which means it tends to be mapped to the processor’s cache, making it very fast.

Another performance hit for the heap is that the heap, being mostly a global resource, typically has to be multi-threading safe, i.e. each allocation and deallocation needs to be – typically – synchronized with “all” other heap accesses in the program.

Hash table – Separate Chaining, Open Hashing

ref –

Open hashing – in this strategy, none of the objects are actually stored in the hash table’s array; instead once an object is hashed, it is stored in a list which is separate from the hash table’s internal array. “open” refers to the freedom we get by leaving the hash table, and using a separate list. By the way, “separate list” hints at why open hashing is also known as “separate chaining”.

The most frequently used general purpose implementation of an associative array is with a hash table: an array combined with a hash function that separates each key into a separate “bucket” of the array. The basic idea behind a hash table is that accessing an element of an array via its index is a simple, constant-time operation. Therefore, the average overhead of an operation for a hash table is only the computation of the key’s hash, combined with accessing the corresponding bucket within the array. As such, hash tables usually perform in O(1) time, and outperform alternatives in most situations.

Hash tables need to be able to handle collisions: when the hash function maps two different keys to the same bucket of the array. The two most widespread approaches to this problem are separate chaining and open addressing.

Average | Worst Case

Space O(n)[1] | O(n)
Search O(1) | O(n)
Insert O(1) | O(n)
Delete O(1) | O(n)

The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given a key, the algorithm computes an index that suggests where the entry can be found:

index = f(key, array_size)

hash = hashfunc(key) // where hash is some number
index = hash % array_size // in order to fit the hash into the array size, it is reduced to an index using modulo operator

In this method, the hash is independent of the array size, and it is then reduced to an index (a number between 0 and array_size − 1) using the modulo operator (%) .

Choosing a hash function

Table size and range finding

The hash functions introduced in The Art of Hashing were designed to return a value in the full unsigned range of an integer. For a 32-bit integer, this means that the hash functions will return a value in the range [0..4,294,967,296). Because it is extremely likely that your table will be smaller than this, it is possible that the hash value may exceed the boundaries of the array.

The solution to this problem is to force the range down so that it fits the table size.

For example, if the table size is 888, and we get 8,403,958, how do we fit this value within the table?

A table size should not be chosen randomly because most of the collision resolution methods require that certain conditions be met for the table size or they will not work correctly. Most of the time, this required size is either a power of two, or a prime number.

Why a power of two? Because we use bitwise operation to get performance benefits. Bitwise operations are done on binary bit combinations and thus, that’s why the table size needs to be in power of 2. A table size of a power of two may be desirable on some implementations where bitwise operations offer performance benefits. The way to force a value into the range of a power of two can be performed quickly with a masking operation.

For example, to force the range of any value into eight bits, you simply use the bitwise AND operation on a mask of 0xff (hexadecimal for 255):

0x8a AND 0xff = 0x8a

from hex to digit:

Note that _ _ _ _ _ _ _ _ = 8 bits = 1 byte.
2 2 2 2 2 2 2 2 = 2^8 = 256 representations (memory addresses)

0x8a -> 1000(8) 1010(a) -> 1000 1010 binary

binary to decimal is 1 * 128 + 0 * 64 + 0 * 32 + 0 * 16 + 1 * 8 + 0 * 4 + 1 * 2 + 0 * 1 = 128 + 10 = 138.

from digit to hex:

Thus, if we were to get a value of 138, we force this value to give us a return index from an array size of 8 bits by using AND operation with 0xff.

Thus, in code, you’ll get the parameter 138. Then you convert it to 1000 1010, then to hex which is 0x8a.
Then you apply the AND bit op which gives you 0x8a AND 0Xff = 0x8a

so 138 in 256. But what if you get some large number like 888?
888 in binary is 0x378

0x378 AND 0x0ff = 78

we append 0 to front of 0xff because we’re dealing with 3 hex places. We apply the AND and get 78. Thus if you get hash value 888, it would give you index 78.

table[hash(key) & 0xff]
This is a fast operation, but it only works with powers of two. If the table size is not a power of two, the remainder of division can be used to force the value into a desired range with the remainder operator. Note that this is slightly different than masking because while the mask was the upper value that you will allow, the divisor must be one larger than the upper value to include it in the range. This operation is also slower in theory than masking (in practice, most compilers will optimize both into the same machine code):

table[hash(key) % 256]
When it comes to hash tables, the most recommended table size is any prime number.

This recommendation is made because hashing in general is misunderstood, and poor hash functions require an extra mixing step of division by a prime to resemble a uniform distribution. (

Another reason that a prime table size is recommended is because several of the collision resolution methods require it to work. In reality, this is a generalization and is actually false (a power of two with odd step sizes will typically work just as well for most collision resolution strategies), but not many people consider the alternatives and in the world of hash tables, prime rules.

Advantages in using Hash Table

The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when the number of entries is large. Hash tables are particularly efficient when the maximum number of entries can be predicted in advance, so that the bucket array can be allocated once with the optimum size and never resized.

If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by:
1) a careful choice of the hash function,
2) bucket table size,
3) internal data structures.

In particular, one may be able to devise a hash function that is collision-free, or even perfect. In this case the keys need not be stored in the table.

Open Addressing

Another popular technique is open addressing:

  1. at each index of our list we store one and one only key-value pair
  2. when trying to store a pair at index x, if there’s already a key-value pair, try to store our new pair at x + 1
    if x + 1 is taken, try x + 2 and so on…
  3. When retrieving an element, hash the key and see if the element at that position (x) matches our key. If not, try to access the element at position x + 1. Rinse and repeat until you get to the end of the list, or when you find an empty index — that means our element is not in the hash table

Separate chaining with linked lists

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket—that is, it is roughly proportional to the load factor.

For this reason, chained hash tables remain effective even when the number of table entries n is much higher than the number of slots. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list.

For separate-chaining, the worst-case scenario is when all entries are inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries, so the worst-case cost is proportional to the number n of entries in the table.

The bucket chains are often searched sequentially using the order the entries were added to the bucket. If the load factor is large and some keys are more likely to come up than others, then rearranging the chain with a move-to-front heuristic may be effective. More sophisticated data structures, such as balanced search trees, are worth considering only if the load factor is large (about 10 or more), or if the hash distribution is likely to be very non-uniform, or if one must guarantee good performance even in a worst-case scenario. However, using a larger table and/or a better hash function may be even more effective in those cases.

Backstory on Unicode and UTF8

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

All you need to know about memory is that it’s one large array. But one large array containing what? The array contains bytes. In computer organization, people don’t use the term “index” to refer to the array locations. Instead, we use the term “address”.

“address” and “index” mean the same, so if you’re getting confused, just think of “address” as “index”.

Each address stores one element of the memory “array”. Each element is typically one byte.


We’ve defined a word to mean 32 bits. This is the same as 4 bytes. Integers, single-precision floating point numbers, and MIPS instructions are all 32 bits long. How can we store these values into memory? After all, each memory address can store a single byte, not 4 bytes.

The answer is simple. We split the 32 bit quantity into 4 bytes. For example, suppose we have a 32 bit quantity, written as 0x90AB12CD, which is hexadecimal. Since each hex digit is 4 bits, we need 8 hex digits to represent the 32 bit value.

So, the 4 bytes are: 90, AB, 12, CD where each byte requires 2 hex digits.

It turns out there are two ways to store this in memory.

High Endian


Address Value

1000 90
1001 AB
1002 12
1003 CD

Little Endian

In little endian, you store the least significant byte in the smallest address. Here’s how it would look:

Address Value
1000 CD
1001 12
1002 AB
1003 90

some backstory…
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes
0048 0065 006C 006C 006F

However, some computers are big-endian, some little-endian. Thus, there two ways to store Unicode. One way of storing Unicode is big-endian mode. Another is little-endian mode. How do we know which mode to store in? It uses BOM (Unicode Byte Order Mark) to signify this. If your computer is in little-endian mode, and the incoming Unicode word’s BOM signifies big-endian, then you have to swap that big-endian so that its little endian. And vice versa. Also, not every Unicode string in the wild has a byte order mark at the beginning.

So people were forced to come up with the bizarre convention of storing a FE FF (2 bytes) at the beginning of every Unicode string.

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF.

In order to use BOM, it would double the amount of storage it took for strings.Then there was ANSI and DBCS character sets and who’s going to convert them all?
For this reason alone, most people decided to ignore Unicode for several years and in the meantime things got worse.

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, up to 6 bytes.

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.

Thus, 3 “popular” ways of encoding Unicode:

UTF-16 (16 bits, high-endian)
UTF-16 (16 bits, low-endian)


The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Unicode is a character set

Unicode assigns every character in existence a unique number called a code point.

In other words, a letter maps to something called a code point

As Joel would explain it, every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639

This magic number is called a code point.

The U+ means “Unicode”
The numbers are hexadecimal.

The English letter A would be U+0041.

Why is it U+0041?

0041 in hex is 0 * (16^4)^4 + 0 * (16^3) + 4 * (16^2) + 1 * (16^1) = 0 + 0 + 64 + 1 = 65.
65 is represented as A.


Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.

Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

10FFFF (hex)

1 0 15 15 15 15 convert each hex to digit 0-9, A(10) – F(15)

0001 0000 1111 1111 1111 1111 convert digits to binary

1114111 binary to decimal

we also have decimal 0, so total is 1114112 representations of characters

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:


This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a” in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

OK, so say we have a string:


which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.


That’s where encodings come in.

Number Code to Binary

A is given code point U+0041:

0041 in hex is 0 * (16^4)^4 + 0 * (16^3) + 4 * (16^2) + 1 * (16^1) = 0 + 0 + 64 + 1 = 65.
65 is represented as A.

Where code point is any of the numerical values that make up the code space.
Code space 65, is given the decimal value 65, and also the Unicode letter A. Hence, how to represent decimal 65 in binary?

Bi-nary means two.

Binary is represented by bits.
1 Bit binary represents 0 or 1. 2 ^ (1 bit) = 2 combinations
2 bits binary represents 00, 01, 10, 11, 4 combinations. 2 ^ (2 bits) = 4 combinations

Point codes represent text in computers, telecommunications equipment, and other devices.
It maps the character “A” to the number 65

How do we convert this to binary?

Division Way

(the proper way is to divide by 2, and use the remainder as bit)

65 / 2 = 32 R1 binary bits: [1]
32 / 2 = 16 R0 binary bits: [1 0 ]
16 / 2 = 8 R0 binary bits: [1 0 0]
8 / 2 = 4 R0 binary bits: [1 0 0 0]
4 / 2 = 2 R0 binary bits: [1 0 0 0 0]
2 / 2 = 1 R0 binary bits: [1 0 0 0 0 0]
1 / 2 = 0 R1 binary bits: [1 0 0 0 0 0 1]

So as you can see the bits are 1 0 0 0 0 0 1
So in 8 bit format we have, 0 1 0 0 0 0 1

That is the binary conversion from decimal point 65.

Layout visual way
First, we must lay out the binary and show what decimal it represents

The 0th bit is represented by 2 ^ 0 = 1
The 1st bit is represented by 2 ^ 1 = 2
The 2nd bit is represented by 2 ^ 2 = 4
The 3rd bit is represented by 2 ^ 3 = 8
The 4th bit is represented by 2 ^ 4 = 16
The 5th bit is represented by 2 ^ 5 = 32
The 6th bit is represented by 2 ^ 6 = 64
…so on.

8 bits
0 0 0 0 0 0 0 0

We lay them out and try to see which of these decimals add up to be 65. The trick is to get the largest number where its less than or equal to 65. In our case, its the 6th bit (64). Thus, at the the 6th bit we mark it 1.

0 1 0 0 0 0 0 0

65 – 64 = 1.

Then we try to find a bit where its less than or equal to 1.

Obviously, it would be perfect at 0th bit.

0 1 0 0 0 0 0 1

64 + 1 = 65 Thus it matches the 65 ASCII code.


We see that at the 2^(6th bit) is 64. Then all we need is 1, where it is at 2^0.
Hence at the binary, we need to mark 1 for the 6th bit and the 0th bit in order to represent that we need the decimal value it represents:
+ 0 * (2 ^ 4)
0 * (2^7) + 1 * (2 ^ 6) + 0 * (2 ^ 5) + 0 * (2 ^ 4) + 0 * (2 ^ 3) + 0 * (2 ^ 2) + 0 * (2 ^ 1) + 1 * (2 ^ 0)

Make sure you use only 0 or 1 because we are dealing with binary.


Finally, at the binary level, simply write the binary needed to represent the numbers:



2) Binary to Hex

First, a hex has 16 code points: 0-15

where 0 – 9 takes on the numbers 0 – 9

and 10 – 15 takes on A, B, C, D, E, F

Thus a hex bit has 15 combinations of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F

This is represented by 4 binary bits because 2^4 = 16

Hence we need to break our binary bits into hex, or literally into 4 bit binary

Thus 010000010 becomes 0100 0010, where each 4 bit binary represents 1 hex
Let’s convert it to hex

0100 = 0 * (2^3) + 1 * (2^2) + 0 * (2^1) + 0 * (2^0) = 4 hex
0001 = 0 * (2^3) + 0 * (2^2) + 0 * (2^1) + 1 * (2^0) = 1 hex

(if you get 1111 = 16 -> F)

Thus, ASCII character ‘A’, is 65 code point, binary 01000010, or hex 0x41.

And if we are to check in the unicode table, We see that indeed A is mapped to decimal 65, with a hex of 0x41 for UTF8

and a hex of 0x0041 for UTF 16.



A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the unicode standard. UTF-8 is backwards compatible with ASCII. UTF-9 is preferred encoding for e-mail and web pages.

The first 128 characters of Unicode (which correspond one-to-one with ASCII) are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

Unicode is a character set. UTF-8 is encoding

UTF-8 is defined to encode code points (any of the numerical values that make up the code space that contains a symbol) in one to four bytes.

For example, the first 128 code points are given to ASCII symbols

0 -> 0000
0 -> 0000

7 -> 0111
f -> 1111

0000 0000 (0)
0111 1111 (128)

The next 1,920 characters covers the remainder of almost all
Latin alphabets, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana, and N’Ko Combining Diacritical Marks.

0111 1111 (128)
1920 code points for Arabic, Hebrew, most European scripts (most notably excluding Georgian)
[110]11111 [10]111111 (2048)

(2^10) * 1 + (2^9) * 1 + (2^8) * 1 + (2^7) * 1 + (2^6) * 1 + (2^5) * 1 + (2^4) * 1 + (2^3) * 1 + (2^2) * 1 + (2^1) * 1 (2^0) * 1 = 2048

where [110] means 2 bytes total
[10] means to continue

we can continue storing other languages from 2048 and on by using the starting bits to represent how many bytes total,
and for each next byte, [10] to mean it will continue:

i.e If it was to use 3 bytes, it uses 1110 to represent it will use 3 bytes total. For each sequential byte, it would use
10 means to continue in the later bytes
[1110] xxxx [10]xxxxxx [10]xxxxxx

[110]11111 [10]111111 (2048)
for BMP
[1110]xxxx [10]xxxxxx [10]xxxxxx (???)

(2^15) * 1


1st byte: Standard ASCII
2nd bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
3rd bytes: BMP
4th bytes: All Unicode characters

UTF 16