Senin, 13 Oktober 2014

Data in the Cloud

What the reader will learn:
      That there are many types of data storage available, meeting different needs
      That the web and cloud provide special challenges to database users but also provide opportunities to design data-centric solutions that are not just relational
      Some hands-on experience of using NoSQL databases and understanding of the business needs addressed by each solution
      An appreciation of the other data-centric uses the cloud can help with, such as backup and recovery and disaster recovery

6.1                Historic Review of Database Storage Methods

 These days we tend to think only of computer-based databases, but databases have been with us for centuries. The ancient Egyptians stored data on stone tables in columns and rows. More recently, libraries used a card index system to allow users to retrieve an ISBN number from an indexed drawer, full of cards.  The key attributes of a database could be seen as:
      They contain data (though not necessarily information).
      They allow the data to be read and can allow data to be updated, deleted and inserted.
      There are often mechanisms to (rapidly) search for data items.
 In the modern era, the  fi rst universally accepted standard model was that of the hierarchical database. They stored data in a tree-like hierarchy. When querying the data, you navigated down the hierarchy to get to the record you needed (see Fig.  6.1 ).
U sers of current relational databases may recognise the idea as very similar to that of a B-tree index, which is usually the default form of index storage in current Relational Database Management System (RDBMS). This is not accidental. The speed with which you get to an item of data is potentially much faster than other search algorithms and is always a known, manageable number of data reads.
R. Hill et al., Guide to Cloud Computing: Principles and Practice, Computer                         121
Communications and Networks, DOI 10.1007/978-1-4471-4603-2_6,
© Springer-Verlag London 2013
Fig. 6.1 The hierarchical data storage model

6.2              Relational Is the New Hoover

 Rapid though certain queries can be in hierarchical databases, they do have some  signifi cant downsides. Especially from the perspective of today when we have extremely  fl exible RDBMS available, the rigidity of the hierarchy means that accessing data in any other way than through the tree structure is very resource costly and slow.
U nlike today, at the time when these databases were in common use, hard disk space was very expensive. The only way to provide some alternative routes through the data was effectively to duplicate the data within different trees. This was very costly.
A d hoc queries of the type we take for granted now were not easy either. In the example in Fig.  6.1 , we could probably already guess that we would have a product, buyer and salesman table in a relational schema for the same data. Once stored, we can select our required data at will, by forming joins between tables.
 At a time when hard disk drives were expensive and managers wanting to understand the data in the system would often have to wait days, or even weeks, for the query to be coded, punched and run, it is not surprising that Codd’s seminal works on the relational model in the 1970s caused such an upheaval in the database world.
 SQL, the standardised language which comes with relational databases, is, provided the database is designed properly, extremely  fl exible in what it allows a user to query. The query language itself, whilst it can challenge a newcomer, is a far cry from the diffi c ult alternative of writing your own data access mechanisms in a programming language. Its near-English nature means that many managers became free of the constraints of having to use programmers to get the data they needed to understand their businesses better.
Note : It is the change in the way which we store the data that changes the way we can access it. No one storage model is perfect. Hierarchical still works well if you only ever query data using a known tree structure. Other models have coexisted with relational over the last couple of decades. Object and object relational are perhaps

6.5    Backup or Disaster Recovery?
still the most common, for example. It is the change in storage brought about by web and cloud technologies that have now generated a number of alternative ‘NoSQL’ databases that we explore later in this chapter.

6.3             Database as a Service

I n previous chapters, we have described a variety of cloud service models. One that is important in terms of this chapter is the idea of databases being available as a service.
 Cloud computing provides opportunities for ef fi cient data management where users defi n e, update and query a database which is running in the cloud environment. As with the other ‘as-a-service’ services, it has the familiar characteristics of pay-per-use and an on-demand model. Instead of having the necessary resources for deploying a database management system, companies may prefer to employ DaaS mainly for  fl exibility and costs.
M any organisations have invested heavily in client–server-based database systems over the last two decades. Names like Oracle, SQL Server and DB2 are in everyday parlance for most corporate data professionals. However, some heavy investment has been made to open up the pay-for-use approach to storing data. The big players include Amazon RDS and Microsoft Azure, although most vendors now include cloud versions of their RDBMS. There are new players in the smaller-scale database  fi eld too, such as Amazon’s

6.4             Data Storage in the Cloud

 The cloud has not only enabled database management systems to become pay-asyou-go; it has also provided organisations with the opportunity to store their raw data, or perhaps copies of it, off-site. By storing multiple copies of the data on several servers, these sites have built in ‘redundancy’, that is, the objects you store in them are replicated. This means a single point of failure will not cause the loss of any data overall. Organisations can use services such as Amazon’s S3 or more specialist backup and disaster recovery vendors like Backup Technology to keep copies of data in the cloud, rather than having an expensive second, standby server on-site.
6.5             Backup or Disaster Recovery?
It is to be hoped that the computer-using reader will understand the concept of backing up their valuable data! Hard drives do go wrong, and when they do, they can destroy information. For many organisations, this information is vital to their livelihood. Losing it would not be a minor inconvenience but could actually put them out of business.
I n database systems, it is normal for database administrators (DBA) to be responsible for ensuring no data is ever lost. The difference between backup and DR can be seen as a matter of scale. We can describe a backup strategy as an approach to ensuring no data is permanently lost. At its simplest, this means taking copies of the changes that are made to the database over a certain period (a week for example) and storing those changes externally to the database. Should data go missing, as a result of some disk failure, for example, a recovery mechanism comes into play to allow the DBA to ‘replay’ the changes to the database since a point in time when the entire database was consistent. Depending upon how much data is lost and which parts of the database are effected, this can be a relatively simple process or can mean retrieving the last full backup copy of the database and then replaying all changes since that backup.
B ackup and recovery normally works on the premise that the server and DBMS are up and running and that what needs repairing are holes in the database itself. Disaster recovery is focused on how to keep a business’s critical systems available when calamities happen: earthquakes,  fi res,  fl oods, accidental deletion of tables or other user mistake—things that can affect the very fabric of our IT systems.
 DR often requires the duplication of systems so that if the master system goes down, a parallel system in a different site kicks in, allowing users to continue using the system. This is, of course, costly, as it means duplicating expensive IT equipment. The speed with which a system becomes available again is measured as the Mean Time to Recovery (MTTR). Better MTTR will normally require expensive solutions. Parallel systems connected by high-speed networks mean that the system out time could be measured in seconds but do mean virtually doubling the hardware and maintenance costs of an organisation’s IT systems.
L ess costly alternatives exist. A standby server can, for example, be separate from the active system but has a complete backup of the database stored on it and has changes applied to it on a regular basis, often overnight.
N ow cloud brings other alternatives and is causing organisations to rethink their backup and DR strategies. Traditional backup and recovery methods have largely been based around magnetic tape as a medium for storing data. The advantage of this medium is that it is relatively cheap. The disadvantage is that it is slow to access and can be quite labour intensive. Systems where the user simply clicks a button to send their backup information to one of several servers somewhere in the cloud are fast pressuring the traditional approach.
 However, there are issues, real and perceived, about the security of storing your data in the cloud. Many decision-makers are just more comfortable with knowing where their data is physically stored and putting in place a security mechanism to protect that data. Putting data in the cloud, they fear, makes it open to hacking.
 Of course, data on your own server is not necessarily always safe. Encrypting data before it is passed into the backup environment is one way to add extra layers of protection around the data. True though these facts may be, to the frustration of the providers of disk in the cloud, the perception is still sometimes that you are safer having control of your own data on your own servers. We cover security and governance in the cloud in more detail in Chap.  10 .
6.7     Business Drivers for the Adoption of Different Data Models
6.6               If You Only Have a Hammer – Or Why Relational

May Not Always Be the Right Answer

R elational databases (RDBMS) have dramatically changed the way we deal with organisational data. They allow nonprogrammers to access data relatively easily on an ad hoc basis. They are also very safe, and it is given that the data you look at when you query the database is correct, valid and current. This is no accident. RDBMS are designed around the ACID characteristics; that is, data held in them have:
      Atomicity . All changes must be ‘all or nothing’ to allow transactions to work (see below).
      Consistency . All data must always follow the rules as de fi ned in the schema.
      Isolation . Any data being affected by one transaction must not be used by another transaction until the  fi rst is complete.
      Durability.  Once a transaction is committed, it should become permanent and unalterable by machine failure of any type.
 The business need addressed by an RDBMS is often based around the concept of a transaction as opposed to a process. As a simple example, if I was to want to transfer £100 from my account to Richard’s to settle a debt, we can see two basic processes:
1.   Reduce my account by £100
2.   Increase Richard’s account by £100
 Once both processes have completed, the transaction is complete, and we can say the changes have been committed to the database. However, should the database crash after step (1), but before step (2), the transaction is not complete. If we were only dealing with successful processes, then my account would be £100 lighter, but Richard would still think that I owed him £100.
 Both in the world of work and on university courses, the ACID model is often taken for granted as a requirement for any database. We can lose sight of the fact that sometimes our data needs are not transactionally focused, and therefore, sometimes, the relational model isn’t the most appropriate for what we need to do with the data we are storing. This statement may come as a shock to people for whom a database is a relational database!

6.7          Business Drivers for the Adoption of Different Data Models

R elational databases have been at the centre of most business systems for decades. For large organisations, with large volumes of data, they are essential. However, as web-driven systems began to expand, it became clear that the relational model is not good at everything. Google eventually took the decision to write their own database system to support their global requirement for information searching, for example.
T he story behind BigTable is well documented, especially by Chang et al.  (2008) . Its basic data model is not relational, as that model did not allow Google to provide rapid text-based searches across the vast volumes of data stored in the web. It is rather a sparse, distributed multidimensional database. Although it is Google’s own product and not openly available, other such databases do exist. HBase is an open source database that claims to have a similar data model to that of BigTable, for example.
 Why would Google feel it is necessary to put so much effort and resource to create their own database? Simply because relational databases, with their large processing overhead in terms of maintaining the ACID attributes of the data they store and their reliance on potentially processor hungry joins, is not the right tool for the task they had before them: quickly fi n ding relevant data from terabytes of unstructured data (web content) which may be stored across thousands of geographically desperate nodes. In short, they found that relational does not scale well for this type of data.

6.8             You Can’t Have Everything

 DBAs have long had to balance a number of business drivers. And DBAs have long realised that there are balances to be found between competing drivers. Availability (the fact that the database and systems using it should ideally be always ready to accept work) is one such driver, for example. A bit like grandma’s apple pie, everyone wants their database to be  fi ve 9s (99.999%) available. But that sort of availability costs both money and time. Moreover, particularly in an ACID environment, built-in redundancy (having multiple copies of the data in case one node fails) is also likely to slow performance (another key driver for a DBA) as more writes are required to keep the system consistent.
 As we have alluded to already, some of today’s data processing tasks involve data which is duplicated (for redundancy and performance reasons) across potentially many different nodes in the web. This can be when RDBMS start to struggle, especially with large volumes of such distributed data.
 The DBA of such an RDBMS may start by examining ways that he or she can speed retrieval up by using secondary indexing. The problem is that whilst this may speed a read operation, it is also likely to slow every write operation. Further, they may  fi nd that the RDBMS, in a quest to maintain ACID properties, struggles with keeping every node consistent.
 There have been some interesting papers published which attempt to justify the NoSQL approach by referring to CAP theorem. Brewer  (2012) makes the point that there is a three-way pull when designing distributed systems in general and for the web in particular. The three forces competing are the needs for the data to be:
      Partition tolerant (coping with breaks in physical network connections)
M ost traditional RDBMS would guarantee that all the values in all our nodes are identical before it allows another user to read the values. But as we have seen, that is at a signi fi cant cost in terms of performance.

6.10   So What Alternative Ways to Store Data Are There?
6.9               Basically Available, Soft State, Eventually

Consistent (BASE)

 BASE is the NoSQL operating premise, in the same way that traditional transactionally focused databases use ACID. You will note that we move from a world of certainty in terms of data consistency to a world where all we are promised is that all copies of the data will, at some point, be the same.
 Even here there is no single approach. The eventual state of consistency can be provided as a result of a read-repair, where any outdated data is refreshed with the latest version of the data as a result of the system detecting stale data during a read operation. Another approach is that of weak consistency. In this case, the read operation will return the fi r st value found, not checking for staleness. Any stale nodes discovered are simply marked for updating at some stage in the future. This is a performance-focused approach but has the associated risk that data retrieved may not be the most current.
 Again, some people used to traditional RDBMS environments will  fi nd it hard to handle the last sentence! In some applications, however, it is not critical for every user to have identical data at all times. Think, for example, of a discussion board application:
 In a thread, let’s say 100 answers exist as a result of a user, A, asking a question. User B decides to add comment 101. A user C, logging on to review the threads after user B has committed their comment, should expect to see 101 answers. But if one node (node N) fails to get rapidly updated and user C retrieves their data from N, they will only see 100 answers. But does this really matter? User C is in the same position as if they had viewed the thread a minute ago. It’s easy to say it does matter, but if we give you the option of free access to an uncertain number of answers or charge you £10 to access a guaranteed up-to-date list, you may start to think it matters less!
6.10             So What Alternative Ways to Store Data Are There?
A s we have seen, the ACID requirements that have guided relational database designers for decades are not sacrosanct. Moreover, implementing the procedures needed to ensure that the ACID properties can be guaranteed is a large overhead on the DBMS—an overhead that costs in terms of performance.
 There may, for example, be occasions when the database user would be willing to forgo a guarantee that every user sees the same version of the data (consistency) eventually for a rule that everyone can see a version of the data immediately. Techniques used to ensure consistency in RDBMS environments typically include storing old values in an undo area and preventing other users seeing data by locking rows, or even tables, whilst transactions happen. For the locked-out user, this seems like the database is performing poorly or, worse, is not functioning properly.

6.11          Column Oriented

W e have become very used to thinking of rows of data. Many of the database professionals’ standard tools force this way of looking at data upon us. Not only do tables hold rows, but it can be exported as a CSV (comma-separated value) or  fi xed length text  fi le. For many applications, this model is entirely appropriate.
H           owever, in some circumstances, the majority of activity against the data hap-pens in a column-focused way. Often in data analysis, for example, we will be taking averages, looking for high and low values or looking for trends in one particular column.
I              n those circumstances, it is rather ineffi c ient (in terms of disk reads and CPU usage) to have to  fi nd each row, locate the column concerned, check if it contains data and then read the value. Many high-end RDBMS have invented methods for speeding this readand-search process, but, nonetheless, it is wasteful if all you want to do is read the same attribute from each row. We should remember that reading from disk is a slow process and limiting disk reads is an important aspect of any DBMS optimisation process.
T he column-oriented approach improves read effi c iency (though not necessarily write effi c iency) by not having any row identifi e r and packing data to allow compression schemes to reduce the volume of disk storage required. This allows for more rapid analysis of those column values but does mean that full tuples (or records) have to be recreated by using positional information to look up the other data items from a row, and that is inef fi cient. Column based is therefore better suited to implementations where analysis of one column is needed.
M oreover, because the best performance, in terms of compression, will happen when similar values are adjacent to each other, the values of a column should ideally be known at the outset, so they can be written down to disk in the most effi c ient way. One insert could result in many rewrites as the data is shuf fl ed round to accommodate the new value, making low data volatility essential.
 Examples include Ingres Vectorwise, Apache HBase and Cassandra. We will be using the latter as an example of the type for you to experiment with later in this chapter.

6.12          Document Oriented

R elational database design techniques all work on the assumption that the data you wish to store is prede fi nable and will always contain the same type of data. But in life, we sometimes collect data which is less well structured.
 Take as an example the following contact information from a CRM database:
  FirstName =  “George”, Surname=”Smith” ,
Pets=[{Type:”Dog”, Name:”Flint”,Age:5}, {Type:”Cat”, Name:”Jade”,Age:7}], Tel=”0111223344”.
 FirstName = “Mary”, Surname=”Songo” , Pets=[{Type:”Dog”, Name:”Ajax”,Age:2}],
Mob=”077777777”, Tel=”0211225555”.
Both of these are entirely realistic data about the customer, but the data items are  not identical. Whilst a relational database would have a schema which included columns to store all potential data items, document-oriented databases have the
6.14    When to Use Which Type of Data Storage?
fl e xibility to allow you to store this sort of semi-structured data in a  schema-less environment.
M any of the document-centric databases don’t allow data to be locked in the way that is required for atomic transactions in RDBMS. Since locking is a signifi c ant performance overhead, this enables them to claim performance advantages over traditional databases. The downside, of course, is that some applications absolutely require secure transactional locking mechanisms. Document oriented is probably not the right vehicle for highly structured data in a transaction-dependent environment, as occurs in many OLTP systems.
 The performance advantages that accrue from document-centric implementations mean that they are often used when there are large volumes of semi-structured data, such as web page content, comment storage and event logging. They may well also support sharding (spreading a table’s rows) of data across multiple nodes, again as a performance device.
 Examples include CouchDb and MongoDB. We will be using the latter as an example of the type for you to experiment with later in this chapter.

6.13          Key–Value Stores (K–V Store)

I n some situations, the database user needs are even more basic than either of the two examples above. Sometimes you just want to store some information against a particular key (or index). If you always know the key value when you want to get data out from the store and you don’t need the rule enforcement that comes with an RDBMS, then a K–V store will probably be the fastest option. See the section about MapReduce in Chap.  5 .
 Because of their light footprints, such databases can often be built into applications, removing the need for any DBMS.
 As with all the data storage options mentioned here, the real driver for the move from RDBMS being the single answer to all database needs has been the web and latterly the cloud. The enormous amounts of data for applications like Facebook and Twitter will be stored in data centres across many different geographic locations.
 Horizontal scaling is the process of having that data spread across many nodes, and many of these newer, NoSQL-type database systems are designed from the start to be able to cope with this physical aspect of the data store. RDBMS on the other hand, with query responses typically created by joining data from two or more tables, can struggle with managing distributed data that is spread across many nodes.
 Examples include Voldemort and Redis.
6.14            When to Use Which Type of Data Storage?
 The  fi rst point to make is that the various types of database need not be mutually exclusive. This is not, as some websites argue, a great battle to the death for the future of databases. Relational databases certainly handle transactional data well, and, for the foreseeable future, especially given the huge levels of investment in RDBMS made by most organisations, they will remain the most frequently used type of database.
 When data professionals are looking at querying large datasets that may be distributed across the web, however, some other tools have entered their armoury. In undertaking the tutorials later this chapter, you will be able to gauge for yourself what the relative strengths and weaknesses are. And it is important to say that every application will be different, but as a rule of thumb, we can say:
1.   Column-based databases allow for rapid location and return of data from one particular attribute. They are potentially very slow with writing, however, since data may need to be shuffl e d around to allow a new data item to be inserted. As a rough guide then, traditional transactionally orientated databases will probably fair better in an RDBMS. Column based will probably thrive in areas where speed of access to non-volatile data is important, for example, in some decision support applications. You only need to review marketing material from commercial contenders, like Ingres Vectorwise, to see that business analytics is seen as the key market and speed of data access the main product differentiator.
2.   Document-centric databases are good where the data is diffi c ult to structure. Web pages and blog entries are two oft-quoted examples. Unlike RDBMS, which impose structure by their very nature, document-centric databases allow free-form data to be stored. The onus is then on the data retriever to make sense of the data that is stored.
3.   If you do not need large and complex data structures and can always access your data using a known key, then key–value stores have a performance advantage over most RDBMS. Oracle has a feature within their RDBMS that allows you to defi n e a table at an index-organised table (IOT), and this works in a similar way. However, you do still have the overhead of consistency checking, and these IOTs are often just a small part of a larger schema. RDBMS have a reputation for poor scaling in distributed systems, and this is where key/value stores can be a distinct advantage.

6.15         Summary

I n the previous chapters in this part two, we have looked at cloud from the application perspective. In this chapter, we have seen that there are many different data storage methods available for a database. The decision as to which is the right one is driven by the task at hand. How we estimate the potential benefi t s of a move to cloud is covered in more detail in Chap.  8 .
 We have also seen how cloud can help with typical database tasks such as backup and recovery and disaster recovery whilst reviewing the business drivers that would come into play for using the cloud for these tasks.
T his chapter is different to the others in that the following tutorials are longer. Many readers will have had some experience of the relational model and SQL already, but we felt that readers need to appreciate the different approaches on offer. In the following tutorials, therefore, we concentrate on two of the more modern approaches—one column based and one document based.
6.19     The Column-Based Approach

6.16          Further Reading

 The tutorials that follow provide pointers to more resources about the Cassandra and MongoDB.
 “Transforming relational database into HBase: a case study” by Chongxin Li is an interesting discussion of the complexities of mapping relation to Hbase. Chongxin Li: Transforming relational database into HBase: a case study. 2010 IEEE International Conference on Software Engineering and Service Sciences (ICSESS), pp. 683–687 (16–18 July 2010). doi:  10.1109/ICSESS.2010.5552465 (Chang et al.
 2008 ).

6.17         Tutorials

 In this section, we ask you to try examples of NoSQL databases. The business case for these trials is given below, and sample datasets with further work can be downloaded from our book website.

6.18         BookCo

 In this case study, we are looking at data from a chain of booksellers called BookCo. You are the chief information offi c er of this company that sells books and other related products via the Internet. You have many information needs to address, but you feel you need to assess some new tools which may help with some of those tasks. BookCo has already invested heavily in a leading client/ server database solution, and that will remain at the centre of most transactional business needs. However, there are two areas that you feel are worthy of further investigation:
1.   One task that has been problematic in the past has been the statistical analysis of the sales data. In particular, you need to increase the speed at which the analysts get data back on trends and averages. You decide to investigate column-based approach as you have heard that this can dramatically increase query performance.
2.   You capture communications from customers from a variety of sources. As the BookCo chain is international, the different sources may well be generated anywhere in the world. You need to investigate if it would be useful to allow this data to be stored using a document-centric data store, rather than an RDBMS.

6.19           The Column-Based Approach

 There are a number of column-based solutions available. To get a feel for the approach, you have decided to give Cassandra a try. It is an open source solution, so there are no costs for this experiment.

6.20          Cassandra Tutorial

6.20.1        Installation and Con fi guration

 The tutorial demonstrates the use of Cassandra on Ubuntu Linux. Cassandra will also run on other Linux servers and can be installed on Windows.
 This guide was created using Ubuntu Linux 11.10. As at April 2012, the current stable release is 1.0.9. It is expected that basic Unix commands are known in order to install the system. If you are unsure, there are many web-based Linux command guides. If you don’t mind being called ‘clueless’, this is a good starting point:  http://
T he fi r st step is to download the latest version of Cassandra. ( http://cassandra. . Make sure it is the latest stable version. Once you have the download page in your browser:
      Select to download the latest bin version. As at April 2012, the current version is apache-cassandra-1.0.9-bin.tar.gz
      Save the binary to preferred location, for example, your desktop.
      Open a Linux terminal window and navigate to the folder where the fi l e was saved.
      Decompress the .gz  fi le using the following process.
 #Assuming that the  fi le was saved in folder Desktop
$called Desktop
 $tar xzf apache-cassandra-1.0.9-bin.tar.gz
#where 1.0.9 is the version should be changed to the version you have downloaded
C assandra needs you to install the JRE dependency if you don’t already have it.
Use the following command:
  $sudo apt-get install openjdk-6-jre
 #wait for the download and installation process to  fi nish
 Moreover, it is required to create some directories for Cassandra to store data and logs. Appropriate permissions should be granted to users. This task is a once-only task. Open a terminal session and perform the following operations. (Replace  USERNAME with the OS authentication username.)
 $sudo mkdir -p /var/log/cassandra
 $sudo chown -R USERNAME /var/log/cassandra
 $sudo mkdir -p /var/lib/cassandra
 $sudo chown -R USERNAME /var/lib/cassandra
 The next step is to run the server. This has to be done through a terminal session. This step must be repeated each time you reboot your OS.
N avigate again to the folder into which you have extracted Cassandra. Then, navigate to the bin directory.
 To start the server, use the ‘ cassandra ’ command.
 $cd apache-cassandra-0.8.4
$cd bin

 Note: It is important not to close this terminal.
N ow that the server is running, the client interface should be started. Open a new terminal session and execute the command below, after navigating to the bin folder as before:
 $cd apache-cassandra-0.8.4
 $cd bin
Welcome to the Cassandra CLI.  Type ‘help;’ or ‘?’ for help.
 Type ‘quit;’ or ‘exit;’ to quit.
 Once the client interface is open, we need to connect to the local server you started earlier by using the connect/command. By default, the port is 9160, and notice that commands should end with a semicolon ‘; ’ . By default, the system will connect the ‘ test ’ cluster/keyspace.
 [default@unknown] connect localhost/9160 ;
 Connected to: “Test Cluster” on localhost/9160
T o recap, we have downloaded and installed Cassandra. We then started the Cassandra server in a terminal session. In a separate session, we then started the client. Finally, we used the client to connect to our default ‘ test cluster ’ using port 9160.
 Note: There are alternative graphical user interface (GUI) tools you could use with Cassandra instead of the Cassandra CLI. However, the CLI makes the technical underpinning far more visible and therefore gives the new user a better insight into how Cassandra works.

6.20.2        Data Model and Types

B efore we get our hands dirty with our new toy, we need to understand some of Cassandra’s basic concepts.
 A  keyspace in Cassandra can be seen as the equivalent to a database schema in a relational DBMS, and it is a group of related column families. With an attribute called placement_strategy, you can determine how you will distribute replicas around the nodes you will be using.
 A  column family is very broadly equivalent to a table in an RDBMS. Each such family is stored in a separate fi l e. The fi l e itself is sorted into key order. It is therefore important, in terms of reducing disk reads and therefore improving performance, that columns which are regularly accessed together should be kept together, within the same column family, and that data types within a family should ideally be the same.
A   column is the smallest unit of storage in Cassandra. A standard column is composed of the tuple of a unique name (key), value and a timestamp. The key identi fi es a row in a column family. The timestamp corresponds to the last update time.
 A  supercolumn is a list of columns used to store multiple items of a similar type.
Supercolumns do not have timestamps.
 Put another way, the building blocks we will be reviewing in Cassandra are:
      A column, which is a key–value pair.
      A supercolumn which is a collection of associated columns that are sorted.
      A column family, which is a container for columns which are sorted. These columns can be referenced through a row id.
 Examples may help clarify the difference between standard and supercolumns.  This is an example of a standard column in use:
 “example” :
 column : “emailaddress”  value : “”  timestamp: 123456
 And here is an example of a Super-Column in use:
 name: “homeaddress”  value: {  street : {name: “street” , value : “Oxford”, time-stamp
: 12345},  postcode: {name: “postcode”, value: “SO234”, timestamp : 123456}
 Note how there are a variety of values in the supercolumn but one, and only one, value in the column.
 Now we have Cassandra installed, we should get used to the basic commands. One of the  fi rst useful commands to learn is the show command. If we get into the situation where we forget which cluster we are working on, we can ask Cassandra by using the ‘ show cluster name; ’ command.
 And if we need to know which keyspaces there are, we again use show:
 [default@unknown] show cluster name;
 Test Cluster
 [default@unknown]show keyspaces;  Keyspace: system:
 ....other names would appear here

6.20.3        Working with Keyspaces

 The output of the last command displays various information about the ‘ system ’ keyspace. This is used mainly for system tasks and is not dissimilar to the system tablespace in Oracle, for example. Once a new keyspace is created, it would also be displayed as an output of the command.
 Initially, a new keyspace (analogous to a relational database) has to be created
using the   create keyspace KeyspaceName [with att1=”value1”] [and att2=”value2”] ’ where K eyspaceName corresponds to the name you want to give to the keyspace. Two attributes can be added to keyspace:
1.   The replication
2.   Replica placement strategy
 Until we get a feel for Cassandra, however, these are not required as we are using Cassandra locally instead of on many machines. But don’t lose sight of the fact that one of the main reasons that Cassandra was invented was to be able to cope with large volumes of data in a distributed, replicated environment.
 After creation, to connect to a keyspace, we use the  use Keyspacename where the  Keyspacename is the name given to the created keyspace.
 For the BookCo sales records, we are going to create a keyspace with the name sales. When doing this, we will see that Cassandra will respond with a unique UUID number:
 [default@unknown] create keyspace sales;
 89478600-c772-11e0-0000-242d50cf1fbf #UUID
 Waiting for schema agreement…
 … schemas agree across the cluster
 [default@unknown] use sales;
 Authenticated to keyspace: sales  Working with Column Families
 To create a column family (analogous to a relational table), we should provide a name and a comparator. The comparator is required by Cassandra because it needs to know the order in which to store data, and the comparator is responsible for this sorting of the columns within the column family. Once any new column is added, it is compared with others and is sorted accordingly—during the insert. This is important as it is not possible to sort results during a query. One way of dealing with the lightweight query facility is to have multiple column families, one for each expected query return type.
C olumns in Cassandra are sorted based on their name. So the comparator can be used to tell Cassandra how to order columns inside their row and, in extent, how the columns will be sorted when they are returned. The comparator can also be used in subcolumns (of supercolumns). These are the comparators that are available out of box:
 It is the default type that sorts based on bytes. If it is not required for validation process, then choose instead of UTF8Type
 Sorting by using the character encoding
 Sorting based on a string that uses UTF-8 as character encoder. It gives the advantage to allow data validation
 Sorting based on 64-bit long numbers
 Sorting based on a 16-byte timestamp
A dvanced users can, if this list doesn’t cover an application’s needs, write their own comparators using Java. The comparator is a required element for a valid column family. However, the other column metadata is optional:
      Data validation . Remember, validation is a costly overhead, but we may need, for some applications, to test the data being entered. The types are the same as in the table above but are used for validation purposes instead of sorting. Prede fi nition is not required, but if you are aware of the columns that need validation, such rules should be implemented at family creation time. The ‘schema’ is fl e xible so new columns can be added later, if required.
      Index keys. This is another optional attribute. With it, you can create indexing keys which allow search queries against the indexed values. If no indexes exist, querying based on certain values will not be possible. An example of creating a primary index key will be demonstrated later in these notes. More than one column can be indexed in a column family, and these are called secondary indexes. Secondary indexes are recommended for columns with low cardinality because rewriting the indexes is a very costly overhead, and so, where possible, we avoid indexing volatile data.
 The command for creating a column family is
 create column family ColumnFamilyName with comparator = ‘Type’ and column_metadata = [ {column_name: ColumnName, validation_class: ‘Type’, index_type: KEYS }, …];
 Note: When referring to types, look at the types table and add them without the single quotes. As mentioned before, the  column_metadata is optional.
F or the BookCo sales, we should create a ‘s ale ’ standard column family by using the UTF8Type and then again this type for validation class, without indexing any key at the moment.
 [default@sales] create column family sale with comparator = UTF8Type  … AND column_metadata =
 … [ {column_name: sale_id, validation_class: UTF8Type},
 … {column_name: book_id, validation_class: UTF8Type},
 … {column_name: seller, validation_class: UTF8Type},
 … {column_name: buyer, validation_class: UTF8Type},
 … {column_name: price, validation_class: UTF8Type},
 … {column_name: payment_method, validation_class:
 … {column_name: date , validation_class:UTF8Type}];
 Waiting for schema agreement…
 … schemas agree across the cluster
 Now we are going to create another column family about novelratings and recommendations.
 To do so, we should create a new  SuperColumn column family using the
 create column family ColumnFamilyName with comparator = ‘Type’ AND column_type=’Super’ AND subcomparator= ‘Type’ [AND column_metadata = [{ column_name : Col-umnName , validation_class: ‘Type’ , index_type: KEYS }, … ];  command.
Everything after the subcomparator attribute could have been omitted.
 You should notice the  ‘column_type=Super’ attribute and  ‘subcomparator= Type’ which refers to the comparator used for the subcolumns of this type of column family.
  [default@sales] create column family Novelrating with comparator = ‘UTF8Type’
 … AND column_type=’Super’ AND subcomparator= ‘UTF8Type’ AND column_metadata =
 … [{column_name : author , validation_class : UTF8Type
 … {column_name : comment , validation_class : UTF8Type
 … {column_name : rate , validation_class : UTF8Type }];  c12796d9-c8f3-11e0-af3e-e700f669bcfc  Waiting for schema agreement…
 … schemas agree across the cluster
 A very important operation with column families is  Update . We may need to update a column family with the specifi e d attribute(s) and value(s) or even add index keys (see previous paragraphs).
T o do so, we use the  update column family ColumnFamilyName with comparator = ‘Type’ AND column_metadata = [{ column_ name : ColumnName , validation_class: ‘Type’}, index_type :
KEYS …]; command. This is similar to the  Create command but with the changed attributes and values.
 If, for example, we want to create an index key on the  sale_id.  Follow the example on how to use the  update command:
  [default@sales] update column family sale with comparator =update column family sale with comparator = ‘UTF8Type’ and default_validation_class=UTF8Type and key_validation_class=UTF8Type and column_metadata=  [ {column_name: sale_id, validation_class: UTF8Type, index_type: KEYS},
 {column_name: book_id, validation_class: UTF8Type},
 {column_name: seller, validation_class: UTF8Type},
 {column_name: buyer, validation_class: UTF8Type},
 {column_name: price, validation_class: UTF8Type},
 {column_name:    payment_method,     validation_class:
{column_name: date , validation_class:UTF8Type}];
 Waiting for schema agreement…
 … schemas agree across the cluster

6.20.4        Working with Columns        Inserting and Updating Values

A s we pointed out above, the column is the base storage unit in Cassandra. We will fi n d ourselves inserting to, deleting from, updating and creating columns extensively as we build our applications.
T o add or update data in a column, use the  ColumnFamilyName[‘key’]
[‘ColumnName’] = ‘value’; command. (cf = column family name)
 We mentioned that a column also has a timestamp associated with it. This is added or updated automatically by Cassandra.
 The key refers to the row key. An example of this command being used is below.
We are going to create a new entry for  sale1:  
 [default@sales] set sale[‘sale1’][‘sale_id’] = ‘1’;  Value inserted.
 [default@sales] set sale[‘sale1’][‘book_id’] = ‘123’;  Value inserted.
 [default@sales] set sale[‘sale1’][‘seller’] = ‘mike’;  Value inserted.
 [default@sales] set sale[‘sale1’][‘buyer’] = ‘john’;  Value inserted.
[ default@sales] set sale[‘sale1’][‘price’] = ‘15’;  Value inserted.
[ default@sales] set sale[‘sale1’][‘payment_method’] =
 Value inserted.
[ default@sales]    set     sale[‘sale1’][‘date’]     =
 The column family, as it stands at the moment, is shown in below.
 Now we will populate the sale column family with more data. You can use these examples, but do try to add more of your own:
  #Creating a new row for sale 2  set sale[‘sale2’][‘sale_id’] = ‘2’;  set sale[‘sale2’][‘book_id’] = ‘143’;  set sale[‘sale2’][‘seller’] = ‘peter’;  set sale[‘sale2’][‘buyer’] = ‘john’;  set sale[‘sale2’][‘price’] = ‘25’;  set sale[‘sale2’][‘payment_method’] = ‘PayPal’;  set sale[‘sale2’][‘date’] = ‘12Aug2011’; #Creating a new row for sale 3  set sale[‘sale3’][‘sale_id’] = ‘3’;  set sale[‘sale3’][‘book_id’] = ‘144’;  set sale[‘sale3’][‘seller’] = ‘paul’;  set sale[‘sale3’][‘buyer’] = ‘peter’;  set sale[‘sale3’][‘price’] = ‘43’;  set sale[‘sale3’][‘payment_method’] = ‘Visa’; set sale[‘sale3’][‘date’] = ‘13Aug2011’;
 #Creating a new row for sale 4  set sale[‘sale4’][‘sale_id’]=  ‘4’;  set sale[‘sale4’][‘book_id’] = ‘127’;  set sale[‘sale4’][‘seller’] = ‘josh’;  set sale[‘sale4’][‘buyer’] = ‘alex’;  set sale[‘sale4’][‘price’] = ‘45’;  set sale[‘sale4’][‘payment_method’] = ‘Visa’;  set sale[‘sale4’][‘date’] = ‘12Aug2011’;
 To add data in a supercolumn and its subcolumns, use the set  ColumnFamily
Name[‘key’][‘superColumnName’][‘col’] = ‘value’; The key refers to the row key. The example here adds for a novel called ‘ Alpha Mike’ .
  [default@sales] set Novelrating[‘Alpha Mike’][‘entry1’]
[‘author’] = ‘Peter Pond’;  Value inserted.
 [default@sales] set Novelrating [‘Alpha Mike’][‘entry1’]
[‘comment’] = ‘Alpha Mike is a very good novel’;  Value inserted.
 [default@sales] set Novelrating [‘Alpha Mike’][‘entry1’]
[‘rate’] = ‘5/5’;
 [default@sales] set Novelrating [‘Alpha Mike’][‘entry2’]
[‘author’] = ‘Jason Stream’;  Value inserted.
 [default@sales] set Novelrating [‘Alpha Mike’][‘entry2’]
[‘comment’] = ‘Good storyline’;  Value inserted.
[default@sales] set Novelrating [‘Alpha Mike’][‘entry2’]
[‘rate’] = ‘4/5’;  Value inserted.
 Now we will populate the novelrating column family with more data. You can use these examples, but do try to add more of your own:
 #Some more recommendations for Mike.  set Novelrating[‘Alpha Mike’][‘entry3’][‘author’] =
‘Phil Nastis’;  set Novelrating[‘Alpha Mike’][‘entry3’][‘comment’] =
‘Excellent thriller’;
set Novelrating[‘Alpha Mike’][‘entry3’][‘rate’] =
 #Creating a new row for a Novel called Catch 22 and adding 2 entries for recommendations  set Novelrating[‘Catch 22’][‘entry1’][‘author’] =
‘Heller’;  set Novelrating[‘Catch 22’][‘entry1’][‘comment’] =
‘Brilliant book’;  set Novelrating[‘Catch 22’][‘entry1’][‘rate’] = ‘5/5’;  set Novelrating[‘Catch 22’][‘entry2’][‘author’] =
‘Heller’;  set Novelrating[‘Catch 22’][‘entry2’][‘comment’] =
‘Dated,but still good’;  set Novelrating[‘Catch 22’][‘entry2’][‘rate’] = ‘4/5’;
 1.        Reading Data from Column Values

 There are different ways to retrieve data from an existing keyspace. Initially, we can use the c ount ColumnFamilyName[‘key’]; command to count the number of columns stored in the key.
 To do the same thing for subcolumns of the supercolumn, use the  count
ColumnFamilyName[‘key’][‘superColumnName’]; .
 Note: The ‘s ale1 ’ is the row key and not the column. In the second command, ‘A lpha Mike ’ is the key and the supercolumn is ‘e ntry1 ’.
[default@sales] count sale[‘sale1’];
7 columns
[default@sales] count Novelrating[‘mike’][‘entry1’];
 3 columns
 #Try to retrieve a key that does not exist
 [default@sales] count sale[‘sale10’];
 0 columns
 To     retrieve   all            the          information          about      ‘sale1’ ,          use          the           get ColumnFamilyName[‘key’]; command. Notice that the records have an added ‘timestamp’ value that is automatically added or updated when the record is added or updated:
  [default@sales] get sale[‘sale1’];  => (column=book_id, value=123, time-stamp=1313532897200000)  => (column=buyer, value=john, time-stamp=1313532920840000)  => (column=date, value=10Aug2011,  time-stamp=1313532967504000)
 => (column=payment_method, value=PayPal, time-stamp=1313532956880000)
 => (column=price, value=15, time-stamp=1313532931944000)  => (column=sale_id, value=1, time-stamp=1313532878948000)  => (column=seller, value=mike, time-stamp=1313532909544000)  Returned 7 results.
 Now we could try to retrieve the information from the supercolumns we have created previously. To retrieve all the information, the command is the same as above. Try the command for yourself to see the full output:
  [default@sales] get Novelrating[‘Alpha Mike’];
 => (super_column=entry1,
 (column=author, value=Peter Pond, time-stamp=1313601352679000)
 (column=comment, value=Alpha Mike is a very good Novel, timestamp=1313601369614000)
 (column=rate, value=5/5, timestamp=1313601388350000))  => …. And so on …..
W e can also retrieve only the value of a column by using the  get
ColumnFamilyName [‘key’] [‘columnName’] command.
 To retrieve a value from a subcolumn of a supercolumn, we use the  get
ColumnFamilyName [‘key’] [‘superColumnName’] [‘column-
Name’] command. Examples are
  [default@sales] get sale[‘sale1’][‘book_id’];
 => (column=book_id, value=123, time-stamp=1313532897200000)
 [default@sales] get Novelrating[‘Alpha Mike’][‘entry1’]
 => (column=rate, value=5/5, time-stamp=1313601388350000)
 We can put conditions on our queries, just like a where in an SQL command. However, in order to use this command properly, we need to add indexes to support specifi c  queries. For example, to retrieve data based on the index created for the
‘s ale_id ’, we would use the get CF where  column = ‘value’ [and column = ‘value’ and …] command.
T o help understand why indexing key is important, let us initially retrieve data based on  book_id which was not previously indexed.
 [default@sales] get sale where book_id = ‘143’;
 No indexed columns present in index clause with operator EQ.
 The system responds that the  book_id column is not indexed. Now try with the  sale_id column which was indexed previously during the update command demonstration:
 [default@sales] get sale where sale_id = ‘1’;
 RowKey: sale1
 =>     (column=book_id,      value=123,     time-stamp=
1313532897200000)  ...........
 1 Row Returned.
T o allow the retrieval of books information using the  book_id , we can update the column family to add a secondary index as shown below. We will also add an index on the price column:
 [default@sales] update column family sale with comparator = ‘UTF8Type’ and column_metadata=
 … [ {column_name: sale_id, validation_class: UTF8Type, index_type: KEYS},
 … {column_name: book_id, validation_class: UTF8Type, index_type: KEYS},
 … {column_name: seller, validation_class: UTF8Type},
 … {column_name: buyer, validation_class: UTF8Type},  … {column_name: price, validation_class: UTF8Type, index_ type: KEYS},
 … {column_name: payment_method, validation_class:
 … {column_name: date , validation_class:UTF8Type}];  b6cee09c-c9a6-11e0-af3e-e700f669bcfc  Waiting for schema agreement…
 … schemas agree across the cluster
N ow we can try to retrieve information based on the  book_id again. Then, we will use price and then combine both. Just as with SQL, with the price column, we can use comparison operators such as greater or smaller than:
[ default@sales] get sale where book_id=’143’;
 RowKey: sale2
 =>     (column=book_id,      value=143,     time-stamp=
1313673572425000)  ...........
 1 Row Returned.
 [default@sales] get sale where price = 15;
 RowKey: sale1
 =>     (column=book_id,      value=123,     time-stamp=
1313532897200000)  ...........
 1 Row Returned.
 [default@sales] get sale where book_id=’143’ and price >
0  Row Returned.
 [default@sales] get sale where book_id=’143’ and price <
 RowKey: sale2
 =>     (column=book_id,      value=143,     time-stamp=
1313673572425000)  ...........
1  Row Returned.
 Another way to retrieve all the rows in a column family is the list  ColumnFamilyName; command:
  [default@sales] list Novelrating;
B efore you try this, imagine what the output might be like to make sure you understand the data being stored. Once you run it, did you get the data back you expected?
 List can also be used with the limit condition to control the number of records that will be retrieved. The returned results indicate that random records are returned instead of ordered.
 [default@sales] list sale LIMIT 2;
 RowKey: sale3
 =>     (column=book_id,      value=144,     time-stamp=
1313673579083000)  ................
 RowKey: sale2
 =>     (column=book_id,      value=143,     time-stamp=
1313673572425000)  ................
 2 Rows Returned.        Deleting Rows and Columns

I n keeping our data current, we may need to delete a column from a row. For example, to delete the  ‘date’ column from  ‘sale1’ , we would use the  del
ColumnFamilyName[‘key’][‘col’];  command.
 To delete the entire row of  ‘sale1’ we would use the  del
ColumnFamilyName[‘key’]; command.
 To delete a supercolumn, we would use the  del ColumnFamilyName[‘key’] [‘super’]; command.
 To delete a subcolumn, we would use the  del ColumnFamilyName[‘key’]
[‘super’][‘col’]; command.
 Examples of all these are given below:
 [default@sales] del sale[‘sale1’][‘date’];  column removed.
 [default@sales] get sale[‘sale1’][‘date’];
 Value was not found
 [default@sales] del sale[‘sale1’];  row removed.
 [default@sales] get sale[‘sale1’];  Return 0 results.
 [default@sales]   del    Novelrating[‘mike’][‘entry1’]
[‘rate’];  column removed.
 [default@sales] del Novelrating[‘mike’][‘entry1’];  column removed.
 [default@sales] get Novelrating[‘mike’][‘entry1’];  Returned 0 results.        Drop Column Families and Keyspace

 To drop a column family, we use the  drop column family ; command.
T o drop a Cassandra keyspace, we use  drop keyspace keyspacename; command.
 [default@sales] drop column family sale;  ade3bc44-236f-11e0-8410-56547f39a44b  #To drop the keyspace follow the example
 [default@sales] drop keyspace sales;
 Waiting for schema agreement…
 … schemas agree across the cluster

6.20.5        Shutdown

 At the end of our busy day, we may need to bring down the Cassandra client. To do this, we simply use the quit; or exit; command from the client terminal session.
 If we also want to shutdown the Cassandra server, we use the Ctrl + C keys.

6.20.6         Using a Command-Line Script

Y ou can build scripts which create column families and insert data. This is very useful in the prototyping phase when we typically need to keep recreating elements. In the example below, we are going to create a new family called tsale in the sales keyspace; don’t forget to have your authentication as the  fi rst line.
 Open a new editor page and add this into it:
 use sales ;  create column family tsale with comparator = ‘UTF8Type’ and default_validation_class=UTF8Type and key_validation_class=UTF8Type and column_metadata=
 [ {column_name: sale_id, validation_class: UTF8Type, index_type: KEYS},
 {column_name: book_id, validation_class: UTF8Type},
 {column_name: Novel, validation_class: UTF8Type},
 {column_name: buyer, validation_class: UTF8Type},
 {column_name: price, validation_class: UTF8Type},
 {column_name: payment_method, validation_class: UTF8Type},  {column_name: date , validation_class:UTF8Type}];  set tsale[‘sale1’][‘sale_id’] = ‘1’;  set tsale[‘sale1’][‘book_id’] = ‘123’;  set tsale[‘sale1’][‘Novel’] = ‘mike’;  set tsale[‘sale1’][‘buyer’] = ‘john’;  set tsale[‘sale1’][‘price’] = ‘15’;  set tsale[‘sale1’][‘payment_method’] = ‘PayPal’;  set tsale[‘sale1’][‘date’] = ‘10Aug2011’;
 Now save this as cass.txt into your Cassandra\bin directory directory.
 Open a new terminal session and CD to the Cassandra\bin directory. Do not run cassandra-cli at this stage. Instead, type
  ./cassandra-cli -host localhost -port 9160 -f cass.txt
 This will, if it can, run the content of the script  fi le called cass.txt that you have just created.

6.20.7        Useful Extra Resources

 We have only just got started with Cassandra with these notes. We hope that there is enough here to give you an understanding of the key principles and to help identify how Cassandra is different to an RDBMS.
 If you wish to  fi nd out more, some useful resources include:
      This book’s website (link). There are a bigger dataset and some more detailed queries to give you practice in using the product.
      Hewitt, E. (2010) Cassandra: The De fi nitive Guide, published by O’Reilly Media.
      Apache Cassandra (2011) Of fi cial Site. Last accessed on 1 Sept 2011 at  http://
      Apache Wiki (2011) Wiki Frontpage. Last accessed on 1 Sept 2011 at  http://
      DataStax (2011) Of fi cial Portal. Last accessed on 1 Sept 2011 at  http://www.

6.20.8         The Document-Based Approach

T here are a number of document-based solutions available. To get a feel for the approach, you have decided to give MongoDb a try. It is an open source solution, so there are no costs for this experiment.

6.21          MongoDB Tutorial

6.21.1        Installation and Con fi guration

T his guide was made and tested with Ubuntu Linux 11.10. Several Linux systems include package managers that can help to install MongoDB. As at the time of writing, packages are available for Ubuntu, Debian, CentOS, Fedora and Red Hat. It is expected that basic Unix commands are known. If you are unsure, there are many web-based Linux command guides. If you don’t mind being called ‘clueless’, this is a good starting point:  Follow the following steps:
1.   Go to MongoDB downloads page ( and select the Linux 32-bit or Linux 64-bit product version.
2.   Save the binary to preferred location.
3.   Open a terminal window and navigate to the folder where the  fi le was saved.
4.   Decompress the .tgz  fi le using the following process:
 #Assuming that the  fi le was saved in folder Desktop
$called Desktop
 $tar zxf mongodb-linux-i686-2.0.4.tgz
 #where 2.0.4 is the current version
 $cd mongodb-linux-i686-2.0.4
 Important tip: By default, MongoDB stores data in data/db directory. However, this has to be created manually but only once. To do so, in a terminal type:
 $sudo mkdir -p /data/db/
 #We have to grant access for the directory to our username
 # Where USERNAME is the Operating System Username
 $sudo chown username /data/db

 The next step is to run the server. This has to be done through a terminal window. This step is essential each time you reboot your OS:
 #In a terminal, navigate to the previously extracted mongodb folder. For this example it is on the Desktop $cd Desktop
 $cd mongodb-linux-i686-2.0.4
 #Remember that the version number may vary
 #Navigate to BIN directory and start the mongod server
 $cd bin
 You should see a series of feedback lines. If you receive any errors indicating that the binary cannot be executed properly, you probably downloaded a 64-bit version which is not supported by your OS, or vice versa. Repeat the process by downloading the correct version.
 Do not close the current terminal window. Finally, in a new terminal window, start the MongoDB shell:
 #Navigate again in the BIN directory and start the shell
 $cd Desktop/mongodb-linux-i686-2.0.4/bin
B y default, MongoDB each time connects to the ‘test’ database. This will be explained in the next section.

6.21.2          Documents, Data Types and Basic Commands

T o recap, MongoDB as a document-oriented storage system replaces tables, rows and columns with documents. Each document includes one or more key–value pairs. Each document is part of a collection, and in a database, more than one collection can exist. A collection can be seen as similar to a table in an RDBMS.
 Below is an example of the key–value pairs that constitute the documents.
A document can have any structure as we can see in the example below.
 {“key” : “value”}
 {“name” : “George”}
 #The previous example shows a single key “name” with the value of “George”. A document may include multiple key/ value pairs as below {“name” : “George”, “surname”: “James”}
 {“name” : “John”, “surname”: “King”,”email”: “example@”}
#Example documents of a collection with different fi  elds

6.21.3        Data Types

M any data types can be part of a document in a MongoDB database, and some of them are listed below:
 Data type
 Explanation and examples
{“x” : true}.  A boolean value can be either true or false
{“x” : null}.  It can be used to express a null value
 {“x” : 3.14} {“x” : 3}
{“x” : “George”}.  Any string can be used with this type
{“x” : new Date()}.  Dates stored as milliseconds
{“x” : [“1”,”2”,”3”]}.  Lists of values can be stored as an array
 Embedded document
{“x”: {“name” : “George”}}.  Documents can
contain embedded documents in them, a document inside another document

6.21.4         Embedding and Referencing

B y embedding a document, it is meant that a document is embedded within another document as explained in the table above. Referencing refers to storing the data in a separate collection (say collection A) and referencing it in another collection (say collection B). However, the MongoDB manual suggests that embedding documents inside a document is preferred of the two approaches. The usage of references requires extra queries to retrieve data from other collections, slowing the query.
 Any of these data types can be used in a document. As there is no schema in a document, data types do not have to be predefi n ed. Examples will be shown later.
H aving started the database server and the shell, we can now start using the database. But  fi rst, here are some general navigational tips to start with:
1.   Commands related to collections have the syntax of  db.COLLECTION_NAME. COMMAND().  The db variable is the currently used database.
2.   To switch between databases, we use the command  use Name_of_Database.
3.   To see which database is currently in use, execute the  db command.
4.   The database is automatically created upon the creation of the  fi rst document.
5.   To identify which collections are in the a database, we can use  db. getCollectionNames().
6.   To see a list of useful database commands, type ‘help’ in the shell script. R eview examples of these commands below and then execute them for yourself in your terminal session. First of all, switch to the ‘products’ database, which will be used for our case study.
 Note: In all the examples, commands typed in the shell are indicated by ‘>’.
Responses from the system do not include that sign. cloud@project:~/Desktop/mongodb-linux-i686-2.0.4/bin$
 MongoDB shell version: 2.0.4  connecting to: test
 > db  test
 > use products  switched to db products
 > db  products
 > db.getCollectionNames();
 [ ]
 Note: The last command returns an empty array as at the moment no collection exists in the database.
A s with most databases, there are four basic types of operation we will need to complete. They are often described as CRUDing: create, read, update and delete. We will use MongoDB to do each of these tasks.      Create

T his operation adds a new document to a collection. To enter some new components in the database, we can create a local variable (one for each new document) and then insert it in the collection.
A s the shell is using JavaScript objects, we are going to use a variable to represent the document. In this example, we need to create two documents corresponding to two books that the company sells.
 To create a variable, use this syntax:
  book1 = {“isbn” : “123” , “type”: “fi  ctional”, “price”
: “50”, “description” : “A fi  ctional book”} .
I n the fi g ure below, the shell returns the details of the variable once it is created.
For this example, create the two variables as shown below. Then, use the
db.COLLECTION_NAME.insert(variable) command to store it as a document in a collection.
 Note: Data can also be inserted directly through the shell without de fi ning the document as a variable. This will be covered later in the more advanced command section. (Note: Depending on your terminal settings, when you paste this in, you can only see the end of the line in the terminal.)
 # Creating variable for book1
 > book1 = {“isbn” : “123” , “type”: “fi  ctional”, “price” : “50”, “description” : “A fi  ctional book”}
 #System response indicating that variable has been created
 “isbn” : “123”,
“type” : “fi  ctional”,
 “price” : “50”,
 “description” : “A fi  ctional book”
 # Creating variable for component_two
 > book2 = {“isbn” : “124” , “type”: “tech”, “price” : “40”, “description” : “Book for tech”, “pages” : “150”}  #System response indicating that variable has been created
 “isbn” : “124”,
 “type” : “tech”,
 “price” : “40”,
 “description” : “Book for tech”
 “pages” : “150”,
# Inserting variables as documents in a collection // The collection will be automatically created once the fi  rst insert occurs
 > db.products.insert(book1);
 > db.products.insert(book2);
 At the moment, we have created two documents in the product collection that was created as a result of the fi r st insert. You may have noticed that the schema-free feature of MongoDB allows us to use different data elements in each document. The  fi rst document does not have the ‘page’ element which exists in the second document, for example.      Read

 To start with, let’s use the count operation to see how many documents are in a collection. The command is  db.COLLECTION_NAME.count() . To retrieve the documents themselves, we can use ‘  nd() ’. This command returns all the documents in a collection.  db.COLLECTION_  ndOne() returns only one document from a collection.
W e can also fi l ter our results by specifying selectors as parameters to fi n d. The parameter can be expressed in the form of  {x : x} which means ‘ where x == x ’. This will be shown in more depth later in this tutorial.
 #Counting the number of documents in a collection
 > db.products.count();
 #Retrieving all the documents in a collection
 >  nd();
 { “_id” : ObjectId(“4e2c2a88ae1a6bc450311480”), “isbn” : “123” , “type”: “fi ctional”, “price” : “50”, “description” :
“A fi  ctional book” }
{ “_id” : ObjectId(“4e2c2a99ae1a6bc450311481”), “isbn” : “124” , “type”: “tech”, “price” : “40”, “description” : “Book for tech”, “pages” : “150” }
 #Retrieving the fi  rst document in a collection. Alternativelly , we can use parameters to specify the result and remove “We will see it later”.
 >  ndOne();
 “_id” : ObjectId(“4e2c2a88ae1a6bc450311480”),
 “isbn” : “123”,
 “type” : “fi  ctional”,
 “price” : “50”,
 “description” : “A fi  ctional book”
#Retrieving the document in a collection with isbn equals 123
 >  nd({“isbn” : “123”});
 { “_id” : ObjectId(“4e2c2a88ae1a6bc450311480”), “isbn” : “123” , “type”: “fi  ctional”, “price” : “50”, “description” :
“A fi  ctional book” }
 As you can see, the document has an added  fi eld _id. This is automatically generated by the system once a document is inserted in the database to ensure that each document has a unique identity similar to a primary key in relational databases. If we wish, we can de fi ne that object id manually, but for the moment, we will make do with the system-generated version.      Update

O f course we may need to update the product details in our database at some time. For example, the company buyer may wish to add stock quantity information for a book. Using the update operation, we need to pass two parameters:
(i)       The document to update
(ii)     The changes to be made
T he command is  db.COLLECTION_NAME.update({documentToChange}, newVariable)
 Initially, we have to use an object as before. We can use the same one (book1) or a new one which will include the same information.
R eview the example below and execute the same commands for yourself. Then, try some for changes of your own.
 #Here we add a new element in the document with the key ‘quantity’ and value ‘3’.
 > book1.quantity=3
 #Now we have to update the document which has ‘serial’
123 with the updated object
# db.COLLECTION_NAME.update({parameter1},parameter2)
 > db.products.update({“isbn”:”123”},book1)
 #Use the fi  nd command to query the changed document. In this example we are using a selector to fi  lter our results.
Further explanation will be given later.
 >  nd({isbn:”123”});
 “_id” : ObjectId(“4e2c2a88ae1a6bc450311480”),
 “isbn” : “123”,
 “type” : “fi ctional”,
 “price” : “50”,
 “description” : “A fi  nctional book”
 “quantity” : 3
 # the document has been successfully updated      Delete

 We may need to remove documents from a collection. For this, we use  db. COLLECTION_NAME.remove() . Parameters may be passed to specify the document to be deleted or (CAUTION!) without parameters to remove all the documents within a collection. We can also use the drop operation to remove a collection.
 Follow the sequence of the commands:
 #Removing the document with isbn 123
 > db.products.remove({isbn:”123”});
 #Query the collection to defi  ne whether the document has been removed
>  ndOne({isbn:”123”});  null
 #As there is no entry for the document with isbn ‘123’, the shell returns a null value.
 #Now check the collections that are currently in the database and then drop the “products”.
 > db.getCollectionNames();
 [ “products”, “system.indexes” ]
 > db.products.drop();
 #System confi  rmation of the action  true
 > db.getCollectionNames();
 [ “system.indexes” ]

6.21.5         Advanced Commands and Queries

T his section contains examples of more advanced usage of the MongoDB command set. From the case study, we will try to implement these requirements:
 Each product has its own ISBN, description, author and other characteristics as well as a number of available sellers for it. Moreover, each book may be associated to a number of tags for easier searching. User reviews and comments are available. Finally, users should be able to  fi nd a book based on its characteristics, prices and other  fi ltering options.

6.21.6        More CRUDing

 We have dropped the collection, so a new one has to be created. Moreover, we need to create at least 20 new documents to represent books. This tutorial gives 20 examples that can be used, but it is recommended that you create your own documents in order to familiarise yourself with the concept.
A t this stage, data will be inserted directly in the shell without defi n ing the document as an object. You may still create objects and insert them in the ‘products’ collection if you wish.
A n example of how the different data types are used is shown below. Then, ten documents are listed after the code box.
 “description”:”A fi  ctional book”,
 “pages” : 150,
 “publicationdate” : new Date(‘10-06-2010’), #Date
 “publisher” : { “name”: “Willeyy”, “location”: “UK” },
#Embedded Document
 “tag-category”: [“fi ctional”,”science”], #Array
 “sellers” : [ #Array of Documents
 {“sellername”: “bookstore” , “price” : 50},
 {“sellername”: “seller4523” ,”price” : 60} ],
 “comments”: [
 {“email”: “”, “comment” : “ test test”}
 #Now Insert it in the collection following the in-structions below. The insert command is in red:
 db.product.insert({“isbn” : 123 , “title”: “Book 1”, “author” : “Mike”,”coathor” : “Jessy”, “description” : “A fi  ctional book”,”pages” : 150, “publicationdate” : new Date(‘10-06-2010’), “publisher” : { “name”: “Springer”, “location”: “UK”},”tag-category”: [“ fi ction-al”,”science”], “sellers” : [{“sellername”: “bookstore” , “price” :
50},{“sellername”: “seller4523”, ”price” : 60}], “comments”: [ {“email”: “”, “com-ment” : “ test test”}]});

6.21.7        Sample Data Set

I nsert these data sets in your MongoDB collection. You should then create at least another ten similar documents for yourself. We are using the book ISBN as the primary key.
  Book 1 = db.product.insert({“isbn” : 123 , “title”: “Book 1”, “author” : “Mike”,”coathor” : “Jessy”,
“description” : “A fi  ctional book”,”pages” : 150, “publicationdate” : new Date(‘10-06-2010’), “publisher”
: { “name”: “Willeyy”, “location”: “UK”},”tag-category”: [“fi  ctional”,”science”], “sellers” : [{“sellername”: “bookstore” , “price” : 50},{“sellername”: “seller4523” ,”price” : 60}], “comments”: [ {“email”: “test@example.
com”, “comment” : “ test test”}]});
 Book 2 = db.product.insert({“isbn” : 124 , “title”: “Book 2”, “author” : “George”, “description” : “Comedy book about”,”pages” : 120, “publicationdate” : new Date(‘12-06-2010’), “publisher” : { “name”: “Wil”, “location”: “USA”},”tag-category”: [“comedy”,”family”], “sellers” : [{“sellername”: “bookstore” , “price” : 50}], “comments”: [ {“email”: “”, “comment” : “ test test”}]});
 Book 3 = db.product.insert({“isbn” : 125 , “title”: “Book 3”, “author” : [“Tony”,”James”], “description” :
“geographic”,”pages” : 80, “publicationdate” : new
Date(‘14-07-2010’), “publisher” : { “name”: “Wil”, “location”: “USA”},”tag-category”: [“geographic”], “sellers” : [{“sellername”: “John” , “price” : 45},
{“sellername”: “Joe” , “price” : 55}], “comments”: [ {“email”: “”, “comment” : “ test test”},{“email”: “”, “comment” : “very good book”}]});
 Book 4 = db.product.insert({“isbn” : 128 , “title”: “Book 4”, “author” : [“Mike”,”Jack”], “description” : “A physics book”,”pages” : 150, “publicationdate” : new Date(‘10-06-2010’), “publisher” : { “name”: “Mike”,
“location”: “CHINA”},”tag-category”:      [“physics”, ”science”], “sellers” : [{“sellername”: “Mikee” , “price”
: 30},{“sellername”: “jackk” , ”price” : 10}]});
 Book 5 = db.product.insert({“isbn” : 133 , “title”:
“Book 5”, “author” : “Bob”,”description” : “chemistry”,
“pages” : 100, “publicationdate” : new Date(‘2006-0501’), “publisher” : { “name”: “Jason”, “location”: “India”},”tag-category”: [“chemistry”], “sellers” : [{“sellername”: “take” , “price” : 80}]});
 Book 6 = db.product.insert({“isbn” : 137 , “title”: “Book 6”, “author” : [“Chuck”,”Paul”],”description” :
“Mathematics Computer Science”, “pages” : 430,
”publicationdate” : new Date(‘3-04-2001’), “publisher” : { “name”: “Willey”, “location”: “UK”},”tag-category”: [“mathematics”,”computers”,”science”], “sellers” : [{“sellername”: “pricess” , “price” : 20}], “comments”:
[ {“email”: “”, “comment” : “not good”}, {“email”: “”, “comment” : “this book is great”}]});
 Book 7 = db.product.insert({“isbn” : 136, “title”: “Book 7”, “author” : “George”, “description” : “Crime” ,”pages” : 120, “publicationdate” : new Date(‘20-05-2000’), “publisher” : { “name”: “Willey”, “location”: “UK”},
”tag-category”: [“crime”,”action”], “sellers” : [{“sellername”: “bookstore3” , “price” : 40},{“sellername”: “book1231” , “price” : 50}], “comments”: [ {“email”: “test1@example.
com”, “comment” : “good book”}]});
 Book 8 = db.product.insert({“isbn” : 138, “title”: “Book 8”, “author” : [“Tony”,”Jack”], “description” : “Drama”,”pages” : 80, “publicationdate” : new Date
(‘24-07-2010’), “publisher” : { “name”: “John”, “location”:
“Canada”},”tag-category”: [“drama”,”thriller”], “sellers” : [{“sellername”: “John” , “price” : 55},{“sellername”: “Joesw” , “price” : 65}], “comments”: [ {“email”:
“”, “comment” : “ test is good”}, {“email”: “”, “comment” : “very good book”}]});
 Book 9 = db.product.insert({“isbn” : 140 , “title”: “Book 9”, “author” : [“Mike”,”George”], “description” :
“Architect”,”pages” : 150, “publicationdate” : new
Date(‘10-06-2010’), “publisher” : { “name”: “Take”, “location”: “CHINA”},”tag-category”: [“architecture”,
”science”], “sellers” : [{“sellername”: “Mike” , “price” : 30},{“sellername”: “jack” ,”price” : 10}]});
 Book 10 = db.product.insert({“isbn” : 141 , “title”: “Book 10”, “author” : “Bob”,”description” : “English”, “pages” : 100, “publicationdate” : new Date(‘2006-05-01’), “publisher” : { “name”: “Mikes”, “location”: “Sweden”},
”tag-category”: [“English”,”language”], “sellers” :
[{“sellername”: “Paul” , “price” : 60}]});

6.21.8        More on Deleting Documents

I t is important to understand that once data is removed, there is no undo of the action or recovery of the removed documents.
A n example we could use from the case study is that the books of the author “Tony” have been suspended and are not available for sale anymore. This example will delete any documents from the product collection where the author is “Tony”. If you have inspected the data set above, you should have noticed that two documents include this author: books 3 and 8.
 >db.produ ct.count();
 >db.product.remove({“author”: “Tony”});
A t the beginning of the process, there were ten documents in the collection. After removing those where the value of the author was “Tony”, only eight documents were left in the collection. Despite that the authors were listed inside an array, so the query mechanism worked ef fi ciently.
 Over to you: To familiarise yourself with the process, use any other criteria to remove documents. Once  fi nished, reinsert the removed documents.

6.21.9        More on Updating Documents

 A document that is stored in the database can be changed anytime. As previously explained, the update method requires two parameters: There are many ways to replace a document, and these are explained in this section:
      The simplest way to update a document was illustrated in the basic commands. That method was useful as it was necessary to make changes to the schema of the document.
      The second way is to use modifi e rs in cases where only specifi c  parts of the document will be updated. This section presents the most important modi fi ers.

6.21.10       The Modi fi ers $set/  $unset
T hese are used to set the value of a key, and if the key does not exist, it is created automatically. Unset is used to delete a given  fi eld. Let us see some examples:
“ db.COLLECTION_NAME.update({“documentToChange”}, { “$set”
{“values...”})” “db.COLLECTION_NAME.update({“documentToChange”}, { “$unset”
: {fi eldToRemove : 1});”
 Note: Number 1 in the  fi eldToRemove is required as a positive action.
 When a book is sold, appropriate actions should be taken. Consider the need for an update where one stock item is sold. For this, we can use  $set . Moreover, “$set” can be used to change the type of a value, for instance, changing a string to an array. We can use “$unset” to delete the comment fi e ld just for demonstration purposes.
 >  nd({“isbn” : 128});
 { “_id” : ObjectId(“4e302e849d72be29a7d749e9”), “isbn” :
128, “title” : “Book 4”, “author” : [ “Mike”, “Jack” ],
“description” : “A physics book”, “pages” : 150,
“publicationdate” : ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” : “Mike”, “location” : “CHINA” }, “tag-category” : [ “physics”, “science” ], “sellers” : [ { “sellername” : “Mikee”, “price” : 30 }, { “sellername” : “jackk”, “price” : 10 } ] }
 #The document with isbn 128 has two sellers at the beginning
 #Once a book is sold the associated entry should be removed like as illustrated below
 >db.product.update({“isbn” : 128}, {“$set” : {“sellers”
: [ { “sellername” : “Mikee”, “price” : 30 }]}});
 >  nd({“isbn” : 128});
 {    “isbn”     :     128,     “_id”     :     Objec-tId
(“4e302e849d72be29a7d749e9”), “author” : [ “Mike”, “Jack” ], “description” : “A physics book”, “pages” : 150, “publicationdate” : ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” : “Mike”, “location” : “CHINA” }, “sellers” : [ { “sellername” : “Mikee”, “price” : 30 } ], “tag-category” : [ “physics”, “science” ], “title” :
“Book 4” }
 #The query returns the updated and correct document
 #Now we will remove the comments fi  eld from another document with $unset command. First retrieve the docu-ment to see that the comments fi  eld exists and then use the
$unset command
>  nd({“isbn” : 124})
 {“isbn” : 124 , “title”: “Book 2”, “author” : “George”,
“description” : “Comedy book about”,”pages” : 120, “publicationdate” : new Date(‘12-06-2010’), “publisher”
: { “name”: “Wil”, “location”: “USA”},”tag-category”:
[“comedy”,”family”], “sellers” : [{“sellername”: “bookstore” , “price” : 50}], “comments”: [ {“email”:
“”, “comment” : “ test test”}]}
 > db.product.update({“isbn” : 124},{“$unset” : {comments
 >  nd({“isbn” : 124})
 {    “isbn”     :     124,     “_id”     :     ObjectId
(“4e30801e9889f6cdcf761296”), “author” : “George”, “description” : “Comedy book about”, “pages” : 120, “publicationdate” : ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” : “Wil”, “location” : “USA” }, “sellers” : [ { “sellername” : “bookstore”, “price” : 50 } ], “tag-category” : [ “comedy”, “family” ], “title” : “Book 2” }
 # The comments fi  eld was completely removed from the document
T ry to update at least two more documents by modifying (adding/removing) the different sellers or changing the names of author(s) or others. $inc

I t is similar to  $set,  but it is strictly used only for numbers. It increases an existing number by adding to it a given number .It is regularly used for analytics such as the update visitor counter of a website. The case study does not refer to these issues, but a suitable example could be to change the pages of a book:
#Using the update query to increment the pages of a particular book. The pages were 150 (see the previous query) and now incrementing by 205
> db.product.update({“isbn” : 128},{“$inc” : {pages
 >  nd({“isbn”:128});
 {    “isbn”     :     128,     “_id”     :     ObjectId
(“4e302e849d72be29a7d749e9”), “author” : [ “Mike”, “Jack” ], “description” : “A physics book”, “pages” : 355, “publicationdate” : ISODate(“1970-01-01T00:00:00Z”), “publisher” : { “name” : “Mike”, “location” : “CHINA” }, “sellers” : [ { “sellername” : “Mikee”, “price” : 30 } ], “tag-category” : [ “physics”, “science” ], “title” :
“Book 4” }
 #The update was successful and the pages currently are

6.21.11        Querying Documents

T he fi n d method is used to search for documents in a collection. It is possible to de fi ne more than one key/value pair as a parameter to search. For instance, a customer wants to fi n d all the books that are in the ‘science’ category and then one of them which was written by George. Moreover, we can specify which keys will be returned. For this example, we will return only the title, ISBN and author(s) of the book(s).
 The command we will use is  nd({“condition”});  For specifying which keys to return, use the extended command:
“  nd({“condition”},{“key” : 1});
 Number 1 is a true/false setting used to de fi ne that a key should be returned; number 0 is used to de fi ne that a key should not be returned.
 # In this query we want to fi  nd science books, and present only their isbn,author and title
 >  nd({“tag-category” : “science”},{“isbn” :
1 , “author” : 1 , “title” : 1, “_id”: 0});
 { “isbn” : 123, “title” : “Book 1”, “author” : “Mike” }
 { “isbn” : 128, “title” : “Book 4”, “author” : [ “Mike”, “Jack” ] }
 { “isbn” : 137, “title” : “Book 6”, “author” : [ “Chuck”, “Paul” ] }
{  “isbn” : 140, “title” : “Book 9”, “author” : [ “Mike”, “George” ] }
 # Then we are using another condition for querying the books in science category and those written by George.
In this example we don’t want to return the “_id” key.
 >  nd({“tag-category” : “science”, “author” :
“George”},{“isbn” : 1 , “author” : 1 , “title” : 1 , “_id” : 0});
 { “isbn” : 140, “title” : “Book 9”, “author” : [ “Mike”, “George” ] }        Query Criteria and Operators

Query Condition
 $lt  < ,
 less than
 $lte <= ,
 less than or equal
 $gt   > ,
 greater than
 $gte >= ,
 greater than or equal
 not equal
 in between
 one or other condition
 Using the query criteria and operators (or combinations) in the table above, examples are presented below to demonstrate their usage. All the examples relate to the case study, whilst customers use the search engine based on different criteria to defi n e their books.
 #Using $lte and $gte to look for book price greater and equal to 20 , and less than and equal to 80. Also specifying what to return.
 >  nd({“pages” : {“$gte” : 20, “$lte”:
80}},{“isbn” : 1, “_id”: 0});  { “isbn” : 125 }
 { “isbn” : 138 }
 #Create a variable with time and use it to fi  nd the books published before that date  > thedate = new Date(“01/01/2009”);
 >  nd({“publicationdate”         : {“$lt”     :
 #Using $in to match pages of books with books
 >  nd({“pages” : {“$in” :[80,100]}},{“isbn”
:1,”_id” : 0}); {“isbn” : 125 }  { “isbn” : 133 }
 { “isbn” : 138 }
 { “isbn” : 141 }
 #Using $or to fi  nd books that are written from Tony or are science category
 >  nd({“$or”:[{“author”:”Tony”},{“tagcategory”:”science”}]},{“isbn” :1,”_id”:0});  { “isbn” : 123 }
 { “isbn” : 125 }
 { “isbn” : 128 }
 { “isbn” : 137 }
 { “isbn” : 138 }
 { “isbn” : 140 }
Limits, Skips and Sorting
T o extend the query options, this section presents limits, skips and sorting on the returned results of a query.
 Limit sets the maximum number of results that will be obtained.
 Skip skips the given number of results and then returns the rest of the results. If the collection has fewer documents than the given number, then nothing will be returned.
 Sort sorts documents in given ascending (1) or descending (−1) order.
 See the examples below and then try some similar queries for yourself:
 #Requiring all results (therefore the {}) from the collection and limit to 3
>  nd({},{“isbn”:1, “_id”:0}).limit(3);  { “isbn” : 123 }
 { “isbn” : 128 }
 { “isbn” : 140 }
 #Requiring all results from the collection and skipping the fi  rst 5 while returning the rest in the collection
 >  nd({},{“isbn”:1, “_id”:0}).skip(5);
 { “isbn” : 137 }

6.22    Review Questions
 { “isbn” : 136 }
 { “isbn” : 141 }
 { “isbn” : 125 }
 { “isbn” : 138 }
#Requiring all results (therefore the {}) from the collection sorted by book title descending and number of pages ascending
 > db.product. fi nd({},{“title”:1,”pages”:1, “_id”:0}). sort({pages: -1});
{  “title” : “Book 6”, “pages” : 430 }
 { “title” : “Book 1”, “pages” : 150 }
{  “title” : “Book 4”, “pages” : 150 }
 { “title” : “Book 9”, “pages” : 150 }  .........
 { “title” : “Book 3”, “pages” : 80 }
 { “title” : “Book 8”, “pages” : 80 }
 #Finally, we will combine all the query options to-gether
 > db.product. fi nd({},{“title”:1,”pages”:1, “_id”:0}).
sort({pages: -1}).limit(3).skip(2);  { “title” : “Book 4”, “pages” : 150 }
{  “title” : “Book 9”, “pages” : 150 }
{  “title” : “Book 2”, “pages” : 120 }
 To shutdown MongoDB client, use the ‘exit;’ command.
 To shutdown the MongoDB server, use the Ctrl + C keys.       Useful Extra Resources

 We have only just got started with MongoDB with these notes. We hope that there is enough here to give you an understanding of the key principles and to help identify how MongoDB is different to an RDBMS.
 If you wish to  fi nd out more, some useful resources include:
      Chodorow, K. and Dirolf, M.: MongoDB: The Defi n itive Guide Powerful and Scalable Data Storage. O’Reilly Media, Sebastopol (2010)
      MongoDB: MongoDB Of fi cial Website. (2011). Accessed 1 Sept 2011
      MongoDB(2011b) Quickstart Unix. Available at

6.22         Review Questions

 The answers to these questions can be found in the text of this chapter.
      What does ACID stand for, and how sacrosanct are each of the elements in NoSQL databases?
      What was Cassandra designed for in terms of the types of data it stores?
      What is meant by horizontal scaling?
      How might the cloud be used as part of a disaster recovery strategy?
      What types of data are ideal for MongoDB, and what for Cassandra?

6.23           Group Work Research Activities

 These activities require you to research beyond the contents of the book and can be tackled individually or as a discussion group.

6.24          Discussion Topic 1

 Once you have become conversant with one or other of the databases in the tutorials of this chapter, attempt to draw up a SWOT analysis to see what the strengths and weaknesses, threats and opportunities there may be from adopting the database in an organisation.

6.25          Discussion Topic 2

T ry to think of useful criteria for comparing different types of database to help you decide which might be the most appropriate for a given application. Two examples of criteria might be scalability and performance which we have already referred to in the text above.
Acknowledgements   Much of the tutorial material was initially created by Antonis Petrou as an output from his MSc dissertation at Shef fi eld Hallam University. He was an excellent student to supervise and achieved an excellent fi n al mark for his work. I am grateful for his permission to use portions of his work. Peter Lake


Cassandra: The Defi nitive Guide By: Eben Hewitt Publisher:O’Reilly Media, Print: November (2010). Ebook: November 2010, Print ISBN:978-1-4493-9041-9  ISBN 10:1-4493-9041-2, Ebook ISBN:978-1-4493-9042-6 | ISBN 10:1-4493-9042-0
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst.  26 (2), 26 (2008). doi:1 0.1145/1365815.1365816.  h ttp:// , Article 4 (June 2008)
Brewer, E.: Pushing the CAP: strategies for consistency and availability. Computer  45 (2), 23–29