![]() ![]() We considered writing a binary search variant to find the first element >= the target IP, however we could not find a library which does this and we don’t have the time to write and test one ourselves. Please note: we cannot use a binary search algorithm since we do not know the specific value we are searching for a priori. Since we are not using a DB, the query logic is simply to iterate through the array to find the first row with IP_TO >= target IP. We had to run the test with JVM parameter -Xmx2G to ensure all the data could be loaded into memory. Some tests had to be modified, either due to time constraints (test could not run in a reasonable time) or because the test crashed or hung. Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode) Modifications to Tests Test was run on the following platform configuration: Hardware Overview:Ĭapacity: 500.11 GB (500,107,862,016 bytes) test getting fields, and verify result ![]() Long random = (long)(Math.random() * Integer.MAX_VALUE) the following code iterates for each look up test execution The following JDBC code utilizes prepared statements and bind variables to select the first matching row: PreparedStatement ps = conn.prepareStatement() Test results for the in-memory baseline and each of the embedded DBs follows: Also, we require the exclusive use of a JDBC interface to the database so we can easily swap DBs based on performance and stability. Databases EvaluatedĮach of the databases were set up, configured, and queried using identical SQL scripts and JDBC code, except where specifically noted in the test results section. Leveraging an embedded DB which is not only thread safe but also concurrent will allow re-use of in-process memory cache across multiple threads using the Hadoop Task JVM Reuse feature. For example, we used the hsqldb-1.8.0.10.jar bundled with Hadoop. This test is Hadoop/Cascading specific in that it uses the standard set of jar files included in the class path of Apache Hadoop 20.2 and Cascading 1.2. This page evaluates several Java embedded databases as a fit for the use case. We need a shared-nothing architecture, and therefore copying a Java embedded database locally to pull our small data close to our big data seems the best approach. Likewise, we cannot afford to take a network hit for every look-up and we currently cannot batch look-up requests to reduce network hits. We cannot use a middleware or shared database solution since these will quickly become overwhelmed by requests from our cluster. Therefore we have to look at some alternatives for looking up the data without loading it all in memory. While we can currently fit this in memory with the current infrastructure we will have issues if this data grows significantly beyond it’s current size. Currently the entire IP2Location DB-5 data set consumes 1+ GB of memory when loaded into a Java object array. Since the IP geographic data operates on ranges of IP addresses, we cannot simply join the log data to the IP data (using RDBMS or MapReduce), and we cannot use a simple key/value cache to look-up the IP data. ![]()
0 Comments
Leave a Reply. |