nunojob:~ dscape/08$ echo The Black Sheep

Posts tagged ‘Database’

IBM DB2 Express-C em versão mac

DB2 for Mac

DB2 for Mac

É oficial. A versão GRÁTIS do DB2 está disponivel para download para mac.

Acabaram-se as desculpas do não quero outra maquina virtual para correr isso, nem sequer experimento.

Eu sei que sou suspeito para falar já que faço parte da equipa DB2. A análise que vou aqui fazer é muito influenciada pelo meu dia a dia no trabalho mas o que escrevo aqui é a minha opinião pessoal .

A IBM não trabalha no DB2 para pessoas como nós que têm uns sites jeitosos com alguns milhares de hits diários (com sorte). Eles fazem isto para aguentar soluções de escala gigante, algumas com standards pesados em  XML de agências governamentais, financeiras, health-care, etc, que transaccionam quantidades enormes de informação diáriamente. Essas empresas não só tem que minar os dados como fazer queries sobre eles de uma forma bastante intensiva. Estou a falar das maiores empresas americanas, e não o digo decor. Ouvi-o da boca de DBAs da Merrill Lynch, Barclays, ONU, Morgan Stanley, etc.. Que tem eles em comum? Todos eles usam DB2 e estão interessados em usar as funcionalidades XML do produto.

Já agora ninguém confia que seja possivel ter performance em XML certo? Bem a IBM tem pessoas inteligentes (como eu, lol) a trabalhar em tornar isso possivel. Deixo este link para vos aguçar o apetite. Claro que a performance não será a mesma que SQL mas comparado com os parsers xml que andam a usar… eheh. Exprimentem. :P

Como já descrevi o cliente normal do DB2 é facil constactar que não é feito para vender a José, ao Joaquim. Nem sequer a pequena empresa da Josefina. A versão Express-C é gratís para todos por isso mesmo. As limitações são um máximo de 16Gb de ram e 4 processadores na maquina.

Se isto parece razoavel:

DB2 for Mac Download

DB2 for Mac Download

Depois contem como correu e se precisarem de umas dicas podem sempre entrar em contacto.

Footnote: Para os interessados se estão a desenvolver algo com um standard xml estranho  a probabilidade desse standard ser suportado pela ibm é grande e pode ser consultado aqui.

Mondrian Multidimensional K-Anonymity in Ruby

Article: Mondrian Multidimensional K-Anonymity

Lame Ruby Implementation:

# ==================================================================================
# anonymization: group.rb
# ==================================================================================
ENVIRONMENT = 'release' #'release'

require 'set'
require 'rubygems'
require 'ruby-debug' if ENVIRONMENT == 'debug'

# ==================================================================================
# class group
#
# usage:
#  require 'group'
#
#  g = Group.new <quasi_ids>, <filename>
#  g.anonymize <k>
#
# example:
#
# lefevre.db
# 
#     0             2   < -- quasi_ids
#
#   |age|  sex  | zipc | disease      |
#---+---+-------+------+--------------+--
# 0 | 25  Male    53711 Flu           |
# 1 | 25  Female  53712 Hepatitis     |
# 2 | 26  Male    53711 Bronchitis    |
# 3 | 27  Male    53710 Broken_Arm    |
# 4 | 27  Female  53712 AIDS          |
# 5 | 28  Male    53711 Hang_Nail     |
#---+---+-------+------+--------------+--
#
# irb
#  >> require 'group'
#  >> g = Group.new [0,2], 'lefevre.db'
#  >> g.anonymize 2, 'degen'
# ==================================================================================
class Group
  # create a setter method for @tuples, @filename
  # so that g.tuples = x works
  attr_writer :tuples, :filename
  
  @@debug = { 'best_attribute' => ENVIRONMENT == 'debug',
              'intersection'   => ENVIRONMENT == 'debug',
              'split'          => ENVIRONMENT == 'debug',
              'ordering'       => ENVIRONMENT == 'debug',
              'vars'           => ENVIRONMENT == 'debug',
              'args'           => ENVIRONMENT == 'debug'
           }
  # ================================================================================
  # to create a new group with Group.new
  # ================================================================================
  # needs to remove the full_ids from the read.
  def initialize(quasi_ids, filename, depth=0, available_ids=nil)
    # if no valid attributes are given quasi are used
    available_ids = quasi_ids if available_ids.nil?
    
    # initialize the instance vars
    @tuples = []
    @quasi_ids = quasi_ids
    @available_ids = available_ids
    @depth = depth

    # serves as wilcard so that no file is read on recursion
    filename == '*wc' ? @filename = nil : @filename = filename
    
    if @@debug['args'] and @depth == 0
      debug_puts "args : file => #{@filename}"
      debug_puts "args : k => #{@k}"
      debug_puts "args : quasi_ids => #{@quasi_ids.to_s}"
    end
              
    
    # run the read and backup procedures
    read
  end
  
  # ================================================================================
  # anonymization
  # ================================================================================
  def anonymize(k, heuristic='degen', partial_order=[])
    
    if @@debug['vars']
      #debug_puts "dvars : @tuples #{@tuples}" 
      debug_puts "dvars : @available_ids #{@available_ids},"
      debug_puts "dvars : @depth #{@depth}"
    end

    # stop case
    if isnt_splittable? k
      debug_puts "dsplit: no split available for k-level #{k} with size" +
                 " #{@tuples.size}" if @@debug['split']

      # sort and generalize remaining attributes
      @available_ids.each do |attribute|
        sort attribute
        generalize attribute
      end

      # exit
      return
    end

    # where and in what attribute should we split
    # these functions have a heavy effect on the usefulness of the information
    # for the k-anonymity table
    split_attribute  = find_split_attribute @available_ids, heuristic, partial_order
    split_pos        = find_split_position split_attribute

    # create the groups for the 
    # recursion
    group1 = Group.new @quasi_ids, '*wc', @depth + 1, @available_ids.clone
    group2 = Group.new @quasi_ids, '*wc', @depth + 1, @available_ids.clone

    # split at the given position
    split split_pos, group1, group2

    if split_groups_satisfy_k_anonymity?(k,group1,group2)
  
      debug_puts "dsplit: no more split available with attribute" + 
          " #{split_attribute} (g1: #{group1.size}, g2: #{group2.size})" if @@debug['split']

      # generalize by split_attribute and then remove it from the available
      # attributes array
      generalize split_attribute
      @available_ids.delete split_attribute

      # anonymize remaining available attributes
      anonymize k, heuristic, partial_order

    else # splitting successful
      debug_puts "dsplit: splitting on attribute #{split_attribute} at" +
                 " position #{split_pos} of #{@tuples.size}" if @@debug['split']
      
      # assign the two groups to this instance
      @group1 = group1
      @group2 = group2
      
      group1.anonymize k, heuristic, partial_order
      group2.anonymize k, heuristic, partial_order
      
      #@tuples = []
    end
  end

  # ================================================================================
  # io and backup related 
  # ================================================================================
  # read @tuples from @filename
  def read
    unless @filename.nil?
      f = File.open @filename
      f.each_line do |line|
        @tuples < < line.rstrip.split("\t\t")
      end
      f.close
    end
  end
  

  # reset the class to reuse
  def reset
    @available_ids  = @originally_available_ids
    @tuples = []
    read
  end

  # ================================================================================
  # overrides
  # ================================================================================
  # number of tuples
  def size
    @tuples.size
  end

  # ================================================================================
  # aux
  # ================================================================================
  # to_s
  
  def to_s
    
    str = ""
    
    unless @tuples.empty?
      @tuples.each do |line| 
        @tuples[0].size.times { |i| str << line[i].to_s + "\t\t"}
        str << "\n"
      end
    end

    str
  end

  # shows a yaml representation of internal object
  def to_y
    require 'yaml'
    y self
  end

  private
  
  def debug_puts(message)
    ident=''
    @depth.times {|i| ident+="  "}
    puts ident + message
  end

  # ================================================================================
  # aux for anonymization
  # ================================================================================
  # finds the attribute with the largest range. According to LeFevre this is a good
  # heuristic to find the attribute on
  def find_split_attribute(attributes_list, heuristic, partial_order)

    debug_puts "dorder: choosing from" + 
               " #{attributes_list.to_s}" if @@debug['ordering']

    best_attrib = -1
    best_attrib_count = 0.0

    attributes_list = find_minimal_elements partial_order, attributes_list

    debug_puts "dorder: minimal list is" +
               " #{attributes_list.to_s}" if @@debug['ordering']

    attributes_list.each do |attribute|
      values = @tuples.map{|t| t[attribute]}.to_set
  
      # degen heuristic: split on the attribute that had more degeneracy
      if heuristic == 'degen'
        if values.size < best_attrib_count or best_attrib == -1
          best_attrib = attribute
          best_attrib_count = @tuples.size.to_f / values.size.to_f
        end
      elsif heuristic == 'single'
        if values.size < best_attrib_count or best_attrib == -1
          best_attrib = attribute
          best_attrib_count = values.size
        end
      else #default
        if values.size > best_attrib_count
          best_attrib = attribute
          best_attrib_count = values.size
        end
      end
    end

    debug_puts "dbest : best atribute is #{best_attrib} with" + 
               " count #{best_attrib_count}" if @@debug['best_attribute']
    
    return best_attrib
  end
  
  #  returns the position of the leftmost or rightmost median element.
  #  used to split in lhs and rhs 
  def find_split_position(attribute_id)
    sort attribute_id
    
    median_pos = @tuples.size / 2
    median = @tuples[median_pos][attribute_id]
    
    split_pos_high = median_pos
    split_pos_low  = median_pos
    
    # split point correspond to highest index that has median value
    split_pos_high += 1 while (@tuples.size >= split_pos_high + 2) and
                              (@tuples[split_pos_high + 1][attribute_id] == median)
      
    high_smaller_group_size = 
            [split_pos_high + 1, @tuples.size - split_pos_high - 1].min

    # split point correspond to lowest index that has median value
    split_pos_low -= 1 while (split_pos_low > 1) and
                              (@tuples[split_pos_low - 1][attribute_id] == median)
    
    low_smaller_group_size = 
            [split_pos_low, @tuples.size - split_pos_low].min
    
    # choose the one with the largest group
    if high_smaller_group_size > low_smaller_group_size
      split_pos = split_pos_high
    else
      split_pos = split_pos_low - 1
    end
    
    return split_pos
  end
  
  # finds minimal elements from the list of the given attribute list according to
  # partial order specified in partial_order. partial_order contains all complete chains.
  def find_minimal_elements(partial_order, possible_elements)
    
    if partial_order.empty?
      debug_puts "dorder: no ordering specified" if @@debug['ordering']
      
      return possible_elements
    end

    # choose all possible_elements that arent in partial_order
    # those are minimal
    minimal_list = possible_elements.select { |element| !partial_order.flatten.member?(element) }
    
    # haskell goodies ^^
    # restrict partial_order to values in possible_elements
    restricted_partial_order = partial_order.map { |l| l.select { |element| possible_elements.member?(element) } }
    
    if @@debug['ordering']
      debug_puts "dorder: possible_elements list is" + 
                 " #{possible_elements.to_s}"
      debug_puts "dorder: partial_order list is" +
                 " #{partial_order.to_s}" 
      debug_puts "dorder: restricted_partial_order is" + 
                 " #{restricted_partial_order.to_s}"
    end

   non_zero_chains = restricted_partial_order.select { |chain| not chain.empty? }

   non_zero_chains.each do |c|
     candidate = c[0]
     
     minimal = !restricted_partial_order.any? do |chain|
        chain.member?(candidate) and chain[0] != candidate
     end
     
     if minimal and not minimal_list.member?(candidate)
       minimal_list << candidate
     end
   end

   return minimal_list
  end

  # replaces attribute value with generalization that cover all tuples.
  # Expects tuples to be sorted by attribute.
  def generalize(attribute)
    min_val = @tuples[0][attribute]
    max_val = @tuples[-1][attribute]
    
    unless min_val == max_val
      @tuples.each do |t|
        t[attribute] = [min_val, max_val]
      end
    end
    
  end

  def split(split_pos, group1, group2)
    group1.tuples = @tuples[0..split_pos]
    group2.tuples = @tuples[split_pos+1..@tuples.size]
  end

  def sort(attribute)
    @tuples = @tuples.sort_by { |t| t[attribute] }
  end
  
  # ================================================================================
  # verbose conditions
  # ================================================================================
  def isnt_splittable?(k)
    k < 2 or group_cant_be_split_for_level?(k) or no_split_attributes_are_available?
  end
  
  def group_cant_be_split_for_level?(k)
    @tuples.size < 2*k
  end
  
  def no_split_attributes_are_available?
    @available_ids.empty?
  end
  
  def split_groups_satisfy_k_anonymity?(k,group1,group2)
    group1.size < k or group2.size < k
  end
end

# hack on array to display lists correctly
class Array
  def to_s
    "[" + self.join(',') + "]"
  end
end

Ruby on Rails vs Java

Software Engineer at Critical Software?

Last Monday I attended an job interview at Critical Software. Had some troubles finding tecmaia facilities and got there 30 minutes late! (I know, need a GPS)

They started the interview with some general questions about my background and the work I did at Mobicomp. Then a more technical part, where I was to respond (like if it was an exam) to some questions about object-oriented design, database, Threading, Linux, C++, XML, UML, mySQL. Then a final part of the interview was reached, where my psychological strengths were measured, and I was able to speak for myself and tell them what I like to do.

This is the second interview I’d been in since summer break, the first was at Edigma.com, but that one did not went so well. It’s a shame, as I feel I would be a very good addition to that team. I know the responsibility was not entirely mine, as the interview was very bad. Had no structure, they didn’t take a single note about what I said, they didn’t have a script to follow. I could continue this list as I feel strongly disappointed with them. Don’t get me wrong what they do is cool, but the recruiting process is not.

Follow

Get every new post delivered to your Inbox.