TITLE: Learning Julia
DATE: 2021-04-10
AUTHOR: John L. Godlee
====================================================================


Julia is a programming language I've been wanting to learn for a
while. I've been encountering larger and larger datasets during my
PhD, both with terrestrial LiDAR data and huge species-by-site
diversity matrices. Running everything in R is becoming restrictive
when I want to crunch lots of data within a reasonable amount of
time.

Julia promises that it looks like Python, feels like Lisp, and runs
like C, supposedly straddling that trade-off between speed,
expressiveness, and generalisability.

I watched this Youtube tutorial to get me started on the basic
syntax, which is quite pleasant to read, and seems familiar to me
having a background in R with some Python and shell-scripting. I
also found that the official Julia documentation is pretty useful
for understanding some of the finer points in more detail.

 [Youtube tutorial]: https://www.youtube.com/watch?v=8h8rQyEpiZA
 [official Julia documentation]: https://docs.julialang.org/en/v1/

As my first practice I aimed to calculate the Shannon and Simpson
diversity indices for a huge species by site matrix.

I created the matrix in R and wrote it to a .csv file:

   mat <- matrix(sample(0:1000, 10^7, replace = TRUE), nrow =
100000)

   # Write matrix for use in Julia
   write.csv(mat, "mat.csv", row.names = FALSE)

Then in R I used the {vegan} package to calculate the Shannon and
Simpson diversity indices, and the {microbenchmark} package to time
how long it took:

   library(vegan)

   div <- function(x) {
     shannon <- diversity(x, "shannon")
     simpson <- diversity(x, "simpson")
     return(list(shannon, simpson))
   }

   microbenchmark(
     div(mat),
     times = 100)

The mean time to complete was 2167 milliseconds.

In Julia the overhead is a bit larger, just because I'm writing my
own functions for the diversity indices:

   # Packages
   using CSV
   using BenchmarkTools
   using Tables

   # Import matrix from .csv
   mat = CSV.File("mat.csv") |> Tables.matrix

   # Define Shannon function
   function shannon(x)
       xno = [i for i=x if i != 0]
       N = sum(xno)
       p = xno / N
       return -sum(p .* log.(p))
   end

   # Define Simpson function
   function simpson(x)
       N = sum(x)
       p = x / N
       return sum(p.^2)
   end

   # Iterate over columns
   function testfunc()
       shanout = []
       simpout = []
       for col in eachcol(mat)
           push!(shanout, shannon(col))
           push!(simpout, simpson(col))
       end
   end

   # Benchmark
   @benchmark testfunc()

The median time to complete was only 521 ms.

The main issue I'm having at the moment which is tripping me up a
lot is the way Julia handles reassigning variables. While in R I
could do something like this:

   x <- c(1,2,3)
   y <- x
   y[1] <- 10
   y != x

and have the last line evaluate as true, in Julia unless I use
copy() when reassigning the variable, y will continue to equal x:

   x = [1,2,3]
   y = x
   y[1] = 10
   y != x

   y = copy(x)
   y[1] = 100
   y != x