Contents

Contents

Learning Julia the Hard Way

This course is primarily designed for data scientists working in breeding and genetics. Others who want to use Julia for different purposes may also find it useful. We will explore the development cycle of a Julia project.

Preface

One can lift heavy weights as if they are light (in Chinese, 举重若轻), to demonstrate capability. One can also lift light weights as if they were heavy (in Chinese, 举轻若重). The latter is sometimes more meaningful, because:

What you will learn

The main goal of this course is to guide you through setting up a Julia simulation project, using packages from the JuliaBnG ecosystem.

I will also show you how to:

Note that these configurations may not be optimal. Feedback and suggestions are welcome.

The primary objective of this tutorial is to ensure you can run the simulation package on your own computer.

Conventions

Please feel free to click the links, especially the annotations.

Materials

Introduction

The task

The project we are addressing today is the design of a breeding program for the Fluffy sheep breeder. The breeding task is straightforward: you can select any number of parents and assign arbitrary weights to the two traits. Breeding will likely continue, but careful planning is required to rank among the top shepherds.

Setting up the environment

We will run simulations to test our breeding programs. This requires writing some scripts and tools in Julia.

Julia should already be installed on your laptop, but we will double-check to ensure the environment is ready for intensive tasks.

1. Git

It is recommended to have git installed and configured.

2. Terminal

We will test and run the project in a terminal. Linux and macOS users can use the built-in terminal. Windows users can use Windows Terminal and WSL2, if possible. Your terminal should be able to execute the julia and git commands.

3. Julia

Once your terminal is ready, regardless of your operating system, refer to Julia installation to ensure julia is correctly installed.

4. Julia editor

The Julia editor mentioned here is not the one you use to write code. It is the editor invoked by the @edit macro in the Julia REPL. Click here to learn how to inspect the implementation of functions.

5. Working directory

It is good practice to save your projects in a dedicated working directory, e.g., ~/projects. For example, our project, named flfSheep, will be saved in ~/projects/flfSheep.

The project

Best practices

Let's discuss how a simulation is conducted. Stabilizing Selection Out-of-Africa provides an example that many programmers dream of:

Sounds good? Let's reconsider.

A typical breeding simulation may consist of the following steps:

  1. Generate a historical population to create linkage disequilibrium (LD).

  2. Create a bottleneck to mimic the founder effect.

  3. Selection over several generations.

  4. Selection with updated breeding schemes, e.g., switching EBV from PBLUP to GBLUP.

One might think these steps could be standardized so that parameters are listed in a GUI. However, what if we change the nuclear population structure—for example, from a single large population to one consisting of many small flocks? What if we change the mating rules? Such scenarios are numerous.

Breeding practices vary by species, and practices for the same species may vary by country. The art of breeding, which changes across locations and time, makes script-based simulation more realistic. While established simulation procedures can be sophisticated, changing the simulation logic or methods can be difficult.

To be or not to be, that is the question.

This famous line is just one of many. Organizing these lines results in scripts of various genres and styles. A script is more versatile and powerful than a GUI.

Project analysis

We can take some figures from the Fluffy sheep breeder webpage:

It is likely that wool length has higher heritability than curliness.

These are my observations; please let me know if you observe different values.

Our goal is to develop a Julia simulation package to identify a breeding program that achieves the highest possible score.

Development of the package

Before we start, ensure your environment is set up as described in the previous section. We will work within the Julia REPL until the project is complete.

1. Create the project

In your terminal, navigate to the directory where you want to create the project:

cd ~/projects  # or your preferred project directory
julia          # to enter the Julia REPL

Press ] to enter package mode, and execute the following commands:

generate flfSheep

# Press <backspace> to exit package mode and return to the REPL
cd("flfSheep")  # Enter the project directory

# Press `]` to enter package mode again
activate .  # Activate the project we just created

# "(flfSheep) pkg>" will be displayed as the prompt
# Type `status` to view the project's status

# Press `<backspace>` to exit package mode
using flfSheep  # Load the project

Click the following link to view the project structure.

You will see a greet function in the src/flfSheep.jl file. Below is an example of what you might see in your terminal:

In the meantime, you may want to read about the almighty tab.

At this stage, you can use your favorite editor to modify the files in the project directory.

2. Add JuliaBnG packages

In the REPL, press ] to enter package mode, and run the following command:

add BnGStructs           # from JuliaBnG
add DataFrames
add FisherWright         # from JuliaBnG
add Random
add RelationshipMatrices # from JuliaBnG
add StatsBase

# Breeding.jl is not yet in the General registry.
# Add it using its URL:
add https://github.com/JuliaBnG/Breeding.jl  # from JuliaBnG

Click the next link to see what has changed.

3. Write a test function

Refer to the hello world project for detailed explanations and to ensure you have followed these steps correctly.

Create a file named src/tstBreeding.jl with the following content:

"""
  tstFlf()

Test script to identify an effective breeding program for the fluffy sheep.
"""
function tstFlf()
   @info "Starting the test function" 
end

In src/flfSheep.jl, replace the line:

greet() = print("Hello World!")

with:

using BnGStructs
using Breeding
using DataFrames
using FisherWright
using Random
using RelationshipMatrices
using StatsBase

include("tstBreeding.jl")

export tstFlf

At this point, although flfSheep doesn't yet perform complex tasks, you have successfully created a package and added its dependencies. This is your "Hello World" package. From here, you can begin developing your own breeding simulations—perhaps the "Hamlet" or "Macbeth" of breeding programs.

The rest of the tutorial is contained within the tstBreeding.jl file, as shown in the next section.

The simulation

I have provided a sample simulation program here with explanations. You can replace the contents of src/tstBreeding.jl in your project with this sample and modify it as needed.

"""
    founder_genotype(nid)

Generate genotypes and a linkage map for `nid` sheep.
"""
function founder_genotype(nid)
    @info "Return genotypes and a pedigree"

    # Generate a historical population using the `FisherWright` package
    ne = nid + 50   # Effective population size
    ng = 2000       # Number of generations of random mating
    mr = 1.0        # Mutation rate per Morgan per meiosis
    sheep = Sheep(ne) # A Sheep struct that defines name, population size, chromosomes, and M

    mts, cbp = fisher_wright(
        ne,  # This function generates LD among SNPs
        ng,  # at mutation-drift equilibrium
        sheep.chromosome, # Chromosome lengths downloaded from NCBI
        mr;
        M = sheep.M,  # 10⁸ bp per Morgan
    )

    hxy, lmp = muts2bitarray(mts, cbp; flip = true)

    # Sample an exclusive 20k SNP chip and 5k QTL. Sample 150 IDs.
    # - First, sample 150 IDs
    maf = 0.0
    hxy, lmp = sampleID(hxy, lmp, maf, nid)

    # - Sample chip SNPs and QTLs
    nchp, nqtl = 20_000, 5_000
    chip = SNPSet("chip", nchp; maf = 0.1, exclusive = true)
    qtl = SNPSet("qtl", nqtl; maf = 0.0, exclusive = true)
    hxy, lmp = sampleLoci(hxy, lmp, chip, qtl)
    lms = sum_map(sheep.chromosome)

    return hxy, lmp, lms
end

"""
    traits()

Define the traits of interest and their genetic VCV matrix.
"""
function traits()
    curliness = Trait("cn"; h² = 0.64, μ = 100.0, σₐ = 8)
    wlength = Trait("wl"; h² = 0.36, μ = 100.0, σₐ = 6)
    trts = [curliness, wlength]
    V = [ # Genetic VCV matrix for these two traits
        64.0 -30
        -30 36  # r = -5/8; adjust -30 for a different correlation.
    ]
    return trts, V
end

"""
    pedigree(prt, trts)

Create an empty pedigree to be updated later.
"""
function pedigree(prt, trts, nram)
    nid = size(prt, 1)
    newe = nid - nram
    ped = DataFrame(
        id = Int32.(1:nid),
        sire = prt[:, 1],
        dam = prt[:, 2],
        generation = Int8.(0),
        sex = shuffle([zeros(Int8, newe); ones(Int8, nram)]),
        tmi = 0.0,
    )
    
    for pre in ("tbv_", "ft_", "ebv_")
        for t in trts
            ped[!, pre * t.name] .= 0.0
        end
    end
    ped
end

"""
    founder_pedigree(hxy, lmp, nram, trts, V)
    
Create the pedigree for the founder generation (G0) using genotypes 
and the linkage map. QTL effects are sampled based on the genetic 
VCV matrix `V`, and TBV and phenotypes are calculated.
"""
function founder_pedigree(hxy, lmp, nram, trts, V)
    eqtl(hxy, lmp, V, trts)  # Sample QTL effects for the two traits.
    nid = size(hxy, 2) ÷ 2
    ped = pedigree(zeros(Int32, nid, 2), trts, nram)
    tbv!(ped, hxy, lmp, trts)
    phenotype!(ped, trts)
    return ped
end

"""
    parents(ped, nram, newe)

Select parents from the most recent generation of the pedigree to 
breed the next generation. Sires and dams are selected based on 
their Total Merit Index (TMI) in descending order.

Different breeding schemes can use alternative selection strategies 
by implementing a new `parents` function.
"""
function parents(ped, nram, newe)
    # Filter the pedigree for the most recent generation
    tpd = filter(row -> row.generation == ped.generation[end], ped)
    # Select rams and ewes separately 
    spd = groupby(tpd, :sex)

    cnd = spd[(sex = 1,)]
    idx = sortperm(cnd.tmi, rev=true)[1:nram]
    rams = cnd.id[idx]

    cnd = spd[(sex = 0,)]
    idx = sortperm(cnd.tmi, rev=true)[1:newe]
    ewes = cnd.id[idx]

    nid = size(tpd, 1)

    # Produce the same number of offspring as the previous generation
    sires = StatsBase.sample(rams, nid; replace = true)
    dams = StatsBase.sample(ewes, nid; replace = true)
    sortslices([sires dams], dims = 1)
end

"""
    breeding_program(hxy, lmp, ped, lms, trts, V, nram, scenario; ngen = 5)

Run a breeding program to update the pedigree and select parents for 
future generations. Multi-trait BLUP is used to update EBVs, 
and TMI is calculated accordingly.
"""
function breeding_program(hxy, lmp, ped, lms, trts, V, nram, scenario; ngen = 5)
    for gen in 1:ngen
        @info "Generation $gen"

        # Inverse of the numerator relationship matrix
        Ai = Ainv(ped)

        # Update EBV using multi-trait BLUP
        mtpblup(ped, trts, Ai, V)

        # Update TMI
        ped.tmi = scenario.w * ped.ebv_cn + (1 - scenario.w) * ped.ebv_wl

        # Select parents based on TMI
        prt = parents(ped, scenario.nram, scenario.newe)

        # Placeholder for offspring genotypes
        off = falses(size(hxy, 1), size(prt, 1) * 2)

        # Simulate genotypes for offspring
        gene_drop(hxy, off, prt, lms, lmp.pos)

        # Temporary pedigree for the next generation
        tpd = pedigree(prt, trts, nram)

        # Calculate TBV for offspring
        tbv!(tpd, off, lmp, trts)
        
        # Calculate phenotypes for offspring
        phenotype!(tpd, trts)

        # Update generation number for offspring
        tpd.generation .= gen

        # Update ID for offspring
        tpd.id .+= ped.id[end]

        # Update pedigree with offspring
        append!(ped, tpd)

        # Update genotypes with offspring
        hxy = hcat(hxy, off) 
    end

    return ped
end

"""
    report(ped)

Report the average score and inbreeding coefficient for the final 
generation of the pedigree.
"""
function report(ped)
    @info "Final score and inbreeding coefficient for the last generation:"
    lg = groupby(ped, :generation)[(generation = ped.generation[end],)]
    @info "  - Score: ", round(mean(lg.tbv_cn) + mean(lg.tbv_wl), digits = 2)
    ic = 0.0
    A = nrm(ped)
    for id in lg.id
        ic += A[id, id] - 1
    end
    ic /= size(lg, 1)
    @info "  - Inbreeding coefficient: ", round(ic, digits = 3)
end

"""
    tstFlf()

Test program to identify an optimal breeding scheme for Fluffy sheep. 
This function can use brute-force loops to find a plan that 
maximizes the score on https://sheep-breeder-2.vercel.app/.
"""
function tstFlf()
    @info "Starting test..."

    @info "  - Generating genotypes, linkage map, traits, and G0 pedigree"
    nram, newe = 50, 100
    hxy, lmp, lms = founder_genotype(nram + newe)

    trts, V = traits()
    ped = founder_pedigree(hxy, lmp, nram, trts, V)

    scenario = (nram = 25, newe = 50, w = 0.5)

    # Maintain a copy of original data for testing different scenarios
    cxy, cmp, cpd = copy(hxy), copy(lmp), copy(ped)

    @info "Scenario: " scenario
    tpd = breeding_program(cxy, cmp, cpd, lms, trts, V, nram, scenario)
    report(tpd)
end

This simulation is lightning fast. I have provided the basic framework; to tackle the problem set introduced at the beginning, you might loop through various scenarios (e.g., varying nram from 10 to 40) to find the optimal result.

Final remarks

I hope you enjoyed this tutorial. Below are a few additional topics I would like to address.

AI in Development

The template for this tutorial was generated with AI assistance. Starting from a single prompt for a two-column webpage, I refined it through several iterations to reach this final format. You can download the complete "source code" for this tutorial from the JuliaBnG GitHub page. For example:

git clone https://github.com/JuliaBnG/nadas-course

AI tools are becoming indispensable for modern workflows. My arguments for using AI are:

First:

The AI tools that have significantly impacted the computer science workforce must be effective to some degree. They are also easily accessible.

Second:

AI acts as a multiplier. Its effectiveness depends on the user's knowledge. The more you know, the more effectively you can prompt and utilize AI.

Third:

AI excels at generating and explaining command-line operations and scripts. It is good practice to become familiar with these tools.

Furthermore:

Markdown is becoming a universal "programming language" for documentation. Julia offers the unique advantage of being as easy to write as Python while delivering the performance of C or Fortran.

Simulation packages

Not all of my simulation code will be open-sourced. The primary reason is that organizing and maintaining these packages requires substantial resources and funding.


This tutorial was written in February 2026.

by Dr. Xijiang Yu

Last updated: February 26, 2026

Select a highlighted term to view annotations.

title: "Git Installation"

Windows

Download the installer from https://git-scm.com/.

Linux

The git command-line tools are typically installed by default. If they are not installed, you can use your system's package manager to install them:

# For systems using `dnf`
sudo dnf install git

# For systems using `apt`
sudo apt install git

# For systems using `pacman`
sudo pacman -S git

# For systems using `zypper`
sudo zypper install git

macOS

Download the installer from https://git-scm.com/.

Verification

Run git --version to verify the installation.

Git Configuration

Configuration

The most direct way to configure git is to save the following content into your ~/.gitconfig file.

[user]
	name = Xijiang Yu
	email = xijiang@users.noreply.github.com
[github]
	user = xijiang
[core]
	editor = emacsclient -ct
[alias]
	ss = status -s
	lg = log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --branches
	tg = log --date-order --graph --tags --simplify-by-decoration --pretty=format:'%ai %h %d'
[init]
	defaultBranch = main
[credential]
	helper = cache
[filter "lfs"]
	clean = git-lfs clean -- %f
	smudge = git-lfs smudge -- %f
	process = git-lfs filter-process
	required = true
[filter "remove-notebook-output"]
	clean = "jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR"

Notes

Using the Command Line

Alternatively, you can use the following commands to configure git. Remember to substitute your own details for the placeholders.

git config --global user.name "Xijiang Yu"
git config --global user.email "xijiang@users.noreply.github.com"
git config --global github.user "xijiang"
git config --global core.editor "emacsclient -ct"
git config --global alias.ss "status -s"
git config --global alias.lg "log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --branches"
git config --global alias.tg "log --date-order --graph --tags --simplify-by-decoration --pretty=format:'%ai %h %d'"
git config --global init.defaultBranch "main"
git config --global credential.helper "cache"
git config --global filter.lfs.clean "git-lfs clean -- %f"
git config --global filter.lfs.smudge "git-lfs smudge -- %f"
git config --global filter.lfs.process "git-lfs filter-process"
git config --global filter.lfs.required true
git config --global filter.remove-notebook-output.clean "jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR"

Julia Installation

The recommended way to install Julia is to run the following command in your terminal:

curl -fsSL https://install.julialang.org | sh

This will install the latest stable version of Julia, as well as the juliaup tool. Start Julia from the command-line by typing julia. This will launch the Julia REPL (Read-Eval-Print Loop). You should see something similar to this:

Press ] to enter package manager mode:

add Revise

Then press Ctrl-D to exit the Julia REPL.

Create (or modify) a file at ~/.julia/config/startup.jl with the following content:

using Revise

Subsequently, Revise will be loaded automatically whenever you start Julia.

Revise is an essential tool for Julia development. It allows you to modify your code and observe the changes immediately without needing to restart the session. We will use it throughout this course.

Julia Editor

You can set the default editor for Julia in your terminal using the following command:

export JULIA_EDITOR="code"

Then, start the Julia REPL:

julia> @edit sin(1)

This will open the source code of the sin function in your editor (e.g., VS Code).

When running Julia on a remote server, it is better to use a terminal-based editor such as emacsclient -nw, emacsclient -t, vim, or nano. This will display the highlighted source code within your terminal session.

To make this setting persistent, add the export command to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc).

Project Structure

The project directory flfSheep will have the following initial structure:

flfSheep/
├── Project.toml
└── src/
    └── flfSheep.jl

The Project.toml file will contain configuration similar to this:

name = "flfSheep"
uuid = "..."
authors = ["Your Name <[EMAIL_ADDRESS]>"]
version = "0.1.0"
...

The src/flfSheep.jl file serves as the main entry point for the module:

module flfSheep

greet() = print("Hello World!")

end # module flfSheep

Next Steps

Within the flfSheep directory, you can initialize version control using git:

git init
git add Project.toml src/flfSheep.jl
git commit -m "Initial commit"

The Almighty Tab

The Tab key in the Julia REPL is extremely versatile. It is used for code completion, typing Unicode characters, and more. For example, typing \flf and pressing the Tab key will complete the text to flfSheep. If you type flf, press Tab to get flfSheep, then type .g and press Tab again, it will complete to flfSheep.greet.

To type a Unicode character, type \ followed by the name of the character, then press the Tab key. For example, typing \mu and pressing Tab will result in μ.

Emojis start with \: followed by their name. For example, typing \:sheep: and pressing Tab will produce 🐑.

Typing \^2 then Tab gives you ².

Typing \sqrt then Tab gives you .

Familiarity with $\LaTeX$ makes typing mathematical symbols even easier.

Changes After Adding Packages

Once the packages are added, Project.toml will be updated to include the dependencies and compatibility requirements:

name = "flfSheep"
uuid = "866d409a-f53d-443e-8628-6d86ebd581db"
version = "0.1.0"
authors = ["Xijiang Yu <xijiang@users.noreply.github.com>"]

[deps]
BnGStructs = "764f5799-4fa8-443f-b488-b719b6ff37e3"
Breeding = "cb565f4f-111b-40c2-8768-3bafadb18772"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
FisherWright = "a555c6d9-8fd4-4ecd-87f6-904b8d8d9cbe"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
RelationshipMatrices = "d4d45922-7dca-4708-ac5a-a6dbb0551829"

[compat]
BnGStructs = "0.2.2"
Breeding = "0.1.0"
DataFrames = "1.8.1"
FisherWright = "0.1.6"
Random = "1.11.0"
RelationshipMatrices = "0.1.4"

Note that if you intend to register your package in the Julia General Registry, additional configuration will be required. This is beyond the scope of this course, but can be easily managed with the assistance of AI tools like GitHub Copilot.

A new Manifest.toml file will also be created in the project directory.

echo Manifest.toml >> .gitignore
git add .gitignore

The command above ensures that Manifest.toml is not tracked by Git. Since Manifest.toml is automatically generated by Julia and can change frequently based on the local environment, it is typically excluded from version control.

A Hello World Project

The contents of the src/flfSheep.jl file:

module flfSheep

using BnGStructs
using Breeding
using FisherWright
using RelationshipMatrices

include("tstBreeding.jl")
# More files can be included here. 
# Pay attention to the order of function definitions if using Julia 1.12 or later.

export tstFlf # Expose the function

end # module flfSheep

The contents of the src/tstBreeding.jl file:

"""
    tstFlf()

A test program to identify an effective breeding scheme for Fluffy sheep.
"""
function tstFlf()
    @info "Starting test..."
end

The text within the triple quotes constitutes the docstring for the tstFlf function.

In your REPL:

flfSheep.tstFlf()
# or, if you have "export tstFlf" in flfSheep.jl:
tstFlf()

This will display the information from the tstFlf function. The REPL recognizes the function because of the Revise.jl package we activated earlier. It is a highly recommended tool.

In your REPL:

?tstFlf # Type '?' to enter help mode

This will display the function's docstring. This is particularly useful as your collection of functions grows, and these docstrings can also be automatically incorporated into your manual later.

Annotations for the Main Program

This file contains a large number of functions. In a production project, it is better to organize them into separate files. They are consolidated here for easier reference and explanation.

Function tstFlf

Let's examine the last function first; it serves as the main driver for the full simulation. The code is largely self-explanatory:

  1. Generate the genotypes, linkage map, and pedigree DataFrame for generation 0.

  2. Define the traits and the genetic VCV matrix.

  3. Define the simulation scenarios.

  4. Create a copy of generation 0 and execute the test breeding program.

Function founder_genotype

This function generates a historical population with $N_e = 200$ over 2,000 generations. Approximately 120,000 segregating SNPs across 26 autosomes are obtained.

Remarks

Function traits

Configure trait information and the genetic VCV matrix.

Functions pedigree and founder_pedigree

These functions are self-explanatory.

Function parents

This function selects parents based on specified criteria. Different breeding programs can implement their own parent selection logic to create pairs according to their specific schemes.

Function breeding_program

The main breeding program function. In this example, it simulates selection over five consecutive generations.

Remark

The Ainv function is particularly noteworthy: it is exceptionally fast and capable of handling massive pedigrees that often cause other programs to fail.

Function report

Generates a summary of the pedigree data.