How To Generate Music With AI | A Step-by-Step Guide

Music generation code in python



As technology is changing rapidly it is breaking all the limitations it has in past and coming up with new and innovative ideas to help mankind. If we talk about all its innovations and advancement it would be a long list to discover and talk about. Modern advancement shows that Artificial Intelligence has a creative potential to help the entertainment industry in many ways just like the domain of audio & music.

With the help of Deep Learning techniques, AI can do wonders for you, such as generating music and creating beautiful AI compositions. Apart from other global creative trends, AI-generated music is taking hype.  Music generation using AI involves the creation of new compositions using algorithms that can learn and mimic the patterns of existing music. The purpose of this article is to discuss the state of the art of models and AI music generation techniques used in generating music, a brief overview of music generation code in python, and at the end a GitHub link for music generation.

So, Let’s get started!!

AI music generation techniques

The field of Artificial Intelligence that deals with processing and comprehending audio inputs, such as speech and music, is referred to as the Audio Domain. This field includes a broad range of tasks & applications, such as:

Music Classification

The ability to classify music based on genre, artist, and other attributes, which is used in music streaming and recommendation services.

Music Generation

The ability to generate music, which is used in applications such as music composition and generation of background music.

Audio Segmentation

The ability to segment audio into different segments, such as speech, music, and silence, which is used in applications such as speech recognition and audio editing.

Audio Enhancement

Improving the quality of the audio signal, removing noise, improving speech intelligibility, etc.

AI techniques such as Deep Learning, Machine Learning, Natural Language Processing, Computer Vision, and Signal Processing are used to solve these tasks. The combination of these AI music generation techniques with large amounts of data and computational power has led to a significant improvement in the performance of audio-related AI tasks in recent years.

How do computers understand Audio Data?

AI processes audio by first converting the analogue audio signal into a digital representation which is then processed by the algorithm. This is referred to as digitization. Once the audio has been converted to digital format, it can be analyzed using techniques such as Signal Processing, Natural Language Processing, and Machine Learning.

How To Generate Music With AI | A Step-by-Step Guide Tezeract

Key Features For Training Audio Model

Some data features and transformations that are important in speech and audio processing are Mel-frequency cepstral coefficients (MFCCs), Gammatone-frequency cepstral coefficients (GFCCs), Linear-prediction cepstral coefficients (LFCCs), Bark-frequency cepstral coefficients (BFCCs), Power-normalized cepstral coefficients (PNCCs), spectrum, cepstrum, spectrogram, and more.

We can use some of these features directly and extract features from others, like spectrum, to train a machine learning model.

Let’s dive deep into the GitHub code overview of music generation using AI with python

Music generation code in python

Let’s get started with writing the initial code, Importing important libraries, and assigning GPU.

!nvidia-smi -L

from google.colab import drive

!pip install git+

import jukebox
import torch as t
import librosa
import os
from IPython.display import Audio
from jukebox.make_models import make_vqvae, make_prior, MODELS, make_model
from jukebox.hparams import Hyperparams, setup_hparams
from jukebox.sample import sample_single_window, _sample, \
                           sample_partial_window, upsample, \
from jukebox.utils.dist_utils import setup_dist_from_mpi
from jukebox.utils.torch_utils import empty_cache
rank, local_rank, device = setup_dist_from_mpi()

Choosing the lyrical model. 

We will go with ‘5b_lyrics’. You can try ‘1b_lyrics’ as well.

model = '5b_lyrics' # or '5b' or '1b_lyrics'
hps = Hyperparams() = 44100
hps.n_samples = 3 if model in ('5b', '5b_lyrics') else 8
# Specifies the directory to save the sample in.
# We set this to the Google Drive mount point. = '/content/gdrive/My Drive/samples'
chunk_size = 16 if model in ('5b', '5b_lyrics') else 32
max_batch_size = 3 if model in ('5b', '5b_lyrics') else 16
hps.levels = 3
hps.hop_fraction = [.5,.5,.125]

vqvae, *priors = MODELS[model]
vqvae = make_vqvae(setup_hparams(vqvae, dict(sample_length = 1048576)), device)
top_prior = make_prior(setup_hparams(priors[-1], dict()), vqvae, device)

There are two modes for training, ‘ancestral’ and ‘primed’. ‘Ancestral’ creates songs based on artists and genres. ‘Primed’ creates songs based on samples provided. Choosing any ONE mode from the following. I will go with ‘ancestral’.

# Selecting Mode

# The default mode of operation.
# Creates songs based on artist and genre conditioning.
mode = 'ancestral'

# Prime song creation using an arbitrary audio sample.
mode = 'primed'
# Specify an audio file here.
audio_file = '/content/gdrive/My Drive/primer.wav'
# Specify how many seconds of audio to prime on.

Defining some parameters to the function.

sample_hps = Hyperparams(dict(mode=mode, codes_file=codes_file, audio_file=audio_file, prompt_length_in_seconds=prompt_length_in_seconds))

Specifying the output audio length.

Specifying the sample length of the given audio.

# Note: Metas can contain different prompts per sample.
# By default, all samples use the same prompt.
metas = [dict(artist = "Rick Astley",
            genre = "Pop",
            total_length = hps.sample_length,
            offset = 0,
            lyrics = """
          ] * hps.n_samples
labels = [None, None, top_prior.labeller.get_batch_labels(metas, 'cuda')]

The sampling temperature is a hyperparameter that controls the randomness or “creativity” of the model’s output. A lower temperature will result in more predictable or conservative output, while a higher temperature will produce more varied or “creative” output. The temperature is typically set during training and can be adjusted when generating new audio samples. The value of temperature is between 0 to 1.

sampling_temperature = .98

lower_batch_size = 16
max_batch_size = 3 if model in ('5b', '5b_lyrics') else 16
lower_level_chunk_size = 32
chunk_size = 16 if model in ('5b', '5b_lyrics') else 32
sampling_kwargs = [dict(temp=.99, fp16=True, max_batch_size=lower_batch_size,
                    dict(temp=0.99, fp16=True, max_batch_size=lower_batch_size,
                    dict(temp=sampling_temperature, fp16=True, 
                         max_batch_size=max_batch_size, chunk_size=chunk_size)]

This will generate 3 levels of samples, level 2, level 1, and level 0. And it will place these samples in google drive. And later on, we will use these samples to up sample them and create level 0 sound. level is the sample that we actually need.

if sample_hps.mode == 'ancestral':
  zs = [t.zeros(hps.n_samples,0,dtype=t.long, device='cuda') for _ in range(len(priors))]
  zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
elif sample_hps.mode == 'upsample':
  assert sample_hps.codes_file is not None
  # Load codes.
  data = t.load(sample_hps.codes_file, map_location='cpu')
  zs = [z.cuda() for z in data['zs']]
  assert zs[-1].shape[0] == hps.n_samples, f"Expected bs = {hps.n_samples}, got {zs[-1].shape[0]}"
  del data
  print('Falling through to the upsample step later in the notebook.')
elif sample_hps.mode == 'primed':
  assert sample_hps.audio_file is not None
  audio_files = sample_hps.audio_file.split(',')
  duration = (int(sample_hps.prompt_length_in_seconds**top_prior.raw_to_tokens
  x = load_prompts(audio_files, duration, hps)
  zs = top_prior.encode(x, start_level=0, end_level=len(priors), bs_chunks=x.shape[0])
  zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
  raise ValueError(f'Unknown sample mode {sample_hps.mode}.')

Listen to the level 2 audio sample. This is the most low-quality sample that is generated, this can be handy if you want to adjust the parameters again, and regenerate the sample.


Now we are done with level 2 and level 1 samples, let’s create level 0 samples.

# Set this False if you are on a local machine that has enough memory (this allows you to do the
# lyrics alignment visualization during the upsampling stage). For a hosted runtime, 
# we'll need to go ahead and delete the top_prior if you are using the 5b_lyrics model.
if True:
  del top_prior
upsamplers = [make_prior(setup_hparams(prior, dict()), vqvae, 'cpu') for prior in priors[:-1]]
labels[:2] = [prior.labeller.get_batch_labels(metas, 'cuda') for prior in upsamplers]

Generating Level 0 sound can take up to 12 hours, Make sure that colab is not running in the background, otherwise it will crash and processing can be terminated.

zs = upsample(zs, labels, sampling_kwargs, [*upsamplers, top_prior], hps)

When level 0 is generated you can now listen to it in google colab.  


Git Repo Link:


That’s a wrap guys!!

Hopefully, this article has given you an overview of AI music generation techniques and how these techniques can be used to generate music, compositions, and lyrics. You can use this python code for generating music or visit my GitHub link for music generation.

In conclusion, Artificial Intelligence is a rapidly evolving field that has the potential to revolutionize various industries. However, it’s important to remember that while AI has the potential to bring about many benefits, it also poses certain ethical and societal challenges that must be addressed. As we continue to push the boundaries of what’s possible with AI, we must work together to ensure that the technology is developed and used in a way that is fair, ethical, and beneficial for all.


Suggested Articles