June 1, 2022 Published by Rachel Bittner, Sr. Research Scientist
Introducing Basic Pitch, Spotify’s free open source tool for converting audio into MIDI. Basic Pitch uses machine learning to transcribe the musical notes in a recording. Drop a recording of almost any instrument, including your voice, then get back a MIDI version, just like that. Unlike similar ML models, Basic Pitch is not only versatile and accurate, but also fast and computationally lightweight. It was built for artists and producers who want an easy way to turn their recorded ideas into MIDI, a standard for representing notes used in digital music production.
Basic Pitch is not your average MIDI converter
For the past 40 years, musicians have been using computers to compose, produce, and perform music, everywhere from bedrooms to concert halls. Most of this computer-based music uses a digital standard called MIDI (pronounced “MID-ee”). MIDI acts like sheet music for computers — it describes what notes are played and when — in a format that’s easy to edit. Did a note sound weird in that chord you played? Change it with a click.
While MIDI is used by nearly all modern musicians, creating compositions from scratch with MIDI can be a challenge. Usually, musicians have to produce MIDI notes using some sort of computer interface, like a MIDI keyboard, or by typing the notes into their software by hand. This is because live performances on real instruments are typically difficult for a computer to interpret: once a performance has been recorded, the individual notes that were played are tricky to separate and identify.
This is a real problem for musicians who primarily sing their ideas, but aren’t familiar with piano keyboards or complex music software. For other musicians, having to compose on a MIDI keyboard or manually assembling an entire MIDI score note by note, mouse click by mouse click, can be creatively constraining and tedious.
Mighty, speedy, and more expressive MIDI
To solve this problem, researchers at Spotify’s Audio Intelligence Lab teamed up with our friends at Soundtrap to build Basic Pitch — a machine learning model that turns a variety of instrumental performances into MIDI.
While other note-detection systems have existed for years, Basic Pitch offers a number of advantages:
- Polyphonic + instrument-agnostic: Unlike most other note-detection algorithms, Basic Pitch can track multiple notes at a time and across various instruments, including piano, guitar, and ocarina. Many systems limit users to only monophonic output (one note at a time, like a single vocal melody), or are built for only one kind of instrument.
- Pitch bend detection: Instruments, like guitar or the human voice, allow for more expressiveness through pitch bending: vibrato, glissando, bends, slides, etc. However, this valuable information is often lost when turning audio into MIDI. Basic Pitch supports this right out of the box.
- Speed: Basic Pitch is light on resources, and is able to run faster than real time on most modern computers (Bittner et al. 2022).
By combining these properties, Basic Pitch lets you take input from a variety of instruments and easily turn it into MIDI output, with a high degree of nuance and accuracy. The MIDI output can then be imported into a digital audio workstation for further adjustments.
Basic Pitch gives musicians and audio producers access to the power and flexibility of MIDI, whether they own specialized MIDI gear or not. So now they can capture their ideas whenever inspiration strikes and get a head start on their compositions using the instrument of their choice, whether that’s guitar, flugelhorn, or their own voice.
Easy peasy! Well…
Does better always have to mean bigger?
To build Basic Pitch, we trained a neural network to predict MIDI note events given audio input. In general, it’s hard to make systems that are both accurate and efficient. Usually in ML, to make things more accurate, the easiest way is to add more data and make our models bigger.
When we look around at popular ML models today, we see a tendency toward computationally extreme solutions. Think OpenAI’s Jukebox with its billions of parameters, or DALL-E and its 12 billion parameters, or Meta’s OPT-175B, a 175-billion-parameter large language model. Large models with targeted use cases can have very good results: using a dataset with tons of piano audio will be effective at recognizing piano input.
But we wanted a model that could work with input from a variety of instruments and polyphonic recordings to create a tool that’s useful for piano virtuosos as well as shower crooners. Did that automatically mean building a mega model? Or, could we build a model that did more (recognize notes played by a much wider range of instruments), but that was also lighter, and just as accurate, as a heavier, more power-hungry solution?
That was the challenge we set for ourselves. But to accomplish that, we would have to address a number of transcription problems specific to music.
Music, musicians, and machine learning
Since we set out to build this tool for musicians, not just researchers, we knew speed was important. No matter how impressive your ML model, no one enjoys waiting around for the results, especially if they’re in the middle of doing something creative. Inspiration doesn’t like progress bars.
Reducing the model size would help with speed. But the goal of creating a fast, light, and accurate ML model presented a few challenges that are specific to music audio:
- When music is polyphonic, lots of the sounds overlap in both time and frequency. For example, if a piano plays C3 and C4 at the same time, the notes share a lot of similar harmonic frequency content. This makes it hard to “untangle”.
- It’s not always easy to know when to group pitches into a single note, or when to break them into multiple notes. In the example clip below?, listen to the way the singer scoops around different notes in “sooooong” and “aboooout”, etc. How many notes are there in the word “song”? Two, three…five-ish? Depending on who you are, you might hear it differently.
- It’s hard to make an ML system that transcribes any instrument well. There are some really good models out there for instruments like the piano. Even making a model that transcribes any type of piano well is difficult. (Think about all the different sound qualities of different piano recordings — a Steinway grand in a concert hall, an out-of-tune upright in a honky-tonk, etc.). It’s even harder to make a model that works well on any instrument, which ranges from a kazoo to a soprano opera singer.
So how did we do?
An ML model with a lightweight footprint
In the end, the model for Basic Pitch was made to be lightweight by using a combination of tricks drawing on past research from Spotify’s Audio Intelligence Lab and other researchers, including:
- Using a harmonic constant-Q transform as input (Bittner et al. 2017, Balhar 2018)
- Jointly modeling the onsets, frames (Kelz et al. 2016, Hawthorne et al. 2018), and multipitch information (Bittner et al. 2017)
- Just using fewer layers and fewer parameters!
In our experiments, the shallow architecture and the tricks above resulted in a lightweight model with a high level of accuracy.
Compared to other AI systems, which can require massive amounts of energy-intensive processing to run, Basic Pitch is positively svelte, at <20 MB peak memory and <17K parameters. (Yes: 17,000, not 17 billion!)
Greater versatility, comparable accuracy
State-of-the-art systems are typically built for one specific instrument type, but Basic Pitch is both accurate and versatile. As seen in our experimental results above, Basic Pitch performed well at detecting notes from various instrument types, including vocal performances (the Molina dataset in the first column), which can be especially tricky to get right.
Try the demo at basicpitch.io
Basic Pitch is so simple, it even works in your web browser: head on over to basicpitch.io to try it out without downloading anything. For those interested in more details:
- We’ve open sourced the Basic Pitch model on GitHub for the Pythonistas.
- We presented a paper on our model at ICASSP 2022.
- Check out a video of our ICASSP presentation.
Made for creators, shared with everyone
Originally, we set out to develop Basic Pitch exclusively for Soundtrap and its community of artists and audio producers. But we wanted to share it with even more creators, whether they were musicians, developers, or both. So we open sourced the model and made the tool available online for anyone to use.
Right now, Basic Pitch gives musicians a great starting point for transcriptions, instead of having to write all their musical ideas down manually from scratch or buying extra MIDI hardware. This library comes with a standard open source license and can be used for any purpose, including integrating it with other music production tools. We can also imagine how the model could be integrated into real-time systems — for example, allowing a live performance to be automatically accompanied by other MIDI instruments that “react” to what the performer is playing.
We also wanted to share our approach for building lightweight models so other ML researchers might be inspired to do the same. For certain use cases, the large, computationally heavy, and energy-hungry models make sense. Sometimes they’re the best and only solution. But we don’t think this should be the only approach — and certainly not the default one. If you go in thinking you can create a leaner ML model, you probably can (and should ?).
Now that Basic Pitch is out there for music creators, software engineers, and researchers to use, develop, and build upon, we can’t wait to see what everyone does with it. No matter how much we test a model in our experiments, there’s still nothing like seeing how it performs out in the wild, with real-world use cases. In this initial version of Basic Pitch, we expect to discover many areas for improvement, along with new possibilities for how it could be used.
Want to help us improve Basic Pitch or have an idea for building something great with it? Let us know in the repo!
Basic Pitch was built by:
Rachel Bittner, Juanjo Bosch, Vincent Degroote, Brian Dombrowski, Simon Durand, Sebastian Ewert, Gabriel Meseguer Brocal, Nicola Montecchio, Adam Rackis, David Rubinstein, Ching Sung, Scott Sheffield, Peter Sobot, Daniel Stoller, Charae Tongg, and Jan Van Balen
Balhar, Jirí and Jan Hajič, Jr. “Melody Extraction Using a Harmonic Convolutional Neural Network.” (2018).
Bittner, Rachel, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, and Sebastian Ewert. “A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation.” 47th International Conference on Acoustics, Speech and Signal Processing (May 2022).
Bittner, Rachel, Brian McFee, Justin Salamon, Peter Li, and Juan P. Bello. “Deep Salience Representations for f0 Estimation in Polyphonic Music.” 18th International Society for Music Information Retrieval Conference (October 2017).
Hawthorne, Curtis, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. “Onsets and Frames: Dual-Objective Piano Transcription.” (2018).
Kelz, Rainer, Matthias Dorfer, Filip Korzeniowski, Sebastian Böck, Andreas Arzt, and Gerhard Widmer. “On the Potential of Simple Framewise Approaches to Piano Transcription.” (December 2016).
Tags: engineering leadership, machine learning, tech research
Is there a program that Converts audio to MIDI? ›
Introducing Basic Pitch, Spotify's free open source tool for converting audio into MIDI. Basic Pitch uses machine learning to transcribe the musical notes in a recording. Drop a recording of almost any instrument, including your voice, then get back a MIDI version, just like that.How do I convert audio to MIDI in Melodyne? ›
From the “Settings” menu, choose “Save as MIDI…”. In the file selection box that appears next, you can enter a name and a storage location for the MIDI file. Now import this file back into your DAW onto a track with a MIDI or software tone generator.How do I convert audio to MIDI logic? ›
Create MIDI regions from audio regions using Flex Pitch
Choose Flex Pitch from the Flex Mode pop-up menu in the Audio Track Editor menu bar. After making any needed corrections, choose Edit > Create MIDI Track from Flex Pitch Data in the Audio Track Editor menu bar.
Whether you're importing a vocal melody or recording a track in realtime, there's little debate that Vochlea has the best voice-to-MIDI plugin on the market.What files can be converted to MIDI? ›
Use this tool to convert audio files such as mp3, wav, ogg, m4a, and many other formats to MIDI. Note: The quality of the resulting MIDI file depends largly on the input music. The MIDI file quality can range from good quality to unusable.Can Audacity convert audio to MIDI? ›
Audacity can always be your best choice when you want to convert audio files in MP3, WAV, MIDI, M4A and so on. In many cases, we need to convert MP3 to MIDI.Is it possible to convert MP3 to MIDI? ›
There are also some MP3 to MIDI converting software that can assist you to make the audio conversion, such as Audacity. It is a great free audio editing tool around. With it, you can convert audio files in MP2, MP3, M4A, OGG, FLAC, MIDI, etc.Can you get Melodyne for free? ›
With the trial version, you can test the full range of functions of Melodyne for 30 days free of charge with no obligation.Can you use Melodyne without a DAW? ›
When you install the Melodyne plug-in, it will also install the standalone app. The standalone app lets you record audio directly into the software. You can also import any existing audio file. Standalone mode is especially useful if you don't use a DAW or need to edit an individual track.Can logic transcribe music? ›
Using Logic's various export options, we work from xml, midi. and audio files to transcribe, using either your Logic session files, or your Logic xml, midi and audio exports. We offer the same service for ProTools users.
How do you use flex pitch? ›
Turn on flex in the Audio Track Editor
- Select an audio track in the Tracks area, then choose View > Show Editors.
- Select an audio track in the Tracks area, then click the Editors button in the control bar.
- Double-click an audio region.
Many audio producers use MIDI because of the versatility it offers them in production. For example, an audio producer has the ability to edit sequences and it also offers more flexibility than direct digital audio. From here, music producers can change the key, instrumentation and tempo of an arrangement.How do you make MIDI sound realistic? ›
Intentionally adding imperfections and randomization to your virtual instruments can overcome this.
- Automate Velocity. ...
- Vary Your Articulations. ...
- Use Modulation to Make Synthetic Sounds More Organic.
Disadvantages of MIDI over digital audio: – Because MIDI data does not represent the sound but musical instruments, playback will be accurate only if the MIDI playback (instrument) is identical to the device used in the production. – Higher cost and requires skill to edit. – Cannot emulate voice, other effects.Is USB MIDI better than MIDI? ›
A USB connection has much higher bandwidth than a DIN-MIDI connection, which means that instead of each cable having just one physical port with 16 channels, a single USB cable can support 16 virtual MIDI ports each with 16 channels, for a total of 256 MIDI channels down one cable!Does MIDI use more CPU than audio? ›
MIDI by itself is less CPU intensive than pure audio, but I can't remember a time where I used MIDI without a soft synth and effects after it. The plugins are CPU intensive. Your soft synths, delays, reverbs, etc. By bouncing down to audio, you're consolidating the sound.Is MIDI obsolete? ›
Today, MIDI is used all the time, both on stage during live performances and under the hood of digital audio workstations and virtual instruments.Are MIDI files legal? ›
The US Copyright Office ("USCO") will accept copyright registration filings for your original music in Standard MIDI File (SMF) format, as well as for sequences of published and recorded songs in SMF format.What is the difference between audio files and MIDI files? ›
If you are confused, it might be helpful to think of the difference between audio and MIDI like this: *Audio recording is about capturing the sound of the actual performance. *MIDI recording or 'sequencing' is about capturing the actual notes of the performance.Does Audacity have MIDI support? ›
You can't use a MIDI keyboard directly within Audacity as the software does not have the ability to record or write MIDI files or communicate with a MIDI keyboard or controller directly. Despite this, the software is capable of handling some basic MIDI information that can be used in audio production.
Why can't I export to MIDI on Audacity? ›
Audacity can't convert audio into MIDI, but if you have a “Note track” (http://manual.audacityteam.org/man/note_tracks.html) in the Audacity project then you can export it as either a . mid or . gro file.Is MIDI better than MP3? ›
While MP3s are limited to the original instrument that was recorded, MIDI is far more versatile. For example, a guitar recording can't be changed to a synthesizer or a violin. But a MIDI composition that was originally written for guitar could also be played on a flute or an entire woodwind orchestra.Does MIDI transmit digital audio? ›
Musical Instrument Digital Interface (MIDI) is a standard to transmit and store music, originally designed for digital music synthesizers. MIDI does not transmit recorded sounds. Instead, it includes musical notes, timings and pitch information, which the receiving device uses to play music from its own sound library.What's better Auto-Tune or Melodyne? ›
The key difference between the two plug-ins is that Melodyne is nonlinear, whereas Autotune processes audio in real-time in a linear fashion. To be more specific, Melodyne records the audio it is to process into the plug-in. Then you are able to go in and manually fine tune (pun intended!) each note.Do most artists use Melodyne? ›
Pitch correction software such as Melodyne and Antares Auto-Tune are widely used throughout the music industry to achieve subtle vocal retuning, even on singers that can produce a near impeccable performance.What is the cheapest version of Melodyne? ›
The cheapest version of Melodyne is Celemony Melodyne 5 Essential which allows you to edit vocals using the most basic functions of pitch and timing and it costs $99. The higher tier versions are: Assistant ($249) – includes audio to MIDI, tempo detection and more in depth pitch control.What are the minimum requirements for Melodyne? ›
Melodyne 5.2 or higher runs native on Apple Silicon Macs. Intel or AMD Dual Core processor (Quad Core or better recommended), 4 GB RAM (8 GB or more recommended), Windows 10 (64-bit), Windows 11, ASIO-compatible audio hardware.
Melodyne does the trick! Even the essential version is incredibly powerful. Sure, it doesn't have all the polyphony, some other bells and whistles are missing, but it does what it does better than anyone else.Can Melodyne automatically fix vocals? ›
It lets you adjust the pitch and timing of a track. This plugin is really useful for pitch correcting a vocal, but you can also use it on pretty much any other instrument. People often call it “autotune,” and it does have auto-correcting features.Is Melodyne destructive? ›
Melodyne is a plug‑in, so the editing it does is non‑destructive until you decide to render the audio.
What is the difference between Melodyne editor and Melodyne studio? ›
Editor adds Polyphony and Tempo Editing. Studio adds Multi-track, Sound Editor and Quantize-to-track functionality.What can I use instead of Melodyne? ›
- Voloco. Free • Proprietary. Android. ...
- Auto-Tune. Paid • Proprietary. Music Production App. ...
- x42-Autotune. Free • Open Source. Mac. ...
- TalentedHack. Free • Open Source. Music Production App. ...
- GSnap. Free • Proprietary. Mac. ...
- CrispyTuner. Paid • Proprietary. Music Production App. ...
- Autotalent. Free • Open Source. ...
- NewTone. Paid • Proprietary.
Want to simplify some music for your beginning chorus or transcribe a piano piece for flute? Arranging a copyrighted musical work requires the permission of the copyright owner.Is there software that can transcribe music? ›
It has more than enough for producing full-length albums, which is why it should come as no surprise that over 800 artists are known to use it. Some of the most popular musicians that can be seen using Logic Pro include: Calvin Harris. Ed Sheeran.Is Flex pitch free? ›
Flex Pitch comes free with Logic Pro and is great for editing monophonic material, even though it's less clean than Melodyne.How much does logic cost? ›
|Features||Smart Tempo Brushes Logic Remote Plug-ins Flex Time Flex Pitch|
|Usability||Much of a muchness: of these two, Mac users will probably prefer Logic Pro; Windows users can only use Pro Tools; both will require a similar learning curve|
It does indeed loose quality especially on vocals, I've stopped using it because of that, I notice it more on the tail end of vocals at lower level you can hear the sound breaking up.Is there a program that converts sheet music to MIDI? ›
PlayScore 2 works as a sheet music to MIDI converter. This means you can scan sheet music or PDF and digitize your files. To convert score to MIDI you will need to scan a photo of sheet music into the PlayScore 2 app on your phone or tablet.How do I turn an audio file into a MIDI instrument? ›
- Click File > Open to import the MP3 file.
- Play the MP3 file and click File > Export Audio.
- Open the Format drop-down menu and select Other Uncompressed Files.
- Click Options and select MIDI from the Header.
- Click Save.
- Hit OK.
Does Audacity convert to MIDI? ›
Audacity can always be your best choice when you want to convert audio files in MP3, WAV, MIDI, M4A and so on. In many cases, we need to convert MP3 to MIDI.Can Audacity export MIDI? ›
Additionally you can export label files or MIDI or save a compressed copy of your project in a set of Ogg Vorbis formatted files. Please be aware that when using Export Audacity will only export tracks that are not grayed-out by use of the Mute or Solo buttons in the tracks' Track Control Panels.Can a WAV file be converted to a MIDI file? ›
There are lots of online or offline WAV to MIDI converters that can be used to convert files for free without requiring you to do tedious tasks.Is downloading MIDI files illegal? ›
The midi is an artistic work and protected by copyright just as sheet music or an actual recording would be. It's possible, even likely, that a midi file running free on the internet is itself an infringing copy. If so, any purported licence is worthless.Why can't Audacity export as MIDI? ›
Audacity can't convert audio into MIDI, but if you have a “Note track” (http://manual.audacityteam.org/man/note_tracks.html) in the Audacity project then you can export it as either a . mid or . gro file.What is the best audio to export as in Audacity? ›
The most common uncompressed formats are WAV and AIFF. There is no loss of quality compared to the original audio when playing uncompressed formats, except for some possible loss of low frequencies in the 4-bit (A)DPCM formats.What is the best export for Audacity? ›
96 kbps CBR mono can give excellent quality for voice. 128 kbps CBR stereo can give reasonable quality where stereo is required. 192 kbps CBR stereo can give very good quality where stereo is required. 256 kbps VBR stereo can give excellent quality stereo music, though not recommended for streaming.How do I make a MIDI file from scratch? ›
- 1) Open FL Studio and Create a New Empty Project.
- 2) Add a VST Instrument to the Project.
- 3) Select Default Pattern “Pattern 1” and Open Piano Roll.
- 4) Create MIDI Notes In the Piano Roll – Or Use Your Existing MIDI Pattern.
- 5) Export the MIDI From Piano Roll.
- 6) Save the MIDI File.
These days, people use MIDI for all kinds of things, but we most commonly use it for triggering virtual software instruments in a digital audio workstation (DAW). That said, like any platform, communications standard, or file type, MIDI has its upsides and downsides.Are MIDI files lossless? ›
Other than this difference in the instrument themselves, though, Midi files are lossless representations. MP3 and OGG actually capture the comined signal of the sound.