Saturday, 26 August 2017

Trading Strategie Con L1 Filtraggio


Questa Guida Best Practices è un manuale per aiutare gli sviluppatori di ottenere le migliori prestazioni da NVIDIA CUDA GPU. Presenta stabilito tecniche di parallelizzazione e ottimizzazione e spiega la codifica metafore e modi di dire che può notevolmente semplificare la programmazione per le architetture di GPU CUDA-capable. Mentre i contenuti possono essere utilizzati come un manuale di riferimento, si dovrebbe essere consapevoli del fatto che alcuni argomenti vengono rivisitati in contesti diversi, come diversi argomenti di programmazione e configurazione sono esplorate. Come risultato, si raccomanda che i lettori prima volta procedono attraverso la guida sequenzialmente. Questo approccio permetterà di migliorare notevolmente la comprensione delle pratiche di programmazione efficaci e consentono di utilizzare al meglio la guida per riferimento futuro. Le discussioni in questa guida tutti usano il linguaggio di programmazione C, così si dovrebbe essere confortevole codice di lettura C. Questa guida fa riferimento e si basa su diversi altri documenti che si dovrebbe avere a disposizione per riferimento, che sono tutti disponibili senza alcun costo dal sito web CUDA developer. nvidiacuda-download. I seguenti documenti sono risorse particolarmente importanti: CUDA Guida all'installazione CUDA C Programming Guide CUDA Toolkit Manuale di riferimento, in particolare, la sezione ottimizzazione di questa guida si presuppone che sia già scaricato con successo e installato il toolkit CUDA (in caso contrario, si rimanda al relativo CUDA Guida all'installazione per la piattaforma) e di avere una certa familiarità con il linguaggio di programmazione CUDA C e l'ambiente (se non, si prega di consultare la Guida CUDA C Programming). Questa guida introduce l'Valutare, parallelizzare, ottimizzare, Deploy (APOD) ciclo di progettazione per le applicazioni con l'obiettivo di aiutare gli sviluppatori di applicazioni per identificare rapidamente le porzioni di loro codice che potrebbero più facilmente trarre vantaggio da accelerazione GPU, rapidamente conto che benefici, e cominciare facendo leva gli incrementi nella velocità con conseguente produzione il più presto possibile. APOD è un processo ciclico: incrementi nella velocità iniziali possono essere raggiunti, testati e distribuiti solo con un investimento iniziale minimo di tempo, a quel punto il ciclo può ricominciare da individuare ulteriori opportunità di ottimizzazione, visto incrementi nella velocità aggiuntivi, e quindi distribuire le versioni ancora più veloci di l'applicazione in produzione. Per un progetto esistente, il primo passo è quello di valutare l'applicazione per individuare le parti del codice che sono responsabili per la maggior parte del tempo di esecuzione. Armati di questa conoscenza, lo sviluppatore può valutare questi colli di bottiglia per la parallelizzazione e iniziare a indagare l'accelerazione GPU. Comprendendo le esigenze degli utenti finali e dei vincoli e applicando Amdahls e Gustafsons leggi, lo sviluppatore in grado di determinare il limite superiore di miglioramento delle prestazioni dall'accelerazione delle porzioni identificati della applicazione. Dopo aver identificato i punti caldi e dopo aver fatto gli esercizi di base per impostare obiettivi e le aspettative, lo sviluppatore ha bisogno di parallelizzare il codice. A seconda del codice originale, questo può essere semplice come mettere in una libreria GPU ottimizzata esistenti come cuBLAS. cuFFT. o spinta. o potrebbe essere semplice come l'aggiunta di un paio di direttive del preprocessore come suggerimenti per un compilatore parallelizzazione. D'altra parte, alcuni disegni applicazioni richiedono una certa quantità di refactoring per esporre il loro parallelismo intrinseco. Come anche le future architetture di CPU richiedono esporre questo parallelismo, al fine di migliorare o semplicemente mantenere le prestazioni delle applicazioni sequenziali, la famiglia di linguaggi di programmazione CUDA paralleli (CUDA CC, CUDA Fortran, ecc) mira a rendere l'espressione di questo parallelismo semplice possibile, allo stesso tempo permettendo, su GPU CUDA-capaci progettati per il massimo throughput parallelo. Dopo ogni giro di parallelizzazione applicazione è completa, lo sviluppatore può passare a ottimizzare l'attuazione per migliorare le prestazioni. Dal momento che ci sono molte ottimizzazioni possibili che possono essere considerati, che hanno una buona comprensione delle esigenze dell'applicazione può contribuire a rendere il processo il più agevole possibile. Tuttavia, come con APOD nel suo complesso, ottimizzazione programma è un processo iterativo (ravvisare l'opportunità di ottimizzazione, applicare e testare l'ottimizzazione, verificare l'aumento di velocità raggiunta e ripetizione), il che significa che non è necessario che un programmatore di spendere grandi quantità di tempo memorizzare la maggior parte di tutte le possibili strategie di ottimizzazione prima di vedere buoni incrementi nella velocità. Invece, strategie possono essere applicate in modo incrementale come sono apprese. Le ottimizzazioni possono essere applicati a vari livelli, dalla sovrapposizione di trasferimenti di dati con il calcolo tutta la strada fino alla fine-tuning in virgola mobile sequenze operative. Gli strumenti di profiling disponibili sono preziosi per guidare questo processo, in quanto possono aiutare a suggerire un corso di nuova migliore di azione per gli sviluppatori sforzi di ottimizzazione e di fornire i riferimenti nelle porzioni rilevanti della sezione di ottimizzazione di questa guida. Dopo aver completato l'accelerazione GPU di uno o più componenti dell'applicazione, è possibile confrontare il risultato con l'aspettativa originale. Ricordiamo che il primo passo valutare permesso agli sviluppatori di determinare un limite superiore per il potenziale aumento di velocità ottenibile accelerando dato hotspot. Prima di affrontare gli altri punti caldi per migliorare l'aumento di velocità totale, lo sviluppatore deve prendere in considerazione l'attuazione parzialmente parallelizzato e portarlo fino alla produzione. Questo è importante per una serie di ragioni per esempio, consente all'utente di trarre profitto dal loro investimento più presto possibile (l'aumento di velocità può essere parziale, ma è ancora valido), e minimizza il rischio per lo sviluppatore e l'utente, fornendo un evolutivo piuttosto che insieme rivoluzionario di modifiche all'applicazione. In questa guida, sono fatte specifiche raccomandazioni per quanto riguarda la progettazione e la realizzazione di codice CUDA C. Queste raccomandazioni sono suddivisi in categorie in base alla priorità, che è una miscela degli effetti della raccomandazione e la sua portata. Le azioni che presentano sostanziali miglioramenti per la maggior parte delle applicazioni CUDA hanno la massima priorità, mentre le piccole ottimizzazioni che interessano solo situazioni molto specifiche sono date una priorità più bassa. Prima di implementare le raccomandazioni con priorità più bassa, è buona norma assicurarsi che tutte le raccomandazioni a più alta priorità che sono rilevanti sono già stati applicati. Questo approccio tenderà a fornire i migliori risultati per il tempo investito e di evitare la trappola di ottimizzazione prematura. I criteri di portata e benefici per stabilire priorità variano a seconda della natura del programma. In questa guida, rappresentano un caso tipico. Il vostro codice potrebbe riflettere diversi fattori prioritari. Indipendentemente da questa possibilità, è buona norma verificare che nessun raccomandazioni ad alta priorità sono stati trascurati prima di intraprendere gli elementi con priorità inferiore. Nota: Esempi di codice in tutta la guida omettono controllo degli errori per concisione. Codice di produzione dovrebbe, tuttavia, controlla sistematicamente il codice di errore restituito da ogni chiamata API e verificare la presenza di guasti in lanci del kernel chiamando cudaGetLastError (). Da supercomputer per i telefoni cellulari, i moderni processori sempre più affidamento su parallelismo per fornire prestazioni. L'unità di calcolo centrale, che include controllo, aritmetica, registri e tipicamente qualche cache, viene replicato un numero di volte e collegato alla memoria tramite una rete. Come risultato, tutti i processori moderni richiedono codice parallelo per ottenere una buona sfruttamento della loro potenza di calcolo. Mentre processori stanno evolvendo per esporre il parallelismo più fine al programmatore, molte applicazioni esistenti sono evoluti sia come codici seriali o codici paralleli grana grossa (per esempio, in cui i dati viene scomposto in regioni trasformati in parallelo, con sottoregioni condivisa utilizzando MPI). Al fine di trarre profitto da qualsiasi moderna architettura di processore, GPU incluso, i primi passi sono per valutare l'applicazione di identificare i punti caldi, determinare se essi possono essere parallelizzati, e comprendere i carichi di lavoro importanti, ora e in futuro. programmazione CUDA comporta l'esecuzione di codice su due diverse piattaforme simultaneamente: un sistema host con uno o più CPU e uno o più dispositivi GPU NVIDIA CUDA-compatibili. Mentre GPU NVIDIA sono frequentemente associate con la grafica, ma sono anche i motori aritmetiche potenti in grado di eseguire migliaia di fili leggeri in parallelo. Questa funzionalità li rende adatti per calcoli in grado di sfruttare l'esecuzione parallela. Tuttavia, il dispositivo è basato su un disegno nettamente diverso dal sistema host, e la sua importante capire queste differenze e come essi determinano le prestazioni delle applicazioni CUDA per utilizzare efficacemente CUDA. Le differenze principali sono nel modello di threading e nei ricordi fisici separati: risorse filettatura condutture di esecuzione sui sistemi host in grado di supportare un numero limitato di thread simultanei. I server che hanno quattro processori hex-core oggi può essere eseguito solo 24 thread simultaneamente (o 48 se le CPU supportano la tecnologia Hyper-Threading.) A titolo di confronto, la più piccola unità eseguibile di parallelismo su un dispositivo CUDA comprende 32 thread (definito un ordito di fili) . Moderna GPU NVIDIA in grado di supportare fino a 1536 thread attivi simultaneamente per multiprocessore (vedi Caratteristiche e specifiche della Guida CUDA C Programming) su GPU con 16 multiprocessori, questo porta a più di 24.000 thread contemporaneamente attivi. Threads Threads su una CPU sono generalmente entità dei pesi massimi. Il sistema operativo deve scambiare le discussioni on e off canali di esecuzione CPU per fornire funzionalità di multithreading. cambi di contesto (quando due thread vengono scambiati) sono quindi lenta e costosa. In confronto, le discussioni sulle GPU sono estremamente leggeri. In un sistema tipico, migliaia di fili sono in coda per il lavoro (in orditi di 32 thread ciascuno). Se la GPU deve attendere su un ordito di fili, comincia semplicemente l'esecuzione di lavori su un altro. Perché registri separati sono assegnati a tutte le discussioni attive, senza scambio di registri o altro stato deve verificarsi quando si passa tra i thread GPU. Risorse rimangono assegnati a ogni thread fino a che non completa la sua esecuzione. In breve, core della CPU sono progettati per ridurre al minimo la latenza per uno o due discussioni per volta ciascuno, mentre le GPU sono progettate per gestire un gran numero di concorrenti fili leggeri al fine di massimizzare il throughput. RAM Il sistema host e il dispositivo ogni hanno i loro distinti ricordi fisici collegati. Come i ricordi host e dispositivi sono separati dal PCI Express (PCIe) bus, gli elementi nella memoria host devono occasionalmente essere comunicati attraverso il bus di memoria del dispositivo o viceversa, come descritto in cosa viene eseguito su un dispositivo CUDA abilitato Questi sono i differenze di hardware principali tra host e dispositivi CPU GPU rispetto alla programmazione parallela. Altre differenze sono discussi come si presentano altrove in questo documento. Le applicazioni composte con queste differenze in mente possono trattare l'host e dispositivo insieme come un sistema eterogeneo coeso in cui ogni unità di elaborazione è sfruttato per fare il tipo di lavoro che sa fare meglio: il lavoro sequenziale sull'host e il lavoro in parallelo sul dispositivo. I seguenti problemi dovrebbero essere considerati per determinare quali parti di un'applicazione per eseguire sul dispositivo: Il dispositivo è particolarmente adatto per i calcoli che possono essere eseguiti su numerosi elementi di dati simultaneamente in parallelo. Ciò comporta tipicamente aritmetica su grandi insiemi di dati (ad esempio matrici) dove la stessa operazione può essere eseguita attraverso migliaia, se non milioni, di elementi allo stesso tempo. Questo è un requisito per ottenere buone prestazioni su CUDA: il software deve utilizzare un gran numero (generalmente migliaia o decine di migliaia) di thread concorrenti. Il supporto per l'esecuzione di numerose discussioni in deriva parallele da utilizzare Cudas di un modello di threading leggero sopra descritto. Per ottenere le migliori prestazioni, ci dovrebbe essere una certa coerenza in accesso alla memoria da fili adiacenti in esecuzione sul dispositivo. Alcuni modelli di accesso di memoria consentono l'hardware a fondersi gruppi di lettura e scrittura di più elementi di dati in una sola operazione. I dati che non possono essere disposti in modo da consentire a coalescenza. o quello non ha abbastanza località di utilizzare le cache L1 o struttura efficace, tenderà a vedere incrementi nella velocità minore quando viene utilizzato nei calcoli su CUDA. Per utilizzare CUDA, i valori dei dati devono essere trasferiti dall'host al dispositivo lungo il PCI Express (PCIe) bus. Questi trasferimenti sono costose in termini di prestazioni e devono essere ridotti al minimo. (Vedere trasferimento di dati tra host e periferiche.) Questo costo ha diverse ramificazioni: la complessità delle operazioni dovrebbe giustificare il costo di spostamento dei dati da e verso il dispositivo. Il codice che trasferisce dati da utilizzare breve da un piccolo numero di fili vedrà poco o nessun beneficio prestazioni. Lo scenario ideale è quello in cui molti fili eseguono una notevole quantità di lavoro. Ad esempio, trasferendo due matrici al dispositivo per eseguire una somma di matrici e poi trasferire i risultati al ospitante non si renderà conto beneficio prestazioni molto. Il problema qui è il numero di operazioni eseguite per elemento di dati trasferiti. Per la procedura precedente, assumendo matrici di dimensioni NxN, vi sono N 2 operazioni (aggiunte) e 3N 2 elementi trasferiti, quindi il rapporto di operazioni da fattori trasferiti è 1: 3 o O (1). benefici di prestazioni possono essere realizzati più rapidamente quando questo rapporto è maggiore. Ad esempio, una moltiplicazione matrice delle stesse matrici richiede N 3 operazioni (multiply-add), quindi il rapporto di operazioni da fattori trasferiti è O (N), nel qual caso la grande matrice maggiore è il miglioramento delle prestazioni. I tipi di operazioni sono un ulteriore fattore, come le aggiunte hanno differenti profili di complessità rispetto, ad esempio, le funzioni trigonometriche. È importante comprendere il sovraccarico di trasferimento dei dati da e verso il dispositivo per determinare se le operazioni devono essere eseguite sul host o sul dispositivo. I dati devono essere tenuti sul dispositivo a lungo possibile. Poiché i trasferimenti devono essere ridotti al minimo, programmi che vengono eseguiti kernel multipli sugli stessi dati dovrebbero favorire lasciando i dati sul dispositivo tra le chiamate del kernel, piuttosto che trasferire i risultati intermedi per l'host e poi inviarli al dispositivo per i calcoli successivi. Così, nell'esempio precedente, avevano le due matrici da aggiungere già sul dispositivo come risultato di qualche calcolo precedente, o se i risultati della somma sarebbero utilizzati in alcune successivo calcolo, l'aggiunta della matrice deve essere eseguita localmente su il dispositivo. Questo approccio dovrebbe essere utilizzato anche se uno dei passi di una sequenza di calcoli possono essere condotti veloce sull'host. Anche un kernel relativamente lento può essere vantaggioso se evita uno o più trasferimenti PCIe. Il trasferimento dati tra host e il dispositivo fornisce ulteriori dettagli, comprese le misure di larghezza di banda tra l'host e il dispositivo rispetto all'interno del dispositivo corretta. Cache L1 e L2 di default sui dispositivi di capacità di calcolo 2.x cache solo in L2 di default sui dispositivi di capacità di elaborazione più elevate, anche se alcuni permettono di opt-in per la memorizzazione nella cache in L1 così via flag di compilazione. Cache L1 e L2 di default sui dispositivi di capacità di calcolo 2.xe 3.x dispositivi di capacità di elaborazione locali di cache 5.x solo in L2. In caso di accesso trama, se un riferimento trama si associa a un array lineare nella memoria globale, allora il codice del dispositivo può scrivere la matrice sottostante. riferimenti texture che sono tenuti ad array CUDA possono essere scritti tramite operazioni di superficie di scrittura legandosi una superficie allo stesso sottostante Storage Array CUDA). La lettura da una texture durante la scrittura alla sua sottostante matrice di memoria globale nello stesso lancio kernel dovrebbe essere evitato perché le cache di texture sono di sola lettura e non vengono invalidate quando la memoria globale associato viene modificato. Forse la più importante considerazione le prestazioni in programmazione per architetture di GPU CUDA-compatibili è la coalescenza di memoria globale accessi. carichi di memoria globale e negozi di fili di ordito sono fusero dal dispositivo in da un minimo di una transazione quando vengono soddisfatti determinati requisiti di accesso. Nota: Alta priorità: Assicurare accessi alla memoria globale si fusero, quando possibile. I requisiti di accesso per coalescenza dipendono dalla capacità di elaborazione del dispositivo e sono documentati nella Guida CUDA C Programming. Per i dispositivi di capacità di elaborazione 2.x, i requisiti possono essere sintetizzati facilmente: gli accessi concorrenti dei fili di ordito si confluire in un numero di operazioni pari al numero di righe della cache necessarie per servire tutti i fili dell'ordito . Per impostazione predefinita, tutti gli accessi vengono memorizzati nella cache attraverso L1, che, come le linee 128 byte. Per i modelli di accesso sparsi, per ridurre overfetch, a volte può essere utile per la cache solo in L2, che memorizza nella cache più brevi segmenti di 32 byte (vedere la programmazione CUDA C Guide). Per i dispositivi di capacità di elaborazione 3.x, gli accessi alla memoria globale sono memorizzati nella cache solo in L2 L1 è riservato per la memoria locale accessi. Alcuni dispositivi di capacità di elaborazione 3.5, 3.7 o 5.2 consentono la memorizzazione nella cache di opt-in di globali in L1 pure. Accesso memoria in un modo coalescente è ancora più importante quando ECC è acceso. accessi sparse aumentare il trasferimento di memoria ECC in testa, soprattutto quando la scrittura dei dati nella memoria globale. concetti coalescenza sono illustrati nei seguenti esempi semplici. Questi esempi presuppongono calcolo 2.x. capacità Questi esempi presumono che gli accessi vengono memorizzati nella cache attraverso L1, che è il comportamento di default su tali dispositivi, e che gli accessi sono per parole di 4 byte, se non diversamente specificato. Il primo e più semplice caso di coalescenza può essere ottenuta con qualsiasi dispositivo CUDA: filo k - esimo accede parola - esimo k in una linea di cache. Non tutte le discussioni devono partecipare. Ad esempio, se i fili di una connessione ordito adiacenti parole di 4 byte (ad esempio valori float adiacenti), una singola linea di cache L1 128B e quindi una sola operazione coalescente saranno servizio di accesso alla memoria. Tale modello è mostrato in Figura 3. Figura 3. Accesso coalesced - tutte le discussioni accedere a una linea di cache This accesso reticolo produce un singolo 128 byte L1 transazione, indicata dal rettangolo rosso. Se alcune parole di linea non erano stati richiesti da filo (ad esempio se più thread avevano accesso alla stessa parola o se alcuni le discussioni non hanno partecipato l'accesso), tutti i dati nella linea di cache viene recuperata in ogni caso. Inoltre, se accessi per i fili dell'ordito erano stati permutato all'interno di questo segmento, ancora soltanto una transazione L1 128 byte sarebbe stata effettuata da un dispositivo con capacità di elaborazione 2.x. Se thread sequenziali in una memoria ad accesso ordito che è sequenziale ma non allineati con le linee di cache, verranno richiesti due 128 byte cache L1, come mostrato in Figura 4. Figura 4. indirizzi non allineate sequenziali che si inseriscono in due 128 byte L1- linee di cache per operazioni non cache (cioè quelli che bypassano L1 e utilizzare solo la cache L2), un effetto simile è visto, se non a livello dei segmenti L2 32 byte. Nella Figura 5. vediamo un esempio di questo: lo stesso modello di accesso dalla figura 4 viene utilizzata, ma ora L1 cache è disabilitato, così ora cinque segmenti L2 32 byte sono necessari per soddisfare la richiesta. Figura 5. indirizzi sequenziali disallineati che rientrano in cinque a 32 byte segmenti L2-cache di memoria allocata tramite l'API CUDA runtime, come ad esempio via cudaMalloc (). è garantito per essere allineati ad almeno 256 byte. Pertanto, la scelta dimensioni di blocco filo sensibili, come i multipli delle dimensioni ordito (vale a dire 32 su GPU corrente), facilita accessi alla memoria da orditi che sono allineati alle linee di cache. (Si consideri che cosa accadrebbe agli indirizzi di memoria a cui si accede dal secondo, terzo, e successive blocchi di filettatura, se la dimensione del blocco filo non era un multiplo di dimensioni ordito, per esempio.) E 'facile ed informativo per esplorare le ramificazioni di accessi non allineati con un semplice kernel di copia, come quella di un kernel copia che illustra accessi disallineati. Un kernel copia che illustra disallineato accede in un kernel copia che illustra accessi non allineati. i dati vengono copiati dalla matrice di ingresso idata alla matrice di uscita, entrambi i quali esiste in memoria globale. Il kernel viene eseguito all'interno di un ciclo nel codice host che varia l'offset da 0 a 32. parametro (Figura 4 e Figura 4 corrispondono disallineamenti nei casi di caching e memoria non cache accessi, rispettivamente.) La larghezza di banda effettiva per la copia con diversi offset su un NVIDIA Tesla M2090 (capacità di elaborazione 2.0, con ECC acceso, in quanto è di default) è mostrato in Figura 6. Figura 6. Andamento del kernel offsetCopy Per la NVIDIA Tesla M2090, memoria globale accessi senza offset o con offset che sono multipli di 32 parole si traducono in una singola transazione linea di cache L1 o L2 4 carichi segmento della cache (per carichi non-L1-caching). La larghezza di banda raggiunta è di circa 130GBs. In caso contrario, o due linee L1 cache (modalità di cache) o da quattro a cinque segmenti di cache L2 (modalità non-caching) vengono caricati per ordito, un aumento di circa il 45 esimo del throughput di memoria raggiunto senza compensazioni. Un punto interessante è che potremmo aspettarci il caso di caching per eseguire peggiore del caso non-caching per questo esempio, dato che ogni ordito in caso di cache recupera il doppio di molti byte in quanto richiede, mentre nel caso non-caching, solo 54 come numero di byte, come richiesto vengono recuperate per ordito. In questo particolare esempio, che effetto non è evidente, però, perché i fili di ordito adiacenti riutilizzare le linee di cache loro vicini inverosimile. Così, mentre l'impatto è ancora evidente nel caso di carichi di cache, non è così grande come ci si sarebbe potuto aspettare. Sarebbe stato più se orditi adiacenti non avevano mostrato un alto grado di riutilizzo delle linee di cache oltre inverosimile. Come visto in precedenza, nel caso di accessi sequenziali disallineati, le cache di dispositivi 2.x capacità di elaborazione aiutano molto per ottenere prestazioni ragionevoli. Può essere diverso con accessi non unit strided, tuttavia, e questo è un modello che si verifica frequentemente quando si tratta di dati multidimensionali o matrici. Per questo motivo, assicurando che il più possibile i dati in ogni riga della cache recuperata viene effettivamente utilizzata è una parte importante di ottimizzazione delle prestazioni di accessi alla memoria su questi dispositivi. Per illustrare l'effetto di accesso strided sulla larghezza di banda effettiva, vedere la strideCopy kernel () in un kernel per illustrare non unità di copia dei dati passo. che copia i dati con un passo di elementi del passo tra le discussioni da idata a OData. Un kernel per illustrare non-unità di dati falcata copia la figura 7 illustra una situazione in questo caso, le discussioni nel giro di parole di accesso ordito in memoria con un passo di 2. Questa azione porta ad un carico di due linee di cache L1 (otto cache L2 segmenti in modalità non-caching) per ordito sul Tesla M2090 (elaborazione capacità di 2.0). Figura 7. filetti adiacenti di accesso di memoria con un passo di 2 A passo di 2 risultati in 50 dell'efficienza loadstore dal mezzo gli elementi della transazione non vengono utilizzati e rappresentano sprecati banda. Con l'aumento del passo, la larghezza di banda effettiva diminuisce fino al punto in cui 32 righe di cache vengono caricati per i 32 fili in ordito, come indicato in Figura 8. Figura 8. Rappresentazione del kernel strideCopy Come illustrato nella Figura 8. non-unit Stride accessi di memoria globale dovrebbe essere evitato per quanto possibile. Un metodo per farlo utilizza la memoria condivisa, che viene discusso nella prossima sezione. Perché è on-chip, memoria condivisa è molto più alta larghezza di banda e una minore latenza della memoria locale e globale - a condizione che vi siano conflitti bancari tra i fili, come descritto nella sezione seguente. Per ottenere la larghezza di banda di memoria alta per accessi concorrenti, la memoria condivisa è suddivisa in moduli di memoria di uguali dimensioni (banche) che possono essere accessibili simultaneamente. Pertanto, qualsiasi carico di memoria o vendite di N indirizzi che si estende banchi di memoria n distinti possono essere servite contemporaneamente, ottenendo una larghezza di banda effettiva che è n volte superiore rispetto alla larghezza di banda di una singola banca. Tuttavia, se più indirizzi di una mappa richiesta di memoria allo stesso banco di memoria, gli accessi vengono serializzati. L'hardware divide una richiesta di memoria che ha conflitti bancarie in altrettante richieste senza conflitti separati se necessario, diminuendo la larghezza di banda effettiva di un fattore pari al numero di richieste di memoria separate. L'unica eccezione è quando più thread in un ordito affrontare la stessa posizione di memoria condivisa, con un conseguente trasmissione. Dispositivi di 2.x capacità di elaborazione e superiori hanno la capacità aggiuntiva di multicast memoria condivisa di accessi (cioè per inviare copie dello stesso valore a più fili dell'ordito). Per ridurre al minimo i conflitti di banca, è importante capire come indirizzi di memoria mappati banchi di memoria e su come programmare in modo ottimale le richieste di memoria. Compute Capability 2.x Sui dispositivi di capacità di elaborazione 2.x, ogni banca ha una larghezza di banda di 32 bit ogni due cicli di clock, e successive parole di 32 bit vengono assegnati alle banche successive. La dimensione ordito è di 32 thread e il numero delle banche è anche 32, in modo da conflitti banca possono verificarsi tra qualsiasi thread nel ordito. Vedere Compute 2.x capacità nella Guida alla programmazione CUDA C per ulteriori dettagli. Compute Capability 3.x Sui dispositivi di capacità di elaborazione 3.x, ogni banca ha una larghezza di banda di 64 bit ogni ciclo di clock (). Ci sono due diverse modalità bancari: o successive parole di 32 bit (in modalità a 32 bit) o ​​successive parole a 64 bit (modalità a 64 bit) vengono assegnati alle banche successive. La dimensione ordito è di 32 thread e il numero delle banche è anche 32, in modo da conflitti banca possono verificarsi tra qualsiasi thread nel ordito. Vedere Compute 3.x capacità nella Guida alla programmazione CUDA C per ulteriori dettagli. Nota: () Tuttavia, i dispositivi di capacità di elaborazione 3.x in genere hanno frequenze di clock inferiori rispetto ai dispositivi di elaborazione capacità 2.x per una migliore efficienza energetica. Memoria condivisa consente una cooperazione tra i thread in un blocco. Quando più thread in un blocco utilizzano gli stessi dati dalla memoria globale, la memoria condivisa può essere utilizzato per accedere ai dati dalla memoria globale sola volta. Memoria condivisa può anche essere utilizzato per evitare uncoalesced accessi in memoria per il caricamento e la memorizzazione dei dati in un modello coalescente dalla memoria globale e quindi riordinare nella memoria condivisa. A parte i conflitti banco di memoria, non vi è alcuna penalità per accessi non sequenziali o non allineati da un ordito nella memoria condivisa. L'uso della memoria condivisa è illustrato tramite il semplice esempio di una moltiplicazione matrice C AB per il caso A di dimensione MxW, B di dimensioni WXN e C di dimensione MxN. Per mantenere i kernel semplice, M e N sono multipli di 32, e w è 32 per i dispositivi di elaborazione capacità di 2.0 o superiore. Una decomposizione naturale del problema è quella di utilizzare un blocco e mattonelle di dimensione di fili wXw. Pertanto, in termini di piastrelle wXw, A è una matrice colonna B è una matrice riga, e C è il loro prodotto esterno Figura 9. Viene avviata la griglia di Nw da blocchi Mw, in cui ogni blocco di filettatura calcola gli elementi di una diversa piastrella in C da una singola piastrella di a e una singola piastrella di B. Figura 9. matrice Block-colonna moltiplicato per matrice blocco-fila. matrix Block-colonna (A) moltiplicato per matrice a blocchi-fila (B) con conseguente matrice prodotto (C). Per fare questo, il kernel simpleMultiply (non ottimizzato moltiplicazione di matrici) calcola gli elementi di uscita di una piastrella di matrice C. matrice non ottimizzato moltiplicazione non ottimizzato moltiplicazione di matrici. un. b. e c sono puntatori a memoria globale per le matrici A, B, e C, rispettivamente, blockDim. x. blockDim. y. e TILEDIM sono tutti uguali a w. Ogni thread nel blocco wXw-thread calcola un elemento in una piastrella di C. fila e col sono la riga e colonna dell'elemento in C viene calcolato da un particolare thread. Il ciclo for sopra i moltiplica una fila di A da una colonna di B, che poi viene scritta a C. L'effettiva larghezza di banda di questo kernel è solo 6.6GBs su una NVIDIA Tesla K20X (con ECC off). Per analizzare le prestazioni, è necessario considerare come fili di ordito di accesso alla memoria globale nel ciclo for. Ogni ordito di fili calcola una fila di una piastrella di C, che dipende da una sola fila di A e un'intera piastrelle di B come illustrato in Figura 10. Figura 10. Computing una fila di una piastrella. Computing una fila di una piastrella in C utilizzando una riga di A e un'intera piastrelle di B. Per ciascuna iterazione i del ciclo for, i fili in un ordito leggere una riga della piastrella B, che è un sequenziale e fusero accesso tutte le capacità di calcolo. Tuttavia, per ogni iterazione i. tutte le discussioni in un ordito di leggere lo stesso valore dalla memoria globale per la matrice A, come l'indice rowTILEDIMi è costante all'interno di un ordito. Anche se un tale accesso richiede solo 1 transazione su dispositivi di elaborazione capacità 2.0 o superiore, non è sprecato banda nella transazione, perché viene utilizzata solo una parola di 4 byte di 32 parole nella riga di cache. Possiamo riutilizzare questa linea cache in successive iterazioni del ciclo, e ci sarebbe eventualmente utilizzare tutti 32 parole tuttavia, quando molti orditi eseguono sullo stesso multiprocessore simultaneamente, come è generalmente il caso, la linea di cache può essere facilmente rimosso dal cache tra iterazioni I e i1. Le prestazioni su un dispositivo di qualsiasi capacità di elaborazione può essere migliorata leggendo una piastrella di A nella memoria condivisa come illustrato in Utilizzo memoria condivisa per migliorare l'efficienza carico di memoria globale moltiplicazione matriciale. L'utilizzo di memoria condivisa per migliorare l'efficienza del carico di memoria globale nella moltiplicazione di matrici In utilizzando memoria condivisa per migliorare l'efficienza del carico di memoria globale in moltiplicazione di matrici. ogni elemento in una tegola di A viene letto dalla memoria globale solo una volta, in modo completamente coalescente (senza spreco di banda), alla memoria condivisa. All'interno di ogni iterazione del ciclo for, un valore in memoria condivisa viene trasmesso a tutte le discussioni in un ordito. No syncthreads () chiamata barriera di sincronizzazione è necessario, dopo aver letto la piastrella di A nella memoria condivisa perché solo thread all'interno l'ordito che scrivono i dati in memoria condivisa leggono i dati. (Nota: in luogo di syncthreads () l'array condiviso potrebbe essere necessario essere contrassegnati come volatili per la correttezza sui dispositivi di elaborazione capacità di 2.0 o superiore consultare la Guida NVIDIA Fermi di compatibilità). Questo kernel ha una larghezza di banda effettiva di 7.8GBs su una NVIDIA Tesla K20X. Questo illustra l'uso della memoria condivisa come cache gestione dell'utente quando il criterio di cache L1 sgombero hardware non corrisponde bene con le esigenze dell'applicazione o quando la cache L1 non viene utilizzato per operazioni di lettura dalla memoria globale. Un ulteriore miglioramento può essere fatto per quanto Utilizzo memoria condivisa per migliorare l'efficienza di carico memoria globale in offerte moltiplicazione matrice con matrice B. Nel calcolo ciascuna delle righe di una piastrella di matrice C, la piastrella intera di B viene letto. La lettura ripetuta della piastrella B può essere eliminata la lettura in memoria condivisa volta (Improvement leggendo i dati aggiuntivi in ​​memoria condivisa). Miglioramento leggendo i dati aggiuntivi nella memoria condivisa notare che, in miglioramento dalla lettura dei dati aggiuntivi nella memoria condivisa. una chiamata syncthreads () è necessario, dopo aver letto la piastrella B perché un ordito legge dati dalla memoria condivisa che sono stati scritti nella memoria condivisa da diversi fili di ordito. The effective bandwidth of this routine is 14.9 GBs on an NVIDIA Tesla K20X. Note that the performance improvement is not due to improved coalescing in either case, but to avoiding redundant transfers from global memory. The results of the various optimizations are summarized in Table 2. Table 2. Performance Improvements Optimizing C AB Matrix Multiply NVIDIA Tesla K20X These results should be compared with those in Table 2. As can be seen from these tables, judicious use of shared memory can dramatically improve performance. The examples in this section have illustrated three reasons to use shared memory: To enable coalesced accesses to global memory, especially to avoid large strides (for general matrices, strides are much larger than 32) To eliminate (or reduce) redundant loads from global memory To avoid wasted bandwidth Local memory is so named because its scope is local to the thread, not because of its physical location. In fact, local memory is off-chip. Hence, access to local memory is as expensive as access to global memory. In other words, the term local in the name does not imply faster access. Local memory is used only to hold automatic variables. This is done by the nvcc compiler when it determines that there is insufficient register space to hold the variable. Automatic variables that are likely to be placed in local memory are large structures or arrays that would consume too much register space and arrays that the compiler determines may be indexed dynamically. Inspection of the PTX assembly code (obtained by compiling with - ptx or - keep command-line options to nvcc ) reveals whether a variable has been placed in local memory during the first compilation phases. If it has, it will be declared using the. local mnemonic and accessed using the ld. local and st. local mnemonics. If it has not, subsequent compilation phases might still decide otherwise, if they find the variable consumes too much register space for the targeted architecture. There is no way to check this for a specific variable, but the compiler reports total local memory usage per kernel (lmem) when run with the --ptxas-options-v option. The read-only texture memory space is cached. Therefore, a texture fetch costs one device memory read only on a cache miss otherwise, it just costs one read from the texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. Texture memory is also designed for streaming fetches with a constant latency that is, a cache hit reduces DRAM bandwidth demand, but not fetch latency. In certain addressing situations, reading device memory through texture fetching can be an advantageous alternative to reading device memory from global or constant memory. If textures are fetched using tex1D(). tex2D(). or tex3D() rather than tex1Dfetch(). the hardware provides other capabilities that might be useful for some applications such as image processing, as shown in Table 4. Table 4. Useful Features for tex1D(), tex2D(), and tex3D() Fetches Automatic handling of boundary cases 1 Can be used only with normalized texture coordinates 1 The automatic handling of boundary cases in the bottom row of Table 4 refers to how a texture coordinate is resolved when it falls outside the valid addressing range. There are two options: clamp and wrap. If x is the coordinate and N is the number of texels for a one-dimensional texture, then with clamp, x is replaced by 0 if x lt 0 and by 1-1 N if 1 lt x . With wrap, x is replaced by frac(x) where frac(x) x - floor(x) . Floor returns the largest integer less than or equal to x . So, in clamp mode where N 1, an x of 1.3 is clamped to 1.0 whereas in wrap mode, it is converted to 0.3 Within a kernel call, the texture cache is not kept coherent with respect to global memory writes, so texture fetches from addresses that have been written via global stores in the same kernel call return undefined data. That is, a thread can safely read a memory location via texture if the location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread within the same kernel call. There is a total of 64 KB constant memory on a device. The constant memory space is cached. As a result, a read from constant memory costs one memory read from device memory only on a cache miss otherwise, it just costs one read from the constant cache. Accesses to different addresses by threads within a warp are serialized, thus the cost scales linearly with the number of unique addresses read by all threads within a warp. As such, the constant cache is best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access. Generally, accessing a register consumes zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts. The latency of read-after-write dependencies is approximately 24 cycles, but this latency is completely hidden on multiprocessors that have sufficient warps of threads concurrent per multiprocessor. For devices of compute capability 2.0, which have 32 CUDA cores per multiprocessor, as many as 768 threads (24 warps) might be required to completely hide latency, and so on for devices of higher compute capabilities. The compiler and hardware thread scheduler will schedule instructions as optimally as possible to avoid register memory bank conflicts. They achieve the best results when the number of threads per block is a multiple of 64. Other than following this rule, an application has no direct control over these bank conflicts. In particular, there is no register-related reason to pack data into float4 or int4 types. Register pressure occurs when there are not enough registers available for a given task. Even though each multiprocessor contains thousands of 32-bit registers (see Features and Technical Specifications of the CUDA C Programming Guide ), these are partitioned among concurrent threads. To prevent the compiler from allocating too many registers, use the - maxrregcountN compiler command-line option (see nvcc ) or the launch bounds kernel definition qualifier (see Execution Configuration of the CUDA C Programming Guide ) to control the maximum number of registers to allocated per thread. Device memory allocation and de-allocation via cudaMalloc() and cudaFree() are expensive operations, so device memory should be reused andor sub-allocated by the application wherever possible to minimize the impact of allocations on overall performance. One of the keys to good performance is to keep the multiprocessors on the device as busy as possible. A device in which work is poorly balanced across the multiprocessors will deliver suboptimal performance. Hence, its important to design your application to use threads and blocks in a way that maximizes hardware utilization and to limit practices that impede the free distribution of work. A key concept in this effort is occupancy, which is explained in the following sections. Hardware utilization can also be improved in some cases by designing your application so that multiple, independent kernels can execute at the same time. Multiple kernels executing at the same time is known as concurrent kernel execution. Concurrent kernel execution is described below. Another important concept is the management of system resources allocated for a particular task. How to manage this resource utilization is discussed in the final sections of this chapter. Thread instructions are executed sequentially in CUDA, and, as a result, executing other warps when one warp is paused or stalled is the only way to hide latencies and keep the hardware busy. Some metric related to the number of active warps on a multiprocessor is therefore important in determining how effectively the hardware is kept busy. This metric is occupancy. Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. (To determine the latter number, see the deviceQuery CUDA Sample or refer to Compute Capabilities in the CUDA C Programming Guide .) Another way to view occupancy is the percentage of the hardwares ability to process warps that is actively in use. Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation. One of several factors that determine occupancy is register availability. Register storage enables threads to keep local variables nearby for low-latency access. However, the set of registers (known as the register file ) is a limited commodity that all threads resident on a multiprocessor must share. Registers are allocated to an entire block all at once. So, if each thread block uses many registers, the number of thread blocks that can be resident on a multiprocessor is reduced, thereby lowering the occupancy of the multiprocessor. The maximum number of registers per thread can be set manually at compilation time per-file using the - maxrregcount option or per-kernel using the launchbounds qualifier (see Register Pressure ). For purposes of calculating occupancy, the number of registers used by each thread is one of the key factors. For example, devices with compute capability 1.1 have 8,192 32-bit registers per multiprocessor and can have a maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). This means that in one of these devices, for a multiprocessor to have 100 occupancy, each thread can use at most 10 registers. However, this approach of determining how register count affects occupancy does not take into account the register allocation granularity. For example, on a device of compute capability 1.1, a kernel with 128-thread blocks using 12 registers per thread results in an occupancy of 83 with 5 active 128-thread blocks per multi-processor, whereas a kernel with 256-thread blocks using the same 12 registers per thread results in an occupancy of 66 because only two 256-thread blocks can reside on a multiprocessor. Furthermore, register allocations are rounded up to the nearest 256 registers per block on devices with compute capability 1.1. The number of registers available, the maximum number of simultaneous threads resident on each multiprocessor, and the register allocation granularity vary over different compute capabilities. Because of these nuances in register allocation and the fact that a multiprocessors shared memory is also partitioned between resident thread blocks, the exact relationship between register usage and occupancy can be difficult to determine. The --ptxas optionsv option of nvcc details the number of registers used per thread for each kernel. See Hardware Multithreading of the CUDA C Programming Guide for the register allocation formulas for devices of various compute capabilities and Features and Technical Specifications of the CUDA C Programming Guide for the total number of registers available on those devices. Alternatively, NVIDIA provides an occupancy calculator in the form of an Excel spreadsheet that enables developers to hone in on the optimal balance and to test different possible scenarios more easily. This spreadsheet, shown in Figure 11. is called CUDAOccupancyCalculator. xls and is located in the tools subdirectory of the CUDA Toolkit installation. Figure 11. Using the CUDA Occupancy Calculator to project GPU multiprocessor occupancy In addition to the calculator spreadsheet, occupancy can be determined using the NVIDIA Visual Profilers Achieved Occupancy metric. The Visual Profiler also calculates occupancy as part of the Multiprocessor stage of application analysis. As described in Asynchronous and Overlapping Transfers with Computation. CUDA streams can be used to overlap kernel execution with data transfers. On devices that are capable of concurrent kernel execution, streams can also be used to execute multiple kernels simultaneously to more fully take advantage of the devices multiprocessors. Whether a device has this capability is indicated by the concurrentKernels field of the cudaDeviceProp structure (or listed in the output of the deviceQuery CUDA Sample). Non-default streams (streams other than stream 0) are required for concurrent execution because kernel calls that use the default stream begin only after all preceding calls on the device (in any stream) have completed, and no operation on the device (in any stream) commences until they are finished. The following example illustrates the basic technique. Because kernel1 and kernel2 are executed in different, non-default streams, a capable device can execute the kernels at the same time. CUDA work occurs within a process space for a particular GPU known as a context. The context encapsulates kernel launches and memory allocations for that GPU as well as supporting constructs such as the page tables. The context is explicit in the CUDA Driver API but is entirely implicit in the CUDA Runtime API, which creates and manages contexts automatically. With the CUDA Driver API, a CUDA application process can potentially create more than one context for a given GPU. If multiple CUDA application processes access the same GPU concurrently, this almost always implies multiple contexts, since a context is tied to a particular host process unless CUDA Multi-Process Service is in use. While multiple contexts (and their associated resources such as global memory allocations) can be allocated concurrently on a given GPU, only one of these contexts can execute work at any given moment on that GPU contexts sharing the same GPU are time-sliced. Creating additional contexts incurs memory overhead for per-context data and time overhead for context switching. Furthermore, the need for context switching can reduce utilization when work from several contexts could otherwise execute concurrently (see also Concurrent Kernel Execution ). Therefore, it is best to avoid multiple contexts per GPU within the same CUDA application. To assist with this, the CUDA Driver API provides methods to access and manage a special context on each GPU called the primary context. These are the same contexts used implicitly by the CUDA Runtime when there is not already a current context for a thread. Note: NVIDIA-SMI can be used to configure a GPU for exclusive compute mode. which limits the number of threads andor processes that can have simultaneous contexts on a particular GPU to one. Note: Medium Priority: To hide latency arising from register dependencies, maintain sufficient numbers of active threads per multiprocessor (i. e. sufficient occupancy). Register dependencies arise when an instruction uses a result stored in a register written by an instruction before it. The latency on current CUDA-enabled GPUs is approximately 24 cycles, so threads must wait 24 cycles before using an arithmetic result. However, this latency can be completely hidden by the execution of threads in other warps. See Registers for details. Note: Medium Priority: The number of threads per block should be a multiple of 32 threads, because this provides optimal computing efficiency and facilitates coalescing. The dimension and size of blocks per grid and the dimension and size of threads per block are both important factors. The multidimensional aspect of these parameters allows easier mapping of multidimensional problems to CUDA and does not play a role in performance. As a result, this section discusses size but not dimension. Latency hiding and occupancy depend on the number of active warps per multiprocessor, which is implicitly determined by the execution parameters along with resource (register and shared memory) constraints. Choosing execution parameters is a matter of striking a balance between latency hiding (occupancy) and resource utilization. Choosing the execution configuration parameters should be done in tandem however, there are certain heuristics that apply to each parameter individually. When choosing the first execution configuration parameter-the number of blocks per grid, or grid size - the primary concern is keeping the entire GPU busy. The number of blocks in a grid should be larger than the number of multiprocessors so that all multiprocessors have at least one block to execute. Furthermore, there should be multiple active blocks per multiprocessor so that blocks that arent waiting for a syncthreads() can keep the hardware busy. This recommendation is subject to resource availability therefore, it should be determined in the context of the second execution parameter - the number of threads per block, or block size - as well as shared memory usage. To scale to future devices, the number of blocks per kernel launch should be in the thousands. When choosing the block size, it is important to remember that multiple concurrent blocks can reside on a multiprocessor, so occupancy is not determined by block size alone. In particular, a larger block size does not imply a higher occupancy. For example, on a device of compute capability 1.1 or lower, a kernel with a maximum block size of 512 threads results in an occupancy of 66 percent because the maximum number of threads per multiprocessor on such a device is 768. Hence, only a single block can be active per multiprocessor. However, a kernel with 256 threads per block on such a device can result in 100 percent occupancy with three resident active blocks. As mentioned in Occupancy. higher occupancy does not always equate to better performance. For example, improving occupancy from 66 percent to 100 percent generally does not translate to a similar increase in performance. A lower occupancy kernel will have more registers available per thread than a higher occupancy kernel, which may result in less register spilling to local memory. Typically, once an occupancy of 50 percent has been reached, additional increases in occupancy do not translate into improved performance. It is in some cases possible to fully cover latency with even fewer warps, notably via instruction-level parallelism (ILP) for discussion, see nvidiacontentGTC-2010pdfs2238GTC2010.pdf. There are many such factors involved in selecting block size, and inevitably some experimentation is required. However, a few rules of thumb should be followed: Threads per block should be a multiple of warp size to avoid wasting computation on under-populated warps and to facilitate coalescing. A minimum of 64 threads per block should be used, and only if there are multiple concurrent blocks per multiprocessor. Between 128 and 256 threads per block is a better choice and a good initial range for experimentation with different block sizes. Use several (3 to 4) smaller thread blocks rather than one large thread block per multiprocessor if latency affects performance. This is particularly beneficial to kernels that frequently call syncthreads(). Note that when a thread block allocates more registers than are available on a multiprocessor, the kernel launch fails, as it will when too much shared memory or too many threads are requested. Shared memory can be helpful in several situations, such as helping to coalesce or eliminate redundant access to global memory. However, it also can act as a constraint on occupancy. In many cases, the amount of shared memory required by a kernel is related to the block size that was chosen, but the mapping of threads to shared memory elements does not need to be one-to-one. For example, it may be desirable to use a 32x32 element shared memory array in a kernel, but because the maximum number of threads per block is 512, it is not possible to launch a kernel with 32x32 threads per block. In such cases, kernels with 32x16 or 32x8 threads can be launched with each thread processing two or four elements, respectively, of the shared memory array. The approach of using a single thread to process multiple elements of a shared memory array can be beneficial even if limits such as threads per block are not an issue. This is because some operations common to each element can be performed by the thread once, amortizing the cost over the number of shared memory elements processed by the thread. A useful technique to determine the sensitivity of performance to occupancy is through experimentation with the amount of dynamically allocated shared memory, as specified in the third parameter of the execution configuration. By simply increasing this parameter (without modifying the kernel), it is possible to effectively reduce the occupancy of the kernel and measure its effect on performance. As mentioned in the previous section, once an occupancy of more than 50 percent has been reached, it generally does not pay to optimize parameters to obtain higher occupancy ratios. The previous technique can be used to determine whether such a plateau has been reached. Awareness of how instructions are executed often permits low-level optimizations that can be useful, especially in code that is run frequently (the so-called hot spot in a program). Best practices suggest that this optimization be performed after all higher-level optimizations have been completed. Single-precision floats provide the best performance, and their use is highly encouraged. The throughput of individual arithmetic operations is detailed in the CUDA C Programming Guide. Note: Low Priority: Use shift operations to avoid expensive division and modulo calculations. Integer division and modulo operations are particularly costly and should be avoided or replaced with bitwise operations whenever possible: If n is a power of 2, ( i n ) is equivalent to ( i log2 n ) and ( i n ) is equivalent to ( i amp n - 1 ). The compiler will perform these conversions if n is literal. (For further information, refer to Performance Guidelines in the CUDA C Programming Guide ). The reciprocal square root should always be invoked explicitly as rsqrtf() for single precision and rsqrt() for double precision. The compiler optimizes 1.0fsqrtf(x) into rsqrtf() only when this does not violate IEEE-754 semantics. Note: Low Priority: Avoid automatic conversion of doubles to floats. The compiler must on occasion insert conversion instructions, introducing additional execution cycles. This is the case for: Functions operating on char or short whose operands generally need to be converted to an int Double-precision floating-point constants (defined without any type suffix) used as input to single-precision floating-point computations The latter case can be avoided by using single-precision floating-point constants, defined with an f suffix such as 3.141592653589793f. 1.0f. 0.5f. This suffix has accuracy implications in addition to its ramifications on performance. The effects on accuracy are discussed in Promotions to Doubles and Truncations to Floats. Note that this distinction is particularly important to performance on devices of compute capability 2.x. For single-precision code, use of the float type and the single-precision math functions are highly recommended. When compiling for devices without native double-precision support such as devices of compute capability 1.2 and earlier, each double-precision floating-point variable is converted to single-precision floating-point format (but retains its size of 64 bits) and double-precision arithmetic is demoted to single-precision arithmetic. It should also be noted that the CUDA math librarys complementary error function, erfcf(). is particularly fast with full single-precision accuracy. For some fractional exponents, exponentiation can be accelerated significantly compared to the use of pow() by using square roots, cube roots, and their inverses. For those exponentiations where the exponent is not exactly representable as a floating-point number, such as 13, this can also provide much more accurate results, as use of pow() magnifies the initial representational error. The formulas in the table below are valid for x gt 0, x -0. that is, signbit(x) 0. Table 5. Formulae for exponentiation by small fractions Note: Medium Priority: Use the fast math library whenever speed trumps precision. Two types of runtime math operations are supported. They can be distinguished by their names: some have names with prepended underscores, whereas others do not (e. g. functionName() versus functionName() ). Functions following the functionName() naming convention map directly to the hardware level. They are faster but provide somewhat lower accuracy (e. g. sinf(x) and expf(x) ). Functions following functionName() naming convention are slower but have higher accuracy (e. g. sinf(x) and expf(x) ). The throughput of sinf(x). cosf(x). and expf(x) is much greater than that of sinf(x). cosf(x). and expf(x). The latter become even more expensive (about an order of magnitude slower) if the magnitude of the argument x needs to be reduced. Moreover, in such cases, the argument-reduction code uses local memory, which can affect performance even more because of the high latency of local memory. More details are available in the CUDA C Programming Guide . Note also that whenever sine and cosine of the same argument are computed, the sincos family of instructions should be used to optimize performance: sincosf() for single-precision fast math (see next paragraph) sincosf() for regular single-precision sincos() for double precision The - usefastmath compiler option of nvcc coerces every functionName() call to the equivalent functionName() call. This switch should be used whenever accuracy is a lesser priority than the performance. This is frequently the case with transcendental functions. Note this switch is effective only on single-precision floating point. Note: Medium Priority: Prefer faster, more specialized math functions over slower, more general ones when possible. For small integer powers (e. g. x 2 or x 3 ), explicit multiplication is almost certainly faster than the use of general exponentiation routines such as pow(). While compiler optimization improvements continually seek to narrow this gap, explicit multiplication (or the use of an equivalent purpose-built inline function or macro) can have a significant advantage. This advantage is increased when several powers of the same base are needed (e. g. where both x 2 and x 5 are calculated in close proximity), as this aids the compiler in its common sub-expression elimination (CSE) optimization. For exponentiation using base 2 or 10, use the functions exp2() or expf2() and exp10() or expf10() rather than the functions pow() or powf(). Both pow() and powf() are heavy-weight functions in terms of register pressure and instruction count due to the numerous special cases arising in general exponentiation and the difficulty of achieving good accuracy across the entire ranges of the base and the exponent. The functions exp2(). exp2f(). exp10(). and exp10f(). on the other hand, are similar to exp() and expf() in terms of performance, and can be as much as ten times faster than their pow() powf() equivalents. For exponentiation with an exponent of 13, use the cbrt() or cbrtf() function rather than the generic exponentiation functions pow() or powf(). as the former are significantly faster than the latter. Likewise, for exponentation with an exponent of -13, use rcbrt() or rcbrtf(). Replace sin(ltexprgt) with sinpi(ltexprgt). cos(ltexprgt) with cospi(ltexprgt). and sincos(ltexprgt) with sincospi(ltexprgt). This is advantageous with regard to both accuracy and performance. As a particular example, to evaluate the sine function in degrees instead of radians, use sinpi(x180.0). Similarly, the single-precision functions sinpif(). cospif(). and sincospif() should replace calls to sinf(). cosf(). and sincosf() when the function argument is of the form ltexprgt. (The performance advantage sinpi() has over sin() is due to simplified argument reduction the accuracy advantage is because sinpi() multiplies by only implicitly, effectively using an infinitely precise mathematical rather than a single - or double-precision approximation thereof.) By default, the nvcc compiler generates IEEE-compliant code for devices of compute capability 2.x, but it also provides options to generate code that somewhat less accurate but faster and that is closer to the code generated for earlier devices: - ftztrue (denormalized numbers are flushed to zero) - prec-divfalse (less precise division) - prec-sqrtfalse (less precise square root) Another, more aggressive, option is - usefastmath. which coerces every functionName() call to the equivalent functionName() call. This makes the code run faster at the cost of diminished precision and accuracy. See Math Libraries. Note: High Priority: Minimize the use of global memory. Prefer shared memory access where possible. Memory instructions include any instruction that reads from or writes to shared, local, or global memory. When accessing uncached local or global memory, there are 400 to 600 clock cycles of memory latency. As an example, the assignment operator in the following sample code has a high throughput, but, crucially, there is a latency of 400 to 600 clock cycles to read data from global memory: Much of this global memory latency can be hidden by the thread scheduler if there are sufficient independent arithmetic instructions that can be issued while waiting for the global memory access to complete. However, it is best to avoid accessing global memory whenever possible. Note: High Priority: Avoid different execution paths within the same warp. Any flow control instruction ( if. switch. do. for. while ) can significantly affect the instruction throughput by causing threads of the same warp to diverge that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path. To obtain best performance in cases where the control flow depends on the thread ID, the controlling condition should be written so as to minimize the number of divergent warps. This is possible because the distribution of the warps across the block is deterministic as mentioned in SIMT Architecture of the CUDA C Programming Guide. A trivial example is when the controlling condition depends only on ( threadIdx WSIZE ) where WSIZE is the warp size. In this case, no warp diverges because the controlling condition is perfectly aligned with the warps. Note: Low Priority: Make it easy for the compiler to use branch predication in lieu of loops or control statements. Sometimes, the compiler may unroll loops or optimize out if or switch statements by using branch predication instead. In these cases, no warp can ever diverge. The programmer can also control loop unrolling using For more information on this pragma, refer to the CUDA C Programming Guide. When using branch predication, none of the instructions whose execution depends on the controlling condition is skipped. Instead, each such instruction is associated with a per-thread condition code or predicate that is set to true or false according to the controlling condition. Although each of these instructions is scheduled for execution, only the instructions with a true predicate are actually executed. Instructions with a false predicate do not write results, and they also do not evaluate addresses or read operands. The compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by the branch condition is less than or equal to a certain threshold: If the compiler determines that the condition is likely to produce many divergent warps, this threshold is 7 otherwise it is 4. Note: Low Medium Priority: Use signed integers rather than unsigned integers as loop counters. In the C language standard, unsigned integer overflow semantics are well defined, whereas signed integer overflow causes undefined results. Therefore, the compiler can optimize more aggressively with signed arithmetic than it can with unsigned arithmetic. This is of particular note with loop counters: since it is common for loop counters to have values that are always positive, it may be tempting to declare the counters as unsigned. For slightly better performance, however, they should instead be declared as signed. For example, consider the following code: Here, the sub-expression stridei could overflow a 32-bit integer, so if i is declared as unsigned, the overflow semantics prevent the compiler from using some optimizations that might otherwise have applied, such as strength reduction. If instead i is declared as signed, where the overflow semantics are undefined, the compiler has more leeway to use these optimizations. Note: High Priority: Avoid the use of syncthreads() inside divergent code. Synchronizing threads inside potentially divergent code (e. g. a loop over an input array) can cause unanticipated errors. Care must be taken to ensure that all threads are converged at the point where syncthreads() is called. The following example illustrates how to do this properly for 1D blocks: In this example, the loop has been carefully written to have the same number of iterations for each thread, avoiding divergence ( imax is the number of elements rounded up to a multiple of the block size). Guards have been added inside the loop to prevent out-of-bound accesses. At the point of the syncthreads(). all threads are converged. Similar care must be taken when invoking syncthreads() from a device function called from potentially divergent code. A straightforward method of solving this issue is to call the device function from non-divergent code and pass a threadactive flag as a parameter to the device function. This threadactive flag would be used to indicate which threads should participate in the computation inside the device function, allowing all threads to participate in the syncthreads(). Having completed the GPU acceleration of one or more components of the application it is possible to compare the outcome with the original expectation. Recall that the initial assess step allowed the developer to determine an upper bound for the potential speedup attainable by accelerating given hotspots. Before tackling other hotspots to improve the total speedup, the developer should consider taking the partially parallelized implementation and carry it through to production. This is important for a number of reasons for example, it allows the user to profit from their investment as early as possible (the speedup may be partial but is still valuable), and it minimizes risk for the developer and the user by providing an evolutionary rather than revolutionary set of changes to the application. With each generation of NVIDIA processors, new features are added to the GPU that CUDA can leverage. Consequently, its important to understand the characteristics of the architecture. Programmers should be aware of two version numbers. The first is the compute capability. and the second is the version number of the CUDA Runtime and CUDA Driver APIs. The compute capability describes the features of the hardware and reflects the set of instructions supported by the device as well as other specifications, such as the maximum number of threads per block and the number of registers per multiprocessor. Higher compute capability versions are supersets of lower (that is, earlier) versions, so they are backward compatible. The compute capability of the GPU in the device can be queried programmatically as illustrated in the deviceQuery CUDA Sample. The output for that program is shown in Figure 12. This information is obtained by calling cudaGetDeviceProperties() and accessing the information in the structure it returns. Figure 12. Sample CUDA configuration data reported by deviceQuery The major and minor revision numbers of the compute capability are shown on the third and fourth lines of Figure 12. Device 0 of this system has compute capability 1.1. More details about the compute capabilities of various GPUs are in CUDA-Enabled GPUs and Compute Capabilities of the CUDA C Programming Guide. In particular, developers should note the number of multiprocessors on the device, the number of registers and the amount of memory available, and any special capabilities of the device. Certain hardware features are not described by the compute capability. For example, the ability to overlap kernel execution with asynchronous data transfers between the host and the device is available on most but not all GPUs with compute capability 1.1. In such cases, call cudaGetDeviceProperties() to determine whether the device is capable of a certain feature. For example, the asyncEngineCount field of the device property structure indicates whether overlapping kernel execution and data transfers is possible (and, if so, how many concurrent transfers are possible) likewise, the canMapHostMemory field indicates whether zero-copy data transfers can be performed. The CUDA Driver API and the CUDA Runtime are two of the programming interfaces to CUDA. Their version number enables developers to check the features associated with these APIs and decide whether an application requires a newer (later) version than the one currently installed. This is important because the CUDA Driver API is backward compatible but not forward compatible. meaning that applications, plug-ins, and libraries (including the CUDA Runtime) compiled against a particular version of the Driver API will continue to work on subsequent (later) driver releases. However, applications, plug-ins, and libraries (including the CUDA Runtime) compiled against a particular version of the Driver API may not work on earlier versions of the driver, as illustrated in Figure 13. Figure 13. Compatibility of CUDA versions When in doubt about the compute capability of the hardware that will be present at runtime, it is best to assume a compute capability of 2.0 as defined in the CUDA C Programming Guide section on Technical and Feature Specifications. To target specific versions of NVIDIA hardware and CUDA software, use the - arch. - code. and - gencode options of nvcc. Code that uses the warp shuffle operation, for example, must be compiled with - archsm30 (or higher compute capability). See Building for Maximum Compatibility for further discussion of the flags used for building code for multiple generations of CUDA-capable device simultaneously. The host runtime component of the CUDA software environment can be used only by host functions. It provides functions to handle the following: Device management Context management Memory management Code module management Execution control Texture reference management Interoperability with OpenGL and Direct3D As compared to the lower-level CUDA Driver API, the CUDA Runtime greatly eases device management by providing implicit initialization, context management, and device code module management. The CC host code generated by nvcc utilizes the CUDA Runtime, so applications that link to this code will depend on the CUDA Runtime similarly, any code that uses the cuBLAS. cuFFT. and other CUDA Toolkit libraries will also depend on the CUDA Runtime, which is used internally by these libraries. The functions that make up the CUDA Runtime API are explained in the CUDA Toolkit Reference Manual. The CUDA Runtime handles kernel loading and setting up kernel parameters and launch configuration before the kernel is launched. The implicit driver version checking, code initialization, CUDA context management, CUDA module management (cubin to function mapping), kernel configuration, and parameter passing are all performed by the CUDA Runtime. It comprises two principal parts: A C-style function interface ( cudaruntimeapi. h ). C-style convenience wrappers ( cudaruntime. h ) built on top of the C-style functions. For more information on the Runtime API, refer to CUDA C Runtime of the CUDA C Programming Guide. When deploying a CUDA application, it is often desirable to ensure that the an application will continue to function properly even if the target machine does not have a CUDA-capable GPU andor a sufficient version of the NVIDIA Driver installed. (Developers targeting a single machine with known configuration may choose to skip this section.) Detecting a CUDA-Capable GPU When an application will be deployed to target machines of arbitraryunknown configuration, the application should explicitly test for the existence of a CUDA-capable GPU in order to take appropriate action when no such device is available. The cudaGetDeviceCount() function can be used to query for the number of available devices. Like all CUDA Runtime API functions, this function will fail gracefully and return cudaErrorNoDevice to the application if there is no CUDA-capable GPU or cudaErrorInsufficientDriver if there is not an appropriate version of the NVIDIA Driver installed. If cudaGetDeviceCount() reports an error, the application should fall back to an alternative code path. A system with multiple GPUs may contain GPUs of different hardware versions and capabilities. When using multiple GPUs from the same application, it is recommended to use GPUs of the same type, rather than mixing hardware generations. The cudaChooseDevice() function can be used to select the device that most closely matches a desired set of features. Detecting Hardware and Software Configuration When an application depends on the availability of certain hardware or software capabilities to enable certain functionality, the CUDA API can be queried for details about the configuration of the available device and for the installed software versions. The cudaGetDeviceProperties() function reports various features of the available devices, including the CUDA Compute Capability of the device (see also the Compute Capabilities section of the CUDA C Programming Guide ). See CUDA Runtime and Driver API Version for details on how to query the available CUDA software API versions. All CUDA Runtime API calls return an error code of type cudaErrort the return value will be equal to cudaSuccess if no errors have occurred. (The exceptions to this are kernel launches, which return void, and cudaGetErrorString(). which returns a character string describing the cudaErrort code that was passed into it.) The CUDA Toolkit libraries ( cuBLAS. cuFFT. etc.) likewise return their own sets of error codes. Since some CUDA API calls and all kernel launches are asynchronous with respect to the host code, errors may be reported to the host asynchronously as well often this occurs the next time the host and device synchronize with each other, such as during a call to cudaMemcpy() or to cudaDeviceSynchronize(). Always check the error return values on all CUDA API functions, even for functions that are not expected to fail, as this will allow the application to detect and recover from errors as soon as possible should they occur. Applications that do not check for CUDA API errors could at times run to completion without having noticed that the data calculated by the GPU is incomplete, invalid, or uninitialized. Note: The CUDA Toolkit Samples provide several helper functions for error checking with the various CUDA APIs these helper functions are located in the samplescommoninchelpercuda. h file in the CUDA Toolkit. Each generation of CUDA-capable device has an associated compute capability version that indicates the feature set supported by the device (see CUDA Compute Capability ). One or more compute capability versions can be specified to the nvcc compiler while building a file compiling for the native compute capability for the target GPU(s) of the application is important to ensure that application kernels achieve the best possible performance and are able to use the features that are available on a given generation of GPU. When an application is built for multiple compute capabilities simultaneously (using several instances of the - gencode flag to nvcc), the binaries for the specified compute capabilities are combined into the executable, and the CUDA Driver selects the most appropriate binary at runtime according to the compute capability of the present device. If an appropriate native binary ( cubin ) is not available, but the intermediate PTX code (which targets an abstract virtual instruction set and is used for forward-compatibility) is available, then the kernel will be compiled Just In Time (JIT) (see Compiler JIT Cache Management Tools ) from the PTX to the native cubin for the device. If the PTX is also not available, then the kernel launch will fail. Alternatively, the nvcc command-line option - archsmXX can be used as a shorthand equivalent to the following more explicit - gencode command-line options described above: However, while the - archsmXX command-line option does result in inclusion of a PTX back-end target by default (due to the codecomputeXX target it implies), it can only specify a single target cubin architecture at a time, and it is not possible to use multiple - arch options on the same nvcc command line, which is why the examples above use - gencode explicitly. CUDA applications are built against the CUDA Runtime library, which handles device, memory, and kernel management. Unlike the CUDA Driver. the CUDA Runtime guarantees neither forward nor backward binary compatibility across versions. It is therefore best to redistribute the CUDA Runtime library with the application when using dynamic linking or else to statically link against the CUDA Runtime. This will ensure that the executable will be able to run even if the user does not have the same CUDA Toolkit installed that the application was built against. Note: When statically linking to the CUDA Runtime, multiple versions of the runtime can peacably coexist in the same application process simultaneously for example, if an application uses one version of the CUDA Runtime, and a plugin to that application is statically linked to a different version, that is perfectly acceptable, as long as the installed NVIDIA Driver is sufficient for both. Statically-linked CUDA Runtime The easiest option is to statically link against the CUDA Runtime. This is the default if using nvcc to link in CUDA 5.5 and later. Static linking makes the executable slightly larger, but it ensures that the correct version of runtime library functions are included in the application binary without requiring separate redistribution of the CUDA Runtime library. Dynamically-linked CUDA Runtime If static linking against the CUDA Runtime is impractical for some reason, then a dynamically-linked version of the CUDA Runtime library is also available. (This was the default and only option provided in CUDA versions 5.0 and earlier.) To use dynamic linking with the CUDA Runtime when using the nvcc from CUDA 5.5 or later to link the application, add the --cudartshared flag to the link command line otherwise the statically-linked CUDA Runtime library is used by default. After the application is dynamically linked against the CUDA Runtime, this version of the runtime library should be bundled with the application. It can be copied into the same directory as the application executable or into a subdirectory of that installation path. Other CUDA Libraries Although the CUDA Runtime provides the option of static linking, the other libraries included in the CUDA Toolkit (cuBLAS, cuFFT, etc.) are available only in dynamically-linked form. As with the dynamically-linked version of the CUDA Runtime library. these libraries should be bundled with the application executable when distributing that application. The CUDA Toolkits End-User License Agreement (EULA) allows for redistribution of many of the CUDA libraries under certain terms and conditions. This allows applications that depend on these libraries to redistribute the exact versions of the libraries against which they were built and tested, thereby avoiding any trouble for end users who might have a different version of the CUDA Toolkit (or perhaps none at all) installed on their machines. Please refer to the EULA for details. Note: This does not apply to the NVIDIA Driver the end user must still download and install an NVIDIA Driver appropriate to their GPU(s) and operating system. When redistributing the dynamically-linked versions of one or more CUDA libraries, it is important to identify the exact files that need to be redistributed. The following examples use the cuBLAS library from CUDA Toolkit 5.5 as an illustration: In a shared library on Linux, there is a string field called the SONAME that indicates the binary compatibility level of the library. The SONAME of the library against which the application was built must match the filename of the library that is redistributed with the application. For example, in the standard CUDA Toolkit installation, the files libcublas. so and libcublas. so.5.5 are both symlinks pointing to a specific build of cuBLAS, which is named like libcublas. so.5.5. x . where x is the build number (e. g. libcublas. so.5.5.17 ). However, the SONAME of this library is given as libcublas. so.5.5 : Because of this, even if - lcublas (with no version number specified) is used when linking the application, the SONAME found at link time implies that libcublas. so.5.5 is the name of the file that the dynamic loader will look for when loading the application and therefore must be the name of the file (or a symlink to the same) that is redistributed with the application. The ldd tool is useful for identifying the exact filenames of the libraries that the application expects to find at runtime as well as the path, if any, of the copy of that library that the dynamic loader would select when loading the application given the current library search path: In a shared library on Mac OS X, there is a field called the install name that indicates the expected installation path and filename the library the CUDA libraries also use this filename to indicate binary compatibility. The value of this field is propagated into an application built against the library and is used to locate the library of the correct version at runtime. For example, if the install name of the cuBLAS library is given as rpathlibcublas.5.5.dylib. then the library is version 5.5 and the copy of this library redistributed with the application must be named libcublas.5.5.dylib. even though only - lcublas (with no version number specified) is used at link time. Furthermore, this file should be installed into the rpath of the application see Where to Install Redistributed CUDA Libraries. To view a librarys install name, use the otool - L command: The binary compatibility version of the CUDA libraries on Windows is indicated as part of the filename. For example, a 64-bit application linked to cuBLAS 5.5 will look for cublas6455.dll at runtime, so this is the file that should be redistributed with that application, even though cublas. lib is the file that the application is linked against. For 32-bit applications, the file would be cublas3255.dll. To verify the exact DLL filename that the application expects to find at runtime, use the dumpbin tool from the Visual Studio command prompt: Once the correct library files are identified for redistribution, they must be configured for installation into a location where the application will be able to find them. On Windows, if the CUDA Runtime or other dynamically-linked CUDA Toolkit library is placed in the same directory as the executable, Windows will locate it automatically. On Linux and Mac, the - rpath linker option should be used to instruct the executable to search its local path for these libraries before searching the system paths: Note: It may be necessary to adjust the value of - ccbin to reflect the location of your Visual Studio installation. To specify an alternate path where the libraries will be distributed, use linker options similar to those below: For Linux and Mac, the - rpath option is used as before. For Windows, the DELAY option is used this requires that the application call SetDllDirectory() before the first call to any CUDA API function in order to specify the directory containing the CUDA DLLs. Note: For Windows 8, SetDefaultDLLDirectories() and AddDllDirectory() should be used instead of SetDllDirectory(). Please see the MSDN documentation for these routines for more information. The NVIDIA System Management Interface ( nvidia-smi ) is a command line utility that aids in the management and monitoring of NVIDIA GPU devices. This utility allows administrators to query GPU device state and, with the appropriate privileges, permits administrators to modify GPU device state. nvidia-smi is targeted at Tesla and certain Quadro GPUs, though limited support is also available on other NVIDIA GPUs. nvidia-smi ships with NVIDIA GPU display drivers on Linux, and with 64-bit Windows Server 2008 R2 and Windows 7. nvidia-smi can output queried information as XML or as human-readable plain text either to standard output or to a file. See the nvidia-smi documenation for details. Please note that new versions of nvidia-smi are not guaranteed to be backward-compatible with previous versions. ECC error counts Both correctable single-bit and detectable double-bit errors are reported. Error counts are provided for both the current boot cycle and the lifetime of the GPU. GPU utilization Current utilization rates are reported for both the compute resources of the GPU and the memory interface. Active compute process The list of active processes running on the GPU is reported, along with the corresponding process nameID and allocated GPU memory. Clocks and performance state Max and current clock rates are reported for several important clock domains, as well as the current GPU performance state ( pstate ). Temperature and fan speed The current GPU core temperature is reported, along with fan speeds for products with active cooling. Power management The current board power draw and power limits are reported for products that report these measurements. Identification Various dynamic and static information is reported, including board serial numbers, PCI device IDs, VBIOSInforom version numbers and product names. ECC mode Enable and disable ECC reporting. ECC reset Clear single-bit and double-bit ECC error counts. Compute mode Indicate whether compute processes can run on the GPU and whether they run exclusively or concurrently with other compute processes. Persistence mode Indicate whether the NVIDIA driver stays loaded when no applications are connected to the GPU. It is best to enable this option in most circumstances. GPU reset Reinitialize the GPU hardware and software state via a secondary bus reset. The NVIDIA Management Library (NVML) is a C-based interface that provides direct access to the queries and commands exposed via nvidia-smi intended as a platform for building 3rd-party system management applications. The NVML API is available on the NVIDIA developer website as part of the Tesla Deployment Kit through a single header file and is accompanied by PDF documentation, stub libraries, and sample applications see developer. nvidiatesla-deployment-kit. Each new version of NVML is backward-compatible. An additional set of Perl and Python bindings are provided for the NVML API. These bindings expose the same features as the C-based interface and also provide backwards compatibility. The Perl bindings are provided via CPAN and the Python bindings via PyPI. All of these products ( nvidia-smi. NVML, and the NVML language bindings) are updated with each new CUDA release and provide roughly the same functionality. Managing your GPU cluster will help achieve maximum GPU utilization and help you and your users extract the best possible performance. Many of the industrys most popular cluster management tools now support CUDA GPUs via NVML. For a listing of some of these tools, see developer. nvidiacluster-management. Any PTX device code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation ( JIT ). Just-in-time compilation increases application load time but allows applications to benefit from latest compiler improvements. It is also the only way for applications to run on devices that did not exist at the time the application was compiled. When JIT compilation of PTX device code is used, the NVIDIA driver caches the resulting binary code on disk. Some aspects of this behavior such as cache location and maximum cache size can be controlled via the use of environment variables see Just in Time Compilation of the CUDA C Programming Guide. It is possible to rearrange the collection of installed CUDA devices that will be visible to and enumerated by a CUDA application prior to the start of that application by way of the CUDAVISIBLEDEVICES environment variable. Devices to be made visible to the application should be included as a comma-separated list in terms of the system-wide list of enumerable devices. For example, to use only devices 0 and 2 from the system-wide list of devices, set CUDAVISIBLEDEVICES0,2 before launching the application. The application will then enumerate these devices as device 0 and device 1, respectively. This appendix contains a summary of the recommendations for optimization that are explained in this document. Performance optimization revolves around three basic strategies: Maximizing parallel execution Optimizing memory usage to achieve maximum memory bandwidth Optimizing instruction usage to achieve maximum instruction throughput Maximizing parallel execution starts with structuring the algorithm in a way that exposes as much data parallelism as possible. Once the parallelism of the algorithm has been exposed, it needs to be mapped to the hardware as efficiently as possible. This is done by carefully choosing the execution configuration of each kernel launch. The application should also maximize parallel execution at a higher level by explicitly exposing concurrent execution on the device through streams, as well as maximizing concurrent execution between the host and the device. Optimizing memory usage starts with minimizing data transfers between the host and the device because those transfers have much lower bandwidth than internal device data transfers. Kernel access to global memory also should be minimized by maximizing the use of shared memory on the device. Sometimes, the best optimization might even be to avoid any data transfer in the first place by simply recomputing the data whenever it is needed. The effective bandwidth can vary by an order of magnitude depending on the access pattern for each type of memory. The next step in optimizing memory usage is therefore to organize memory accesses according to the optimal memory access patterns. This optimization is especially important for global memory accesses, because latency of access costs hundreds of clock cycles. Shared memory accesses, in counterpoint, are usually worth optimizing only when there exists a high degree of bank conflicts. As for optimizing instruction usage, the use of arithmetic instructions that have low throughput should be avoided. This suggests trading precision for speed when it does not affect the end result, such as using intrinsics instead of regular functions or single precision instead of double precision. Finally, particular attention must be paid to control flow instructions due to the SIMT (single instruction multiple thread) nature of the device. The NVIDIA nvcc compiler driver converts. cu files into C for the host system and CUDA assembly or binary instructions for the device. It supports a number of command-line parameters, of which the following are especially useful for optimization and related best practices: - maxrregcountN specifies the maximum number of registers kernels can use at a per-file level. See Register Pressure. (See also the launchbounds qualifier discussed in Execution Configuration of the CUDA C Programming Guide to control the number of registers used on a per-kernel basis.) --ptxas-options-v or - Xptxas-v lists per-kernel register, shared, and constant memory usage. - ftztrue (denormalized numbers are flushed to zero) - prec-divfalse (less precise division) - prec-sqrtfalse (less precise square root) - usefastmath compiler option of nvcc coerces every functionName() call to the equivalent functionName() call. This makes the code run faster at the cost of diminished precision and accuracy. See Math Libraries. ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS) ARE BEING PROVIDED AS IS. NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation. Trademarks NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U. S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Entdecken Sie all Ihre Lieblingsthemen in der SlideShare-App Holen Sie sich die SlideShare-App. Damit knnen Sie Inhalte fr spter speichern und sogar offline Nutzen. Weiter zur Webseite fr Mobilgerte SlideShare verwendet Cookies, um die Funktionalitt und Leistungsfhigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklren Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere Nutzervereinbarung und die Datenschutzrichtlinie. SlideShare verwendet Cookies, um die Funktionalitt und Leistungsfhigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklren Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere unsere Datenschutzrichtlinie und die Nutzervereinbarung. Helping Pharmas Manage Compliance Risks for Speaker Programs von cognizant 16 Aufrufe Collective Intelligence: Filling the Insurance Talent Gap von cognizant 9 Aufrufe The Coming Intelligent Digital Assistant Era and Its Impact on Online Platforms von cognizant 207 Aufrufe The Internet of Things: The New Rx for Pharmaceuticals Manufacturing amp Supply Chains von cognizant 100 Aufrufe The Work Ahead: How Digital Thinking Separates Retailx27s Leaders from Laggards von cognizant 153 Aufrufe How Decision-Support Tools Cure the Prior Authorization Time Drain von cognizant 128 Aufrufe The Work Ahead: Designing Manufacturings Digital Future von cognizant 204 Aufrufe Nine Novel Tactics for Software Product Managment in the New Digital Age von cognizant 30 Aufrufe Analytics-Driven Healthcare: Improving Care, Compliance and Cost von cognizant 113 Aufrufe Making the Shift to the Next-Generation Enterprise von cognizant 36 Aufrufe Mobility in Field Services Management: Part Two von cognizant 53 Aufrufe Mobility in Field Services Management: Part One von cognizant 111 Aufrufe A Transformation Roadmap for Media and Entertainment Revitalization von cognizant 25 Aufrufe Keep on SMACking: Taking Social, Mobile, Analytics and Cloud to the Bottom Line von cognizant 27 Aufrufe Gamification for Insurers: A Practitioners Perspective von cognizant 37 Aufrufe The Chatbot Imperative: Intelligence, Personalization and Utilitarian Design von cognizant 621 Aufrufe DevOps Best Practices: Combine Coding with Collaboration von cognizant 22 Aufrufe 14 Insights for Healthcare to Master the Work Ahead in the Digital Age von cognizant 224 Aufrufe 4 Rules for the New Underwriting Playbook von cognizant 154 Aufrufe Demystifying Blockchain von cognizant 4.035 Aufrufe Infographic How will Internet of Things (IoT) change the world as we know it von IQGroup 96.558 Aufrufe 2016 Internet Trends Report von kleinerperkins 3.067.249 Aufrufe Virtual, Collaborative Workplaces: How to Improve Productivity von cognizant 3.205 Aufrufe Taking a True Measure of Customer Experience von cognizant 2.522 AufrufeRecently, Quantopians Chief Investment Officer, Jonathan Larkin, shared an industry insiders overview of the professional quant equity workflow. This workflow is comprised of distinct stages including: (1) Universe Definition, (2) Alpha Discovery, (3) Alpha Combination, (4) Portfolio Construction and (5) Trading. This Notebook focuses on stage 3: Alpha Combination. At this stage, Machine Learning is an intuitive choice as we have abstracted the problem to such a degree that it is now a classic classification (or regression) problem which ML is very good at solving and coming up with an alpha combination that is predictive. In this notebook we will show how to: construct alphas in pipeline, preprocess alphas so that they can be fed into scikit-learn classifiers, train a ML classifier on these alphas with good performance on hold-out data, analyze which alphas were most predictive. In addition, this notebook discusses: Pointers on how to improve the accuracy. A small Machine Learning competition. Resources to further your understanding of Machine Learning. Note, however, that this workflow is still a bit rough around the edges. We are working on improving it and adding better educational materials. This serves as a sneak-peek for the curious and adventurous. Loading notebook preview. Notebook previews are currently unavailable. shared this notebook The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. And here is a library of commonly used alpha factors. You can copyamppaste some (or all) of these into the above notebook to see if you can improve the accuracy of the classification. Ultimately, we will include these in a library you can just import. Loading notebook preview. Notebook previews are currently unavailable. shared this notebook The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. Stunning, no other word for it. It will take me a while to absorb it all since I am still currently working through various books. However, your work will make this entire area vastly easier for me to comprehend and put into practice. One initial question: Exercise: It is also common to run cross-validation on the training data and tweak the parameters based on that score, testing should only be done rarely. I am aware of having made serious errors in this respect in the past. I still can39t quite think my way through this properly. I understand the whole concept of cross validation on training data and so forth but wonder whether you could expand your thoughts a little on when to actually resort to the test data We will be coming up with countless different models using a combination of the factors. I39m just not quite clear on when to use any of those models on the test data. Accuracy on test set 67.26 As you are well aware, predictions on price alone rarely exceed much above the 50 random level. This therefore seems more than a quot small edge in the market (or at least the test set) quot It will be interesting to see as we implement models going forward whether this sort of accuracy can be maintained on fresh data. But it certainly looks promising, even if it has not yet been put into an actual system for back testing. Interesting to see the low ranking of momentum. Anyway, thanks to the whole team for this. Cross-validation: A common method is to do model development and comparison without ever touching the hold-out set. Once you have something you think works really well you put it to the final test to see how much you fooled yourself. A good mind-set is that you lose something every time you test on your hold-out set. Accuracy: I39m a bit suspicious of that myself. Initially I got more like 53 which seems much more reasonable. I think this is due to having overlapping time-periods. The factors themselves are computed over overlapping windows and so are the returns to be predicted. Thus, probably just by forward-projecting you get pretty good predictability. Then again, the classifier does not have access to any forward returns so can39t just forward-fill (except the first 4 days). I can post another version where I subsample the data to only train and predict once every week to avoid the overlap. This brings accuracy down significantly (i. e. to 53). The ultimate test, as always, will be how well this predicts on future data (i. e. testing the classifier in a month from now). The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. Anthony: The high accuracy was indeed driven by the overlap from the training day returns into the testing day returns. I updated the NB now to leave a 4-day delay between train and test (as we have 5-day returns) and the accuracy is down to a more reasonable 53.3. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. You Q folks are doing great work, I39m sure, but for me, it39s a bit like trying to eat a nice steak in one bite (and perhaps trying to take in a whole potato, as well). My understanding is that you are interested in quotcoming up with an alpha combination that is predictivequot (in 39combination39 you presumably include interactions, up to some effective polynomial order). Presumably, you are trying to predict forward daily returns, based on daily prices (open or close. take your pick). There is a vector of Y39s, which consists of the daily returns of N stocks (not per unit wall clock time, since early closesweekendsholidays are ignored). So, the name of the game is to find the set of M X39s that, in combination, best predict Y39s (on any timescale. I just need predictability to trade). The X39s consist of a set of quotcommonly used factorsquot (fundamentals, technical indicator, and their combinations, etc. whatever one dreams up). So, if I follow, there are returns for N stocks (Y), and the supposition is that M factors (X) can be used to predict those returns. So, in effect, are you creating N independent models Are you finding Yn fn(Xn), where n corresponds to the nth stock Or is it that you are building a model that best predicts the overall return (R) and the model consists of the M factors and the weightings of N stocks in the portfolio Something like R f(X, W), where R is the scalar overall return of my optimized portfolio, X is a vector of M factors, and W is a vector of N stock weights Or maybe you are doing something else altogether On a separate note, I can39t tell from your notebook if you are computing the feature importance for a single period, or these features would be considered persistent over a long time span (e. g. back to 2002 or whatever) It would seem the question to answer would be, are there any factors that are consistently predictive over many market cycles (even if their relative importance varies with time). In other words, which factors are total junk and which ones are the gems. Should I include the agegendernationalityfavorite food of the CEO Or do these factors never have any predictive power whatsoever, and just over-complicate the model, and impute a risk of over-fitting What is the chart of quotFeature importancesquot telling us If it is not obvious from my comments here and elsewhere, I39m interested in the overall workflow you folks are putting together, but I39m finding it challenging to get the bigger picture. Your efforts seem to be based on an established intellectual edifice that is foreign to me. Jonathan Larkin provides a decent 30,000 foot view on blog. quantopiana-professional-quant-equity-workflow. We have the Q500US amp Q1500US (along with the more general makeusequityuniverse ), and then alphalens, which is a manual tool for analyzing individual factors. Followed by your work above, to combine the factors. Etc. At a level lower than 30,000 feet, but above the ground level of individual lines of code (let39s say 10,000 feet), there is a kind of overall point-in-time optimization problem that is being solved, which includes the universe selection, and the factors and their interactions. There39s also the issue of how the universe selection interacts with the factors, which I don39t think is addressed in the workflow (at least not as a factor to be automatically optimized). In the end, it would seem that as an output of the factor combination step, I should have a model of how to forecast returns for every stock independently, no Is that where we are headed Then the portfolio construction step would be to sort out how to apply those forecasts and construct portfolio weights (i. e. how much of my portfolio should be in XYZ at any given point in time) Perhaps you could provide a sense for the overall optimization problem that is being solved (across the entire workflow) and how your factor combination step fits in Presumably, your quotlibrary of commonly used alpha factorsquot was run through alphalens and somehow vetted Each one was shown to have some predictive power individually, and now you are combining them in a fashion that accounts for interactions (quota non-linear combination of featuresquot). Sorry, as I said, I39m missing the 10,000 foot view. What are you doing To provide some background, I am familiar with the so-called quotresponse surface methodologyquot used in design and process engineering. My fuzzy picture is that you are building a multi-dimensional response surface that can be used for optimization, subject to constraints If so, what order of model are you constructing (first-order, second-order) Is the ML model effectively the response surface Or has it already done the optimization Again, I39m obviously confused. Anthony: Yes, this normalization works for all classifiers. I should have a model of how to forecast returns for every stock independently, no With independently, do you refer to the fact that we are ranking the stocks here and that rank depends on the other stocks in the univ The idea is: we have factors, we combine them in such a way to be predictive of future relative returns. When we can predict that, we just need to long stocks that the model predicts to go up, and short the ones that go down. The other stuff is just icing and there are many ways to skin the cat. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. Not sure that makes things any clearer, but I39ll take what I can get. I39ll have to take another pass through your workbook, to see if I can understand it. All of this ranking talk, I don39t understand. For example, if I assign a value of 1-10 to foods that I like, 10 being the highest, then by default, I can rank them. It is the value assignment that would seem to matter. I guess you are saying that the underlying assumption is that there is a monotonic relationship between the values and their utility for trading. So, if I rank them, I can pick the topbottom 10 or whatever. Still, it is not the ranking, but the assignment of values in such a way that the monotonicity holds, right The ranking is just turning the crank, so to speak. Under your section quotRun the pipelinequot you end up with a table of numbers, resulting from results. head(). How did you derive the table I see factors as column labels (except for quotReturnsquot), and then for each day, there is a list of securities (presumably every security in the universe). And integers within the table (except for the quotReturnsquot column). How are you determining the integers How are the returns calculated Are the returns used to determine the integers The integers are ranks, but is the ranking by security (row-wise), but not normalized Or by factor (column-wise) Or something else Are you ranking the relative predictive power of each factor, relative to other factors, by security For a single security, I read across, and for a single factor, I read down Also, I don39t understand if a factor doesn39t directly predict the 5-day return, then how are you applying it For example, WorkingCapitalToAssets doesn39t tell me if I have a stock price for XYZ, what the price will be 5 days later Wouldn39t all of my factors need to be formulated to actually forecast prices If you could fill in some missing pieces on how you are doing the forecasting and relative ranking, it might help (me at least). What is the recipe (in words, mostly, since I don39t have the patience to unravel PythonQ API code. which I can do later, once I understand the big picture) The integers are the rank of that factor of that security on that day relative to the rest of the universe. Returns are calculated over a 5-day window. The computation of returns are not influenced by the factor rank or vice-versa. For example: on day 1 we have four stocks in our universe. 1. On that day WorkingCapitalToAssets values of all stocks are 0.2, 0.3, 0.1, 0.5 . 2. We rank them which will become 1, 2, 0, 3 . 3. We then normalize so that it becomes 0.25, 0.75, 0. 1. . 4. Assume 5-day-returns from day 2 to day 6 of each stock of .01, 0.05, -.01, 0.02 . 5. We then binarize the returns by a median-split which gives -1, 1, -1, 1 . 6. We then feed this into a classifier which will try to find a way to predict the labels from the input. In this case, we could use the decision rule: if normalizedrankedfactorofstocki gt 0.5 predict 1, else predict -1. In this case, it would get 100 of the training data correct. Intuitively, stocks which have a high rank in WorkingCapitalToAssets are predictive of a stock going up more than the other stocks in its universe. If we were to act on our prediction on day 2, we would long stock 2 and 4 and short stock 1 and 3 and exit the positions on day 6. This would be a profitable trade. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. I feel compelled to understand the ranking of factors better (versus using factors directly, which in my experience, is the typical thing to do, e. g. in scienceengineering). I found this: On page 16 (slide 15), it says: Rank vector R ((1), (2). (N)) is maximal invariant statistic under monotone transforms That is any statistic unaffected by monotone transforms is a function of rank vector I39m not sure what all this means yet, but it is not clear that any information is lost, under certain assumptions (e. g. as I discuss above, ranking is equivalent to an axis offset and normalizationscaling, for equi-spaced (uniformly sampled) factor values). Maybe if one assumes a monotonic relationship between the raw factors ( X ) and the responses ( Y ), then no information is lost in converting to ranks of the factors ( Xrank ) But it reduces the dimensionality, since I can now work with discrete integer values (levels) instead of continuous float values (or I don39t need to interpolate the missing data) But the risk is that if the monotonic relationship does not hold, then I39ve thrown a monkey wrench into the whole thing (e. g. if my response Y peaks at extreme values of X and then rolls off) Maybe one has to show that the monotonic relationship actually holds, prior to ranking Just guessing. Anybody understand the ranking Small update: I fixed the breaking change by replacing np. percentile() with np. nanpercentile(). This is due to updated numpy (quantopianpostssoon-upgrade-to-pandas-0-dot-18 ). Unfortunately, the shape of the resulting data also changed which I39m not sure why that is the case. It also negatively affected the OOS performance of the classifier, as you can see. I will try to get to the bottom of why the two would be different but others are free to explore too and see if they can get a better score. Loading notebook preview. Notebook previews are currently unavailable. shared this notebook The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. I am new to Quantopian and spotted that you have been using machine learning for trading strategies. I am not yet familiar with the environment of Quantopian and would like to see if you could help giving some insight using ML on the development environment of Quantopian. I myself have more than 30 years experience in proprietary trading (discretionary approach, not systematic one) while my partner is more into quantitative approach. We look to collaborate with someone with more AI expertise and experience of algo development (someone like you we believe). We hope our domain knowledge in proprietary trading could also be good for your personal use. Just drop us a message. We look forward to communicating with you further. I39m still working my way through your notebook above. In your results. head() what is the Returns column Is it the trailing returns Is that why you then have to quotShift factor ranks to align with future returns nfwddays days in the futurequot I suggest simply doing the alignment within pipeline, so that one can directly read across the rows, as factor-response data. Is this possible Find the top and bottom 30 percentile stocks by their returns. Are you setting up for the ML to predict which stocks fall into the two buckets, the top 30 percentile, and the bottom 30 percentile For example, if you code 1 for the top bucket, and -1 for the bottom bucket (and 0 for not in a bucket), then a given stock XYZ could take on values (-1,0,1) In other words, if my ML ends up predicting a 1 for XYZ, then I39d expect its return in 5 days to be positive, and better than the mass of stocks falling between the 30 to 70 percentiles Another question is presumably this is all aimed toward writing a glorious, scalable long-short algos for your Point72 buddies. What if all the stocks go up and I have nothing to short Is there an implicit assumption here that I39ll end up with a basket of stocks to long, and a basket to short I don39t see anything in the analyses that would imply a market neutrality constraint. Or is it in there, somehow And how would one handle categorical factors (e. g. color of CEO39s sports car, exchange on which stock is listed, etc.) Can this be extended to handle multiple responses (Y39s) You only do returns, but wouldn39t one want a model for other stuff, too The Quantopian contest and pyfolio are concerned with lots of responses, which all have some importance, right It seems that in the end, to do constrained optimization, you need more than one response (e. g. maximize overall return under the constraint that SR gt 1.0, for which you39d need to predict the return and its variance). Grant, those are great questions. In your results. head() what is the Returns column Is it the trailing returns Yes, trailing 5-day returns. Is that why you then have to quotShift factor ranks to align with future returns nfwddays days in the futurequot I suggest simply doing the alignment within pipeline, so that one can directly read across the rows, as factor-response data. Is this possible Yes, that would be ideal, but not possible currently. What we need would be a. shift() pipeline operator. It39s on the feature wish list. Are you setting up for the ML to predict which stocks fall into the two buckets, the top 30 percentile, and the bottom 30 percentile For example, if you code 1 for the top bucket, and -1 for the bottom bucket (and 0 for not in a bucket), then a given stock XYZ could take on values (-1,0,1) In other words, if my ML ends up predicting a 1 for XYZ, then I39d expect its return in 5 days to be positive, and better than the mass of stocks falling between the 30 to 70 percentiles Yes, that39s exactly the right understanding. But note that there could be a case where you correctly predict 1 but the 5-day forward return is not positive in absolute terms. Per esempio. it could be that market is tanking and all stocks are negative, just this one less than others (i. e. it39s still in the top 30th percentile). Another question is presumably this is all aimed toward writing a glorious, scalable long-short algos for your Point72 buddies. What if all the stocks go up and I have nothing to short Is there an implicit assumption here that I39ll end up with a basket of stocks to long, and a basket to short I don39t see anything in the analyses that would imply a market neutrality constraint. Or is it in there, somehow As we train the model to predict relative stock movements it39s very unlikely to not have shortable stocks. As I said above, they can still go up for this to be profitable, just less than the others. So a very simple next step would be to just short the N stocks the classifier is most certain are -139s and long the N stocks the classifier is certain are 139s. That way I39m always market neutral. And how would one handle categorical factors (e. g. color of CEO39s sports car, exchange on which stock is listed, etc.) That39s a great question and not one where there is a single obvious answer to. One idea with e. g. sectors would be to train a single classifier for each sector independently. But there are also other methods to include categorical data in machine learning. This is really where someone can add value by being creative. Can this be extended to handle multiple responses (Y39s) You only do returns, but wouldn39t one want a model for other stuff, too The Quantopian contest and pyfolio are concerned with lots of responses, which all have some importance, right It seems that in the end, to do constrained optimization, you need more than one response (e. g. maximize overall return under the constraint that SR gt 1.0, for which you39d need to predict the return and its variance). Technically, those ML algorithms exist but I don39t think they would be useful here. That39s not to say that those other things you mention are not important, but there are other ways to include them. The ML prediction part is just one piece of the trading algorithm. The prediction just gives me a long and a short book. Do I just equal weight each stock in each book Probably not, so that39s where a risk model would come in handy. For example I could do inverse-vol weighting, but then you also want to reduce exposure to certain risk factors etc. Or also quotmaximize overall return under the constraint that SR gt 1.0quot could be included at this level as well. You could also train a separate classifier to predict vol and feed that into the portfolio optimization, but my hunch is that historical vol will be good enough in that case. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. It39s important to realize that we do not attempt to predict market movements here and are not interested in them. A market-neutral strategy tries to make money on relative price movements. A couple of examples with simplified math: 1) long book returns: 5, short-book returns: -5 - profit: 5 (-1)-5 10 2) long book returns: 15, short-book returns: 5 - profit: 15 (-1)5 10 (bull market) 3) long book returns: -5, short-book returns -15 - profit: -5 (-1)-15 10 (bear market) As you can see, we39re making profit on the spread between the long and short book. We don39t actually care about the total movement, only about the difference between them. The classifier is trained in that manner too, it only tries to predict returns relative to others, not absolute. As such, the classifier should never predict all stocks go down relative to each other . as that39s impossible, even in 2008. In each of these examples, the classifier could have correctly predicted 1s for the long book and -1s for the short books as the ones in the long book are relatively better. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. Sorry, I must be missing something. Say I decide to do one of these long-short thingys with only two stocks, ABC amp XYZ. So, no matter what, I39ll go long ABC amp short XYZ (or vice versa). What if ABC amp XYZ are highly correlated If my model says that both will go down, shouldn39t I short both, and then go long an ETF, as an overall market hedge I guess I don39t get the concept of going long-short if my model predicts everything (or almost everything) will go in a certain direction. Does it all work out somehow in the limit of a relatively large number of volatile stocks, over many market cycles The whole scheme you guys are cooking up would seem to require certain assumptions about how the market works, no Grant, in your case of ABC and XYZ, think about it as a pair trade. Correlation is a fine thing, so long as a) they aren39t 1.0 perfectly correlated and b) you know which one to long and which one to short. It39s OK if your long-short model predicts everything to move in the same direction . The important part is that your model have to has to have different predictions for each stock (or at least predictions for a top book and bottom book). Look again at Thomas39s examples: 1) long book returns: 5, short-book returns: -5 - profit: 5 (-1)-5 10 2) long book returns: 15, short-book returns: 5 - profit: 15 (-1)5 10 (bull market) 3) long book returns: -5, short-book returns -15 - profit: -5 (-1)-15 10 (bear market) Summarizing: If your model predicts every equity to behave exactly the same, it39s useless. If your model accurately predicts differentiation between equities, then it has alpha. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. The great thing about maintaining these long and short baskets is that irrespective of the market you can make money. As Thomas illustrated above, it is all about the spread. Though I understand what you are saying, in that if you know things are going down or going up why not just do one leg with a hedging instrument The answer to that is by long and shorting based on our predictive scheme we are completely isolating the performance of the algorithm to just our ability to predict. As long as we are confident in our model (in this case our ML classifier) then we don39t care what the market is doing. We don39t want to bet on the market - we want to bet on our model. As far as correlation goes, let39s take those two stocks and extrapolate it too 200 stocks, now we have two baskets (long and short) each with a beta of approximately 1.0 (assuming here that on average 100 stocks would have a beta about equal to the market). Now by putting equal amounts of the portfolio value in each leg we get 1.0 (-1) (1.0) 0, for a low beta algorithm. It is this property that is so great because it removes the need for a hedging instrument entirely. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. Thanks Dan, James - Just trying to understand the assumptions here, how the thing actually works (and why it might not work under certain conditions). Taking Thomas39 example again, assume that my forecasting is 100 accurate. Then I could do: 1) long book returns: 5, short-book returns: -5 - profit: 5 (-1)-5 10 2) long book returns: 15, short-book returns: 5 - profit: 15 (-1)5 10 (bull market), or 20 if all long 3) long book returns: -5, short-book returns -15 - profit: -5 (-1)-15 10 (bear market), or 20 if all short Why wouldn39t I do this Instead of a blended profit of 10, I end up with 17 (assuming scenarios 1, 2, amp 3 are equally probable). Is the implicit assumption here that I can forecast individual stocks but not the entire market (which would be nice, since then I could write a much simpler algo that would just go either long or short in SPY) I39ll never get the additional 7 There39s too much risk in going all long or all short, since my ability to forecast the market stinks I39d agree with Dan39s statement quotIf your model predicts every equity to behave exactly the same, it39s uselessquot except that if my model is actually predictive, then I39d just chuck the market neutrality constraint, and go either long or short, but not both equal weight. Perhaps the confusion lies in the fact that you have an implicit market neutrality constraint that, as James points out, one hopes can be met if enough stocks of the right flavor are included in the long-short portfolio. The ranking and then going long-short (top-bottom percentiles) equal-weight will satisfy this constraint under certain assumptions. So, if I play my cards right, I39ll break even, worst-case (on average, I39ll make no profit), although I have to make enough to recover my Vegas flight and hotel. On top of that, there is some magical uncorrelated profit (the so-called quotalphaquot). By the way, since quotbetaquot is so undesirable, if Q makes any money this way, I39d be glad to take it off your hands. Jonathan L. suggests that there may be a deeper understanding: Today Ill add substance to that philosophy by giving you a detailed tour of the investment process for a popular and deep area of the quantitative investment world: cross-sectional equity investing, also known as equity statistical arbitrage or equity market neutral investing. This approach to equity investing involves holding hundreds of stocks long and short in a tightly risk-controlled portfolio with the goal of capturing transient market anomalies while exhibiting little to no correlation to market direction or other major risk factors. Perhaps there is a really good (not highly mathematicaltheoretical), short tutorial out there that fills in the gaps (and shows that it actually can work) Sounds wonderful, but it is hard to picture, particularly in a statistical sense. Why is it called quotequity statistical arbitragequot How would I know that I39m doing quotarbitragequot i. e. printing money, and not something else If beta is zero, then is it automatically quotarbitragequot by definition Or do I have to show some form of point-in-time cointegration for the long-short baskets, point-in-time Is the implicit assumption here that I can forecast individual stocks but not the entire market (which would be nice, since then I could write a much simpler algo that would just go either long or short in SPY) I39ll never get the additional 7 There39s too much risk in going all long or all short, since my ability to forecast the market stinks Yes, that39s exactly right (no snarkyness intended). The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. This usually tells important information especially when working with unbalanced datasets (not the case here) or more than 2 classes. By the way it39s worth studying which method does get you better results in trading, predicting relative movement or predicting probability of going updown and going longshort percentile of those predictions. Because of volatility clustering usual machine learning statistics usually don39t tell you the whole truth about classifier robustness in real trading environment. Loading notebook preview. Notebook previews are currently unavailable. Mikko M shared this notebook Hey Mikko - I39m lurking in this thread as I39m new to Quantopian. But I saw you mention Random Forests and was curious if you happen to have an example of that you39d be willing to share. I39m trying to find a decent example of running a list of vectors through a simple RF with Quantopian. Thus far I have only worked with ANNs and SVMs (outside of Quantopian) and I39m not sure what the appropriate dot file format is, appropriate first settings, etc. Maybe you (or someone else) can help out Any pointers would be welcome. Grazie in anticipo. -) Just in case you guys look to take a RF model to the IDE, I want to remind you that random number generation is not currently supported in live trading (per the live trading guidelines ). The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. Jamie: My current approach with SVMs (outside of Quantopian) is to produce the model externally and then use it to parse during live trading, which does not use much processing at all. I would draw the same separation if ended up using Quantopian for SVMs, ANNs, or RF models. Thanks for the pointer. Sorry, you39re right, I wasn39t clear at all. Currently, live trading only supports deterministic algorithms. As a result, if you want to live trade with an algorithm that depends on random number generation, you have to specify a seed. For example, with the RandomForestClassifer in sklearn, there39s a randomstate parameter that you can use to specify the seed. This will make the algo deterministic. This is only necessary in live trading. Fammi sapere se questo aiuta. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. Why is working capital to assets a most predictive feature This does not make any sense. But I see a more basic issue with this ML approach. ML works best when features are spatial, not temporal, as in character and picture recognition. If you have a picture with the features changing in time ML would produce random results. As someone mentioned, longshort makes no sense and IMO quotmarket neutralquot is a misnomer. Funds during 2008 crash learned that the hard way. When everything goes down your spatial ML classifier will fail to predict it. Random profits from several years will vanish in thin air in less than a month. Numerai is already doing this and results are not good based on payout for best predictions. Not even mentioned data snooping. I would like to see a good explanation as to why ML works with time series and time-based features. Until then I will trade SPY long and I am doing 12 this year with simple strategy and limited risk. First off, this is a great intro. But I have two commentsquestions: 1: It can be argued that access to tools like sklearn has only been widespread for a few years (sklearn didn39t even have it39s initial git commit until 2011). If we see alpha from using ML techniques on backtests, couldn39t it be argued that the alpha only exists because most market participants didn39t have access to ML at that time, and that in the current era when anyone with a laptop can do this analysis, we can expect it to rapidly decay 2: The post mentions briefly using cross validation, and then links to the SKLearn cross validation tools, but I would like to point out that those tools are unusable for time series. The SKLearn cross validation suite just picks random samples for training and random for testing, but it should be obvious that that introduces look ahead bias. Brom - my 2 cents: 1) I have been developing trading strategies for the better part of the past decade. And let me assure you that having access to ML toolslibs doesn39t mean they39re able to produce and then maintain an edge. Besides running simulations in mathematica or now QuantopianQuantconnect is one thing - integrating various ML techniques into a viable trading strategy is quite another. The devil lurks in the details. 2) How is randomized testing associated with look ahead bias Please explain how this is obvious. IMO randomization reduces form fitting and especially market cycle dependencies. Nice work here on ML but I have also doubts about the effectiveness of the method. As someone mentioned already ML does a good job when features are in space domain. In time domain ML does not appear to do a good job. In fact, ML was popular in the 1980s but traders abandoned it as soon as they understood the problems, for much simpler algos. As far as cross validation: There is an issue because it assumes i. i.d. But the most important issue with ML is that each time you update your data the classifier must run again and that means data-snooping is inevitable due to reuse of old data. Unless you are going to use ML only once for a single prediction, something that is not practical of course but this is what is done with text and face recognition, then each time it runs with updated data the probability of a Type I error is higher. This must be understood and ML immediately disqualifies unless there is a sound method for determining significance of results and can compensate for the bias. I have written a post about Numerai. It is interesting to see how more than half of the predictions are worse than random. It is also important to realize that better ML metrics (logloss, accuracy, etc.) do not necessarily mean better predictions. Most top performers in terms of logloss in Numerai have probably achieved that via quotboostingquot based on knowledge of public leaderboard. On the other hand, good predictions can be generated from random participants. Due to that Numerai takes many predictions into account (controlling capital) in an attempt to average out noise. This is smart but at the same time risky as the probability of ruin is not zero and at some point will occur, we just do not know when, hopefully in a long time and after Elon Musk has already colonized Mars.:) Conclusion: ML for markets can be an exercise in futility. I have spent many many hours testing different ML algos: Logistic regression (with and without L1, L2 and k-fold validation) Random forests (with and without k-fold validation) Gradient Boosted Trees (with and without k-fold validation) Decision trees (with and without k-fold validation) SVM with RBF and sigmoid kernel Stochastic gradient descent None of the above worked for me unless I already knew which feature will work. But if I know that, ML is not really required. I can use a simple scan. I do not discount the possibility that it can work well for someone else and I was not doing it right. But that was my experience. This must be understood and ML immediately disqualifies unless there is a sound method for determining significance of results and can compensate for the bias. I absolutely agree with the above. Fitness method is absolutely the most important aspect when working with machine learning. I also am not sure that using notebooks to find results in general way without time series analysis is a good idea at all. Accuracy of the classifier might tell you that everything is better than great but the actual trading results can be catastrophic due to volatility clustering that this kind of analysis does not take into account. I have done a lot of work with genetic algos in the past (some discussion at this old thread forexfactoryshowthread. phpt167720 - please note that this was before the current MLdeep learning hype) and my observation based on lots of studies in actual trading results is that the best results out sample (it39s very easy to get good in sample results) come from very simple fitness method: normalizedreturnsmaximumdd (and you might want to include some correlation analysis if you are trading multiple methods). But you always want to get the fitness score from real market data so volatility clustering is take into account. Does ML work This is like asking if it is possible to find trading methods at all. IMHO computer can find methods as well as human can but you better be sure that you know what your requirements are as the computer will find you the methods you have asked, not the ones that you want. One way to think about this is to suppose that everything is static, going back to the Stone Age (like physics). Then, I can assume a model Y f (X1,X2), and the problem becomes finding the unknown function f . Say I39d like to know how long it takes to bring a waterice mix to boil on my stove top, with a constraint on the total mass of water in the pot. So, X1 could be the mass ratio of water-to-ice, and X2 could be the setting on my burner, 1-10. And Y would be the time it takes for the water to come to a boil (I could get fancy and use a thermometer, or just look for bubbles). I could control the experiment by stirring the pot in a certain way, etc. I would then hire a graduate student, or more likely, an unpaid middle-schooler to conduct N measurements to find the relationship, f . If I know what I39m doing, I39ll find f and I39ll be able to predict Y, the time to boiling, just fine (most likely, with some finagling, I can get a model in the form of a multivariate polynomial in X1 amp X2, over some range of X1 amp X2). For the stock market, one would hope to apply the same approach, but of course controlled measurements are not possible (and would probably be illegal if they were), unless X1 amp X2 naturally vary in time over the parameter space of interest (which of course they do). Then, I could say 39Aha39 I39ll just let the market do the work for me, and once I have a sufficiently large data set of (X1,X2,Y), I39ll figure out Y f (X1,X2). If changes are slow, I can simply update Y f (X1,X2) as new data become available, making incremental adjustments. The adjustments don39t really play into my short-term forecasts they just keep the model in line with reality over the long term. This is not unlike the waterice boiling model construction, where I might have some long-term aging effect (e. g. gradual deterioration of my heating element) that I can compensate for by repeating the measurements on an appropriate time scale. One problem, it would seem, is that I39d like to trade at least weekly, if not more frequently, but there are only 252 trading days in a year. And I only have one noisy data point per day (if I39m following, the proposed workflow is based on pipeline and daily OHLCV bars). But then I39d like a model that is Y f (X1,X2. XN), where N is a large number and I39d like to include non-linearity and interactions up to some order. My intuition is that unless the signal-to-noise is really good and the market mechanisms are stable over many years, I won39t have enough data to trade frequently (which I39d like to do on an effectively continuous basis, to have tight control over the portfolio return, to increase the Sharpe ratio). My intuition is that basing the workflow on daily OHLCV bars doesn39t make fundamental sense, when a historical database of minute OHLCV bars is available. It doesn39t mean that the input data would be greater in dimension (i. e. one could still use 252 data points per year), but they could be significantly better quality in terms of signal-to-noise (e. g. one could compute daily factor values using minute OHLCV bars or summary statistics derived from minute OHLCV bars). For some reason, this basic point has not gotten traction. I39m befuddled. Am I wrong in my intuition Whether I use ML or not, I39ll be much better off with the greatly improved signal-to-noise. Grant, if you look at the code at the notebook you39ll notice that the factors used for prediction are quite long term ones derived from both fundamental and price data. In these kind of factors I don39t think there is much sense to use minutely data to aggregate these kind of factors but I might be wrong. If we would be regressing price data then the question would be if data aggregated from 1 min bars (or even lower tf) would give us better results than using daily data. I don39t think we can answer this question without actually implementing both methods and checking the results. We aren39t actually using daily data, unless the signal-to-noise of a single trade in a given day is the same as that of the mean price for the day, for example. The mean must be a lot less noisy, since it would be the average of 390 individual trades the noise will be knocked down by a factor of 1sqrt(390). And the mean would be a better representation of the price for the day a single trade of unknown volume could be pretty whacky. Re stationarity: That39s certainly the assumption in the window of the training data. But if you don39t assume some form of stationarity over a limited time-period, how would you want to do predictions of any nature in the first place What the classifier is trying to learn is that if a stock has this pattern of factors (lower vol, higher earnings quality, etc) it will do better than then others in its universe. I would not expect these patterns to hold indefinitely. One way to get around that is to retrain the classifier e. g. every week. Brom: Cross-validation is a way to see how robust your classifier is and allows you to tweak hyper-parameters. It also gives you some guidance of what to expect for the hold-out. It39s true that we mix future and past data there but that39s why there is the hold-out set with only future data. That is the only number we should really care about, as I write in the NB. Cross-validation is just a tool to get more out of our training data. Suminda: I have a notebook almost ready to go that shows this Michael Harris: I do know of systems not unlike the one I present here (albeit more complex) that have shown very competitive performance over long periods of time. I do agree that it39s all about the factors (garbage in garbage out) and any system that does not allow you to innovate at that level will be limiting. The ML or factor aggregation step however can turn many weak signals into a stronger one, especially if they are uncorrelated. But it can not create signal where there is none. I don39t think that the well-known factors I used here would be expected to carry much alpha. That39s where the Quantopian data sets get interesting. In my mind, the most general approach would be to input a universe of 500 (or 1500) stocks, point-in-time, and then output a corresponding vector of 500 weights, normalized as a portfolio weight vector (e. g. ready for ordertargetpercent). Or does this happen at the next step of portfolio construction And if so, how do the two fit together That39s exactly right. Transform the long and short baskets to portfolio weights is part of the portfolio construction step. Given the baskets (just binary longshort signals or probabilities) it would run portfolio optimization to e. g. (i) minimize volatility, (ii) minimize exposure to certain risk factors, (iii) take the previous portfolio into account to reduce turn-over, (iv) take transaction and borrow costs into account etc to determine the weight vector for ordertargetpercent(). We haven39t discussed this step at all yet but it is in the works. Mikko: Thanks, love your contributions. I didn39t know about the classification report which looks very useful. I do like the sklearn pipeline approach, it makes the fitting and prediction steps much more concise. I also agree that a classifier should be evaluated in a broader context and over longer periods of time. My next NB will show how to do that. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. quotML is good at telling you what ISquot This is interesting. The problem is that quotWhat ISquot is always relative to some metric. In some domains (example relativity physics) the metric is constant for a wide range of parameters (in case of relativity physics it is the timespace metric, which breaks down only at the quantum level) In the market. the metric changes all the time depending on conditions. quotWhat ISquot changes constantly and this is done naturally due to liquidity constraints among other things. quot. before the current MLdeep learning hype. quot The hype is there because the industry must morph to maintain its profitability. There is always a new promise. In the 1990s the industry offered chart patterns and simple indicators. Most traders were ruined. Now the industry offers GP, machine learning and promises of quotdeep learning. quot Many more will get ruined. Then the industry will morph to something different. I expect home HFT to be available soon for a few dollars a month. There are good applications for ML and deep learning in other domains where features are well-defined and there is no risk of ruin. I remember someone telling a story in a forum that their ML identified a pattern in stolen credit cards: the thief goes to buy gas immediately after stealing the card with high probability. You can then inform the gas stations to be careful in accepting credit cards. Here the application limits losses. Face recognition and aversion of a crime may limit losses of human lives. But when asking ML to generate pure alpha in the markets, this is too much. Maybe best use is for risk management, not for alpha. AMZN with PE of about 205 (TTM) (reminiscent of dot com bubble BTW) is moving the tech market. The classifier could easily get fooled that gains in some stocks are due to some factors when in fact they are due to index correlation. I do not now how to solve this problem (de-correlate data) and I am even not willing to do it when a simple mean reversion algo that I use based on a simple formula out of a college text in probability theory is performing quite well YTD when ML algos are struggling to stay in the black. Please do not take me wrong, I find high value in Quantopian and I actually use it now for backtest sanity check. I think this platform is quite valuable for traders and everyone should use it and contribute. I have doubts about ML and longshort systems with the designation quotmarket neutral. quot There is no such thing and if there is it is only temporary and risks are much larger than those of directional algos. Then, one should be skeptical why some people are in favor of these market neutral strategies. Note that this market is notorious for squeezing out shorts and has done it repeatedly in the last 7 years. Why would anyone want to go short individual stocks when risk is so high I could think of some reasons but this is not the place to discuss them. Disclaimer - I run a trading blog and have had the dubious pleasure of seeing several thousand people come and quickly go (i. e. wipe out) over the past eight years plus. And quite frankly the shake out rate is equally high among regular traders and highly intelligent ones with programming or even quantmath backgrounds. Intelligence and system complexity are not guarantors of success Otherwise we39d see a lot less white papers on trading algos and more yachts parked over here in the Mediterranean -) Machine Learning is just that. The wiki describes it as 39a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.39 Says nothing in there about predicting future market activity. Which, just like the future, you CANNOT. The best you can do is to look for discrepancies, to parse for subtle repetitive patterns, develop a small edge based on market inefficiencies. You work with the factors and markets you KNOW and NOT with the ones YOU THINK YOU KNOW or THINKWISH WILL HAPPEN. Most of you guys probably have forgotten more about mathematics and machine learning than I will ever manage to grasp in my life. I39m literally standing on the shoulders of giants here and I39m more an ML groupie than a quant. But over the past years I nevertheless have managed to hold my own and even help many others develop and maintain a profitable edge. And for the sole reason that I DO NOT EVER ATTEMPT TO PREDICT THE FUTURE. Having spent the better part of the last decade developing trading systems (for myself and clients) I assure you that anything that may appear to be predictable will quickly cease to be. I cannot tell you how many systems I have seen thrive for a few weeks or even months and then nosedive into oblivion. It39s probably the one reliable pattern I am confident about when it comes to trading systems. Part of the problem is that markets are cyclical and that the nature of those cycles are not easily determined. The best indicator is your PampL curve and one of the most promising techniques for improving my systems has been to know when to turn them off and then on again. There are many very profitable systems out there - you just need to figure out when they work and when they don39t. That is where I believe the rubber meets the road. Now machine learning is extremely fascinating to me but it in my mind it is also a double edged sword. I do not believe that complexity equals success - quite to the contrary actually. Give me a simple system with simple (human) observable rules any day instead of having to run three parsers and decision trees in order to produce an entry or exit. The ML systems I am interested in building are simple by design but its patterns require ML techniques to be unveiled. I would not feel comfortable to trade via a black box that spits out rules I cannot understand. FWIW this is one of the better discussions I have seen on this topic and it reflects the high caliber of the people who participate here. One cannot know the true intentions of a manager like Numerai. The concept is smart and the manager appears also very smart. But I have no idea of what he is doing in background. Numerous tests I have done on training and tournament data show that they are i. i.d. despite claims to the opposite and that the highest possible accuracy without any clever boosting is in the order of 3 and logloss is higher than 0.6900. Of course some participants can lower logloss via boosting and get a top placement but that does not increase prediction accuracy. Now it is possible to profit with a 3 edge if there is a decent payoff ratio and a large number of low cost transactions or a high payoff ratio with a lower number of transactions at moderate cost. The problem with this ensemble approach (expectation) is possibility of ruin in the time domain. It is gambling in essence. He may have found a way to minimize risk of ruin but it is still there, it is finite and as a result, ruin will occur at some point (in this sense, ruin even takes place when a cumulative stop-loss is reached after which trading stops.) However, these problems are not particular to ML but present in all trading methods. ML has additional issues with high data-snooping, selection bias and overfitting. Note that all these are in play when one uses ML. In essence then, probability of success is very low. Brom quotOn point two. if you train your ML model on data from 2016, then test it on data from 2015 and get a high accuracy score, it seems like you shouldnt really trust it since you wouldn39t have had data from 2016 if you were trading in 2015.quot Obviously not. The better approach would be to train your model on data from 2008, 2010, 2114, etc. and then test it on 2016. Most likely this will break your model (completely different market cycles) and so your real task begins. Which is to figure out the conditions under which your model functions within your defined parametes. As a sidenote and I concede that this may sound horribly unscientific to an audience of quants: The root of all evil in system development may be in attempting to produce systems that work continuously. There may be some exception to this rule, e. g. time insensitive systems discussed further above, but I suspect that many models fail because they are based on the assumptions that market patternsfractalscycles are continuous. I think our time may be much better spent in modeling what sometimes works well as opposed to what always works. Transform the long and short baskets to portfolio weights is part of the portfolio construction step. Given the baskets (just binary longshort signals or probabilities) it would run portfolio optimization My point is that the workflow that is being discussed and engineered is a kind of system with interfaces. There is a step-by-step flow of inputs and outputs (for a high-level sketch, see blog. quantopiana-professional-quant-equity-workflow ). As a side note, I would encourage the whole financial industry to stop using the term quotalphaquot and use quotfactorquot instead, in this context. For one thing, it pre-supposes that after the universe definition step, you39ve magically found factors that have been expunged of any overall market influences, and can print money in an uncorrelated fashion. Can quotalphaquot as it is conventionally defined, even be calculated accurately until the backtest step anyway Save the buzz-words for the marketing guys. Back to my point. as I understand, you are working to put together a factor combination step using ML that will be general, and applicable to the vast majority of users. You want a kind of module (like the Q500US amp Q1500US universes) that is broad brush and configurable, and that will provide the most general set of outputs to the next step in the workflow, which as you describe, is another optimization. So, my sense is that if your factor combination step takes in N securities, it ought to spit out N securities, as a normalized portfolio weight vector (e. g. a series of calls to ordertargetpercent(stock, weightstock) would update my portfolio). There would be no baskets of long-short the output of the factor combination step could be all long or all short, for example. The next step in the workflow, portfolio construction, would operate on the portfolio weight vector, as you say: run portfolio optimization to e. g. (i) minimize volatility, (ii) minimize exposure to certain risk factors, (iii) take the previous portfolio into account to reduce turn-over, (iv) take transaction and borrow costs into account etc At this step, for example, you would apply the market neutral constraint, which is a risk factor to be managed. If you impose the market neutral constraint by design earlier in the workflow, then one can39t apply it in the portfolio construction step, where it would seem to reside. Your factor combination step, I think, should be an unconstrained maximization of the overall portfolio return, point-in-time (which may be a problem, since there is no constraint on the new portfolio vector bearing any resemblance to the old one. this is handled in the OLMAR algo by minimizing the square of the Euclidean distance between the old and new portfolio vectors, subject to an overall forecasted return inequality constraint). In the factor combination step, are you effectively finding the point-in-time overall return response surface and then finding the peak, with no constraints in the optimization (under a stationarity assumption, I gather) Is the idea that you can sorta patch things up in the next step of portfolio construction, by re-normalizing the portfolio vector output from the factor combination step I39m wondering if you can actually approach things as you are. Your combination and optimization steps may be better thought of as one step. You are wanting to solve a constrained global optimization problem, I think, to control the overall portfolio return versus time. Shouldn39t you be combining the factors to maximize the forecast return for the next portfolio update, subject to a set of equality and inequality constraints I don39t understand how you can break things up, and still get the right answer. EDIT: Maybe what you should be passing to the portfolio optimization step is effectively a point-in-time response surface model for the overall return (the quotmega-alphaquot) Is that what you are doing Then you can use the model in the constrained portfolio optimization problem This would seem to make sense. Here is a version with cross validation implemented. Note with cross validation your training error is very close to your test error, so you can tune your classifier and your final results will be similar to your training results. On the RandomForests classifier there was almost 99 training accuracy but the test accuracy was much closer to 50. Now with cross validation. the RandomForests training error is more realistic. Additionally I have implemented another performance metric: Youden39s J-stat. which ranges from -1 to 1. A J-stat of 0 says your classifier is useless. I ran it for Adaboost, NB, and RandomForests giving: -0.04, -0.06, -0.03. Loading notebook preview. In general, for the portfolio optimization step, I think one wants, at a minimum, the projected absolute return of each stock in the universe, using multiple factors combined to make the forecast. It also seems like it would be handy to have a measure of the expected variance in the forecast return. Additionally, projected stock-to-stock pairwise returns correlations might be useful. In the end, it is something like minimizing the rotation angle of the portfolio vector in its N-dimensional space (since there are N stocks in the universe), subject to a minimum Sharpe ratio constraint (e. g. SR 1.0), and a constraint that the beta of the portfolio be within a range about zero (or not, if one wants a strong beta tilt or long-only or short-only). If the output of the alpha combination step is just baskets of long and short stocks, with no other information, it seems like the optimization step won39t be performed optimally. But admittedly, I39m still trying to get my head around this whole thing. Peter: Thanks, those are great contributions. I updated the original NB with your fixes to the Factor library. It39s also validating to see that the cross-validation gives similar (poor) results than the hold-out, suggesting it does not overfit. I think the next steps would be to use less common factors to try and improve accuracy. Guy: I updated the original NB with a timer for running the pipeline, it took around 3 minutes. Grant: You can certainly experiment with different classifiers, like a polynomial model. In general, these give poor out-of-sample performance due to overfitting. Beta-to-SPY is a great factor to include as well. Your comments in general are valid ways to experiment with the workflow. You shouldn39t understand this as quotthe one and only way to do algorithmic trading on Qquot, but rather, quothere is a starting point that might be a useful template to extend fromquot. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. You might have missed that but Grant talked about regression (similar to a Probit model maybe) and portfolio optimization, not classification. You said that quotIn general, these give poor out-of-sample performance due to overfitting. quot and I agree. But this also holds for machine learning classification. It is not entirely clear how classification is better than either simple regression or probit regression for large longshort portfolios in the absence of a study. At the end of the day, classification is also an optimization method. I would start by comparing the results of machine leaning to a simple cross sectional momentum model, i. e. buy the strong performers and short the week performers. If excess alpha is not significant, then this may be an overkill but nevertheless a good analytical exercise. Data-snooping is hard to overcome and it is a serious issue in machine learning. As the number of factors and securities increase, it becomes difficult to assess the significance of the results due to the multiple comparisons problem. You can certainly experiment with different classifiers, like a polynomial model. In general, these give poor out-of-sample performance due to overfitting. I think over-fitting can be avoided by excluding terms not supported by the fit statistics. As I recall, one ends up with p-values for each polynomial term. One acceptsrejects terms based on their p-values. For example, if I have only two points (factor-response pairs) and I try to fit a quadratic to them, the quadratic terms should be flagged as over-fit I can only fit a straight line (or a constant). Given that one could manage the over-fitting problem, is there a reason this approach would still be a bad idea, over ML or something else The stationarity issue would seem to be the sticky one. If all of my factor-response data represent one unchanging, underlying market going back 20 years, then I have a shot at sorting things out. I gather, though, that we are perhaps dealing with transient stationarity. Is this correct Don39t I also need to know which factors are valid, point-in-time Perhaps this was the point by others above, that a static ML model (or any static model, for that matter) might not work. I need to know the validity of each factor as a function of time Normally, one would want to know if a given factor-response is quotin controlquot over the look-back period, prior to attempting a model. Here is a version of the notebook that uses sector codes. The sector codes are then binarized so that they can be used in an algorithm that cannot handle categorical data, SVM for example. There was no improvement in the training performance, so I didn39t implement the categories in the test. Gentleman The efficacy or otherwise of ML can only be tested by applying the predictions to past data and looking at the profitability of the system. Note my extreme scepticism of back testing after many years experience, however it is all we have. The predictive capability of back testing as to the future profitability of a given system is. probably rather feeble. As we all know however, given a 5050 win loss ration it is quite possible to make substantial profit from long term trend following. CTAs usually make do with 4060 against. Trend following is not want Q wants to do. Giusto. But the rest of us may have interest in this long standing apparent anomaly. Peter: Interesting Seems to be doing a little bit better (but probably not significant). Personally, I39d be interested to see how it does on the hold-out. Sectors are intuitively an important thing to consider. An alternative approach would be to train a classifier for each sector. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. I wonder whether it is better to start from fundamental analysis rather than end with it. Testing each all every factor is to begin with fundamental analysis and to let ML decide which of those factors has predictive ability. Michael Harris and others have pointed out the dangers of over optimisation and repeatedly testing new ML versions over the same data. This is the paradox of back testing ultimately it is useless. The determinants of price in the LONG term must surely be obvious: economic growth at the macro level and earnings growth at the individual corporate level. Coupled with a strong balance sheet. In the long term these factors will win out and these companies will enter into and remain in the relevant big cap stock indices until their success falters and they begin the inevitable decline which is the fate of all things. Anything else is mere noise. So perhaps ML ought to be secondary to one39s judgement of what makes markets tick. Unfortunately of course this is hardly a sexy approach for a hedge fund. Michael: Certainly that39s true, although we have quite a bit of hold-out data here which dampens the severity. The other benefit is that everyday we accumulate new hold-out data. Like with algorithms, the proof is in paper-trading. Anthony: I definitely think intuition and statistics should complement each other. For example, a quality earnings factor has some intuitive appeal, so I include it in the model. Then when training the classifier, I find an inverted-U shape relationship to how it relates to returns. That could lead to the insight that obviously bad earnings are bad, but perhaps overly positive reports are suspicious too and might try to mask a more fundamental problem. You seem to have gained a lot of intuition in markets over the years which is highly valuable so I encourage you to compile this knowledge into factors which more ML Data Sciency people can play around with and try to optimally employ and combine. The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. You might have missed that but Grant talked about regression (similar to a Probit model maybe) and portfolio optimization, not classification. You said that quotIn general, these give poor out-of-sample performance due to overfitting. quot and I agree. But this also holds for machine learning classification. It is not entirely clear how classification is better than either simple regression or probit regression for large longshort portfolios in the absence of a study. At the end of the day, classification is also an optimization method. My intuition is that if the conditions are right for ML to work, then multivariate polynomial regression for forecasting returns will also work, if applied correctly. If the underlying system is stable in time, all of my factor-response pairs are stationary, and all is copasetic, then I should be able to find an empirically-derived multi-factor response surface. There are areas of science and engineering where this works just fine even if I don39t have a scientific model (e. g. from Maxwell39s equations in EampM), with the right experimental data sets, I can build a perfectly valid, predictive multi-factor empirical model. By analogy, in finance, if the underlying system is stable, then it should be predictable, given the right data sets. Nice summary. It should be used as a guide for any future work. The clash is between unique hypotheses versus hypotheses suggested by the data. I still do not understand whether the objective is asset allocation or short-term anomaly arbitrage. Using fundamental factors for the latter makes no sense. Can anyone clarify the above For example, the probability of AMZN selected as a longer-term holding based on fundamental factors is quite low. On the other hand, as you said: quotUnfortunately of course this is hardly a sexy approach for a hedge fund. quot except in the case that they will act as your sell side to test market strength with a small investments while leveraging the faded trade with a bigger investment. quotwe have quite a bit of hold-out data here which dampens the severity. quot This is true, i. e. a large hold-out increases power of the test and minimizes Type I error. However, what is important is the actual sample of predictions and not the hold-out length. If that is a sufficient sample then there is a chance. But a large hold-out limits the exposure of ML to a variety of market conditions. In my book I recommend doing away with hold-out and use two distinct universes, one for learning and the other for testing, both spanning same timeframes. The test universe should be chosen in advance and never changed to minimize data-snooping. This certainly decreases p-hacking probability but never to zero. There is always possibility of a random fit. What are the objectives of Point72 in terms of trading timeframes intraday, short-term, medium-term of longer-term That can make a huge difference in the approach. Michael: What I never understood about the two-universe hold-out method is how correlations do not invalidate it. If correlations are high and we are in a bear-market, the classifier will easily predict all stocks to go down and get a high score on the other 50. It is also much more complex to find an equal split across industries etc. It also just seems more detached from what we39re actually trying to measure (performance going forward, rather than performance if we knew half of the universe39s returns). The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions. I agree with Anthony39s point of view. Fundamental data on a day to day basis has very little predictive powers. And as expressed by Grant, a multivariate polynomial regression is the same thing as using multiple factors. All real life trading decisions have to be taken at the right edge of a chart. You can39t decide backwards. Looking at past data can only be useful to validate a concept, to show that indeed using such and such trading methods would have produced something over a particular data set over a particular period in time. The main idea has always been to extend to the future what has been found to work in the past. However, if I manipulate the data in the past just to show good results (over optimize), I should not expect the market to comply with my better trading methodology going forward. That is the problem, at the right edge of a chart, the future remains this stochastic sea of stock variances generating this buzzing cloud of price variations where a quasi random walk infrastructure might prevail. And it might be quite independent of its past. Maybe what best describes what I am saying is the following chart: In the beginning, fundamental data is a lesser part of price variations while over the long term, it is what will have dominated the scene. Without some positive fundamental data to support a stock over the interval, a stock might not even survive to appreciate the notion of long term. One problem I see here is that you are trying to show that everything works on real data from the get-go. Your methodology would benefit from synthesizing stock data with known characteristics. Then, you can see if the known inputs generate the expected output. Then, you can take the next step, and say quotIf the real-world input data have characteristics similar to my synthesized data, then my tool will workquot and proceed to try real data. As Michael Harris touches on above, you first need to prove quotthe integrity of the ML processquot and then give it a go on real-world data. For example, if you were developing a ML system to recognize faces, you might use some 3D rendering software to synthesize a set of faces for training, and then see if you could extract known features, before moving on to applying your ML system on actual faces. Is this kinda thing done in the quant world Here39s an example of someone doing something kind of similar to what you39re describing in terms of using synthetic data series: jonathankinlay201406developing-trading-strategies-with-genetic-programming Specifically the paragraphs that read: One of the challenges I devised was to create data sets in which real and synthetic stock series were mixed together and given to the system evaluate. To the human eye (or analysts spreadsheet), the synthetic series were indistinguishable from the real thing. But, in fact, I had planted some patterns within the processes of the synthetic stocks that made them perform differently from their real-life counterparts. Some of the patterns I created were quite simple, such as introducing a drift component. But other patterns were more nuanced, for example, using a fractal Brownian motion generator to induce long memory in the stock volatility process. It was when I saw the system detect and exploit the patterns buried deep within the synthetic series to create sensible, profitable strategies that I began to pay attention. A short time thereafter Haftan and I joined forces to create what became the Proteom Fund. It should be noted that in this case he deliberately throws in patterns that are different from the patterns in the actual stock data, rather than making them as similar as possible. Figured you still might be interested though. I just shared that as an example somewhat along the same lines of what Grant was describing in terms of using synthetic data. I can39t speak to the effectiveness of the author39s use of the synthetic data. The topic is far beyond my capabilities so I can39t comment personally. However, I39ve read some of his research via his blog for a while - jonathankinlay - and he seems to know his way around a trading model, so I tend to take notice whenever he writes about something. Obviously, as you pointed out one can never know without access to his historical returns, which I also don39t have. I recommend you read some more of his writings to decide for yourself. The high Sharpe ratios in excess of 5 indicate high probability of over-fitting to noise. The inclusion of a drift component in synthetic data is possibly responsible for that. And thanks but no - I will not be spending any time in a blog like that. The author gives the answer about the quality of his work himself: quotEven so, given the enormous number of models evaluated, there remains a significant risk of over-fitting. quot Note that the approach followed in that article is quite different from the sound ML approach of Quantopian with models based on classifiers of economically sound factors. It is hard to incorporate these factors in synthetic data and for this reason the data are useless for the purpose of this modeling approach here. I disagree with some posters who have questioned the integrity of the ML process followed by Quantopian. This is a valid process based on classifying features that have economic value. Synthetic data and two universe hold-out are not required. All that is needed is forward testing of about three months and then go live if it works. For weekly trading, that39s only 12 trades (sets of orders). Doesn39t seem like enough data. 12 nstocks, and it39s not reasonable to assume nstocks gt 200, so at least 2400 data points which is not nothing. The larger the amount of data points amp trading days the more confident you can obviously be. But i would add another point here which i don39t think has been mentioned. If after 3 months of Live trading your backtest closely tracks your real account activity, then you can be fairly confident that you are on the right track. If you made a bunch of money in those 3 months but your backtest doesn39t track well your live activity, you need to rethink what it is you are doing.

No comments:

Post a Comment