Artificial Intelligence
What is artificial intelligence?
Artificial intelligence leverages computers and machines to mimic the problem-solving and decision-making capabilities of the human mind. AI combines computer science and robust datasets to enable problem-solving. AI developers take the models that data scientists create and make them into deployable models that can be used in applications.
The diagram below shows how machine learning and deep learning fits into the realm of AI.
Why is this important for hybrid cloud developers?
Integrating AI and machine learning technologies with cloud environments is an increasingly common scenario, driven by use of microservices and the need to scale rapidly. Developers are faced with the challenge to not only build machine learning applications, but to ensure that they run well in production in cloud-native and hybrid cloud environments.
Solution sketch
When developing AI-powered services and applications that run in cloud environments, there is a vast array of development areas to consider including:
- Data selection
- Data preprocessing
- Data visualization
- Model development
- Model deployment
Limitations
Overfitting: Overfitting is a concept in data science which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against the unseen data, defeating its purpose.
Underfitting: Underfitting is a concept in data science which occurs when a statistical model is not complex enough to fit against its training data. When this happens, the algorithm cannot perform accurately against the training nor the unseen data, defeating its purpose.
Key open source projects
Open datasets
Source | APIs | Description |
---|---|---|
Project CodeNet | N/A | Large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems |
UC Irvine Machine Learning Repository | N/A | Collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. |
Kaggle datasets | N/A | Over 50,000 public datasets and 400,000 public notebooks |
Registry of Open Data on AWS | N/A | Discover and share datasets that are available via AWS resources. |
Awesome Public Datasets | N/A | A high quality list of topic-centric public data sources |
World Bank Open Data | World Bank Data Open API | Free and open access to global development data |
WHO Open Data | GHO OData API | World Health Organization’s gateway to health-related statistics for its 194 Member States |
Google Public Data Explorer | N/A | Large datasets that are easy to explore, visualize and communicate |
U.S.Census Bureau | Microdata API | The leading source of quality data about the United States’ people and economy |
Data.gov | Data.gov CKAN API | Data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more |
Yelp Open Dataset | N/A | Subset of Yelp’s businesses, reviews, and user data for use in personal, educational, and academic purposes |
UNICEF | N/A | World’s leading source of data on children with databases of hundreds of international valid and comparable indicators |
Data preprocessing - Wrangling, Cleaning, Analyzing, Computing
Name | Description | GitHub repo | Get Started guide |
---|---|---|---|
Pandas | A fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on Pythong | Source | Guide |
scikit learn - preprocessing | The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. | Source | Guide |
NumPy | An open source project that enables numerical computing with Python | Source | Guide |
BeautifulSoup | A Python library designed for quick turnaround projects like screen-scraping | Source | Guide |
Data visualization
Name | Description | GitHub Repo | Get Started guide |
---|---|---|---|
Matplotlib | Comprehensive library for creating static, animated, and interactive visualizations in Python | Source | Guide |
Seaborn | High-level interface for drawing attractive and informative statistical graphics | Source | Guide |
Plotly | A Python graphing library that makes interactive, publication-quality graphs | Source | Guide |
Bokeh | Python library for creating interactive visualizations for modern web browsers | Source | Guide |
Model development
Name | Description | GitHub Repo | Get Started guide |
---|---|---|---|
Scikit-Learn | Simple and efficient tools for predictive data analysis | Source | Guide |
XGBoost | Otimized distributed gradient boosting library designed to be highly efficient, flexible and portable. | Source | Guide |
Tensorflow | The core open source library to help you develop and train ML models | Source | TensorFlow Get Started |
Keras | A deep learning API written in Python, that enables fast experimentation | Source | Keras Get Started |
PyTorch | A machine learning framework that accelerates the path from research prototyping to production deployment | Source | PyTorch Get Started |
PySpark | PySpark is an interface for Apache Spark in Python.PySpark supports a machine learning library known as MLlib that is used for model training. | Source | Guide |
NLTK | A platform for building Python programs to work with human language data | Source | Guide |
Gensim | Python library for process raw, unstructured digital texts | Source | Guide |
Statsmodels | Python module that provides classes and functions for the estimation of many different statistical models | Source | Guide |
Model deployment
Name | Description | GitHub Repo | Get Started guide |
---|---|---|---|
Kubeflow | The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. | Source | Guide |
TensorFlow Extended | TensorFlow Extended is an end-to-end platform for preparing data, training, validating, and deploying models in large production environments. | Source | Guide |
MLflow | MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. | Source | Guide |
Cloud comparision
IBM Cloud | GCP | AWS | Azure | |
---|---|---|---|---|
SaaS Platforms | Cloud Pak for Data | Vertex AI | SageMaker | Azure ML |