4lqedz 发表于 2024-10-3 19:07:12

从小白到大师,这儿有一份Pandas入门指南


    <div style="color: black; text-align: left; margin-bottom: 10px;">
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">选自Medium</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">作者:Rudolf Höhn<span style="color: black;">设备</span>之心编译</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">参与:李诗萌、张倩</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在本文中,作者从 Pandas 的简介<span style="color: black;">起始</span>,<span style="color: black;">循序渐进</span>讲解了 Pandas 的发展<span style="color: black;">状况</span>、内存优化等问题。这是一篇最佳实践教程,既适合用过 Pandas 的读者,<span style="color: black;">亦</span>适合没用过但想要上手的小白。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">经过</span>本文,你将有望<span style="color: black;">发掘</span>一到多种用 pandas 编码的新<span style="color: black;">办法</span>。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">本文<span style="color: black;">包含</span>以下内容:</p>Pandas 发展<span style="color: black;">状况</span>;内存优化;索引;<span style="color: black;">办法</span>链;随机提示。<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在阅读本文时,我<span style="color: black;">意见</span>你阅读<span style="color: black;">每一个</span>你不<span style="color: black;">认识</span>的函数的文档字符串(docstrings)。简单的 Google 搜索和几秒钟 Pandas 文档的阅读,都会使你的阅读体验更加愉快。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">Pandas 的定义和<span style="color: black;">状况</span></strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">什么是 Pandas?</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Pandas 是一个「开源的、有 BSD 开源协议的库,它为 Python 编程语言<span style="color: black;">供给</span>了高性能、易于<span style="color: black;">运用</span>的数据架构以及数据分析工具」。总之,它<span style="color: black;">供给</span>了被<span style="color: black;">叫作</span>为 DataFrame 和 Series(对<span style="color: black;">哪些</span><span style="color: black;">运用</span> Panel 的人<span style="color: black;">来讲</span>,它们<span style="color: black;">已然</span>被弃用了)的数据抽象,<span style="color: black;">经过</span>管理索引来快速访问数据、执行分析和转换运算,<span style="color: black;">乃至</span><span style="color: black;">能够</span>绘图(用 matplotlib 后端)。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Pandas 的当前最新版本是 v0.25.0 (</p>https://github.com/pandas-dev/pandas/releases/tag/v0.25.0)

      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/5b8309bbdda94e989671d264f7370546~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=jiPDpIaDWndH6bA%2BFzRyupt6oPQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Pandas 正在逐步升级到 1.0 版,而为了达到这一目的,它改变了<span style="color: black;">非常多</span>人们习以为常的细节。Pandas 的核心<span style="color: black;">研发</span>者之一 Marc Garcia <span style="color: black;">发布</span>了一段非常有趣的演讲——「走向 Pandas 1.0」。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">演讲链接:</p>https://www.youtube.com/watch?v=hK6o_TDXXN8

      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">用一句话来总结,Pandas v1.0 <span style="color: black;">重点</span>改善了稳定性(如时间序列)并删除了未<span style="color: black;">运用</span>的代码库(如 SparseDataFrame)。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">数据</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">让<span style="color: black;">咱们</span><span style="color: black;">起始</span>吧!<span style="color: black;">选取</span>「1985 到 2016 年间<span style="color: black;">每一个</span>国家的<span style="color: black;">自s</span>率」<span style="color: black;">做为</span>玩具数据集。这个数据集足够简单,但<span style="color: black;">亦</span>足以让你上手 Pandas。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">数据集链接:</p>https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在深入<span style="color: black;">科研</span>代码之前,<span style="color: black;">倘若</span>你想重现结果,要先执行下面的代码准备数据,<span style="color: black;">保证</span>列名和类型是正确的。</p>import pandas as pdimport numpy as npimport os# to download https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
      data_path = path/to/folder/df = (pd.read_csv(filepath_or_buffer=os.path.join(data_path, master.csv)) .rename(columns={suicides/100k pop : suicides_per_100k, gdp_for_year ($) : gdp_year, gdp_per_capita ($) : gdp_capita, country-year : country_year}) .assign(gdp_year=lambda _df: _df.str.replace(,,).astype(np.int64)) )<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">提示:<span style="color: black;">倘若</span>你读取了一个大文件,在 read_csv(</p>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)中参数设定为 chunksize=N,这会返回一个<span style="color: black;">能够</span>输出 DataFrame 对象的迭代器。
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这儿</span>有<span style="color: black;">有些</span>关于这个数据集的描述:</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/fee5271742e94cb7bcb10a11c6b63f44~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=NWPpPMtDhFUkN4twOtBBjeYQTvU%3D" style="width: 50%; margin-bottom: 20px;"></div>&gt;&gt;&gt; df.columnsIndex(, dtype=object)<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这儿</span>有 101 个国家、年份从 1985 到 2016、两种性别、六个年代以及六个年龄组。有<span style="color: black;">有些</span><span style="color: black;">得到</span>这些信息的<span style="color: black;">办法</span>:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">能够</span>用 unique() 和 nunique() 获取列内<span style="color: black;">独一</span>的值(或<span style="color: black;">独一</span>值的数量);</p>&gt;&gt;&gt; df.unique()array(, dtype=object)&gt;&gt;&gt; df.nunique()101<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">能够</span>用 describe() 输出每一列<span style="color: black;">区别</span>的统计数据(例如最小值、最大值、平均值、总数等),<span style="color: black;">倘若</span>指定 include=all,会针对每一列<span style="color: black;">目的</span>输出<span style="color: black;">独一</span>元素的数量和<span style="color: black;">显现</span>最多元素的数量;</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/71736b47566c4914a42824f084f40607~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=SqU8UkhoAvo6hEthY2yNPDOEWwM%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">能够</span>用 head() 和 tail() 来可视化数据框的一小部分。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">经过</span>这些<span style="color: black;">办法</span>,你<span style="color: black;">能够</span><span style="color: black;">快速</span><span style="color: black;">认识</span>正在分析的表格文件。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">内存优化</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在处理数据之前,<span style="color: black;">认识</span>数据并为数据框的每一列<span style="color: black;">选取</span>合适的类型是很重要的一步。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在内部,Pandas 将数据框存储为<span style="color: black;">区别</span>类型的 numpy 数组(<span style="color: black;">例如</span>一个 float64 矩阵,一个 int32 矩阵)。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">有两种<span style="color: black;">能够</span>大幅降低内存消耗的<span style="color: black;">办法</span>。</p>import pandas as pd
      def mem_usage(df: pd.DataFrame) -&gt; str: """This method styles the memory usage of a DataFrame to be readable as MB. Parameters ---------- df: pd.DataFrame Data frame to measure. Returns ------- str Complete memory usage as a string formatted for MB. """ return f{df.memory_usage(deep=True).sum() / 1024 ** 2 : 3.2f} MB
      def convert_df(df: pd.DataFrame, deep_copy: bool = True) -&gt; pd.DataFrame: """Automatically converts columns that are worth stored as ``categorical`` dtype. Parameters ---------- df: pd.DataFrame Data frame to convert. deep_copy: bool Whether or not to perform a deep copy of the original data frame. Returns ------- pd.DataFrame Optimized copy of the input data frame. """ return df.copy(deep=deep_copy).astype({ col: category for col in df.columns if df.nunique() / df.shape &lt; 0.5})<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Pandas 提出了一种叫做 memory_usage() 的<span style="color: black;">办法</span>,这种<span style="color: black;">办法</span><span style="color: black;">能够</span>分析数据框的内存消耗。在代码中,指定 deep=True 来<span style="color: black;">保证</span><span style="color: black;">思虑</span>到了<span style="color: black;">实质</span>的系统<span style="color: black;">运用</span><span style="color: black;">状况</span>。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">memory_usage():https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">认识</span>列的类型(</p>https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes)很重要。它<span style="color: black;">能够</span><span style="color: black;">经过</span>两种简单的<span style="color: black;">办法</span>节省高达 90% 的内存<span style="color: black;">运用</span>:<span style="color: black;">认识</span>数据框<span style="color: black;">运用</span>的类型;<span style="color: black;">认识</span>数据框<span style="color: black;">能够</span><span style="color: black;">运用</span>哪种类型来减少内存的<span style="color: black;">运用</span>(例如,price 这一列值在 0 到 59 之间,只带有一位小数,<span style="color: black;">运用</span> float64 类型可能会产生不必要的内存开销)<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">除了降低数值类型的<span style="color: black;">体积</span>(用 int32 而不是 int64)外,Pandas 还提出了<span style="color: black;">归类</span>类型:</p>https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">倘若</span>你是用 R 语言的<span style="color: black;">研发</span>人员,你可能觉得它和 factor 类型是<span style="color: black;">同样</span>的。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">这种<span style="color: black;">归类</span>类型<span style="color: black;">准许</span>用索引替换重复值,还<span style="color: black;">能够</span>把<span style="color: black;">实质</span>值存在其他位置。教科书中的例子是国家。和多次存储相同的字符串「瑞士」或「波兰」比起来,<span style="color: black;">为何</span>不简单地用 0 和 1 替换它们,并存储在字典中呢?</p>categorical_dict = {0: Switzerland, 1: Poland}
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Pandas 做了几乎相同的工作,<span style="color: black;">同期</span>添加了所有的<span style="color: black;">办法</span>,<span style="color: black;">能够</span><span style="color: black;">实质</span><span style="color: black;">运用</span>这种类型,并且仍然能够<span style="color: black;">表示</span>国家的名<span style="color: black;">叫作</span>。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">回到 convert_df() <span style="color: black;">办法</span>,<span style="color: black;">倘若</span>这一列中的<span style="color: black;">独一</span>值<span style="color: black;">少于</span> 50%,它会自动将列类型转换成 category。这个数是任意的,<span style="color: black;">然则</span><span style="color: black;">由于</span>数据框中类型的转换<span style="color: black;">寓意</span>着在 numpy 数组间移动数据,<span style="color: black;">因此呢</span><span style="color: black;">咱们</span>得到的必须比失去的多。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">接下来<span style="color: black;">瞧瞧</span>数据中会<span style="color: black;">出现</span>什么。</p>&gt;&gt;&gt; mem_usage(df)10.28 MB&gt;&gt;&gt; mem_usage(df.set_index())5.00 MB&gt;&gt;&gt; mem_usage(convert_df(df))1.40 MB&gt;&gt;&gt; mem_usage(convert_df(df.set_index()))1.40 MB<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">经过</span><span style="color: black;">运用</span>「智能」转换器,数据框<span style="color: black;">运用</span>的内存几乎减少了 10 倍(准确地说是 7.34 倍)。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">索引</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Pandas 是强大的,但<span style="color: black;">亦</span>需要付出<span style="color: black;">有些</span>代价。当你加载 DataFrame 时,它会创建索引并将数据存储在 numpy 数组中。这是什么意思?一旦加载了数据框,只要正确管理索引,就<span style="color: black;">能够</span>快速地<span style="color: black;">拜访</span>数据。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">拜访</span>数据的<span style="color: black;">办法</span><span style="color: black;">重点</span>有两种,分别是<span style="color: black;">经过</span>索引和<span style="color: black;">查找</span><span style="color: black;">拜访</span>。<span style="color: black;">按照</span><span style="color: black;">详细</span><span style="color: black;">状况</span>,你只能<span style="color: black;">选取</span>其中一种。但在大<span style="color: black;">都数</span><span style="color: black;">状况</span>中,索引(和多索引)都是最好的<span style="color: black;">选取</span>。<span style="color: black;">咱们</span>来看下面的例子:</p>&gt;&gt;&gt; %%time&gt;&gt;&gt; df.query(country == "Albania" and year == 1987 and sex == "male" and age == "25-34 years")CPU times: user 7.27 ms, sys: 751 µs, total: 8.02 ms# ==================&gt;&gt;&gt; %%time&gt;&gt;&gt; mi_df.locCPU times: user 459 µs, sys: 1 µs, total: 460 µs<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">什么?加速 20 倍?</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">你要问自己了,创建这个多索引要多<span style="color: black;">长期</span>?</p>%%timemi_df = df.set_index()CPU times: user 10.8 ms, sys: 2.2 ms, total: 13 ms<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">经过</span><span style="color: black;">查找</span><span style="color: black;">拜访</span>数据的时间是 1.5 倍。<span style="color: black;">倘若</span>你只想检索一次数据(这种<span style="color: black;">状况</span>很少<span style="color: black;">出现</span>),<span style="color: black;">查找</span>是正确的<span style="color: black;">办法</span>。否则,你<span style="color: black;">必定</span>要<span style="color: black;">保持</span>用索引,CPU 会为此感激你的。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">.set_index(drop=False) <span style="color: black;">准许</span>不删除用作新索引的列。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">.loc[]/.iloc[] <span style="color: black;">办法</span><span style="color: black;">能够</span>很好地读取数据框,但<span style="color: black;">没法</span>修改数据框。<span style="color: black;">倘若</span>需要手动构建(<span style="color: black;">例如</span><span style="color: black;">运用</span>循环),那就要<span style="color: black;">思虑</span>其他的数据结构了(<span style="color: black;">例如</span>字典、列表等),在准备好所有数据后,创建 DataFrame。否则,<span style="color: black;">针对</span> DataFrame 中的每一个新行,Pandas 都会更新索引,这可不是简单的哈希映射。</p>&gt;&gt;&gt; (pd.DataFrame({a:range(2), b: range(2)}, index=) .loc) a ba 0 0a 1 1
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">因此呢</span>,未排序的索引<span style="color: black;">能够</span>降低性能。为了<span style="color: black;">检测</span>索引<span style="color: black;">是不是</span>已经排序并对它排序,<span style="color: black;">重点</span>有两种<span style="color: black;">办法</span>:</p>%%time&gt;&gt;&gt; mi_df.sort_index()CPU times: user 34.8 ms, sys: 1.63 ms, total: 36.5 ms&gt;&gt;&gt; mi_df.index.is_monotonicTrue
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">更加多</span>详情请参阅:</p>Pandas 高级索引用户指南:https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html;Pandas 库中的索引代码:https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexing.py。<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">办法</span>链</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">运用</span> DataFrame 的<span style="color: black;">办法</span>链是链接多个返回 DataFrame <span style="color: black;">办法</span>的<span style="color: black;">行径</span>,<span style="color: black;">因此呢</span>它们都是来自 DataFrame 类的<span style="color: black;">办法</span>。在<span style="color: black;">此刻</span>的 Pandas 版本中,<span style="color: black;">运用</span><span style="color: black;">办法</span>链是为了不存储中间变量并避免<span style="color: black;">显现</span>如下<span style="color: black;">状况</span>:</p>import numpy as npimport pandas as pddf = pd.DataFrame({a_column: , powerless_column: , int_column: }) df = df.replace(-999, np.nan) df = df ** 2 df = df.astype(np.float64) df = df.apply(lambda _df: _df.replace(4, np.nan)) df = df.dropna(how=all)<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">用下面的链替换:</p>df = (pd.DataFrame({a_column: , powerless_column: , int_column: }) .assign(a_column=lambda _df: _df.replace(-999, np.nan)) .assign(power_column=lambda _df: _df ** 2) .assign(real_column=lambda _df: _df.astype(np.float64)) .apply(lambda _df: _df.replace(4, np.nan)) .dropna(how=all) )<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">说实话,第二段代码更<span style="color: black;">美丽</span><span style="color: black;">亦</span>更简洁。</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/25a633d5ce1c4040b016365fa6f27f18~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=CqSmfak8jOcCPcTdiYok4qZU1R0%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">办法</span>链的工具箱<span style="color: black;">是由于</span><span style="color: black;">区别</span>的<span style="color: black;">办法</span>(<span style="color: black;">例如</span> apply、assign、loc、query、pipe、groupby 以及 agg)<span style="color: black;">构成</span>的,这些<span style="color: black;">办法</span>的输出都是 DataFrame 对象或 Series 对象(或 DataFrameGroupBy)。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">认识</span>它们最好的<span style="color: black;">办法</span><span style="color: black;">便是</span><span style="color: black;">实质</span><span style="color: black;">运用</span>。举个简单的例子:</p>(df .groupby(age) .agg({generation:unique}) .rename(columns={generation:unique_generation})# Recommended from v0.25# .agg(unique_generation=(generation, unique)))<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">得到</span><span style="color: black;">每一个</span>年龄范围中所有<span style="color: black;">独一</span>年代标签的简单链</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/4b23703a9d6e4676a2115a82990a55ea~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=Q1uP%2B%2FwBuhzF9KF5tvuUxqRXS2c%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在得到的数据框中,「年龄」列是索引。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">除了<span style="color: black;">认识</span>到「X 代」覆盖了三个年龄组外,分解这条链。<span style="color: black;">第1</span>步是对年龄组分组。这一<span style="color: black;">办法</span>返回了一个 DataFrameGroupBy 对象,在这个对象中,<span style="color: black;">经过</span><span style="color: black;">选取</span>组的<span style="color: black;">独一</span>年代标签聚合了每一组。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在这种<span style="color: black;">状况</span>下,聚合<span style="color: black;">办法</span>是「unique」<span style="color: black;">办法</span>,但它<span style="color: black;">亦</span><span style="color: black;">能够</span>接受任何(匿名)函数。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在 0.25 版本中,Pandas 引入了<span style="color: black;">运用</span> agg 的新<span style="color: black;">办法</span>:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">https://dev.pandas.io/whatsnew/v0.25.0.html#</p>groupby-aggregation-with-relabeling。
      (df .groupby() .agg({suicides_per_100k: sum}) .rename(columns={suicides_per_100k:suicides_sum})# Recommended from v0.25# .agg(suicides_sum=(suicides_per_100k, sum)) .sort_values(suicides_sum, ascending=False) .head(10))<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">用排序值(sort_values)和 head 得到<span style="color: black;">自s</span>率排前十的国家和年份</p>(df .groupby() .agg({suicides_per_100k: sum}) .rename(columns={suicides_per_100k:suicides_sum})# Recommended from v0.25# .agg(suicides_sum=(suicides_per_100k, sum)) .nlargest(10, columns=suicides_sum))<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">用排序值 nlargest 得到<span style="color: black;">自s</span>率排前十的国家和年份</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在这些例子中,输出都是<span style="color: black;">同样</span>的:有两个指标(国家和年份)的 MultiIndex 的 DataFrame,还有<span style="color: black;">包括</span>排序后的 10 个最大值的新列 suicides_sum。</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/1496a2e11949409ab9487e1d664c50e5~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=itrLmpaEd4P28bV%2BSb2c6F%2FdaV0%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">「国家」和「年份」列是索引。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">nlargest(10) 比 sort_values(ascending=False).head(10) 更有效。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">另一个有趣的<span style="color: black;">办法</span>是 unstack:</p>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html,这种<span style="color: black;">办法</span><span style="color: black;">准许</span>转动索引水平。
      (mi_df .loc[(Switzerland, 2000)] .unstack(sex) [])<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/1eab98b127234dde8f5a1c612c5bf143~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=k9nHZxGFRP09KbvSlHJa9K9mAT0%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">「age」是索引,列「suicides_no」和「population」都有第二个水平列「sex」。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">下一个<span style="color: black;">办法</span> pipe 是最通用的<span style="color: black;">办法</span>之一。这种<span style="color: black;">办法</span><span style="color: black;">准许</span>管道运算(就像在 shell 脚本中)执行比链<span style="color: black;">更加多</span>的运算。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">管道的一个简单但强大的用法是记录<span style="color: black;">区别</span>的信息。</p>def log_head(df, head_count=10): print(df.head(head_count)) return df
      def log_columns(df): print(df.columns) return df
      def log_shape(df): print(fshape = {df.shape}) return df<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">和 pipe <span style="color: black;">一块</span><span style="color: black;">运用</span>的<span style="color: black;">区别</span>记录函数。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">举个例子,<span style="color: black;">咱们</span>想验证和 year 列相比,country_year <span style="color: black;">是不是</span>正确:</p>(df .assign(valid_cy=lambda _serie: _serie.apply( lambda _row: re.split(r(?=\d{4}), _row) == str(_row), axis=1)) .query(valid_cy == False) .pipe(log_shape))<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">用来验证「country_year」列中年份的管道。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">管道的输出是 DataFrame,但它<span style="color: black;">亦</span><span style="color: black;">能够</span>在标准输出(console/REPL)中打印。</p>shape = (0, 13)
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">你<span style="color: black;">亦</span><span style="color: black;">能够</span>在一条链中用<span style="color: black;">区别</span>的 pipe。</p>(df .pipe(log_shape) .query(sex == "female") .groupby() .agg({suicides_per_100k:sum}) .pipe(log_shape) .rename(columns={suicides_per_100k:sum_suicides_per_100k_female})# Recommended from v0.25# .agg(sum_suicides_per_100k_female=(suicides_per_100k, sum)) .nlargest(n=10, columns=))<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">女性<span style="color: black;">自s</span>数量最高的国家和年份。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">生成的 DataFrame 如下所示:</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/823f58c1edfa4fe88fbedc378deffbc0~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=qPEFJs3IzxQPrpyg1R4DC5Z6fu4%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">索引是「年份」和「国家」。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">标准输出的打印如下所示:</p>shape = (27820, 12)shape = (2321, 1)
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">除了记录到<span style="color: black;">掌控</span>台外,pipe 还<span style="color: black;">能够</span>直接在数据框的列上应用函数。</p>from sklearn.preprocessing import MinMaxScaler
      def norm_df(df, columns): return df.assign(**{col: MinMaxScaler().fit_transform(df[].values.astype(float)) for col in columns})
      for sex in : print(sex) print( df .query(fsex == "{sex}") .groupby() .agg({suicides_per_100k: sum, gdp_year: mean}) .rename(columns={suicides_per_100k:suicides_per_100k_sum, gdp_year: gdp_year_mean}) # Recommended in v0.25 # .agg(suicides_per_100k=(suicides_per_100k_sum, sum), # gdp_year=(gdp_year_mean, mean)) .pipe(norm_df, columns=) .corr(method=spearman) ) print(\n)<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">自s</span>数量<span style="color: black;">是不是</span>和 GDP 的下降<span style="color: black;">关联</span>?<span style="color: black;">是不是</span>和性别<span style="color: black;">关联</span>?</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">上面的代码在<span style="color: black;">掌控</span>台中的打印如下所示:</p>male suicides_per_100k_sum gdp_year_meansuicides_per_100k_sum 1.000000 0.421218gdp_year_mean 0.421218 1.000000
      female suicides_per_100k_sum gdp_year_meansuicides_per_100k_sum 1.000000 0.452343gdp_year_mean 0.452343 1.000000<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">深入<span style="color: black;">科研</span>代码。norm_df() 将一个 DataFrame 和用 MinMaxScaling 扩展列的列表当做输入。<span style="color: black;">运用</span>字典理解,创建一个字典 {column_name: method, …},<span style="color: black;">而后</span>将其解压为 assign() 函数的参数 (colunmn_name=method, …)。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在这种特殊<span style="color: black;">状况</span>下,min-max 缩放不会改变对应的输出:</p>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html,它仅用于参数。

      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在(遥远的?)<span style="color: black;">将来</span>,缓式<span style="color: black;">评定</span>(lazy evaluation)可能出<span style="color: black;">此刻</span><span style="color: black;">办法</span>链中,<span style="color: black;">因此</span>在链上做<span style="color: black;">有些</span>投资可能是一个好想法。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">最后(随机)的技巧</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">下面的提示<span style="color: black;">特别有</span>用,但不适用于前面的任何部分:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">itertuples() <span style="color: black;">能够</span>更<span style="color: black;">有效</span>地遍历数据框的行;</p>&gt;&gt;&gt; %%time&gt;&gt;&gt; for row in df.iterrows(): continueCPU times: user 1.97 s, sys: 17.3 ms, total: 1.99 s&gt;&gt;&gt; for tup in df.itertuples(): continueCPU times: user 55.9 ms, sys: 2.85 ms, total: 58.8 ms<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">重视</span>:tup 是一个 namedtuple</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">join() 用了 merge();在 Jupyter 笔记本中,在代码块的开头写上 %%time,<span style="color: black;">能够</span>有效地<span style="color: black;">测绘</span>时间;UInt8 类:</p>https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na支持带有整数的 NaN 值;
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">记住,任何密集的 I/O(例如展开大型 CSV 存储)用低级<span style="color: black;">办法</span>都会执行得更好(尽可能多地用 Python 的核心函数)。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">还有<span style="color: black;">有些</span>本文<span style="color: black;">无</span><span style="color: black;">触及</span>到的有用的<span style="color: black;">办法</span>和数据结构,这些<span style="color: black;">办法</span>和数据结构都很值得花时间去理解:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">数据透视表:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html?source=</p>post_page---------------------------

      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">时间序列/日期功能:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html?source=</p>post_page---------------------------;

      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">绘图:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html?source=</p>post_page---------------------------。

      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">总结</strong></p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">期盼</span>你<span style="color: black;">能够</span><span style="color: black;">由于</span>这篇简短的<span style="color: black;">文案</span>,更好地理解 Pandas <span style="color: black;">背面</span>的工作原理,以及 Pandas 库的发展<span style="color: black;">状况</span>。本文还展示了<span style="color: black;">区别</span>的用于优化数据框内存以及快速分析数据的工具。<span style="color: black;">期盼</span>对<span style="color: black;">此刻</span>的你<span style="color: black;">来讲</span>,索引和<span style="color: black;">查询</span>的概念能更加清晰。最后,你还<span style="color: black;">能够</span>试着用<span style="color: black;">办法</span>链写更长的链。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这儿</span>还有<span style="color: black;">有些</span>笔记:</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">https://github.com/unit8co/medium-pandas-wan?source=</p>post_page---------------------------

      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">除了文中的所有代码外,还<span style="color: black;">包含</span>简单数据索引数据框(df)和多索引数据框(mi_df)性能的<span style="color: black;">按时</span>指标。</p>
      <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/f897edcda6a14683bbe18a6fa19e1c4e~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1728117804&amp;x-signature=cpZTcA%2Fa71JUXmyuizIEiybi0Ck%3D" style="width: 50%; margin-bottom: 20px;"></div>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">熟能生巧,<span style="color: black;">因此</span>继续修炼技能,并<span style="color: black;">帮忙</span><span style="color: black;">咱们</span><span style="color: black;">创立</span>一个更好的世界吧。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">PS:有时候纯用 Numpy 会更快。</p>
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">原文链接:</p>https://medium.com/unit8-machine-learning-publication/from-pandas-wan-to-pandas-master-4860cf0ce442
    </div>




页: [1]
查看完整版本: 从小白到大师,这儿有一份Pandas入门指南