README.rst 9.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236
  1. ==================================
  2. A Guide to Masked Arrays in NumPy
  3. ==================================
  4. .. Contents::
  5. See http://www.scipy.org/scipy/numpy/wiki/MaskedArray (dead link)
  6. for updates of this document.
  7. History
  8. -------
  9. As a regular user of MaskedArray, I (Pierre G.F. Gerard-Marchant) became
  10. increasingly frustrated with the subclassing of masked arrays (even if
  11. I can only blame my inexperience). I needed to develop a class of arrays
  12. that could store some additional information along with numerical values,
  13. while keeping the possibility for missing data (picture storing a series
  14. of dates along with measurements, what would later become the `TimeSeries
  15. Scikit <http://projects.scipy.org/scipy/scikits/wiki/TimeSeries>`__
  16. (dead link).
  17. I started to implement such a class, but then quickly realized that
  18. any additional information disappeared when processing these subarrays
  19. (for example, adding a constant value to a subarray would erase its
  20. dates). I ended up writing the equivalent of *numpy.core.ma* for my
  21. particular class, ufuncs included. Everything went fine until I needed to
  22. subclass my new class, when more problems showed up: some attributes of
  23. the new subclass were lost during processing. I identified the culprit as
  24. MaskedArray, which returns masked ndarrays when I expected masked
  25. arrays of my class. I was preparing myself to rewrite *numpy.core.ma*
  26. when I forced myself to learn how to subclass ndarrays. As I became more
  27. familiar with the *__new__* and *__array_finalize__* methods,
  28. I started to wonder why masked arrays were objects, and not ndarrays,
  29. and whether it wouldn't be more convenient for subclassing if they did
  30. behave like regular ndarrays.
  31. The new *maskedarray* is what I eventually come up with. The
  32. main differences with the initial *numpy.core.ma* package are
  33. that MaskedArray is now a subclass of *ndarray* and that the
  34. *_data* section can now be any subclass of *ndarray*. Apart from a
  35. couple of issues listed below, the behavior of the new MaskedArray
  36. class reproduces the old one. Initially the *maskedarray*
  37. implementation was marginally slower than *numpy.ma* in some areas,
  38. but work is underway to speed it up; the expectation is that it can be
  39. made substantially faster than the present *numpy.ma*.
  40. Note that if the subclass has some special methods and
  41. attributes, they are not propagated to the masked version:
  42. this would require a modification of the *__getattribute__*
  43. method (first trying *ndarray.__getattribute__*, then trying
  44. *self._data.__getattribute__* if an exception is raised in the first
  45. place), which really slows things down.
  46. Main differences
  47. ----------------
  48. * The *_data* part of the masked array can be any subclass of ndarray (but not recarray, cf below).
  49. * *fill_value* is now a property, not a function.
  50. * in the majority of cases, the mask is forced to *nomask* when no value is actually masked. A notable exception is when a masked array (with no masked values) has just been unpickled.
  51. * I got rid of the *share_mask* flag, I never understood its purpose.
  52. * *put*, *putmask* and *take* now mimic the ndarray methods, to avoid unpleasant surprises. Moreover, *put* and *putmask* both update the mask when needed. * if *a* is a masked array, *bool(a)* raises a *ValueError*, as it does with ndarrays.
  53. * in the same way, the comparison of two masked arrays is a masked array, not a boolean
  54. * *filled(a)* returns an array of the same subclass as *a._data*, and no test is performed on whether it is contiguous or not.
  55. * the mask is always printed, even if it's *nomask*, which makes things easy (for me at least) to remember that a masked array is used.
  56. * *cumsum* works as if the *_data* array was filled with 0. The mask is preserved, but not updated.
  57. * *cumprod* works as if the *_data* array was filled with 1. The mask is preserved, but not updated.
  58. New features
  59. ------------
  60. This list is non-exhaustive...
  61. * the *mr_* function mimics *r_* for masked arrays.
  62. * the *anom* method returns the anomalies (deviations from the average)
  63. Using the new package with numpy.core.ma
  64. ----------------------------------------
  65. I tried to make sure that the new package can understand old masked
  66. arrays. Unfortunately, there's no upward compatibility.
  67. For example:
  68. >>> import numpy.core.ma as old_ma
  69. >>> import maskedarray as new_ma
  70. >>> x = old_ma.array([1,2,3,4,5], mask=[0,0,1,0,0])
  71. >>> x
  72. array(data =
  73. [ 1 2 999999 4 5],
  74. mask =
  75. [False False True False False],
  76. fill_value=999999)
  77. >>> y = new_ma.array([1,2,3,4,5], mask=[0,0,1,0,0])
  78. >>> y
  79. array(data = [1 2 -- 4 5],
  80. mask = [False False True False False],
  81. fill_value=999999)
  82. >>> x==y
  83. array(data =
  84. [True True True True True],
  85. mask =
  86. [False False True False False],
  87. fill_value=?)
  88. >>> old_ma.getmask(x) == new_ma.getmask(x)
  89. array([True, True, True, True, True])
  90. >>> old_ma.getmask(y) == new_ma.getmask(y)
  91. array([True, True, False, True, True])
  92. >>> old_ma.getmask(y)
  93. False
  94. Using maskedarray with matplotlib
  95. ---------------------------------
  96. Starting with matplotlib 0.91.2, the masked array importing will work with
  97. the maskedarray branch) as well as with earlier versions.
  98. By default matplotlib still uses numpy.ma, but there is an rcParams setting
  99. that you can use to select maskedarray instead. In the matplotlibrc file
  100. you will find::
  101. #maskedarray : False # True to use external maskedarray module
  102. # instead of numpy.ma; this is a temporary #
  103. setting for testing maskedarray.
  104. Uncomment and set to True to select maskedarray everywhere.
  105. Alternatively, you can test a script with maskedarray by using a
  106. command-line option, e.g.::
  107. python simple_plot.py --maskedarray
  108. Masked records
  109. --------------
  110. Like *numpy.core.ma*, the *ndarray*-based implementation
  111. of MaskedArray is limited when working with records: you can
  112. mask any record of the array, but not a field in a record. If you
  113. need this feature, you may want to give the *mrecords* package
  114. a try (available in the *maskedarray* directory in the scipy
  115. sandbox). This module defines a new class, *MaskedRecord*. An
  116. instance of this class accepts a *recarray* as data, and uses two
  117. masks: the *fieldmask* has as many entries as records in the array,
  118. each entry with the same fields as a record, but of boolean types:
  119. they indicate whether the field is masked or not; a record entry
  120. is flagged as masked in the *mask* array if all the fields are
  121. masked. A few examples in the file should give you an idea of what
  122. can be done. Note that *mrecords* is still experimental...
  123. Optimizing maskedarray
  124. ----------------------
  125. Should masked arrays be filled before processing or not?
  126. --------------------------------------------------------
  127. In the current implementation, most operations on masked arrays involve
  128. the following steps:
  129. * the input arrays are filled
  130. * the operation is performed on the filled arrays
  131. * the mask is set for the results, from the combination of the input masks and the mask corresponding to the domain of the operation.
  132. For example, consider the division of two masked arrays::
  133. import numpy
  134. import maskedarray as ma
  135. x = ma.array([1,2,3,4],mask=[1,0,0,0], dtype=numpy.float_)
  136. y = ma.array([-1,0,1,2], mask=[0,0,0,1], dtype=numpy.float_)
  137. The division of x by y is then computed as::
  138. d1 = x.filled(0) # d1 = array([0., 2., 3., 4.])
  139. d2 = y.filled(1) # array([-1., 0., 1., 1.])
  140. m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m =
  141. array([True,False,False,True])
  142. dm = ma.divide.domain(d1,d2) # array([False, True, False, False])
  143. result = (d1/d2).view(MaskedArray) # masked_array([-0. inf, 3., 4.])
  144. result._mask = logical_or(m, dm)
  145. Note that a division by zero takes place. To avoid it, we can consider
  146. to fill the input arrays, taking the domain mask into account, so that::
  147. d1 = x._data.copy() # d1 = array([1., 2., 3., 4.])
  148. d2 = y._data.copy() # array([-1., 0., 1., 2.])
  149. dm = ma.divide.domain(d1,d2) # array([False, True, False, False])
  150. numpy.putmask(d2, dm, 1) # d2 = array([-1., 1., 1., 2.])
  151. m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m =
  152. array([True,False,False,True])
  153. result = (d1/d2).view(MaskedArray) # masked_array([-1. 0., 3., 2.])
  154. result._mask = logical_or(m, dm)
  155. Note that the *.copy()* is required to avoid updating the inputs with
  156. *putmask*. The *.filled()* method also involves a *.copy()*.
  157. A third possibility consists in avoid filling the arrays::
  158. d1 = x._data # d1 = array([1., 2., 3., 4.])
  159. d2 = y._data # array([-1., 0., 1., 2.])
  160. dm = ma.divide.domain(d1,d2) # array([False, True, False, False])
  161. m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m =
  162. array([True,False,False,True])
  163. result = (d1/d2).view(MaskedArray) # masked_array([-1. inf, 3., 2.])
  164. result._mask = logical_or(m, dm)
  165. Note that here again the division by zero takes place.
  166. A quick benchmark gives the following results:
  167. * *numpy.ma.divide* : 2.69 ms per loop
  168. * classical division : 2.21 ms per loop
  169. * division w/ prefilling : 2.34 ms per loop
  170. * division w/o filling : 1.55 ms per loop
  171. So, is it worth filling the arrays beforehand ? Yes, if we are interested
  172. in avoiding floating-point exceptions that may fill the result with infs
  173. and nans. No, if we are only interested into speed...
  174. Thanks
  175. ------
  176. I'd like to thank Paul Dubois, Travis Oliphant and Sasha for the
  177. original masked array package: without you, I would never have started
  178. that (it might be argued that I shouldn't have anyway, but that's
  179. another story...). I also wish to extend these thanks to Reggie Dugard
  180. and Eric Firing for their suggestions and numerous improvements.
  181. Revision notes
  182. --------------
  183. * 08/25/2007 : Creation of this page
  184. * 01/23/2007 : The package has been moved to the SciPy sandbox, and is regularly updated: please check out your SVN version!