[15:29:37] 10serviceops, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Remove graphite data for nodepool - https://phabricator.wikimedia.org/T215172 (10hashar) p:05Triage→03Normal [15:36:19] 10serviceops, 10Continuous-Integration-Infrastructure (shipyard): Remove graphite data for nodepool - https://phabricator.wikimedia.org/T215172 (10hashar) [16:33:24] 10serviceops, 10Continuous-Integration-Infrastructure (shipyard): Remove graphite data for nodepool - https://phabricator.wikimedia.org/T215172 (10greg) Doesn't the data just fall out after a while? [16:55:06] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) a:05Papaul→03RobH Can you please update this disk with which disk failed? Thanks [17:05:14] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10RobH) a:05RobH→03Papaul Ok, here are the full commands (so you can also run in future as needed): ` robh@thumbor2002:~$ cat /proc/mdstat Personalities : [raid1] md2 :... [17:16:02] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10RobH) In checking dc spares tracking, it shows 11 500GB SATA disks in codfw spare hardware. If this isn't right, please update task and update the tracking sheet. Thanks! [17:48:23] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) Disk with serial number WMAYP0E607DT has been replaced. Server can not find boot device. Server can not boot to OS after disk replacement. [17:57:49] 10serviceops, 10DBA, 10Phabricator, 10Release-Engineering-Team, and 2 others: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 (10Dzahn) [18:19:03] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) I put back the bad disk and boot the system and the system boot into OS with no problem. it looks like what @jcrespo and other mentioned on IRC the grub is installed o... [18:28:43] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) Ack, I will do it tomorrow, thank you @Papaul ! [19:20:19] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) [19:55:13] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Paladox) So we just need a http check that checks the website (without checking if the ssl cert is val... [19:57:43] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10CDanis) >>! In T215033#4925462, @Paladox wrote: > So we just need a http check that checks the website... [20:10:34] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) Indeed, i would say merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/487901 should... [20:12:00] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) >>! In T215033#4925518, @Dzahn wrote: > healthcheck plugin you mentioned. Maybe in a separate t... [20:12:15] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Paladox) Already have T214326 for the health check plugin :)