[00:41:16] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2250.codfw.wmnet'] ` and were **ALL** successful. [00:48:29] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2252.codfw.wmnet'] ` and were **ALL** successful. [00:57:30] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2254.codfw.wmnet'] ` and were **ALL** successful. [01:14:58] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2253.codfw.wmnet'] ` and were **ALL** successful. [02:39:45] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [05:22:26] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Core Platform Team Workboards (Clinic Duty Team): PCS internal request rates tripled on 2019-11-19 - https://phabricator.wikimedia.org/T238832 (10Mholloway) The sudden increase in internal request traffic corresponds precisely with the deployment of http... [05:25:20] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog (Kanban): Mobileapps flapping on scb2005 since 2019-11-26 0:00 UTC - https://phabricator.wikimedia.org/T239344 (10Mholloway) Upon further investigation, i believe this is being caused by {T238832}. [08:23:16] 10serviceops, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10akosiaris) >>! In T236386#5706868, @Ottomata wrote: > @akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/opera... [08:42:47] _joe_: akosiaris cdanis (and maybe others) FYI https://landing.google.com/sre/resources/practicesandprocesses/art-of-slos/ [08:43:58] oh, nice! [08:44:00] thanks! [08:44:37] <_joe_> that's the kind of workshop we should give to product people indeed :P [08:49:13] 10serviceops, 10Gerrit, 10Release-Engineering-Team (Development services): Rename operations/debs/poolcounter-prometheus-exporter to match other Prometheus repositories - https://phabricator.wikimedia.org/T239688 (10hashar) [08:53:08] 10serviceops, 10Gerrit, 10Release-Engineering-Team (Development services): Rename operations/debs/poolcounter-prometheus-exporter to match other Prometheus repositories - https://phabricator.wikimedia.org/T239688 (10Joe) 05Open→03Declined Except that's the name of the software we're packaging https://gi... [08:57:49] "In Google's case, we came to the conclusion that if we aimed to be slightly more reliable than the top consumer ISPs, our users would be substantially more likely to attribute random errors to failures at their ISP. " ..whether or not the isp was actually to blame, I guess :-/ [08:57:55] (from the slo slide deck) [10:22:01] 10serviceops, 10Gerrit, 10Release-Engineering-Team (Development services): Rename operations/debs/poolcounter-prometheus-exporter to match other Prometheus repositories - https://phabricator.wikimedia.org/T239688 (10hashar) 05Declined→03Open I don't see why we could not use a different name for the Debi... [10:22:42] 10serviceops, 10Gerrit, 10Release-Engineering-Team (Development services): Rename operations/debs/poolcounter-prometheus-exporter to match other Prometheus repositories - https://phabricator.wikimedia.org/T239688 (10hashar) [14:24:24] 10serviceops, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [16:34:27] 10serviceops, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) Sigh, it turns out SRE wants us to not rewrite paths. So if we use path based routing, the app needs to handle wh... [16:42:55] 10serviceops, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) How about: - events-logging.wikimedia.org/v1/events - events-analytics.wikimedia.org/v1/events ? Is events-logg... [18:23:34] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: Mobileapps flapping on scb2005 since 2019-11-26 0:00 UTC - https://phabricator.wikimedia.org/T239344 (10Mholloway) [18:24:39] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog: Mobileapps flapping since 2019-11-26 0:00 UTC - https://phabricator.wikimedia.org/T239344 (10Mholloway) [19:10:19] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2256.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031909_dzahn_88754_mw225... [19:11:23] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2257.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031911_dzahn_89118_mw225... [19:14:09] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2258.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031913_dzahn_89640_mw225... [19:14:49] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2259.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031914_dzahn_89825_mw225... [20:23:32] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2256.codfw.wmnet'] ` and were **ALL** successful. [20:24:02] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2257.codfw.wmnet'] ` and were **ALL** successful. [20:26:45] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2258.codfw.wmnet'] ` and were **ALL** successful. [20:47:41] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2259.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2259.codfw.wmnet'] ` [21:19:14] 10serviceops, 10Continuous-Integration-Infrastructure, 10Developer-Wishlist (2017), 10Patch-For-Review, and 3 others: Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890 (10Krinkle) [21:35:34] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) [21:36:16] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) @Papaul Could you take a look at this onsite? [21:36:37] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) p:05Triage→03Normal [21:58:27] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) [21:59:57] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) Also: I noticed in Icinga there is no "mw2259.mgmt" (while for example mw2247.mgmt exists). It's simply not there: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_str... [22:02:08] 10serviceops, 10Operations: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [22:17:43] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2259 down and mgmt does not exist? - https://phabricator.wikimedia.org/T239758 (10Dzahn) [23:42:45] 10serviceops, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10JbuattiWMF) Hi @Aklapper, just to confirm what Prateek said, we would need help with both but #1 is more immediate because we... [23:44:47] 10serviceops, 10Release-Engineering-Team, 10Patch-For-Review: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 (10Dzahn) [23:45:15] 10serviceops, 10Release-Engineering-Team, 10Patch-For-Review: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 (10Dzahn) a:03Dzahn This is happening today, starting in about 15 minutes.