[00:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181121T0000). [00:00:05] stephanebisson, kostajh, Zoranzoki21, and MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:05] Deploy window No deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181121T0000) [00:00:12] Hi, [00:00:16] Hii [00:00:16] I can SWAT [00:00:25] hello [00:00:28] here [00:00:37] Night time :D [00:00:49] 10Operations, 10monitoring, 10Performance-Team (Radar), 10Release-Engineering-Team (Watching / External), 10goodfirstbug: Increase "check_legal_html" coverage to group0 wikis - https://phabricator.wikimedia.org/T208284 (10greg) [00:01:49] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) [00:03:13] 10Operations, 10Parsoid: parsoid-rt repeated failures on ruthenium (parsoid::testing) - https://phabricator.wikimedia.org/T209758 (10Arlolra) > `TypeError: The header content contains invalid characters` I did the expedient thing and just made the redirect url safe and then restarted the server. Can someone... [00:05:03] (03PS6) 10Sbisson: Enable RCPatrol and add some rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 (https://phabricator.wikimedia.org/T209250) (owner: 10Zoranzoki21) [00:05:14] (03PS1) 10Dzahn: racktables: fix path to php.ini on stretch [puppet] - 10https://gerrit.wikimedia.org/r/475024 (https://phabricator.wikimedia.org/T210008) [00:06:25] (03CR) 10Dzahn: [C: 032] racktables: fix path to php.ini on stretch [puppet] - 10https://gerrit.wikimedia.org/r/475024 (https://phabricator.wikimedia.org/T210008) (owner: 10Dzahn) [00:06:29] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10greg) [00:07:58] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/GrowthExperiments/includes/Specials/SpecialWelcomeSurvey.php: gerrit:474946 WelcomeSurvey: indicate that the special page does write (duration: 00m 47s) [00:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:32] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10greg) [00:12:17] Everything is ok with SWAT? [00:13:27] Zoranzoki21: yes, your patches are next [00:13:43] stephanebisson: ok [00:13:44] Zoranzoki21: Are all your patches testable? [00:14:04] stephanebisson: yes [00:14:04] (03PS1) 10Alex Monk: deployment-prep: Try changing redis_lock entries to memc hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475025 (https://phabricator.wikimedia.org/T210030) [00:14:10] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 (https://phabricator.wikimedia.org/T209250) (owner: 10Zoranzoki21) [00:15:12] (03Merged) 10jenkins-bot: Enable RCPatrol and add some rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 (https://phabricator.wikimedia.org/T209250) (owner: 10Zoranzoki21) [00:16:23] stephanebisson: mwdebug1002? [00:16:59] 10Operations, 10Release-Engineering-Team (Backlog): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10greg) next steps here? [00:17:05] 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dzahn) [00:18:00] Zoranzoki21: yes, pardon me I'm dealing with other patches at the same time [00:18:28] stephanebisson: Can I test now? [00:20:03] 10Operations: build grafana package for stretch - https://phabricator.wikimedia.org/T210034 (10Dzahn) [00:20:14] Zoranzoki21: yes, please test on mwdebug1002 [00:20:23] 10Operations: build grafana package for stretch - https://phabricator.wikimedia.org/T210034 (10Dzahn) p:05Triage>03Normal [00:20:41] 10Operations: build grafana package for stretch - https://phabricator.wikimedia.org/T210034 (10Dzahn) [00:20:43] stephanebisson: #workingon [00:20:44] 10Operations, 10Patch-For-Review: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) [00:21:43] stephanebisson: ok is [00:21:55] Zoranzoki21: Also, when you have time, 474967 doesn't rebase cleanly. Can you rebase it manually? [00:22:44] stephanebisson: ok. Will now [00:23:06] Zoranzoki21: 472744 is ok, I can deploy it? [00:23:13] stephanebisson: ok. Yes [00:23:31] for 474967 I have to create new patch again [00:24:38] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474967 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [00:24:48] (03CR) 10jerkins-bot: [V: 04-1] Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474967 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [00:24:50] (03CR) 10jenkins-bot: Enable RCPatrol and add some rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 (https://phabricator.wikimedia.org/T209250) (owner: 10Zoranzoki21) [00:24:56] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:472744 Enable RCPatrol and add some rights on srwikibooks (duration: 00m 46s) [00:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:27] (03PS2) 10Zoranzoki21: Enable suppressredirect on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474976 (https://phabricator.wikimedia.org/T210000) [00:25:38] stephanebisson: ok. Now https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/474976/ [00:26:35] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474976 (https://phabricator.wikimedia.org/T210000) (owner: 10Zoranzoki21) [00:28:24] (03Merged) 10jenkins-bot: Enable suppressredirect on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474976 (https://phabricator.wikimedia.org/T210000) (owner: 10Zoranzoki21) [00:29:19] stephanebisson: mwdebug1002? [00:29:57] Zoranzoki21: yes, you can test now [00:30:40] stephanebisson: ok. Yes. LGTM [00:31:59] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 (10greg) [00:32:15] Zoranzoki21: What do you want to do with 474967? [00:32:28] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:474976 Enable suppressredirect on srwiki (duration: 00m 47s) [00:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:33] I will make new PS [00:32:45] We have ~25 minutes [00:32:55] ok [00:33:16] (03PS2) 10Sbisson: Enable SVGs in page in group1, rest of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475005 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:33:24] MaxSem: your patch is next [00:34:01] ready [00:34:19] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475005 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:34:58] 10Operations, 10Scap, 10Release-Engineering-Team (Backlog): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10greg) [00:35:22] (03Merged) 10jenkins-bot: Enable SVGs in page in group1, rest of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475005 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:37:02] MaxSem: Your patch is on mwdebug1002. Can you test? [00:37:10] (03CR) 10jenkins-bot: Enable suppressredirect on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474976 (https://phabricator.wikimedia.org/T210000) (owner: 10Zoranzoki21) [00:37:12] (03CR) 10jenkins-bot: Enable SVGs in page in group1, rest of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475005 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:37:24] stephanebisson: tested, works [00:38:06] 10Operations: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) [00:39:03] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:475005 Enable SVGs in page in group1, rest of group0 (duration: 00m 46s) [00:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:38] Thanks stephanebisson [00:39:52] MaxSem: No prob [00:40:20] Zoranzoki21: Do you want to bring your other patch into this SWAT window? [00:40:40] stephanebisson: Yes, I working on [00:42:05] stephanebisson: Patch coming [00:42:47] (03PS2) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474967 (https://phabricator.wikimedia.org/T209251) [00:43:08] Should work now [00:43:57] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474967 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [00:44:57] (03Merged) 10jenkins-bot: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474967 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [00:46:02] Zoranzoki21: you can test on mwdebug1002 [00:46:23] sure [00:46:48] 10Operations: upgrade install servers to stretch - https://phabricator.wikimedia.org/T210038 (10Dzahn) [00:47:03] 10Operations: upgrade install servers to stretch - https://phabricator.wikimedia.org/T210038 (10Dzahn) p:05Triage>03Normal [00:47:16] stephanebisson: Looks good [00:48:26] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:474967 Disable FlaggedRevs, enable RC patrol and add rights on srwikinews (duration: 00m 47s) [00:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:33] Zoranzoki21: done [00:48:44] And that concludes SWAT for now [00:49:08] stephanebisson: Looks good. Thanks! [00:50:00] (03CR) 10jenkins-bot: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474967 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [01:14:08] 10Operations: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) I am suggesting we create a new VM called people1001 , copy data over and then delete rutherfordium any concerns? should i keep using element names ? [01:14:39] 10Operations: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) p:05Triage>03Normal [01:17:15] 10Operations, 10ops-eqiad, 10DC-Ops: kubestage1001.mgmt down or flapping - https://phabricator.wikimedia.org/T209112 (10Dzahn) 05Open>03Resolved a:03Dzahn thanks! icinga says it's been up for 11 days now. tentatively closing as resolved [01:20:35] (03PS1) 10Dzahn: wikistats: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/475031 [01:28:07] (03PS1) 10Dzahn: phragile: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475032 [01:28:40] (03CR) 10jerkins-bot: [V: 04-1] phragile: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475032 (owner: 10Dzahn) [01:30:08] (03PS2) 10Dzahn: phragile: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475032 [01:30:38] (03CR) 10jerkins-bot: [V: 04-1] phragile: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475032 (owner: 10Dzahn) [01:37:51] (03PS1) 10Dzahn: peopleweb: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475033 (https://phabricator.wikimedia.org/T210036) [01:38:41] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475033 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [01:45:16] (03PS3) 10Dzahn: phragile: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475032 [01:47:43] (03PS2) 10Dzahn: peopleweb: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475033 (https://phabricator.wikimedia.org/T210036) [01:55:49] (03Abandoned) 10Zoranzoki21: IS.php: Cosmetic changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474478 (owner: 10Zoranzoki21) [03:30:23] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 811.01 seconds [03:35:33] (03PS1) 10BryanDavis: deployment-prep: remove stale redis config [puppet] - 10https://gerrit.wikimedia.org/r/475038 (https://phabricator.wikimedia.org/T210030) [03:47:09] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:49:25] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:16:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 265.14 seconds [04:32:45] (03PS18) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [04:32:53] (03PS23) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [04:32:55] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [04:34:01] (03PS19) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [06:29:25] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/mediawiki_apache] [06:30:29] !log Drop schema change on db1103:3312 and db1105:3312 - T86339 [06:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:33] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:31:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475042 (https://phabricator.wikimedia.org/T86339) [06:35:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475042 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:37:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475042 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:38:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 - T86339 (duration: 00m 51s) [06:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:20] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:39:22] !log Deploy schema change on db1122 - T86339 [06:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475043 [06:41:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475043 (owner: 10Marostegui) [06:42:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475042 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:42:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475043 (owner: 10Marostegui) [06:42:57] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475043 (owner: 10Marostegui) [06:43:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1122 - T86339 (duration: 00m 46s) [06:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:44] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:44:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475044 (https://phabricator.wikimedia.org/T86339) [06:45:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475044 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:46:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475044 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:48:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1074 - T86339 (duration: 00m 46s) [06:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:34] !log Deploy schema change on db1074 (sanitarium master) - T86339 [06:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475045 [06:50:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475045 (owner: 10Marostegui) [06:51:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475045 (owner: 10Marostegui) [06:53:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1074 - T86339 (duration: 00m 46s) [06:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:24] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:54:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475046 (https://phabricator.wikimedia.org/T86339) [06:59:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475046 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:00:11] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:01:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475046 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:01:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475044 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:01:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475045 (owner: 10Marostegui) [07:01:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475046 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:02:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 - T86339 (duration: 00m 46s) [07:02:03] !log Deploy schema change on db1076 - T86339 [07:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:04] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [07:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475047 [07:03:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475047 (owner: 10Marostegui) [07:05:00] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475047 (owner: 10Marostegui) [07:05:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1076 - T86339 (duration: 00m 46s) [07:05:59] !log Deploy schema change on db1066 (s2 master) - T86339 [07:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:11] !log Drop foundationwiki.petition_data from s3 master (db1075) with replication - T208979 [07:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:15] T208979: Drop the petition_data table from production - https://phabricator.wikimedia.org/T208979 [07:11:54] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [07:12:25] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [07:15:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475047 (owner: 10Marostegui) [07:15:29] (03PS1) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [07:15:58] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) (owner: 10Vgutierrez) [07:17:50] (03PS2) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [07:19:31] !log Deploy schema change on db2051 (s4 codfw master) with replication - T86339 [07:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:34] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [07:21:35] (03PS3) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [07:22:03] npre_command != nrpe_command... E_TOOEARLY [07:24:07] PROBLEM - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,2 instance=db2044:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044var-datasource=codfw%2520prometheus%252Fops [07:25:05] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,2 instance=db2044:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044var-datasource=codfw%2520prometheus%252Fops [07:25:48] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2044 came up with predictive failure today: ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I... [07:26:03] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [07:27:14] (03PS4) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [07:30:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475050 (https://phabricator.wikimedia.org/T86339) [07:31:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475050 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:32:04] (03PS5) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [07:32:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475050 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:32:36] !log Deploy schema change on s4 eqiad hosts - T86339 [07:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:39] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [07:32:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475051 [07:33:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 - T86339 (duration: 00m 46s) [07:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:56] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475051 (owner: 10Marostegui) [07:34:45] PROBLEM - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK [07:34:52] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210049 (10ops-monitoring-bot) [07:35:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475051 (owner: 10Marostegui) [07:36:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 - T86339 (duration: 00m 46s) [07:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:53] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210049 (10Marostegui) p:05Triage>03Normal a:03Papaul @Papaul you have any disks to replace this? Even if it is not a new one? Thanks [07:37:20] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2044's disk finally failed {T210049} [07:37:30] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [07:37:45] (03CR) 10Vgutierrez: "pcc looks happy https://puppet-compiler.wmflabs.org/compiler1002/13630/" [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) (owner: 10Vgutierrez) [07:38:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475053 (https://phabricator.wikimedia.org/T86339) [07:39:10] ACKNOWLEDGEMENT - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK Marostegui T210049 - The acknowledgement expires at: 2018-11-27 07:38:52. [07:39:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475053 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:40:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475050 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:40:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475051 (owner: 10Marostegui) [07:41:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475053 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:41:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475053 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [07:42:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475055 [07:42:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 - T86339 (duration: 00m 46s) [07:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:17] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [07:44:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475055 (owner: 10Marostegui) [07:45:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475055 (owner: 10Marostegui) [07:46:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 - T86339 (duration: 00m 45s) [07:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:19] (03PS1) 10Giuseppe Lavagetto: Initial debianization [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) [07:50:51] !log Deploy schema change on s8 codfw master (db2045) with replication - T86339 [07:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:54] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [07:53:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475055 (owner: 10Marostegui) [08:11:52] 10Operations: build grafana package for stretch - https://phabricator.wikimedia.org/T210034 (10fgiunchedi) The grafana package is imported from upstream so likely we'll have to update `aptrepo` to do the right thing and then import the package into reprepro (i.e. without (re)building from upstream) [08:23:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475061 (https://phabricator.wikimedia.org/T86339) [08:24:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475061 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [08:26:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475061 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [08:27:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475062 [08:28:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 - T86339 (duration: 00m 46s) [08:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:12] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [08:30:17] !log Deploy schema changes on s8 eqiad hosts - T86339 [08:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475062 (owner: 10Marostegui) [08:32:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475061 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [08:33:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475062 (owner: 10Marostegui) [08:33:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 - T86339 (duration: 00m 46s) [08:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:25] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [08:35:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475063 (https://phabricator.wikimedia.org/T86339) [08:37:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475063 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [08:38:24] (03CR) 10Filippo Giunchedi: "See inline" (034 comments) [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) (owner: 10Giuseppe Lavagetto) [08:39:47] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475063 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [08:40:37] (03CR) 10Marostegui: [C: 031] mariadb: depooling db1113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474883 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [08:40:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 - T86339 (duration: 00m 45s) [08:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:48] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [08:40:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475064 [08:42:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475064 (owner: 10Marostegui) [08:43:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475064 (owner: 10Marostegui) [08:43:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1104 - T86339 (duration: 00m 45s) [08:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475062 (owner: 10Marostegui) [08:44:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475063 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [08:44:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475064 (owner: 10Marostegui) [08:48:25] !log Deploy schema change on s7 codfw master - T86339 [08:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:28] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [09:01:14] !log depooling db1113 due schema change (T85757) [09:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:18] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:01:24] (03CR) 10Banyek: [C: 032] mariadb: depooling db1113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474883 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:01:41] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depooling db1113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474883 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:09:04] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1113 (duration: 00m 46s) [09:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:07] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:09:41] (03CR) 10jenkins-bot: mariadb: depooling db1113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474883 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:21:04] !log repooling db1113 after schema change (T85757) [09:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:07] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:23:29] (03PS1) 10Banyek: Revert "mariadb: depooling db1113" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475067 [09:24:55] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:11] (03CR) 10Banyek: [C: 032] Revert "mariadb: depooling db1113" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475067 (owner: 10Banyek) [09:27:03] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 0.038 second response time [09:27:11] (03PS1) 10Elukey: Refresh analytics_deploy labs key [labs/private] - 10https://gerrit.wikimedia.org/r/475068 [09:27:17] (03Merged) 10jenkins-bot: Revert "mariadb: depooling db1113" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475067 (owner: 10Banyek) [09:27:22] (03CR) 10Elukey: [V: 032 C: 032] Refresh analytics_deploy labs key [labs/private] - 10https://gerrit.wikimedia.org/r/475068 (owner: 10Elukey) [09:29:07] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1113 (duration: 00m 46s) [09:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:11] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:31:53] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:34:32] (03CR) 10jenkins-bot: Revert "mariadb: depooling db1113" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475067 (owner: 10Banyek) [09:37:26] (03PS1) 10Banyek: mariadb: depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475069 (https://phabricator.wikimedia.org/T85757) [09:37:39] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.296 second response time [09:39:28] (03PS1) 10Banyek: mariadb: depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475070 (https://phabricator.wikimedia.org/T85757) [09:42:50] (03PS1) 10Filippo Giunchedi: hieradata: add new ms-be hosts [puppet] - 10https://gerrit.wikimedia.org/r/475071 (https://phabricator.wikimedia.org/T209395) [09:43:52] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add new ms-be hosts [puppet] - 10https://gerrit.wikimedia.org/r/475071 (https://phabricator.wikimedia.org/T209395) (owner: 10Filippo Giunchedi) [09:44:41] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:06] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1098 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475069 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:45:35] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1096 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475070 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:46:38] (03PS2) 10Banyek: mariadb: depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475070 (https://phabricator.wikimedia.org/T85757) [09:48:55] (03PS2) 10Banyek: mariadb: depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475069 (https://phabricator.wikimedia.org/T85757) [09:49:09] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.645 second response time [09:49:42] (03CR) 10Marostegui: [C: 031] mariadb: depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475070 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:49:48] !log restarted pdfrender on scb1003 [09:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:02] (03CR) 10Marostegui: [C: 031] mariadb: depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475069 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:52:54] !log depooling db1098:3316 due schema change (T85757) [09:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:59] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:53:02] (03CR) 10Banyek: [C: 032] mariadb: depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475069 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:53:49] (03CR) 10Giuseppe Lavagetto: Initial debianization (034 comments) [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) (owner: 10Giuseppe Lavagetto) [09:54:32] (03PS2) 10Giuseppe Lavagetto: Initial debianization [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) [09:54:34] (03PS1) 10Giuseppe Lavagetto: Unvendorize wherever possible [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475072 [09:54:44] (03Merged) 10jenkins-bot: mariadb: depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475069 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:54:59] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [09:56:08] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Unvendorize wherever possible [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475072 (owner: 10Giuseppe Lavagetto) [09:57:14] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1098:3316 (duration: 00m 46s) [09:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:26] (03CR) 10jenkins-bot: mariadb: depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475069 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:59:36] (03PS2) 10Alexandros Kosiaris: kubernetes: Move runtime-config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/474127 [10:04:47] 10Operations, 10Operations-Software-Development: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10Volans) I went ahead and created the repo for the reports at: https://gerrit.wikimedia.org/r/admin/projects/operations... [10:04:53] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [10:07:12] godog: any WIP going on there? ^^^ [10:08:22] (03PS1) 10Banyek: Revert "mariadb: depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475074 [10:08:26] (03PS1) 10Giuseppe Lavagetto: Revert "Unvendorize wherever possible" [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475075 [10:10:44] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475074 (owner: 10Banyek) [10:11:03] 10Operations, 10Operations-Software-Development: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10Volans) [10:11:06] !log repooling db1098 after schema change (T85757) [10:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:09] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:12:24] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475074 (owner: 10Banyek) [10:12:35] (03PS1) 10Elukey: profile::analytics::refinery::job:data_purge: remove unused items [puppet] - 10https://gerrit.wikimedia.org/r/475077 (https://phabricator.wikimedia.org/T172532) [10:13:01] <_joe_> volans: there is a ticket about that IIRC [10:13:11] <_joe_> that host is not in rotation [10:13:22] right T209921 [10:13:23] T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 [10:13:35] I'll update the task [10:13:51] !log banyek@deploy1001 sync-file aborted: T85757: depool db1098:3316 (duration: 00m 03s) [10:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:20] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Volans) ms-be2047 reported down by Icinga since few minutes, unable to ssh, black screen at the console so far. [10:14:52] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1098:3316 (duration: 00m 46s) [10:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:08] (03PS3) 10Alexandros Kosiaris: kubernetes: Move runtime-config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/474127 [10:15:37] what we know is the replication catched up [10:15:57] I like Jaime's yesterday comment about remove s2 and give it's resources to s3 [10:16:12] nothere [10:16:18] banyek: it caught up because it still has trx=2 [10:16:28] yes [10:16:29] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13633/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/475077 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:16:40] banyek: if you like the idea of removing s2, go for it I would say [10:17:07] I'll look around what it takes [10:17:33] I am pretty sure it wouldn't be a walk in a park [10:18:45] I don't think it will be too hard, but take a look and get familiarized with it, and the backups config [10:19:59] 👍 [10:20:24] I finish the schema change on db1096 first as it's patch is prepared [10:20:26] volans: yeah perhaps downtime expired, T209921 [10:20:26] T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 [10:20:37] I'll downtime it again [10:21:34] godog: ack,thx, I didn't reboot it [10:21:56] given I didn't know if you wanted to try to debug something with pa.paul maybe [10:21:59] (03CR) 10Alexandros Kosiaris: [C: 032] "Does the expected thing on https://puppet-compiler.wmflabs.org/compiler1002/13632/argon.eqiad.wmnet/, noop in tools" [puppet] - 10https://gerrit.wikimedia.org/r/474127 (owner: 10Alexandros Kosiaris) [10:22:25] 10Operations, 10Traffic, 10Patch-For-Review: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) >>! In T204225#4761225, @ema wrote: > 1. trafficserver closes its open logpipes upon logging.yaml config reload s/closes/unlinks/. Bug filed upstream: https://github.com/apache/tra... [10:22:46] (03PS4) 10Alexandros Kosiaris: kubernetes: Move runtime-config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/474127 [10:22:47] volans: yup thanks, indeed papaul was looking into it yesterday with dell [10:23:01] 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10jcrespo) > Could you elaborate on how this would work a bit more? I will install a proxy on each master pointing to the mas... [10:23:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Move runtime-config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/474127 (owner: 10Alexandros Kosiaris) [10:23:25] !log depooling db1096:3316 due schema change (T85757) [10:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:28] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:23:59] (03CR) 10Banyek: [C: 032] mariadb: depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475070 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:24:07] 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10jcrespo) Also in terms of prioritization, I asked if I should put this on top of other things and the answer was no due to u... [10:24:17] (03CR) 10jenkins-bot: Revert "mariadb: depool db1098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475074 (owner: 10Banyek) [10:25:06] (03Merged) 10jenkins-bot: mariadb: depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475070 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:26:26] (03PS3) 10Giuseppe Lavagetto: Initial debianization [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) [10:26:26] !log initial weight for new ms-be2* hosts (all but ms-be2047) - T209395 [10:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:30] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [10:26:45] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1096:3316 (duration: 00m 45s) [10:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:34] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: add purge for EL [puppet] - 10https://gerrit.wikimedia.org/r/475078 (https://phabricator.wikimedia.org/T206542) [10:30:06] 10Operations, 10Operations-Software-Development: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10Volans) @crusnov for the puppettization I think we could go with a simple git clone and setting netbox config accordin... [10:30:22] !log stop and upgrade db2095 [10:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:49] (03CR) 10jenkins-bot: mariadb: depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475070 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:37:01] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:37:41] akosiaris: ^ [10:39:33] (03PS1) 10Banyek: Revert "mariadb: depool db1096" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475079 [10:41:33] I did merge ... [10:41:42] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1096" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475079 (owner: 10Banyek) [10:41:54] meh PEBKAC I guess [10:42:14] !log repooling db1096:3316 after schema change (T85757) [10:42:16] ah yeah it's waiting on the yes/no prompt... sigh [10:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:17] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:42:27] (03PS4) 10Giuseppe Lavagetto: Initial debianization [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) [10:42:46] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1096" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475079 (owner: 10Banyek) [10:42:51] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [10:43:38] it happens even in the best of families [10:44:12] :D [10:44:57] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1096:3316 (duration: 00m 46s) [10:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:08] (03PS5) 10Giuseppe Lavagetto: Initial debianization [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) [10:45:29] during scap I got this: [10:46:32] ```10:44:29 Check 'Check endpoints for mw1265.eqiad.wmnet' failed: /wiki/{title} (Main Page) is CRITICAL: Test Main Page returned the unexpected status 503 (expecting: 200); /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 503 (expecting: 200); /w/api.php (Main Page pageprops) is CRITICAL: Test Main Page pageprops returned the unexpected status 503 (expecting: 200) [10:46:32] ``` [10:46:55] the process finished successfully, at the end, but this is the first time I see somehting like thos [10:46:56] (03CR) 10Filippo Giunchedi: [C: 031] "Nice!" [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) (owner: 10Giuseppe Lavagetto) [10:47:01] this [10:47:41] so one of the checks failed [10:47:59] it seems like, yes [10:48:03] the others were good [10:48:07] check if mw1265 is under maintenance, if not report an issue [10:48:33] it is most likely unrelated to the deploy [10:48:36] 👍 [10:48:43] but errors should be reported to get them fixed [10:48:57] (03CR) 10jenkins-bot: Revert "mariadb: depool db1096" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475079 (owner: 10Banyek) [10:49:17] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:50:18] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Initial debianization [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475058 (https://phabricator.wikimedia.org/T209573) (owner: 10Giuseppe Lavagetto) [10:50:20] (03CR) 10DCausse: [C: 031] elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [10:50:25] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:51:03] <_joe_> !log uploading prometheus-php-fpm-exporter to stretch-wikimedia main, T209573 [10:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:08] T209573: Gather metrics from php-fpm - https://phabricator.wikimedia.org/T209573 [10:53:21] _joe_: I was about to comment on the CR! :) For the next upload you might want to use stretch-wikimedia in the distro field instead of unstable [10:54:06] <_joe_> ema: yes ofc, I noticed while uploading :/ [10:54:21] <_joe_> but I deemed that not enough to rebuild again [10:54:32] yes of course [10:55:49] !log stop and upgrade db2074 [10:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:19] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:56:55] (03PS2) 10Alexandros Kosiaris: ores: Change configs to celery4 ones [puppet] - 10https://gerrit.wikimedia.org/r/474694 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [10:57:01] <_joe_> ema: also fun fact, that package will fail to build on unstable :/ [10:57:29] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:39] not sure if related to banyek's report, but there was a spike in fatals some minutes ago [10:57:55] _joe_: broken deps? [10:58:12] I think it is Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.4/includes/parser/Preprocessor_Hash.php on line 184 [10:58:17] <_joe_> ema: yes, the current version of the package requires a library at an old version [10:58:26] <_joe_> jynus: that means something is slow [10:58:52] plus Fatal error: entire web request took longer than 200 seconds and timed out [10:59:45] <_joe_> and that's from the jobqueue [11:00:33] who shall I assign this, which group? [11:00:33] after the fatals, the is a higher baseline of that [11:00:36] !log disable puppet on ores2* ores1* for gradual rollout of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474694/1/modules/ores/manifests/web.pp [11:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:50] traffic? [11:01:21] <_joe_> some query being particularly slow for some time? or something else? I see two cpu spikes on the API [11:01:22] also the database got some long running queries [11:01:29] <_joe_> so probably not the db [11:01:33] <_joe_> https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All&from=now-3h&to=now [11:01:40] Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/DatabaseMysqli.php on line 46 [11:02:18] banyek: operations probably [11:03:05] tx [11:05:12] 10Operations: During scap sync-file error on one endpiont (mw1265) - https://phabricator.wikimedia.org/T210067 (10Banyek) [11:05:56] it is also that time of the day were we get those spikes of QPS [11:05:57] (03CR) 10DCausse: "lgtm, mostly nitpicks" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [11:08:41] (03CR) 10Filippo Giunchedi: [C: 032] mw_rc_irc: remove diamond::collector resource and collector script [puppet] - 10https://gerrit.wikimedia.org/r/475010 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:09:05] (03CR) 10Filippo Giunchedi: [C: 031] mw_rc_irc: ensure diamond::collector absent [puppet] - 10https://gerrit.wikimedia.org/r/475009 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:10:11] (03CR) 10Filippo Giunchedi: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/469250 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:11:20] (03CR) 10Filippo Giunchedi: [C: 031] jenkins: ship syslogs tagged 'jenkins' to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474990 (https://phabricator.wikimedia.org/T143733) (owner: 10Herron) [11:12:40] (03CR) 10Filippo Giunchedi: [C: 031] phabricator: ship apache error logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474988 (https://phabricator.wikimedia.org/T141895) (owner: 10Herron) [11:13:05] (03CR) 10Filippo Giunchedi: [C: 031] rsyslog: ship logs with tag 'icinga' to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474982 (https://phabricator.wikimedia.org/T7) (owner: 10Herron) [11:13:52] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Change configs to celery4 ones [puppet] - 10https://gerrit.wikimedia.org/r/474694 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [11:14:03] (03CR) 10Filippo Giunchedi: [C: 032] Remove Diamond from restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/474930 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [11:22:19] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: debianize docker-registry 2.7.0-rc0 and upload in stretch-wikimedia - https://phabricator.wikimedia.org/T210071 (10fselles) p:05Triage>03Normal [11:22:21] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, added traffic folks too" [puppet] - 10https://gerrit.wikimedia.org/r/474940 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:22:34] (03CR) 10Filippo Giunchedi: [C: 032] Disable Diamond on Graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/474922 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [11:24:13] 10Operations: During scap sync-file error on one endpoint (mw1265) - https://phabricator.wikimedia.org/T210067 (10Aklapper) [11:25:21] (03CR) 10Banyek: [C: 031] "regarding to @Muehlenhoff's comment this is LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [11:38:09] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10GTirloni) @RobH @ayounsi @faidon thanks for your comments, this was really informative! Speaking for myself only, it feels more like a documentation request than tooling (although the... [11:39:46] (03PS1) 10Alexandros Kosiaris: Followup fix for ores: Change configs to celery4 [puppet] - 10https://gerrit.wikimedia.org/r/475084 (https://phabricator.wikimedia.org/T209587) [11:40:54] (03CR) 10Alexandros Kosiaris: [C: 032] Followup fix for ores: Change configs to celery4 [puppet] - 10https://gerrit.wikimedia.org/r/475084 (https://phabricator.wikimedia.org/T209587) (owner: 10Alexandros Kosiaris) [11:41:48] (03PS2) 10Alexandros Kosiaris: Followup fix for ores: Change configs to celery4 [puppet] - 10https://gerrit.wikimedia.org/r/475084 (https://phabricator.wikimedia.org/T209587) [11:43:10] (03CR) 10Ladsgroup: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/475084 (https://phabricator.wikimedia.org/T209587) (owner: 10Alexandros Kosiaris) [11:45:06] (03CR) 10Alexandros Kosiaris: [C: 032] Followup fix for ores: Change configs to celery4 [puppet] - 10https://gerrit.wikimedia.org/r/475084 (https://phabricator.wikimedia.org/T209587) (owner: 10Alexandros Kosiaris) [11:53:42] !log running schema change on dbstore1002 (T85757) [11:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:46] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:59:06] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set up a test node with new version, redis as cache, a new swift container and export metrics over graphana - https://phabricator.wikimedia.org/T210076 (10fselles) p:05Triage>03Normal [12:04:34] !log stop and upgrade db2075 [12:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:08] !log running schema change on dbstore1001 (T85757) [12:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:11] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [12:08:24] !log running schema change on dbstore1001:3316 (T85757) [12:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:40] !log stop and upgrade db2076 [12:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:38] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [12:52:22] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [12:53:22] (03PS20) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [12:59:45] (03CR) 10Ema: [C: 031] "A couple of minor nitpicks, LGTM otherwise. pcc change catalog seems correct." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) (owner: 10Vgutierrez) [13:05:13] !log remove BGP session to 2603 on cr4-ulsfo [13:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:52] 10Operations, 10DBA, 10User-Banyek: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) [13:18:37] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10fgiunchedi) [13:19:32] (03CR) 10Alex Monk: [C: 031] certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) (owner: 10Vgutierrez) [13:22:42] 10Operations, 10DBA, 10User-Banyek: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) I prepare a patch to remove s2 instance, and give it's resources to s3 to see how it works. [13:23:04] (03PS1) 10Banyek: mariadb: remove section s2 from dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) [13:24:11] (03PS2) 10Filippo Giunchedi: jenkins: ship syslogs tagged 'jenkins' to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474990 (https://phabricator.wikimedia.org/T143733) (owner: 10Herron) [13:25:00] (03CR) 10Filippo Giunchedi: [C: 032] jenkins: ship syslogs tagged 'jenkins' to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474990 (https://phabricator.wikimedia.org/T143733) (owner: 10Herron) [13:26:23] (03CR) 10DCausse: elasticsearch: cookbook for multi-cluster services rolling restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [13:27:49] (03PS1) 10Banyek: mariadb: depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475090 (https://phabricator.wikimedia.org/T85757) [13:33:56] (03CR) 10Banyek: "I am also not sure if I have to do anything on tendril" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [13:39:52] !log stop and upgrade db2077 [13:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:58] (03PS21) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [13:43:19] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [13:45:46] (03CR) 10DCausse: [C: 031] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [13:46:16] (03PS2) 10Andrew Bogott: deployment-prep: remove stale redis config [puppet] - 10https://gerrit.wikimedia.org/r/475038 (https://phabricator.wikimedia.org/T210030) (owner: 10BryanDavis) [13:47:23] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: remove stale redis config [puppet] - 10https://gerrit.wikimedia.org/r/475038 (https://phabricator.wikimedia.org/T210030) (owner: 10BryanDavis) [13:48:15] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1085 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475090 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:52:47] (03PS2) 10Banyek: mariadb: depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475090 (https://phabricator.wikimedia.org/T85757) [13:53:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:53:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:55:06] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10aborrero) I agree on several things worth noting: * we don't intend on doing config management with netbox * we don't intend on adding VM/instance information to netbox, nor are consid... [13:55:11] (03CR) 10Marostegui: [C: 031] mariadb: depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475090 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:55:48] (03CR) 10Jcrespo: [C: 04-1] "Let's not give s3 more resources than s4. Increase its buffer pool, yes, but leave some available for filesystem cache (which is a bit low" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [13:56:23] (03CR) 10Banyek: "this makes sense. Let me adjust this" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [13:56:41] (03CR) 10Marostegui: [C: 04-1] "I think you have to remove it too from the backups config, otherwise a backup will be attempted." [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [13:57:47] (03CR) 10Jcrespo: [C: 04-1] "See proposal below." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [13:57:57] (03CR) 10Banyek: "I was searching for it, and I didn't found it in the backups config, but I look for it again" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [13:59:01] (03CR) 10Jcrespo: [C: 04-1] "Manuel, as banyek said, it already uses dbstore2001 for that." [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [13:59:20] (03PS2) 10Banyek: mariadb: remove section s2 from dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) [14:07:14] (03PS1) 10Mathew.onipe: maps: added use_proxy flag to set proxy [puppet] - 10https://gerrit.wikimedia.org/r/475092 (https://phabricator.wikimedia.org/T209570) [14:07:16] (03PS1) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) [14:07:18] (03CR) 10Marostegui: "> I was searching for it, and I didn't found it in the backups" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [14:08:40] (03Abandoned) 10Mathew.onipe: osm::master: update parameters for osm::planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/474968 (owner: 10Mathew.onipe) [14:09:58] (03PS1) 10Giuseppe Lavagetto: mediawiki: add prometheus exporter for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/475094 (https://phabricator.wikimedia.org/T209573) [14:18:03] arturo: hey, where is the screenshot in https://phabricator.wikimedia.org/T208576#4765743 from? [14:22:19] !log depooling db1085 due schema change (T85757) [14:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:23] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [14:22:57] (03CR) 10Banyek: [C: 032] mariadb: depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475090 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [14:24:00] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210049 (10Papaul) @Marostegui I have 2 more new disks [14:24:04] (03Merged) 10jenkins-bot: mariadb: depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475090 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [14:24:43] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210049 (10Marostegui) Let's try one here! Thanks! [14:25:08] (03PS2) 10Giuseppe Lavagetto: mediawiki: add prometheus exporter for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/475094 (https://phabricator.wikimedia.org/T209573) [14:25:25] (03CR) 10jenkins-bot: mariadb: depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475090 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [14:27:07] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1085 (duration: 00m 46s) [14:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:31] !log stopping replication on db1085 (T85757) [14:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:34] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [14:34:24] (03CR) 10Jcrespo: [C: 031] "I think this is ok, but this deploy is quite complex + requires an instance restart, pupppet disabled, etc.- as you may be busy tommorrow " [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [14:35:46] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10faidon) The "cluster" feature is under the "virtualization" module; it's meant to be used to track where VMs run ("Physical devices may be associated with clusters as hosts. This allow... [14:35:53] (03CR) 10Banyek: "Yes, neither today or tomorrow are good. What about Friday? Or we just postpone it until Monday?" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [14:39:04] paravoid: from our netbox, I created and then deleted the cluster. Should be in the changelogs [14:40:01] yeah, I saw that [14:40:18] please use a test instance for stuff like that in the future, too confusing for everyone otherwise :) [14:40:32] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13635/mw1289.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/475094 (https://phabricator.wikimedia.org/T209573) (owner: 10Giuseppe Lavagetto) [14:40:35] we had af-netbox.wmflabs.org with test data, not sure if it's still in a working condition (if not, we should fix it) [14:41:56] cool, I didnt know [14:42:38] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) The IDRAC indicate that that the system health is critical. I have contacted the Dell engineer who's working on the case. {F27269118} [14:45:09] (03CR) 10Jcrespo: [C: 031] "Don't worry, as I said, I can take care of this tomorrow on my own, I can gather logs if you are interested on the process- but that way I" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [14:46:29] (03CR) 10Banyek: "Thanks, I appreciate this, I let this done by you" [puppet] - 10https://gerrit.wikimedia.org/r/475089 (https://phabricator.wikimedia.org/T208320) (owner: 10Banyek) [14:46:52] (03PS1) 10Giuseppe Lavagetto: prometheus::php_fpm_explorer: fix whitespace in template [puppet] - 10https://gerrit.wikimedia.org/r/475099 [14:48:03] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus::php_fpm_explorer: fix whitespace in template [puppet] - 10https://gerrit.wikimedia.org/r/475099 (owner: 10Giuseppe Lavagetto) [14:53:14] (03PS6) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [14:53:34] (03CR) 10Alexandros Kosiaris: "Alternate approach to" [puppet] - 10https://gerrit.wikimedia.org/r/465411 (owner: 10Giuseppe Lavagetto) [14:53:47] (03CR) 10Alexandros Kosiaris: "Alternate approach to" [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [14:54:18] (03PS7) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [14:57:34] (03CR) 10Faidon Liambotis: [C: 04-1] Add example systemd service file (035 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/473270 (owner: 10Thcipriani) [14:57:56] (03CR) 10Faidon Liambotis: [C: 032] Move public key read_bytes inside try [software/keyholder] - 10https://gerrit.wikimedia.org/r/473147 (owner: 10Thcipriani) [14:58:03] (03CR) 10Faidon Liambotis: [C: 032] Fix keyholder script permissions [software/keyholder] - 10https://gerrit.wikimedia.org/r/473146 (owner: 10Thcipriani) [14:58:37] (03Merged) 10jenkins-bot: Fix keyholder script permissions [software/keyholder] - 10https://gerrit.wikimedia.org/r/473146 (owner: 10Thcipriani) [14:58:39] (03Merged) 10jenkins-bot: Move public key read_bytes inside try [software/keyholder] - 10https://gerrit.wikimedia.org/r/473147 (owner: 10Thcipriani) [14:59:19] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:00:18] !log restarting replication on db1085 (T85757) [15:00:21] (03CR) 10Faidon Liambotis: [C: 04-1] "LGTM, apart from the comment in the commit message :)" (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/473272 (owner: 10Thcipriani) [15:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:23] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [15:00:51] (03PS8) 10Vgutierrez: certcentral: Provide bare minimum icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) [15:01:46] (03PS1) 10Giuseppe Lavagetto: prometheus::php-fpm-exporter: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/475100 [15:02:19] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus::php-fpm-exporter: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/475100 (owner: 10Giuseppe Lavagetto) [15:03:49] !log repooling db1085 after schema change (T85757) [15:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:53] (03CR) 10Vgutierrez: "PS8 pcc seems happy" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475049 (https://phabricator.wikimedia.org/T207294) (owner: 10Vgutierrez) [15:04:23] (03PS1) 10Banyek: Revert "mariadb: depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475101 [15:05:49] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational [15:12:25] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475101 (owner: 10Banyek) [15:13:27] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475101 (owner: 10Banyek) [15:14:16] (03PS1) 10Filippo Giunchedi: logstash: rename 'severity' syslog field if present [puppet] - 10https://gerrit.wikimedia.org/r/475104 (https://phabricator.wikimedia.org/T143733) [15:14:38] (03CR) 10jenkins-bot: Revert "mariadb: depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475101 (owner: 10Banyek) [15:15:30] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1085 (duration: 00m 46s) [15:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:33] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [15:15:40] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10faidon) I think we have consensus on the NAPALM stuff :) >>! In T205898#4758699, @Volans wrote: > This is an interesting idea, I think we could investigate that a bit and have ideas on how to us... [15:15:43] !log stop and upgrade db2073 [15:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:07] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:13] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:26] (03PS1) 10Bmansurov: Labs: enable the reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475105 (https://phabricator.wikimedia.org/T209882) [15:23:02] (03CR) 10Volans: "Almost ready!" (0318 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [15:27:25] (03PS2) 10Bmansurov: Labs: enable the reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475105 (https://phabricator.wikimedia.org/T209882) [15:27:56] 10Operations, 10Parsoid: parsoid-rt repeated failures on ruthenium (parsoid::testing) - https://phabricator.wikimedia.org/T209758 (10ssastry) >>! In T209758#4764385, @Arlolra wrote: >> `TypeError: The header content contains invalid characters` > > I did the expedient thing and just made the redirect url safe... [15:28:32] (03CR) 10jerkins-bot: [V: 04-1] Labs: enable the reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475105 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [15:30:20] (03PS2) 10Filippo Giunchedi: logstash: rename 'severity' syslog field if present [puppet] - 10https://gerrit.wikimedia.org/r/475104 (https://phabricator.wikimedia.org/T143733) [15:30:23] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76416 bytes in 0.198 second response time [15:30:25] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.056 second response time [15:34:23] (03PS3) 10Filippo Giunchedi: logstash: rename 'severity' syslog field if present [puppet] - 10https://gerrit.wikimedia.org/r/475104 (https://phabricator.wikimedia.org/T143733) [15:35:26] (03PS2) 10Alexandros Kosiaris: Remove LVS IP assignments for ocg [dns] - 10https://gerrit.wikimedia.org/r/473728 [15:46:17] PROBLEM - MariaDB Slave IO: s4 on db2073 is CRITICAL: CRITICAL slave_io_state could not connect [15:46:17] PROBLEM - MariaDB read only s4 on db2073 is CRITICAL: Could not connect to localhost:3306 [15:46:33] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag could not connect [15:46:33] PROBLEM - MariaDB Slave SQL: s4 on db2073 is CRITICAL: CRITICAL slave_sql_state could not connect [15:46:33] PROBLEM - mysqld processes on db2073 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:47:09] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210049 (10Papaul) a:05Papaul>03Marostegui Disk replaced [15:48:48] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10BBlack) Thanks for the data and the patch! We'll dig into the DNS patch next week and get it merged in so we're serving wikiba.se from our DNS... [15:52:20] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think leaving some trace of OCG around our codebase is important as a warning for the future generations. If you want to concede to ico" [dns] - 10https://gerrit.wikimedia.org/r/473728 (owner: 10Alexandros Kosiaris) [15:53:14] (03PS1) 10Ema: WIP: build -dbgsym packages [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/475108 [15:54:34] (03CR) 10Gehel: "Looks good! @Volans comments should be addressed, but I have nothing to add!" (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [15:54:36] (03CR) 10Alexandros Kosiaris: [C: 032] "in the interest of not being an iconolater, here it goes" [dns] - 10https://gerrit.wikimedia.org/r/473728 (owner: 10Alexandros Kosiaris) [15:57:30] (03CR) 10Nuria: [C: 031] profile::analytics::refinery::job::data_purge: add purge for EL [puppet] - 10https://gerrit.wikimedia.org/r/475078 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [15:59:31] (03CR) 10Nuria: "Nice, next step here would be using marcel drop script to refactor all deletes to happen via the same piece of code. Step by step." [puppet] - 10https://gerrit.wikimedia.org/r/475077 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:01:13] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210090 (10ops-monitoring-bot) [16:01:47] (03PS2) 10Elukey: profile::analytics::refinery::job::data_purge: add purge for EL [puppet] - 10https://gerrit.wikimedia.org/r/475078 (https://phabricator.wikimedia.org/T206542) [16:02:51] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::data_purge: add purge for EL [puppet] - 10https://gerrit.wikimedia.org/r/475078 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [16:05:33] PROBLEM - Host ms-be2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:07:57] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1var-server=wtp2020var-datasource=codfw%2520prometheus%252Fops [16:15:54] (03PS3) 10Cwhite: nginx: use latest commit [puppet] - 10https://gerrit.wikimedia.org/r/474940 (https://phabricator.wikimedia.org/T183454) [16:16:04] (03PS4) 10Filippo Giunchedi: logstash: rename 'severity' syslog field if present [puppet] - 10https://gerrit.wikimedia.org/r/475104 (https://phabricator.wikimedia.org/T143733) [16:16:06] (03PS1) 10Filippo Giunchedi: logstash: add 'level' normalization rules [puppet] - 10https://gerrit.wikimedia.org/r/475110 (https://phabricator.wikimedia.org/T143733) [16:17:04] (03CR) 10Cwhite: [C: 032] nginx: use latest commit [puppet] - 10https://gerrit.wikimedia.org/r/474940 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:20:59] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite) [16:21:22] RECOVERY - mysqld processes on db2073 is OK: PROCS OK: 1 process with command name mysqld [16:22:22] RECOVERY - MariaDB Slave SQL: s4 on db2073 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:23:52] apparently I took too much time to reboot that [16:23:55] more than 2 hours [16:24:40] (03PS2) 10Thcipriani: PEP 328, multi-line imports [software/keyholder] - 10https://gerrit.wikimedia.org/r/473272 [16:25:03] (03CR) 10Faidon Liambotis: [C: 032] PEP 328, multi-line imports [software/keyholder] - 10https://gerrit.wikimedia.org/r/473272 (owner: 10Thcipriani) [16:25:29] !log stop and upgrade db2063 [16:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:38] (03Merged) 10jenkins-bot: PEP 328, multi-line imports [software/keyholder] - 10https://gerrit.wikimedia.org/r/473272 (owner: 10Thcipriani) [16:26:28] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:27:39] thcipriani: :) sorry! :) [16:28:02] paravoid: heh, no worries, it was already a pedantic patch. Good to be pedantic AND correct :) [16:28:56] RECOVERY - MariaDB Slave IO: s4 on db2073 is OK: OK slave_io_state Slave_IO_Running: Yes [16:30:50] RECOVERY - MariaDB read only s4 on db2073 is OK: Version 10.1.37-MariaDB, Uptime 617s, read_only: True, 90.72 QPS, connection latency: 0.003574s, query latency: 0.000693s [16:31:05] (03CR) 10Volans: "Replies to @gehel comments" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [16:36:07] (03PS1) 10Elukey: hadoop: correct mapred_site_extra_properties variable name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/475113 [16:37:17] (03PS2) 10Elukey: hadoop: correct mapred_site_extra_properties variable name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/475113 [16:37:39] !log deploy patch T209794 [16:37:39] (03CR) 10Elukey: [C: 032] hadoop: correct mapred_site_extra_properties variable name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/475113 (owner: 10Elukey) [16:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:41] (03CR) 10Elukey: [V: 032 C: 032] hadoop: correct mapred_site_extra_properties variable name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/475113 (owner: 10Elukey) [16:40:02] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210049 (10Marostegui) a:05Marostegui>03Papaul @Papaul disk failed, can you pull out and pull in again? [16:40:04] (03PS1) 10Elukey: profile::hadoop::common: fix mapred_site_extra_properties variable [puppet] - 10https://gerrit.wikimedia.org/r/475115 [16:40:24] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210090 (10Banyek) @papaul can we ask for a replaacement? [16:40:28] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210090 (10Marostegui) 05Open>03declined Duplicate of T210049 [16:40:43] !log stop and upgrade db2066 [16:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:51] (03PS1) 10Elukey: hadoop::defaults: fix unwanted variable rename [puppet/cdh] - 10https://gerrit.wikimedia.org/r/475116 [16:45:15] (03CR) 10Elukey: [V: 032 C: 032] hadoop::defaults: fix unwanted variable rename [puppet/cdh] - 10https://gerrit.wikimedia.org/r/475116 (owner: 10Elukey) [16:45:57] (03PS2) 10Elukey: profile::hadoop::common: fix mapred_site_extra_properties variable [puppet] - 10https://gerrit.wikimedia.org/r/475115 [16:48:32] RECOVERY - Host ms-be2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.97 ms [16:48:47] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) There has been no notices from eq singapore to test this, and last note was from Vivian's update on Nov 16th. I emailed a reply in just now... [16:50:43] (03PS3) 10Elukey: profile::hadoop::common: fix mapred_site_extra_properties variable [puppet] - 10https://gerrit.wikimedia.org/r/475115 [16:53:52] (03CR) 10Volans: [C: 04-1] "Mostly good, few minor comments inline." (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [16:53:58] (03PS4) 10Elukey: profile::hadoop::common: fix mapred_site_extra_properties variable [puppet] - 10https://gerrit.wikimedia.org/r/475115 [16:55:16] (03CR) 10Elukey: [C: 032] profile::hadoop::common: fix mapred_site_extra_properties variable [puppet] - 10https://gerrit.wikimedia.org/r/475115 (owner: 10Elukey) [16:56:54] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T210049 (10Papaul) a:05Papaul>03Marostegui Done [17:00:54] PROBLEM - Check systemd state on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.16: Connection reset by peer [17:00:58] PROBLEM - mathoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.16: Connection reset by peer [17:01:02] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:02] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:12] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:36] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:42] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/feed/availability (Retrieve feed content availability from \wikipedia.org\) timed out before a response was received [17:02:02] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 0.002 second response time [17:02:42] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:03:06] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [17:03:40] smells like OOM party [17:03:48] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [17:04:08] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [17:04:15] !log stop and upgrade db2080 [17:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:32] so ssh not working for me, going to check serial [17:06:00] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [17:06:27] (03PS3) 10Bmansurov: Labs: enable the reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475105 (https://phabricator.wikimedia.org/T209882) [17:06:30] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) > Hi Rob, > Good day to you. > I am replying on behalf of my colleague Marco, as you have spoken earlier. > > I have created a new case number on the issue with the serve... [17:06:34] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.003 second response time [17:07:00] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [17:07:36] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:07:56] mobrovac: --^ [17:08:21] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=scb1001&var-datasource=eqiad%20prometheus%2Fops&from=now-24h&to=now-1m [17:08:26] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10aborrero) It seems all the approaches we considered have been discarded. Are we missing any other option? [17:08:52] RECOVERY - mathoid endpoints health on scb1001 is OK: All endpoints are healthy [17:09:05] mobrovac: memory usage pattern is not very nice recently.. known? [17:09:05] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=4&fullscreen&var-server=scb1001&var-datasource=eqiad%20prometheus%2Fops&from=now-7d&to=now-1m [17:13:38] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.171 second response time [17:19:24] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [17:20:16] !log stop and upgrade db2081 [17:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:00] !log manually started systemd-journald.service on scb1001 after OOM [17:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:14] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [17:24:24] (03PS1) 10Bstorm: sonofgridengine: cronrunners need hba [puppet] - 10https://gerrit.wikimedia.org/r/475118 (https://phabricator.wikimedia.org/T200557) [17:25:24] going to meetings, load and memory usage seems fine on scb1001, if anybody has time please check as well :) [17:25:37] (03CR) 10Bstorm: [C: 032] sonofgridengine: cronrunners need hba [puppet] - 10https://gerrit.wikimedia.org/r/475118 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:33:15] (03CR) 10Arturo Borrero Gonzalez: "I would like to see a +1 from some other person other than me before merging :-)" [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [17:41:51] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10kchapman) Reminder that there is an IRC meeting today (Wednesday November 21st) at 11pm PST(November 22nd... [17:45:21] (03PS1) 10Dzahn: add people1001.eqiad.wmnet to replace rutherfordium [dns] - 10https://gerrit.wikimedia.org/r/475123 (https://phabricator.wikimedia.org/T210036) [17:46:45] (03PS1) 10Bstorm: sonofgridengine: This may be all that is needed for now on bastions [puppet] - 10https://gerrit.wikimedia.org/r/475124 (https://phabricator.wikimedia.org/T209627) [17:51:42] (03CR) 10Bstorm: [C: 032] sonofgridengine: This may be all that is needed for now on bastions [puppet] - 10https://gerrit.wikimedia.org/r/475124 (https://phabricator.wikimedia.org/T209627) (owner: 10Bstorm) [17:56:12] meep - Can anyone here tell me if the beta commons db lock will be gone in the near future? ;-) [17:56:40] RECOVERY - Device not healthy -SMART- on db2044 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044var-datasource=codfw%2520prometheus%252Fops [18:01:21] 10Operations, 10SRE-Access-Requests: access request for Jeena Huneidi (deployment, conint-admins, contint-docker) - https://phabricator.wikimedia.org/T210027 (10RobH) p:05Triage>03Normal [18:01:51] 10Operations, 10SRE-Access-Requests: access request for Jeena Huneidi (deployment, conint-admins, contint-docker) - https://phabricator.wikimedia.org/T210027 (10RobH) [18:02:45] 10Operations, 10SRE-Access-Requests: access request for Jeena Huneidi (deployment, conint-admins, contint-docker) - https://phabricator.wikimedia.org/T210027 (10RobH) Please note that this is a sudo level request, and has to be approved during the weekly SRE meetings. Our next meeting is on Monday, 2018-11-26... [18:06:13] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10RobH) p:05Triage>03Normal [18:08:35] (03CR) 10Dzahn: [C: 032] add people1001.eqiad.wmnet to replace rutherfordium [dns] - 10https://gerrit.wikimedia.org/r/475123 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [18:08:56] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10RobH) a:03Niharika @niharika: Please review https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups It seems that Hive has both the public and priva... [18:09:10] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10RobH) [18:18:28] 10Operations, 10Sentry, 10vm-requests: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138 (10Dzahn) Do you still request a Ganeti VM? [18:19:24] 10Operations, 10vm-requests, 10Patch-For-Review: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) [18:27:36] 10Operations, 10vm-requests, 10Patch-For-Review: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) [18:28:35] (03PS2) 10GTirloni: labpuppetmaster: Resolve .wmflabs addresses [puppet] - 10https://gerrit.wikimedia.org/r/474923 (https://phabricator.wikimedia.org/T177959) [18:30:18] (03CR) 10GTirloni: [C: 032] labpuppetmaster: Resolve .wmflabs addresses [puppet] - 10https://gerrit.wikimedia.org/r/474923 (https://phabricator.wikimedia.org/T177959) (owner: 10GTirloni) [18:30:37] robh: re that LDAP request, should I generally have them merged in the future? what's the best process since they are officially separate processes? [18:30:46] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10Niharika) @RobH Currently I only need access to Eventlogging data for TemplateWizard (As mentioned in task). I don't know if it's public or private - @Milimetric can pe... [18:31:15] greg-g: they are independent, just any ldap entry requires listing in admins module [18:31:31] i couild add them in ldap only section today [18:31:35] and then on monday move out of that to shell section [18:32:39] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10DannyH) Yes, I approve. [18:37:55] basically can do in either order, but the easiest one is shell first [18:38:52] 10Operations, 10vm-requests, 10Patch-For-Review: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 (10Dzahn) ` [ganeti1003:~] $ sudo makevm This is an interactive script to make it easier to create a Ganeti VM. Please see https://wikitech.wikimedia.org/wiki/Ganeti#C... [18:39:23] gtirloni: is this change about labspuppetmaster100[12] or puppetmasters that run in VPSes? [18:40:15] paravoid: labpuppetmaster, and it hasn't worked because I edit the wrong file probably [18:40:53] the prod ones? [18:41:09] yes [18:41:23] I don't think we should be doing that :( [18:41:52] should i respond on the task or would you prefer discussing it here? [18:44:30] I gotta go, so I just responded, sorry! [18:44:39] (03PS1) 10GTirloni: Revert "labpuppetmaster: Resolve .wmflabs addresses" [puppet] - 10https://gerrit.wikimedia.org/r/475129 (https://phabricator.wikimedia.org/T177959) [18:45:20] (03CR) 10GTirloni: [C: 032] Revert "labpuppetmaster: Resolve .wmflabs addresses" [puppet] - 10https://gerrit.wikimedia.org/r/475129 (https://phabricator.wikimedia.org/T177959) (owner: 10GTirloni) [18:45:58] paravoid: thanks, i've reverted the change [19:03:35] robh: meh, just merge them timing-wise [19:04:16] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10RobH) a:05Niharika>03Milimetric @Milimetric: I'm assigning to you for feedback on if @nikarika needs the private-data version or not. Please advise and unassign yo... [19:06:08] (03CR) 10Dzahn: [C: 031] add redirects of various zh-yue projects to yue [puppet] - 10https://gerrit.wikimedia.org/r/474901 (https://phabricator.wikimedia.org/T209693) (owner: 10ArielGlenn) [19:06:43] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10Milimetric) Private. Niharika would benefit from being a part of analytics-privatedata-users, including access to data before it's sanitized. [19:06:49] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10Milimetric) a:05Milimetric>03None [19:10:48] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 41.26 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [19:11:58] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 88.17 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [19:18:28] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10RobH) >>! In T210022#4766563, @Milimetric wrote: > Private. Niharika would benefit from being a part of analytics-privatedata-users, including access to data before it... [19:28:54] (03PS5) 10Cwhite: initial commit [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) [19:29:13] (03CR) 10Cwhite: "CS 5 is after running black." [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [19:33:49] (03CR) 10BryanDavis: "I have no idea where/how this config is applied, but the host name changes seem reasonable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475025 (https://phabricator.wikimedia.org/T210030) (owner: 10Alex Monk) [19:36:14] (03PS1) 10Bstorm: sonofgridengine: correct hba manifest for this grid variant [puppet] - 10https://gerrit.wikimedia.org/r/475140 (https://phabricator.wikimedia.org/T200557) [19:37:43] (03CR) 10Bstorm: [C: 032] sonofgridengine: correct hba manifest for this grid variant [puppet] - 10https://gerrit.wikimedia.org/r/475140 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:44:22] (03CR) 10Gehel: [C: 04-1] "Looks like there are unexpected changes on labsdb1006: https://puppet-compiler.wmflabs.org/compiler1002/13643/labsdb1006.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/475092 (https://phabricator.wikimedia.org/T209570) (owner: 10Mathew.onipe) [19:48:30] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:49:04] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@e114d99]: Fixing sorting bug on top endpoints [19:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:38] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@e114d99]: Fixing sorting bug on top endpoints (duration: 05m 34s) [19:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:28] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:09:32] (03CR) 10Hashar: [C: 031] logstash: add 'level' normalization rules [puppet] - 10https://gerrit.wikimedia.org/r/475110 (https://phabricator.wikimedia.org/T143733) (owner: 10Filippo Giunchedi) [20:10:42] (03CR) 10Hashar: [C: 031] "Seems legit :)" [puppet] - 10https://gerrit.wikimedia.org/r/475104 (https://phabricator.wikimedia.org/T143733) (owner: 10Filippo Giunchedi) [20:16:43] 10Operations, 10Parsoid: parsoid-rt repeated failures on ruthenium (parsoid::testing) - https://phabricator.wikimedia.org/T209758 (10Arlolra) 05Open>03Resolved a:03Arlolra Thanks [20:20:29] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) So Dell wants us to update the bios and return this to service to see if the error happens again. I'll flash the bios, and attempt to run memtest remotely and see if that w... [20:20:40] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) I removed CPU1 and moved CPU2 in CPU1 slot boot the server with only with CPU2 and had the same CPU problem reported on CPU1 I requested that the parts below been sent to me. - Motherboard - H730P... [20:23:47] 10Operations, 10ops-eqiad, 10DC-Ops: icinga1001 mysterious reboots - https://phabricator.wikimedia.org/T210108 (10colewhite) [20:25:03] (03PS1) 10Rush: logsteralarms: better alerting logic [puppet] - 10https://gerrit.wikimedia.org/r/475145 (https://phabricator.wikimedia.org/T208611) [20:27:13] (03CR) 10Rush: [C: 032] logsteralarms: better alerting logic [puppet] - 10https://gerrit.wikimedia.org/r/475145 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [20:35:18] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:38:40] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set up a test node with new version, Redis as cache, a new Swift container and export metrics over Fraphana - https://phabricator.wikimedia.org/T210076 (10hashar) [20:39:01] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup: Spec out migrating ORES to kubernetes - https://phabricator.wikimedia.org/T210109 (10Ladsgroup) [20:39:04] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:53:06] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) The latency in pushing things to the mgmt network is pretty high, but it is working. Updated the idrac firmware to 2.60 from 2.50, now updating bios from 2.5.4 to 2.8.0 [20:55:49] !log cp5001 reboot for firmware update [20:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:04] PROBLEM - Host cp5001 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:14] yes icinga, im aware ;D [21:00:44] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:14] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:01:23] (03PS1) 10Alex Monk: network::constants: Include cloud private range in all_networks [puppet] - 10https://gerrit.wikimedia.org/r/475150 [21:01:24] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:24] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:26] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:01:26] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:01:28] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:32] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:34] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:01:46] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:46] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:52] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:01:54] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:02:02] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:02:10] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:02:10] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:02:12] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:02:16] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:02:18] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:02:20] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:03:46] RECOVERY - Host cp5001 is UP: PING OK - Packet loss = 0%, RTA = 247.43 ms [21:03:48] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [21:03:52] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [21:03:52] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 72 ESP OK [21:04:04] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [21:04:04] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [21:04:12] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [21:04:12] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 72 ESP OK [21:04:12] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [21:04:20] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 72 ESP OK [21:04:30] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [21:04:30] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [21:04:30] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [21:04:36] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 72 ESP OK [21:04:36] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [21:04:38] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 72 ESP OK [21:04:40] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 72 ESP OK [21:04:52] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [21:04:52] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK [21:04:52] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 72 ESP OK [21:04:52] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 72 ESP OK [21:06:55] (03CR) 10Alex Monk: "It's probably worth noting that 10/8 is already on this list." [puppet] - 10https://gerrit.wikimedia.org/r/475150 (owner: 10Alex Monk) [21:07:31] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Bios updated, now running memtest86+ via Dell diagnostics boot option entry. [21:07:34] PROBLEM - Host cp5001 is DOWN: PING CRITICAL - Packet loss = 100% [21:09:52] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:13:22] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:13:24] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:13:24] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:13:30] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:13:40] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:13:40] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:13:42] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:13:44] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:13:46] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:13:48] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:13:50] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:14:02] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:14:02] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:14:02] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:14:02] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:14:08] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:14:10] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5001_v4, cp5001_v6 [21:14:12] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:14:23] ok, thats due to cp5001 [21:14:24] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:14:24] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5001_v4, cp5001_v6 [21:14:35] and ther isnt an easy way to maint mode a host and avoid ipsec spam [21:14:38] but its ok. [21:16:43] !log cp5001 is offline running hardware tests after firmware updates to see if memory error still exists. ref: T199675 [21:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:47] T199675: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 [21:28:24] (03PS1) 10Dzahn: install_server: add people1001 to DHCP/partman [puppet] - 10https://gerrit.wikimedia.org/r/475154 (https://phabricator.wikimedia.org/T210036) [21:36:16] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [21:37:26] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [21:38:16] (03PS1) 10Alex Monk: deployment-prep: move lists of cache nodes out of labs.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/475225 [21:39:21] (03CR) 10Dzahn: [C: 032] install_server: add people1001 to DHCP/partman [puppet] - 10https://gerrit.wikimedia.org/r/475154 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [21:39:53] (03PS1) 10Hashar: Revert "gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/475226 [21:40:09] (03PS2) 10Hashar: Revert "gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/475226 [21:41:21] (03PS3) 10Dzahn: peopleweb: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475033 (https://phabricator.wikimedia.org/T210036) [21:42:07] (03CR) 10Faidon Liambotis: [C: 04-1] cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [21:42:19] (03CR) 10Faidon Liambotis: [C: 032] cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [21:42:26] (03PS5) 10Faidon Liambotis: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [21:42:34] (03CR) 10Hashar: "Spotted via a comment in log4j.xml.erb (well done Paladox). Seems the log were being filled with INFO messages:" [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [21:43:36] (03CR) 10Paladox: [C: 031] "+1, PolyGerrit plugins are indeed supported from 2.15+ and it should be working (as it shows the logo) :)" [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [21:51:17] (03CR) 10Hashar: Revert "gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [21:51:23] (03PS1) 10Alex Monk: deployment-prep: Clean up from cache-text04 -> cache-text05 migration [puppet] - 10https://gerrit.wikimedia.org/r/475227 [21:51:31] (03PS3) 10Hashar: Revert "gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/475226 [21:55:56] (03CR) 10Dzahn: [C: 032] peopleweb: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475033 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [21:58:08] 10Operations, 10ops-codfw, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) [22:04:05] (03PS2) 10Alex Monk: deployment-prep: Clean up from cache-text04 -> cache-text05 migration [puppet] - 10https://gerrit.wikimedia.org/r/475227 [22:04:23] (03PS2) 10Dzahn: install_server: add people1001 to DHCP/partman [puppet] - 10https://gerrit.wikimedia.org/r/475154 (https://phabricator.wikimedia.org/T210036) [22:28:55] (03CR) 10Dzahn: [C: 032] DNS: Add production and mgmt DNS entries for sessionstore200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/474771 (https://phabricator.wikimedia.org/T209389) (owner: 10Papaul) [22:29:12] (03PS2) 10Dzahn: DNS: Add production and mgmt DNS entries for sessionstore200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/474771 (https://phabricator.wikimedia.org/T209389) (owner: 10Papaul) [22:40:49] (03PS1) 10Dzahn: peopleweb: add role to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475228 (https://phabricator.wikimedia.org/T210036) [22:41:28] (03PS2) 10Dzahn: peopleweb: add role to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475228 (https://phabricator.wikimedia.org/T210036) [22:43:09] (03CR) 10Dzahn: [C: 032] peopleweb: add role to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475228 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [22:51:39] (03PS1) 10Mforns: Add RefineMonitor to EventLoggingSanitization analytics refinery job [puppet] - 10https://gerrit.wikimedia.org/r/475231 (https://phabricator.wikimedia.org/T202429) [22:54:22] (03PS1) 10Dzahn: peopleweb: set httpd MPM to prefork explicitly [puppet] - 10https://gerrit.wikimedia.org/r/475232 (https://phabricator.wikimedia.org/T210036) [22:56:04] (03PS2) 10Dzahn: peopleweb: set httpd MPM to prefork explicitly [puppet] - 10https://gerrit.wikimedia.org/r/475232 (https://phabricator.wikimedia.org/T210036) [22:57:00] (03CR) 10Dzahn: [C: 032] peopleweb: set httpd MPM to prefork explicitly [puppet] - 10https://gerrit.wikimedia.org/r/475232 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [22:57:13] (03PS3) 10Dzahn: peopleweb: set httpd MPM to prefork explicitly [puppet] - 10https://gerrit.wikimedia.org/r/475232 (https://phabricator.wikimedia.org/T210036) [22:58:21] (03PS1) 10Bstorm: toolforge: add qpdf, unpaper, and pngquant to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/475233 (https://phabricator.wikimedia.org/T204422) [23:02:08] (03PS1) 10Dzahn: switch people.eqiad from rutherfordium to people1001 [dns] - 10https://gerrit.wikimedia.org/r/475234 [23:03:47] (03PS1) 10Dzahn: remove rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/475235 (https://phabricator.wikimedia.org/T210036) [23:07:18] (03PS1) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) [23:07:50] (03CR) 10jerkins-bot: [V: 04-1] cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [23:10:05] (03PS1) 10Dzahn: remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036) [23:10:36] (03CR) 10jerkins-bot: [V: 04-1] remove rutherfordium from site, netboot, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/475237 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [23:13:43] (03PS1) 10Dzahn: peopleweb: allow rsync of /home from rutherfordium to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475238 (https://phabricator.wikimedia.org/T210036) [23:14:23] (03CR) 10Dzahn: [C: 032] peopleweb: allow rsync of /home from rutherfordium to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475238 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [23:18:32] (03PS1) 10Smalyshev: Enable dumping RDF data for debugging purposes [puppet] - 10https://gerrit.wikimedia.org/r/475241 (https://phabricator.wikimedia.org/T210044) [23:19:18] PROBLEM - Check systemd state on rutherfordium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:19:25] ^ me [23:19:46] (03CR) 10jerkins-bot: [V: 04-1] Enable dumping RDF data for debugging purposes [puppet] - 10https://gerrit.wikimedia.org/r/475241 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [23:22:53] (03PS1) 10Dzahn: peopleweb: add mapped IPv6 to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475242 (https://phabricator.wikimedia.org/T210036) [23:23:32] (03CR) 10Dzahn: [C: 032] peopleweb: add mapped IPv6 to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475242 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [23:23:43] (03PS2) 10Dzahn: peopleweb: add mapped IPv6 to people1001 [puppet] - 10https://gerrit.wikimedia.org/r/475242 (https://phabricator.wikimedia.org/T210036) [23:24:48] (03PS1) 10Smalyshev: Enable dumping RDF on test & internal [puppet] - 10https://gerrit.wikimedia.org/r/475243 (https://phabricator.wikimedia.org/T210044) [23:25:20] (03PS2) 10Smalyshev: Enable dumping RDF data for debugging purposes [puppet] - 10https://gerrit.wikimedia.org/r/475241 (https://phabricator.wikimedia.org/T210044) [23:26:09] (03CR) 10jerkins-bot: [V: 04-1] Enable dumping RDF data for debugging purposes [puppet] - 10https://gerrit.wikimedia.org/r/475241 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [23:27:27] (03PS3) 10Smalyshev: Enable dumping RDF data for debugging purposes [puppet] - 10https://gerrit.wikimedia.org/r/475241 (https://phabricator.wikimedia.org/T210044) [23:28:45] (03PS1) 10Dzahn: add IPv6 records for people1001.eqiad.wmnet. [dns] - 10https://gerrit.wikimedia.org/r/475245 (https://phabricator.wikimedia.org/T210036) [23:29:47] (03CR) 10Dzahn: [C: 032] add IPv6 records for people1001.eqiad.wmnet. [dns] - 10https://gerrit.wikimedia.org/r/475245 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [23:30:54] (03PS2) 10Smalyshev: Enable dumping RDF on test & internal [puppet] - 10https://gerrit.wikimedia.org/r/475243 (https://phabricator.wikimedia.org/T210044) [23:31:48] RECOVERY - Check systemd state on rutherfordium is OK: OK - running: The system is fully operational [23:32:53] (03CR) 10Smalyshev: "This should be merged when we're ready for dumps test (probably Monday 26th)" [puppet] - 10https://gerrit.wikimedia.org/r/475243 (https://phabricator.wikimedia.org/T210044) (owner: 10Smalyshev) [23:34:23] !log rsyncing /home from rutherfordium.eqiad to people1001.eqiad (people.wikimedia.org) T210036 [23:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:27] T210036: upgrade people.wm.org (rutherfordium) to stretch - https://phabricator.wikimedia.org/T210036 [23:37:36] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:37:40] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:37:58] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:38:10] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:38:16] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:38:32] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:39:26] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [23:39:40] RECOVERY - DPKG on notebook1004 is OK: All packages OK [23:39:47] (03PS1) 10Dzahn: Revert "peopleweb: allow rsync of /home from rutherfordium to people1001" [puppet] - 10https://gerrit.wikimedia.org/r/475249 [23:39:54] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [23:39:58] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [23:40:16] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:40:28] RECOVERY - Disk space on notebook1004 is OK: DISK OK