[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180119T0000). [00:00:05] kaldari and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:04:49] \o [00:05:11] i guess i can deploy [00:05:26] (03CR) 10EBernhardson: [C: 032] Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250) (owner: 10EBernhardson) [00:06:51] (03Merged) 10jenkins-bot: Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250) (owner: 10EBernhardson) [00:07:09] (03CR) 10jenkins-bot: Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250) (owner: 10EBernhardson) [00:07:39] 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team, 10wikitech.wikimedia.org: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3911639 (10bd808) [00:09:50] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T185250 Switch wiktionary sister search on enwiki to title only (step 1) (duration: 00m 57s) [00:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:04] T185250: Investigate irrelevant sister project search results on Wikipedia - https://phabricator.wikimedia.org/T185250 [00:11:29] kaldari: you around for your deploy? [00:11:36] yes [00:11:44] ready [00:11:46] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: T185250 Switch wiktionary sister search on enwiki to title only (step 2) (duration: 00m 56s) [00:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:08] (03PS3) 10EBernhardson: Removing unused citizendium from $wgRelatedSitesPrefixes... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246) (owner: 10Kaldari) [00:12:13] (03CR) 10EBernhardson: [C: 032] Removing unused citizendium from $wgRelatedSitesPrefixes... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246) (owner: 10Kaldari) [00:12:54] ebernhardson: This one is testable, so let me know when [00:14:18] (03Merged) 10jenkins-bot: Removing unused citizendium from $wgRelatedSitesPrefixes... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246) (owner: 10Kaldari) [00:14:43] kaldari: it's on mw1002 [00:14:49] mwdebug1002 [00:14:50] looking... [00:16:31] ebernhardson: Looks good. [00:16:37] sync away [00:16:55] (03CR) 10jenkins-bot: Removing unused citizendium from $wgRelatedSitesPrefixes... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246) (owner: 10Kaldari) [00:18:54] * ebernhardson looks to have broke the !log sending [00:19:30] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T185246: Removing unused citizendium from $wgRelatedSitesPrefixes (duration: 00m 56s) [00:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:44] T185246: Remove Citizendium from $wgRelatedSitesPrefixes - https://phabricator.wikimedia.org/T185246 [00:19:52] kaldari: ok should be all set [00:20:54] bd808 testing dologmsg from tin [00:26:53] thanks! [00:35:07] (03PS1) 10Dzahn: rename ms-be3001.mgmt to bast3003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/405223 (https://phabricator.wikimedia.org/T184936) [00:39:52] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [00:40:36] (03PS1) 10Dzahn: assign 91.198.174.115 to bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405224 (https://phabricator.wikimedia.org/T184936) [00:42:33] (03PS1) 10Dzahn: add IPv6 for bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405225 (https://phabricator.wikimedia.org/T184936) [00:43:33] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 0.017 second response time [00:49:52] (03PS1) 10Dzahn: add bast3003 to site and network constants [puppet] - 10https://gerrit.wikimedia.org/r/405226 (https://phabricator.wikimedia.org/T184936) [00:50:26] (03CR) 10jerkins-bot: [V: 04-1] add bast3003 to site and network constants [puppet] - 10https://gerrit.wikimedia.org/r/405226 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [00:52:08] (03CR) 10Dzahn: "sorry jerkins, but there is currently no replacement for "declares interface::add_ip6_mapped"" [puppet] - 10https://gerrit.wikimedia.org/r/405226 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [00:59:20] (03PS1) 10Dzahn: DHCP: add bast3003 [puppet] - 10https://gerrit.wikimedia.org/r/405227 (https://phabricator.wikimedia.org/T184936) [01:04:14] (03PS1) 10Dzahn: bast3002->bast3003 in DHCP,network constants,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/405229 (https://phabricator.wikimedia.org/T184936) [01:05:49] (03PS1) 10Dzahn: bast3002->bast3003 as prometheus node, rm from site [puppet] - 10https://gerrit.wikimedia.org/r/405230 (https://phabricator.wikimedia.org/T184936) [01:06:04] sounds like those hosts aren't lasting long [01:06:08] I remember when it was hooft.esams [01:07:10] hooft became 3002 [01:07:13] oh [01:07:17] so what was 3001? [01:07:56] https://phabricator.wikimedia.org/T159480 [01:08:01] Possibly not fully decommed? :P [01:08:18] (03PS1) 10Dzahn: prometheus.svc.esams.wmnet: bast3002->bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405231 (https://phabricator.wikimedia.org/T184936) [01:08:27] (03CR) 10Dzahn: "should go together with https://gerrit.wikimedia.org/r/405231" [puppet] - 10https://gerrit.wikimedia.org/r/405230 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [01:08:52] oh, wait [01:09:07] no hooft -> 3001, prometheus -> 3002? [01:09:33] huh: https://phabricator.wikimedia.org/T156506 [01:09:34] oh, hi [01:09:36] weird [01:09:38] hey [01:09:44] yea, i stole another "misc" from esams [01:09:47] and now doing it again [01:14:16] (03PS1) 10Dzahn: decom bast3002, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/405232 (https://phabricator.wikimedia.org/T184936) [01:19:36] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035#3911750 (10Dzahn) > isn't it easier to simply install/designate one of the other machines as bastion until we refresh the misc cluster soon? See subtask, i would just take the first of the recently decom'ed swift boxes... [01:20:18] !log attempting to restore home_pmtpa from bacula to bast1001 [01:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:51] TimStarling: That's....old. What ya hunting for? [01:21:22] bd808: Also for dologmsg from tin: `scap log [foo]` [01:21:35] paravoid was asking me some questions about things that happened in 2008 [01:21:58] version controlled squid configuration would be good if I can find it in there [01:22:33] no_justification: *nod* I was just poking because of the failed !log from the scap [02:27:52] When did we start versioning it? It couldn't have been back then [02:28:02] I know when we moved to git we jettisoned the SVN history [02:30:42] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480 [02:31:07] no_justification: between 08-09 based on tims comment here i would guess https://phabricator.wikimedia.org/T115937 [02:31:42] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3099155 keys, up 5 minutes 5 seconds - replication_delay is 1 [02:37:34] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3911821 (10Tgr) Sorry for taking so long on this :( Somehow it managed to be #2 on my todo list for a very long time, with various other... [02:39:40] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Volker_E) [03:21:21] !log on bast1001: restarting bacula-fd with master key decryption enabled, restarting restore job [03:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 759.81 seconds [03:56:32] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 130.32 seconds [04:35:51] 10Operations, 10Ops-Access-Requests: Requesting access to stat1004, stat1005, stat1006 for mneisler - https://phabricator.wikimedia.org/T184838#3911885 (10JKatzWMF) Approved. I hope my approval counts as Megan is supporting @chelsyx and I am Chelsy's manager. [05:24:32] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2016208 [06:14:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405245 (https://phabricator.wikimedia.org/T174569) [06:17:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405245 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:18:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405245 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:19:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405245 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:20:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1109 - T174569 (duration: 00m 57s) [06:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:49] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:22:39] !log Deploy schema change on db1109 - T174569 [06:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:22] (03PS1) 10Marostegui: mariadb: Move db2034 from s1 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/405246 (https://phabricator.wikimedia.org/T184888) [06:24:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405247 (https://phabricator.wikimedia.org/T162807) [06:26:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405247 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:28:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405247 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:28:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405247 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:29:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 - T162807 (duration: 00m 56s) [06:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:58] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:30:18] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9789/" [puppet] - 10https://gerrit.wikimedia.org/r/405246 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [06:31:22] !log Stop replication in sync db1089 and db1099:3311 - T162807 [06:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:43] (03CR) 10Marostegui: [C: 032] mariadb: Move db2034 from s1 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/405246 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [06:58:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405248 [07:00:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405248 (owner: 10Marostegui) [07:01:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405248 (owner: 10Marostegui) [07:02:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405248 (owner: 10Marostegui) [07:03:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 - T162807 (duration: 00m 55s) [07:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:13] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:11:33] !log Stop x1 on dbstore2002 to copy its content to db2034 - T184888 [07:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:47] T184888: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888 [08:11:40] (03CR) 10Muehlenhoff: [C: 031] rename ms-be3001.mgmt to bast3003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/405223 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [08:12:21] (03PS2) 10Muehlenhoff: Remove firejail config for now-unused ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/403212 (https://phabricator.wikimedia.org/T181591) (owner: 10Brion VIBBER) [08:14:39] (03CR) 10Muehlenhoff: [C: 032] Remove firejail config for now-unused ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/403212 (https://phabricator.wikimedia.org/T181591) (owner: 10Brion VIBBER) [08:18:34] (03PS48) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [08:25:12] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3912051 (10jcrespo) [08:30:30] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 180 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:34:48] (03PS1) 10Jcrespo: mariadb: Reimage db2036 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/405253 [08:36:08] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db2036 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/405253 (owner: 10Jcrespo) [08:43:41] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 503 (expecting: 200) [08:44:40] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [09:00:20] (03CR) 10Marostegui: [C: 032] s1,x1.hosts: Move db2034 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/405255 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [09:00:45] (03PS1) 10Urbanecm: Allow bureaucrats@mr.wiki to grant&revoke accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405256 (https://phabricator.wikimedia.org/T184553) [09:01:02] (03Merged) 10jenkins-bot: s1,x1.hosts: Move db2034 from s1 to x1 [software] - 10https://gerrit.wikimedia.org/r/405255 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [09:03:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405257 (https://phabricator.wikimedia.org/T162807) [09:06:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405257 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:07:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405257 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:07:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405257 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:08:50] !log start cassandra-a on restbase1015 - T184100 [09:08:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105:3311 - T162807 (duration: 00m 57s) [09:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:03] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [09:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:13] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:10:10] !log Stop replication in sync db1089 and db1105:3311 - T162807 [09:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:30] 10Operations, 10DBA: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912108 (10jcrespo) [09:15:00] 10Operations, 10DBA: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912119 (10jcrespo) 05Open>03stalled [09:21:14] 10Operations, 10DBA: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912120 (10jcrespo) These are the latest logs from the hw: ``` 12 Repaired Drive Array 01/16/2018 15:31 01/16/2018 15:31 1 Internal Storage Enclosure Device Failure (Bay 6, Box 1, Port 1I... [09:25:02] 10Operations, 10DBA: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912124 (10Marostegui) This server got a disk replaced a few days ago: T184836 [09:25:35] 10Operations, 10Continuous-Integration-Infrastructure, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3912128 (10MoritzMuehlenhoff) Stretch packages have also been uploaded in the mean time. [09:40:59] (03PS1) 10Jon Harald Søby: Add 3 namespaces to wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) [09:41:09] (03CR) 10jerkins-bot: [V: 04-1] Add 3 namespaces to wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) (owner: 10Jon Harald Søby) [09:43:36] !upgrade remaining mw servers in codfw to HHVM 3.18.7 (and upgraded nginx for T164456 where applicable) [09:43:36] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [09:43:56] (03CR) 10Jon Harald Søby: "When this is merged, the swatter should run the following script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) (owner: 10Jon Harald Søby) [09:46:46] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational [09:52:06] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405259 [09:54:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405259 (owner: 10Marostegui) [09:56:53] !log cp4026 restart varnish-be because of mbox lag [09:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405259 (owner: 10Marostegui) [09:57:17] (03Abandoned) 10Mark Bergsma: Proof of concept for hashing thumbs to original's hash key [puppet] - 10https://gerrit.wikimedia.org/r/29805 (owner: 10Mark Bergsma) [09:57:26] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405259 (owner: 10Marostegui) [09:58:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105:3311 - T162807 (duration: 00m 54s) [09:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:45] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:02:54] !log restarting es2001 [10:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:32] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [10:09:18] 10Operations, 10media-storage: xfs_db blocked / timeout on ms-be2023 - https://phabricator.wikimedia.org/T185298#3912207 (10fgiunchedi) [10:10:59] 10Operations, 10ops-codfw, 10DBA: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912217 (10Marostegui) Probably we should try to upgrade BIOS, raid controller etc... [10:19:05] !log stop mariadb at db2018 to clone it away [10:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:29] (03CR) 10Mark Bergsma: Support per-service-IP BGP MED values (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) (owner: 10Mark Bergsma) [10:32:49] 10Operations, 10MediaWiki-Configuration, 10User-Joe, 10discovery-system: Prepare conftool for safely editing mediawiki-config values - https://phabricator.wikimedia.org/T185080#3912245 (10Joe) [10:37:30] (03CR) 10Jcrespo: [C: 032] mariadb: Enable notifications on es2001-4 and set default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/404768 (owner: 10Jcrespo) [10:38:43] (03PS2) 10Jcrespo: mariadb: Enable notifications on es2001-4 and set default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/404768 [10:44:58] (03PS1) 10Elukey: profile::hadoop::worker: install spark2 package after Hive config [puppet] - 10https://gerrit.wikimedia.org/r/405263 (https://phabricator.wikimedia.org/T166248) [10:52:23] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9790/analytics1040.eqiad.wmnet/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/405263 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [10:53:17] !log updated tor packages on apt.wikimedia.org to 0.3.2.9-1~d80 [10:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:40] (03CR) 10Elukey: [C: 032] profile::hadoop::worker: install spark2 package after Hive config [puppet] - 10https://gerrit.wikimedia.org/r/405263 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [10:55:43] (03PS2) 10Elukey: profile::hadoop::worker: install spark2 package after Hive config [puppet] - 10https://gerrit.wikimedia.org/r/405263 (https://phabricator.wikimedia.org/T166248) [10:55:46] !log restarting es2002 [10:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:22] (03PS1) 10Ladsgroup: labs: Change some configs of wikibase to the new repo handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405266 (https://phabricator.wikimedia.org/T185200) [11:00:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405267 (https://phabricator.wikimedia.org/T162807) [11:02:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405267 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [11:04:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405267 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [11:05:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T162807 (duration: 00m 56s) [11:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:06] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [11:06:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405267 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [11:07:23] (03PS1) 10Jcrespo: mariadb: Decommission old codfw masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405270 (https://phabricator.wikimedia.org/T184090) [11:07:47] (03PS2) 10Jcrespo: mariadb: Decommission old codfw masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405270 (https://phabricator.wikimedia.org/T184090) [11:08:05] (03CR) 10Marostegui: [C: 031] mariadb: Decommission old codfw masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405270 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [11:08:36] jynus: actually, the commit says db2028 too, but I don't see it being removed [11:08:47] yeah, I think I removed it already [11:08:57] so technically it removes its 0 references [11:09:01] :) [11:09:03] which was the idea of the patch [11:09:29] unless you tell me there is references to db2028 [11:09:46] No no, there are not [11:10:00] but I think I cleaned up s6 beforehand, then I realized it was easier to do all at once [11:10:29] yep, the +1 is correct then :) [11:11:48] did you deploy your change already? [11:12:08] it was the 12:05 one, right? [11:12:15] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission old codfw masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405270 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [11:14:15] (03Merged) 10jenkins-bot: mariadb: Decommission old codfw masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405270 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [11:14:41] yeah mine is deployed [11:16:55] (03CR) 10jenkins-bot: mariadb: Decommission old codfw masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405270 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [11:18:50] !log jynus@tin Synchronized wmf-config/db-codfw.php: Decommission old codfw masters (duration: 00m 56s) [11:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:58] !log upgrading tor on radium to 0.3.2.9 [11:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Decommission old codfw masters (duration: 00m 55s) [11:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:22] (03CR) 10Ladsgroup: [C: 032] labs: Change some configs of wikibase to the new repo handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405266 (https://phabricator.wikimedia.org/T185200) (owner: 10Ladsgroup) [11:32:02] PROBLEM - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused [11:32:13] expected ^ [11:32:27] (03Merged) 10jenkins-bot: labs: Change some configs of wikibase to the new repo handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405266 (https://phabricator.wikimedia.org/T185200) (owner: 10Ladsgroup) [11:32:38] 10Operations, 10media-storage: ms-be2023 unresponsive while rebuilding one disk - https://phabricator.wikimedia.org/T185306#3912379 (10fgiunchedi) [11:32:41] (03CR) 10jenkins-bot: labs: Change some configs of wikibase to the new repo handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405266 (https://phabricator.wikimedia.org/T185200) (owner: 10Ladsgroup) [11:33:02] 10Operations, 10hardware-requests: Replacement hardware for cumin masters - https://phabricator.wikimedia.org/T178392#3912394 (10MoritzMuehlenhoff) The hardware configuration from T181419 seems perfectly fine for Cumin masters. I don't have a good estimate how much cheaper a single CPU/32 GB machine would be c... [11:33:22] 10Operations, 10media-storage: ms-be2023 unresponsive while rebuilding one disk - https://phabricator.wikimedia.org/T185306#3912395 (10fgiunchedi) [11:33:36] rebased it tin, not going to deploy as it's labs only [11:41:00] (03PS1) 10Jcrespo: mariadb: Decommission old codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/405273 (https://phabricator.wikimedia.org/T184090) [11:41:46] (03CR) 10Jcrespo: [C: 04-1] "We will wait a bit until the db2018 cloning to db2036 finishes." [puppet] - 10https://gerrit.wikimedia.org/r/405273 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [11:46:57] !log installing sensible-utils security update [11:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:14] (03PS3) 10Jcrespo: Add Proxysql creation debian package script [software] - 10https://gerrit.wikimedia.org/r/404153 [11:48:16] (03PS1) 10Jcrespo: mariadb: Decommission old codfw masters [software] - 10https://gerrit.wikimedia.org/r/405274 [11:48:18] (03PS1) 10Jcrespo: mariadb: Remove references to labsdb1001 and labsdb1003 [software] - 10https://gerrit.wikimedia.org/r/405275 (https://phabricator.wikimedia.org/T184832) [11:50:32] PROBLEM - HHVM rendering on mw2211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:50:47] (03PS1) 10Jcrespo: mariadb: Remove references to db1031 [software] - 10https://gerrit.wikimedia.org/r/405276 (https://phabricator.wikimedia.org/T184054) [11:51:22] RECOVERY - HHVM rendering on mw2211 is OK: HTTP OK: HTTP/1.1 200 OK - 76660 bytes in 0.304 second response time [11:52:58] (03CR) 10Marostegui: [C: 031] mariadb: Remove references to labsdb1001 and labsdb1003 [software] - 10https://gerrit.wikimedia.org/r/405275 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo) [11:55:21] (03CR) 10Jcrespo: [C: 032] mariadb: Remove references to db1031 [software] - 10https://gerrit.wikimedia.org/r/405276 (https://phabricator.wikimedia.org/T184054) (owner: 10Jcrespo) [11:55:27] (03PS2) 10Jcrespo: mariadb: Remove references to db1031 [software] - 10https://gerrit.wikimedia.org/r/405276 (https://phabricator.wikimedia.org/T184054) [11:57:03] PROBLEM - DPKG on mw2208 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:58:11] mw2208 is me [11:58:12] PROBLEM - HHVM rendering on mw2208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [11:58:13] PROBLEM - Nginx local proxy to apache on mw2208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.150 second response time [11:58:30] forgot to downtime that onw [11:59:13] RECOVERY - HHVM rendering on mw2208 is OK: HTTP OK: HTTP/1.1 200 OK - 76660 bytes in 1.136 second response time [11:59:22] RECOVERY - Nginx local proxy to apache on mw2208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 5.751 second response time [11:59:34] (03PS9) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) [12:00:03] RECOVERY - DPKG on mw2208 is OK: All packages OK [12:01:50] (03PS10) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) [12:03:32] !log installing imagemagick security updates [12:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:57] (03PS1) 10Jcrespo: mariadb: Decommission db1031 [puppet] - 10https://gerrit.wikimedia.org/r/405278 (https://phabricator.wikimedia.org/T184054) [12:09:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405279 [12:10:06] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405279 [12:12:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405279 (owner: 10Marostegui) [12:13:56] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1031 [puppet] - 10https://gerrit.wikimedia.org/r/405278 (https://phabricator.wikimedia.org/T184054) (owner: 10Jcrespo) [12:14:02] (03PS2) 10Jcrespo: mariadb: Decommission db1031 [puppet] - 10https://gerrit.wikimedia.org/r/405278 (https://phabricator.wikimedia.org/T184054) [12:14:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405279 (owner: 10Marostegui) [12:15:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T162807 (duration: 00m 56s) [12:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:25] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [12:15:26] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405279 (owner: 10Marostegui) [12:17:32] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:22] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 76660 bytes in 0.305 second response time [12:25:42] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1031.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1031.eqiad.wmnet (110 Connection timed out) [12:26:26] 10Operations, 10ops-eqiad, 10DBA: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3912507 (10jcrespo) [12:28:33] ACKNOWLEDGEMENT - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1031.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1031.eqiad.wmnet (110 Connection timed out) Jcrespo mabout to mitgrate its master [12:30:21] 10Operations, 10ops-eqiad, 10DBA: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3912510 (10jcrespo) a:05jcrespo>03Cmjohnson Chis, these 2 are ready to be unracked or whatever it is its end of life (after being wiped). [12:35:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 13 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:37:33] PROBLEM - puppet last run on kubestagetcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:43] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:50:27] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission old codfw masters [software] - 10https://gerrit.wikimedia.org/r/405274 (owner: 10Jcrespo) [12:50:34] (03PS2) 10Jcrespo: mariadb: Decommission old codfw masters [software] - 10https://gerrit.wikimedia.org/r/405274 [12:50:40] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Decommission old codfw masters [software] - 10https://gerrit.wikimedia.org/r/405274 (owner: 10Jcrespo) [12:50:43] (03PS1) 10Ladsgroup: Revert "labs: Change some configs of wikibase to the new repo handling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405283 [12:50:47] (03CR) 10Ladsgroup: [C: 032] Revert "labs: Change some configs of wikibase to the new repo handling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405283 (owner: 10Ladsgroup) [12:51:03] (03PS2) 10Jcrespo: mariadb: Remove references to labsdb1001 and labsdb1003 [software] - 10https://gerrit.wikimedia.org/r/405275 (https://phabricator.wikimedia.org/T184832) [12:51:19] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Remove references to labsdb1001 and labsdb1003 [software] - 10https://gerrit.wikimedia.org/r/405275 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo) [12:54:05] (03Merged) 10jenkins-bot: Revert "labs: Change some configs of wikibase to the new repo handling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405283 (owner: 10Ladsgroup) [12:55:27] (03CR) 10jenkins-bot: Revert "labs: Change some configs of wikibase to the new repo handling" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405283 (owner: 10Ladsgroup) [13:07:33] RECOVERY - puppet last run on kubestagetcd1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:07:46] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3257087 (10The_RedBurn) Since there's "some other publishers" in the title, here are two examples of non wo... [13:12:11] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3912575 (10faidon) Thanks for working on this task, very much appreciated! My idea was for "tag" to be used for our differe... [13:28:41] (03CR) 10Mark Bergsma: [C: 032] Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) (owner: 10Mark Bergsma) [13:29:08] (03Merged) 10jenkins-bot: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) (owner: 10Mark Bergsma) [13:37:16] 10Operations, 10media-storage: xfs_db blocked / timeout on ms-be2023 - https://phabricator.wikimedia.org/T185298#3912589 (10fgiunchedi) [13:37:49] (03CR) 10Alexandros Kosiaris: [C: 032] Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016 (owner: 10Dduvall) [13:38:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Looks fine to me and a simple `helm test` does what this is supposed to make it do given the last of the container image for now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016 (owner: 10Dduvall) [13:46:12] (03PS1) 10Mark Bergsma: Improve and clarify error handling of DNS lookups [debs/pybal] - 10https://gerrit.wikimedia.org/r/405284 [13:49:37] (03PS3) 10Alexandros Kosiaris: admin: Use the debian staff group for ops [puppet] - 10https://gerrit.wikimedia.org/r/331602 [13:50:18] 10Operations, 10LDAP, 10Release-Engineering-Team (Watching / External): Create 'releng' LDAP group - https://phabricator.wikimedia.org/T183507#3912615 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:52:31] (03CR) 10Alexandros Kosiaris: [C: 031] "With some newly acquired knowledge on my part of the admin module, here's the new version. The fact the `staff` group is used is rather hi" [puppet] - 10https://gerrit.wikimedia.org/r/331602 (owner: 10Alexandros Kosiaris) [13:55:46] 10Operations, 10LDAP, 10Release-Engineering-Team (Watching / External): Create 'releng' LDAP group - https://phabricator.wikimedia.org/T183507#3912637 (10MoritzMuehlenhoff) 05Open>03Resolved Done, see below. @greg, please also add a description of the purpose (and which privileges this group entails) to... [13:59:19] (03PS1) 10Muehlenhoff: Cover cn=releng in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/405286 (https://phabricator.wikimedia.org/T183507) [13:59:42] (03CR) 10jerkins-bot: [V: 04-1] Cover cn=releng in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/405286 (https://phabricator.wikimedia.org/T183507) (owner: 10Muehlenhoff) [14:01:16] (03PS2) 10Muehlenhoff: Cover cn=releng in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/405286 (https://phabricator.wikimedia.org/T183507) [14:07:28] (03CR) 10Muehlenhoff: [C: 032] Cover cn=releng in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/405286 (https://phabricator.wikimedia.org/T183507) (owner: 10Muehlenhoff) [14:28:54] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3912707 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03elukey [14:30:56] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3896291 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw2140.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201... [14:31:11] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3912713 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw2140.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['mw2140.codfw.wmnet'] ``` [14:31:28] !log installing krb5 updates from jessie point release [14:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:19] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3912716 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw2140.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201... [14:39:35] (03PS1) 10Jcrespo: mariadb: Increase db1075 connection rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405287 [14:40:21] (03CR) 10Jcrespo: [C: 031] "https://logstash.wikimedia.org/goto/50f235c972242f548b804b87f5497ea1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405287 (owner: 10Jcrespo) [14:47:32] (03PS3) 10Alexandros Kosiaris: Create module for docker-pkg software [puppet] - 10https://gerrit.wikimedia.org/r/388075 [14:48:52] !log Running migrateArchiveText.php on testwiki (T184629) [14:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] T184629: Run maintenance/migrateArchiveText.php on all wikis - https://phabricator.wikimedia.org/T184629 [14:49:14] Hmm, apparently that was pointless. [14:55:38] (03PS4) 10Alexandros Kosiaris: Create module for docker-pkg software [puppet] - 10https://gerrit.wikimedia.org/r/388075 [14:57:32] RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.001 second response time on 10.64.48.140 port 9042 [15:01:30] (03PS1) 10Elukey: Update mw2140 MAC address after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/405291 (https://phabricator.wikimedia.org/T184788) [15:02:08] moritzm: if you have time --^ [15:03:32] having a look [15:07:18] (03CR) 10Muehlenhoff: Update mw2140 MAC address after mainboard replacement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/405291 (https://phabricator.wikimedia.org/T184788) (owner: 10Elukey) [15:07:59] (03CR) 10Alexandros Kosiaris: [C: 032] "Some months after the initial upload, I 've amended the change, run PCC against it for contint1001 and boron and the results are a noop at" [puppet] - 10https://gerrit.wikimedia.org/r/388075 (owner: 10Alexandros Kosiaris) [15:10:07] (03CR) 10Alexandros Kosiaris: [C: 031] "+1 but let's see if we can get service-checker built for stretch first" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall) [15:19:20] (03PS1) 10Mark Bergsma: Fix RuntimeError in Server.merge [debs/pybal] - 10https://gerrit.wikimedia.org/r/405295 [15:19:22] (03PS1) 10Mark Bergsma: Test all branches of the Server constructor [debs/pybal] - 10https://gerrit.wikimedia.org/r/405296 [15:19:24] (03PS1) 10Mark Bergsma: Add unit testing of monitor creation/loading [debs/pybal] - 10https://gerrit.wikimedia.org/r/405297 [15:19:26] (03PS1) 10Mark Bergsma: Stop pybal on all failures to properly parse 'monitors' [debs/pybal] - 10https://gerrit.wikimedia.org/r/405298 [15:19:36] ema: ^ ;) [15:23:34] !log bootstrap cassandra-a on restbase2010 - T184100 [15:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:45] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [15:24:51] !log Running migrateArchiveText.php on metawiki (T184629) [15:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:03] T184629: Run maintenance/migrateArchiveText.php on all wikis - https://phabricator.wikimedia.org/T184629 [15:32:54] (03CR) 10Jcrespo: [C: 031] mariadb: Decommission old codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/405273 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [15:32:59] (03PS2) 10Jcrespo: mariadb: Decommission old codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/405273 (https://phabricator.wikimedia.org/T184090) [15:35:59] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3912809 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw2140.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['mw2140.codfw.wmnet'] ``` [15:37:33] (03CR) 10Marostegui: "Commit message says db1075 which is the master, but the actual change is db1072. I assume you meant db1072, but double checking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405287 (owner: 10Jcrespo) [15:38:54] !log Running migrateArchiveText.php on all wikis that need it (T184629) [15:39:03] (03PS2) 10Jcrespo: mariadb: Increase db1075 connection rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405287 [15:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:05] (03PS1) 10Jcrespo: mariadb: Pool db1067 while db1089 is depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405300 [15:39:06] T184629: Run maintenance/migrateArchiveText.php on all wikis - https://phabricator.wikimedia.org/T184629 [15:39:31] (03CR) 10Jcrespo: [C: 032] mariadb: Increase db1075 connection rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405287 (owner: 10Jcrespo) [15:40:26] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1067 while db1089 is depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405300 (owner: 10Jcrespo) [15:41:03] (03CR) 10jenkins-bot: mariadb: Increase db1075 connection rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405287 (owner: 10Jcrespo) [15:42:10] (03PS1) 10Giuseppe Lavagetto: cli.tool: drop the "find" interface [software/conftool] - 10https://gerrit.wikimedia.org/r/405301 [15:42:12] (03PS1) 10Giuseppe Lavagetto: Add preemptive validation. [software/conftool] - 10https://gerrit.wikimedia.org/r/405302 (https://phabricator.wikimedia.org/T185080) [15:42:14] (03PS1) 10Giuseppe Lavagetto: Refactor conftool.action [software/conftool] - 10https://gerrit.wikimedia.org/r/405303 [15:42:29] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission old codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/405273 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [15:43:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Tune s1 and s3 database weights (duration: 00m 57s) [15:43:13] (03PS2) 10Elukey: Update mw2140 MAC address after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/405291 (https://phabricator.wikimedia.org/T184788) [15:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:36] (03PS4) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) [15:43:38] (03PS8) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [15:47:01] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319#3912831 (10faidon) [15:50:14] (03PS3) 10Elukey: Update mw2140 MAC address after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/405291 (https://phabricator.wikimedia.org/T184788) [15:50:34] jynus: And I used screen this time ;) [15:51:45] jynus: I am going to repool db1089, I am done with it [15:52:34] (03PS1) 10Marostegui: db-eqiad.php: Slowly pool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405305 [15:52:35] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:54:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly pool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405305 (owner: 10Marostegui) [15:55:42] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly pool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405305 (owner: 10Marostegui) [15:56:23] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405307 [15:56:25] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405307 [15:56:52] jynus: is "PROBLEM - Unmerged changes on repository puppet on puppetmaster100" due to your change? (Just asking, now rush :) [15:57:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1089 - T162807 (duration: 00m 56s) [15:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:14] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [15:57:39] I am at "Merge these changes? (yes/no)?" [15:58:21] "No changes to merge." [15:58:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405307 (owner: 10Marostegui) [15:58:44] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [16:00:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405307 (owner: 10Marostegui) [16:00:34] (03PS1) 10Mark Bergsma: Fix Server testInitialize test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/405308 [16:01:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 - T174569 (duration: 00m 56s) [16:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:27] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:02:46] (03PS2) 10Mark Bergsma: Fix Server testInitialize test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/405308 [16:03:07] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405309 [16:06:01] (03PS8) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [16:06:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405309 (owner: 10Marostegui) [16:06:37] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [16:08:11] (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405312 (https://phabricator.wikimedia.org/T184100) [16:08:15] (03PS9) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [16:08:29] (03CR) 10jerkins-bot: [V: 04-1] restbase: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405312 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [16:09:28] (03PS2) 10Filippo Giunchedi: restbase: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405312 (https://phabricator.wikimedia.org/T184100) [16:09:37] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405309 (owner: 10Marostegui) [16:10:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 - T162807 (duration: 00m 56s) [16:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:08] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [16:11:17] (03PS10) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [16:11:27] (03CR) 10Elukey: [C: 032] Update mw2140 MAC address after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/405291 (https://phabricator.wikimedia.org/T184788) (owner: 10Elukey) [16:11:33] (03PS4) 10Elukey: Update mw2140 MAC address after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/405291 (https://phabricator.wikimedia.org/T184788) [16:11:54] (03CR) 10Elukey: [V: 032 C: 032] Update mw2140 MAC address after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/405291 (https://phabricator.wikimedia.org/T184788) (owner: 10Elukey) [16:13:10] (03PS11) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [16:13:50] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/405312 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [16:15:01] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3912975 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw2140.codfw.wmnet'] ``` The log can be found in `/var/lo... [16:15:59] (03CR) 10Ema: [C: 031] "Nice!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/405308 (owner: 10Mark Bergsma) [16:17:38] (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405312 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [16:17:42] (03PS3) 10Filippo Giunchedi: restbase: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405312 (https://phabricator.wikimedia.org/T184100) [16:20:26] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405313 [16:23:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405313 (owner: 10Marostegui) [16:24:53] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/405314 [16:25:46] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405313 (owner: 10Marostegui) [16:27:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool ddb1089 and depool db1067 (duration: 00m 56s) [16:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:35] (03CR) 10Volans: [C: 032] CHANGELOG: add changelogs for release v2.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/405314 (owner: 10Volans) [16:29:27] PROBLEM - puppet last run on restbase2009 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[restbase/deploy],Service[cassandra-a],Service[cassandra-b],Service[cassandra-c] [16:30:27] PROBLEM - puppet last run on restbase2008 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[restbase/deploy],Service[cassandra-a],Service[cassandra-b],Service[cassandra-c] [16:31:09] that's me ^ known [16:31:38] PROBLEM - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.177 and port 9042: Connection refused [16:31:38] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:31:38] PROBLEM - cassandra-a service on restbase2008 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [16:32:17] PROBLEM - puppet last run on restbase2007 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 7 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[restbase/deploy],Service[cassandra-a],Service[cassandra-b],Service[cassandra-c] [16:32:29] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/405314 (owner: 10Volans) [16:33:27] PROBLEM - cassandra-b SSL 10.192.16.177:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:34:05] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v2.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/405314 (owner: 10Volans) [16:35:07] PROBLEM - cassandra-b service on restbase2007 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [16:36:27] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/331602 (owner: 10Alexandros Kosiaris) [16:36:47] PROBLEM - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.178 and port 9042: Connection refused [16:38:28] PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:40:17] PROBLEM - cassandra-c service on restbase2007 is CRITICAL: NRPE: Command check_cassandra-c-state not defined [16:41:57] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: NRPE: Command check_endpoints_restbase not defined [16:44:45] (03PS1) 10Jcrespo: mariadb: Reenable db2036 notifications after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/405318 (https://phabricator.wikimedia.org/T185294) [16:45:18] PROBLEM - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.176 and port 9042: Connection refused [16:46:17] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:50:47] (03PS1) 10Volans: Upstream release v2.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/405320 [16:51:17] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 3 minutes ago with 5 failures. Failed resources (up to 3 shown): Exec[create_user-replication@netmon2001],Exec[create_user-netbox@netmon2001],Exec[create_user-netbox@localhost],Exec[create_user-prometheus@localhost] [16:51:18] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [17:01:22] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable db2036 notifications after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/405318 (https://phabricator.wikimedia.org/T185294) (owner: 10Jcrespo) [17:04:26] PROBLEM - HHVM rendering on mw2140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:48] being reimaged --^ [17:08:00] (03CR) 10Volans: [C: 032] Upstream release v2.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/405320 (owner: 10Volans) [17:08:44] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3913162 (10ayounsi) Alex found the issue! The data was in /var/lib/postgres/9.6 (default location). The restart made postgres use the "proper" location (set by p... [17:09:18] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3913164 (10ayounsi) [17:10:52] (03Merged) 10jenkins-bot: Upstream release v2.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/405320 (owner: 10Volans) [17:11:16] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:11:19] !log stopping mariadb on db2016,17,18,19,23,28&29 T184090 [17:11:19] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3913165 (10Volans) Nice! So I guess that our puppetization is not correct and should restart Postgres after the first configuration change to ensure that the new... [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:30] T184090: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090 [17:19:08] 10Operations, 10ops-codfw, 10netops: rack spare switches in c1-codfw - https://phabricator.wikimedia.org/T185336#3913206 (10RobH) p:05Triage>03Normal [17:19:36] RECOVERY - HHVM rendering on mw2140 is OK: HTTP OK: HTTP/1.1 200 OK - 76671 bytes in 8.432 second response time [17:26:51] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3913225 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw2140.codfw.wmnet'] ``` and were **ALL** successful. [17:29:08] (03CR) 10Ema: [C: 031] "One nit, looks good otherwise." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/405284 (owner: 10Mark Bergsma) [17:30:39] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3913229 (10elukey) 05Open>03Resolved Pooled and working correctly, closing! [17:31:26] (03CR) 10Ema: Fix RuntimeError in Server.merge (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/405295 (owner: 10Mark Bergsma) [17:31:58] 10Operations, 10ops-eqiad, 10netops: rack spare switches in c1-eqiad - https://phabricator.wikimedia.org/T185337#3913231 (10RobH) p:05Triage>03Normal [17:33:10] (03CR) 10Ema: [C: 031] Test all branches of the Server constructor [debs/pybal] - 10https://gerrit.wikimedia.org/r/405296 (owner: 10Mark Bergsma) [17:33:47] (03CR) 10Ema: [C: 031] Add unit testing of monitor creation/loading [debs/pybal] - 10https://gerrit.wikimedia.org/r/405297 (owner: 10Mark Bergsma) [17:40:23] (03CR) 10Ema: Stop pybal on all failures to properly parse 'monitors' (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/405298 (owner: 10Mark Bergsma) [17:42:34] 10Operations, 10ops-codfw, 10DBA: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913257 (10jcrespo) a:05jcrespo>03Papaul Papaul, these 7 old hosts are ready to go, and we should make room for others. [17:43:45] (03PS1) 10Anomie: Remove duplicate 'hiwiktionary' in s3.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405337 [17:44:13] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [17:44:18] 10Operations, 10Cloud-VPS, 10DNS, 10Beta-Cluster-reproducible: Create some mechanism for instances in projects to modify the project Designate records - https://phabricator.wikimedia.org/T184245#3913265 (10ema) [17:45:53] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:47:19] (03CR) 10Reedy: [C: 032] Remove duplicate 'hiwiktionary' in s3.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405337 (owner: 10Anomie) [17:48:50] (03Merged) 10jenkins-bot: Remove duplicate 'hiwiktionary' in s3.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405337 (owner: 10Anomie) [17:48:54] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3913277 (10elukey) Thanks for the explanation! I have another question: is netflow data ending up in the `neflow` kafka to... [17:49:53] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [17:50:03] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [17:50:24] !log reedy@tin Synchronized dblists/s3.dblist: alphasort and remove dupes (duration: 01m 01s) [17:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:52] 10Operations, 10ops-eqiad, 10DBA: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3913281 (10jcrespo) [17:52:34] (03PS1) 10Jcrespo: mariadb: Exclude labsdb1001,2,3 from megacli policy check [puppet] - 10https://gerrit.wikimedia.org/r/405338 (https://phabricator.wikimedia.org/T142807) [17:52:40] (03CR) 10Dzahn: [C: 031] "looks to me that paladox did what was requested in the earlier reviewer comments. still supports trusty and should (per bd808)" [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [17:53:53] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3777449 (10Ottomata) HMMM. If this is JSON data, and the schema is consistent, we could use JSONRefine to build the table,... [17:54:23] (03PS2) 10Jcrespo: mariadb: Exclude labsdb1001,2,3 from megacli policy check [puppet] - 10https://gerrit.wikimedia.org/r/405338 (https://phabricator.wikimedia.org/T142807) [17:55:36] !log labcontrol1001:~# ip addr del 208.80.154.94/32 dev eth0 [17:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:29] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3913296 (10elukey) >>! In T181036#3913291, @Ottomata wrote: > HMMM. If this is JSON data, and the schema is consistent, we... [17:58:30] !log labcontrol1002:~# ip addr del 208.80.154.102/32 dev eth0 [17:58:39] !log labcontrol1002:~# ip addr del 208.80.154.12/32 dev eth0 [17:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:06] (03PS3) 10Jcrespo: mariadb: Exclude labsdb1001,2,3 from megacli policy check [puppet] - 10https://gerrit.wikimedia.org/r/405338 (https://phabricator.wikimedia.org/T142807) [18:07:06] 10Operations, 10ops-codfw, 10DBA: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913305 (10Papaul) a:05Papaul>03jcrespo @jcrespo thanks. Can you please do the steps below and assign the task back to me. Thanks Disable puppet on host Rem... [18:07:31] (03PS4) 10Jcrespo: mariadb: Exclude labsdb1001,2,3 from megacli policy check [puppet] - 10https://gerrit.wikimedia.org/r/405338 (https://phabricator.wikimedia.org/T142807) [18:10:16] 10Operations, 10ops-codfw, 10DBA: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913309 (10RobH) a:05jcrespo>03RobH Please note that we shoudl do those steps, not Jaime, since he cannot disable the switch port (which has to be done at the... [18:13:27] 10Operations, 10ops-codfw, 10DBA: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913318 (10jcrespo) Note I have not a problem to do those if told, but specially disabling puppet should be done just before literally shutting down the servers an... [18:19:55] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913329 (10RobH) [18:20:21] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3913331 (10jcrespo) [18:21:42] (03PS5) 10Jcrespo: mariadb: Exclude labsdb1001,2,3 from megacli policy check [puppet] - 10https://gerrit.wikimedia.org/r/405338 (https://phabricator.wikimedia.org/T142807) [18:22:27] (03PS6) 10Jcrespo: mariadb: Exclude labsdb1001,2,3 from megacli policy check [puppet] - 10https://gerrit.wikimedia.org/r/405338 (https://phabricator.wikimedia.org/T142807) [18:25:06] (03PS1) 10RobH: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 [puppet] - 10https://gerrit.wikimedia.org/r/405340 (https://phabricator.wikimedia.org/T184090) [18:29:32] (03PS1) 10RobH: removing dns entries for db20(1[6-9]|2[389] [dns] - 10https://gerrit.wikimedia.org/r/405342 (https://phabricator.wikimedia.org/T184090) [18:31:05] (03CR) 10Jcrespo: [C: 032] mariadb: Exclude labsdb1001,2,3 from megacli policy check [puppet] - 10https://gerrit.wikimedia.org/r/405338 (https://phabricator.wikimedia.org/T142807) (owner: 10Jcrespo) [18:31:38] (03PS1) 10Jcrespo: Revert "mariadb: Exclude labsdb1001,2,3 from megacli policy check" [puppet] - 10https://gerrit.wikimedia.org/r/405343 [18:32:19] (03PS1) 10Bearloga: Make Shiny Server a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/405344 (https://phabricator.wikimedia.org/T183984) [18:32:56] (03CR) 10Jcrespo: [C: 04-2] "Deploy once T184832 has been resolved" [puppet] - 10https://gerrit.wikimedia.org/r/405343 (owner: 10Jcrespo) [18:37:25] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913366 (10RobH) Switch ports for later removal (once they are unracked): ge-6/0/0 - db2016 ge-6/0/1 - db2017 ge-6/0/... [18:39:09] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913368 (10RobH) [18:39:51] (03PS2) 10RobH: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 [puppet] - 10https://gerrit.wikimedia.org/r/405340 (https://phabricator.wikimedia.org/T184090) [18:39:57] (03CR) 10RobH: [C: 032] Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 [puppet] - 10https://gerrit.wikimedia.org/r/405340 (https://phabricator.wikimedia.org/T184090) (owner: 10RobH) [18:41:35] (03CR) 10RobH: [C: 032] removing dns entries for db20(1[6-9]|2[389] [dns] - 10https://gerrit.wikimedia.org/r/405342 (https://phabricator.wikimedia.org/T184090) (owner: 10RobH) [18:44:42] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913381 (10RobH) [18:45:15] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3872109 (10RobH) a:05RobH>03Papaul Ok, these are all ready to have disks wiped, unracked, and racktables updated.... [18:45:26] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913384 (10RobH) [18:58:27] !log bootstrapping restbase2010-b - T184100 [18:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:41] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [19:19:14] (03PS2) 10Bearloga: Make Shiny Server a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/405344 (https://phabricator.wikimedia.org/T183984) [19:30:19] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319#3913473 (10bd808) [19:31:52] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3913475 (10ayounsi) Empty logs was due to BGP being disabled between pmacct and the router. It's now back: ``` {"tag": 0, "... [19:41:16] (03CR) 10jenkins-bot: mariadb: Pool db1067 while db1089 is depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405300 (owner: 10Jcrespo) [19:41:19] (03CR) 10jenkins-bot: db-eqiad.php: Slowly pool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405305 (owner: 10Marostegui) [19:41:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405307 (owner: 10Marostegui) [19:41:22] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405309 (owner: 10Marostegui) [19:41:24] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405313 (owner: 10Marostegui) [19:41:26] (03CR) 10jenkins-bot: Remove duplicate 'hiwiktionary' in s3.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405337 (owner: 10Anomie) [19:43:00] !log ms-be3003 - power up via mgmt to check if still connected and usable as temp bastion (T184936) [19:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:11] T184936: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936 [19:43:35] (03PS3) 10Andrew Bogott: Make Shiny Server a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/405344 (https://phabricator.wikimedia.org/T183984) (owner: 10Bearloga) [19:44:14] (03CR) 10Andrew Bogott: [C: 032] Make Shiny Server a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/405344 (https://phabricator.wikimedia.org/T183984) (owner: 10Bearloga) [19:44:37] andrewbogott: thank you very much! [19:44:54] sure thing [19:47:13] (03PS2) 10Zoranzoki21: Add 3 namespaces to wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) (owner: 10Jon Harald Søby) [19:49:12] (03PS3) 10Andrew Bogott: role::labs::mediawiki_vagrant: Warn if not on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) [19:50:57] (03CR) 10Paladox: "Fixed in https://gerrit.wikimedia.org/r/#/c/405208/" [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott) [19:51:37] (03CR) 10Andrew Bogott: [C: 032] role::labs::mediawiki_vagrant: Warn if not on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott) [19:52:46] (03PS8) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) [19:58:12] (03PS1) 10Andrew Bogott: role::labs::mediawiki_vagrant: rephrase logic to os_version [puppet] - 10https://gerrit.wikimedia.org/r/405356 [19:58:36] (03CR) 10jerkins-bot: [V: 04-1] role::labs::mediawiki_vagrant: rephrase logic to os_version [puppet] - 10https://gerrit.wikimedia.org/r/405356 (owner: 10Andrew Bogott) [20:00:33] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [20:00:41] (03PS2) 10Andrew Bogott: role::labs::mediawiki_vagrant: rephrase logic to os_version [puppet] - 10https://gerrit.wikimedia.org/r/405356 [20:01:38] (03PS2) 10Dzahn: rename ms-be3003.mgmt to bast3003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/405223 (https://phabricator.wikimedia.org/T184936) [20:02:14] (03CR) 10Andrew Bogott: [C: 032] role::labs::mediawiki_vagrant: rephrase logic to os_version [puppet] - 10https://gerrit.wikimedia.org/r/405356 (owner: 10Andrew Bogott) [20:02:22] wait, why did you get added to Icinga, ms-be3003 [20:02:26] shouldnt [20:04:33] ACKNOWLEDGEMENT - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn decom / renaming [20:06:13] (03PS1) 10Andrew Bogott: role::labs::mediawiki_vagrant: another try at excluding stretch [puppet] - 10https://gerrit.wikimedia.org/r/405358 [20:06:52] (03CR) 10Andrew Bogott: [C: 032] role::labs::mediawiki_vagrant: another try at excluding stretch [puppet] - 10https://gerrit.wikimedia.org/r/405358 (owner: 10Andrew Bogott) [20:14:27] (03PS1) 10Andrew Bogott: role::labs::mediawiki_vagrant: one last attempt to understand os_version [puppet] - 10https://gerrit.wikimedia.org/r/405361 [20:15:24] (03CR) 10Andrew Bogott: [C: 032] role::labs::mediawiki_vagrant: one last attempt to understand os_version [puppet] - 10https://gerrit.wikimedia.org/r/405361 (owner: 10Andrew Bogott) [20:32:09] (03CR) 10BryanDavis: [C: 031] "drain_queue could probably use a config file if more options are added." [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [20:33:04] 10Operations, 10Puppet: os_version strict distro check doesn't work - https://phabricator.wikimedia.org/T185345#3913626 (10Andrew) [20:34:58] (03CR) 10Dzahn: [C: 032] "using 3003 instead of 3001 since 3001 is already disconnected" [dns] - 10https://gerrit.wikimedia.org/r/405223 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [20:35:53] (03PS3) 10Dzahn: rename ms-be3003.mgmt to bast3003.mgmt [dns] - 10https://gerrit.wikimedia.org/r/405223 (https://phabricator.wikimedia.org/T184936) [20:35:55] (03PS1) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [20:36:25] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:39:40] (03PS3) 10Paladox: Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778) [20:42:56] andrewbogott: i wonder if there is a difference between "!os_version('debian jessie')" and "os_version(debian != jessie)" [20:43:12] there is! os_version(debian != jessie) doesn't compile :) [20:43:19] heh, fair :) [20:43:23] but in either case the clause is simply ignored on trusty [20:43:36] yea, weird, just saw your ticket [20:48:21] == jessie should work [20:51:33] (03Draft1) 10Paladox: Gerrit: Fix performance issues with new login ui [puppet] - 10https://gerrit.wikimedia.org/r/405368 [20:51:35] (03Draft2) 10Paladox: Gerrit: Fix performance issues with new login ui [puppet] - 10https://gerrit.wikimedia.org/r/405368 [20:58:03] (03PS3) 10Paladox: Gerrit: Fix performance issues with new login ui [puppet] - 10https://gerrit.wikimedia.org/r/405368 [21:02:06] (03PS2) 10Dzahn: assign 91.198.174.115 to bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405224 (https://phabricator.wikimedia.org/T184936) [21:06:28] (03CR) 10Dzahn: [C: 032] assign 91.198.174.115 to bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405224 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [21:11:29] (03PS2) 10Dzahn: DHCP: add 24:B6:FD:F6:17:3A as bast3003 [puppet] - 10https://gerrit.wikimedia.org/r/405227 (https://phabricator.wikimedia.org/T184936) [21:14:18] (03PS3) 10Dzahn: DHCP: add 24:B6:FD:F6:17:3A as bast3003 [puppet] - 10https://gerrit.wikimedia.org/r/405227 (https://phabricator.wikimedia.org/T184936) [21:16:06] (03CR) 10Dzahn: [C: 032] DHCP: add 24:B6:FD:F6:17:3A as bast3003 [puppet] - 10https://gerrit.wikimedia.org/r/405227 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [21:22:00] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3913726 (10ayounsi) a:05ayounsi>03Papaul [21:23:16] (03PS4) 10Paladox: Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778) [21:24:15] (03PS5) 10Dzahn: Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [21:25:03] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#3913735 (10ayounsi) [21:25:10] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3913736 (10ayounsi) [21:25:13] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-netbox, looks like it thinks its a prod box - https://phabricator.wikimedia.org/T184242#3913733 (10ayounsi) 05Open>03Resolved Deleted [21:25:39] (03CR) 10Dzahn: [C: 032] Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [21:25:44] mutante thanks :) [21:26:10] no problem [21:28:25] !log bootstrapping restbase2010-c - T184100 [21:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:37] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [21:31:34] 10Puppet, 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Puppet broken on deployment-ores01 due to missing hieradata - https://phabricator.wikimedia.org/T184478#3913751 (10Krenair) Looks like its fixed now? Wanna mark this as resolved? [21:33:12] (03PS1) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/405373 (https://phabricator.wikimedia.org/T168470) [21:33:34] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/405373 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:33:40] (03PS4) 10Paladox: Gerrit: Fix performance issues with new login ui [puppet] - 10https://gerrit.wikimedia.org/r/405368 [21:34:40] (03PS2) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/405373 (https://phabricator.wikimedia.org/T168470) [21:35:11] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/405373 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:36:34] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3913758 (10Krenair) [21:36:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-kafka-jumbo-[12] due to version of a package being missing - https://phabricator.wikimedia.org/T184240#3913756 (10Krenair) 05Open>03Resolved a:03Krenair [21:38:39] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3913763 (10Krenair) [21:38:42] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-sentry01 - https://phabricator.wikimedia.org/T173554#3913761 (10Krenair) 05Open>03Resolved This took a frankly ridiculous amount of time to solve considering how simple the problem and patch was. [21:39:06] (03PS5) 10Paladox: Gerrit: Fix performance issues with new login ui [puppet] - 10https://gerrit.wikimedia.org/r/405368 [21:39:22] (03PS6) 10Paladox: Gerrit: Fix performance issues with new login ui [puppet] - 10https://gerrit.wikimedia.org/r/405368 [22:38:13] 10Operations, 10Analytics-Data-Quality, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3913867 (10Nuria) [22:53:03] !log Ran (time foreachwikiindblist flow.dblist extensions/Flow/maintenance/FlowFixInconsistentBoards.php --force) 2>&1|tee --append ~/FlowFixInconsistentBoards_all_2018-01-19_actual_force.txt [22:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:03] PROBLEM - Check systemd state on elastic1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:56:39] 10Operations, 10Analytics-Data-Quality, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3913932 (10Tbayer) And here is a detailed list of the queries from the first and third client in T185350#3913824 (169 and 174 of them, respectiv... [22:58:08] 10Operations, 10Ops-Access-Requests: Requesting access to bast1001, stat1005, stat1006 for risler - https://phabricator.wikimedia.org/T185356#3913933 (10Ramsey-WMF) [23:02:05] RECOVERY - Check systemd state on elastic1021 is OK: OK - running: The system is fully operational [23:02:22] PROBLEM - DNS ms-be3003.mgmt on ms-be3003.mgmt is CRITICAL: Domain ms-be3003.mgmt.esams.wmnet was not found by the server [23:07:01] (03PS1) 10Chico Venancio: Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) [23:25:07] 10Operations, 10LDAP, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Create 'releng' LDAP group - https://phabricator.wikimedia.org/T183507#3913987 (10greg) >>! In T183507#3912637, @MoritzMuehlenhoff wrote: > Done, see below. @greg, please also add a description of the purpose (and w...