[00:00:05] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180112T0000). [00:00:05] Lucas_WMDE: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:21] howdy! o/ [00:05:24] hello…? [00:06:21] hello :) [00:06:23] I can SWAT [00:06:31] hi! :) [00:06:54] https://gerrit.wikimedia.org/r/#/c/403736/ [00:07:04] it’s a small CSS fix I’d like to backport to Wikidata [00:07:25] yep, okie doke, is this needed on wmf.16/15 or both? [00:07:29] https://tools.wmflabs.org/versions/ [00:08:14] ah [00:08:24] I was just confused for a second while Group 1 was on .15 again [00:08:31] even though I looked into that task earlier today :) [00:08:39] .15 then, for wikidatawiki [00:09:09] k, I can do both, too. Hopefully train is fixed/can move forward soon. [00:09:29] okay, that would be great [00:09:36] * thcipriani does [00:09:38] the patch should apply cleanly, the file hasn’t been touched in a while [00:10:22] oh, wait, lemme check something… [00:10:37] k [00:10:50] (03PS24) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) [00:11:22] FWIW I was able to cherry pick in the gerrit interface for both and it looks right https://gerrit.wikimedia.org/r/#/c/403846/1 and https://gerrit.wikimedia.org/r/#/c/403845/1 [00:11:26] https://gerrit.wikimedia.org/r/#/c/402864/ isn’t in wmf.15 if I’m not mistaken… perhaps you could backport that as well? [00:11:56] sure [00:12:22] does that one need to go out before the css update? [00:12:31] okay thanks, I don’t think anything else from the logs needs backporting [00:12:40] doesn’t really matter I think [00:12:42] k [00:13:22] https://phabricator.wikimedia.org/T183992 might happen again, but it should be a rare condition and we didn’t backport the fix last time already :) [00:13:56] (^ is a “ConstraintParameterException” log error, just in case someone’s looking through the logs later) [00:15:24] * thcipriani waits for jenkins [00:16:22] (03PS4) 10Dzahn: Include php5 packages on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [00:26:16] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, 10Patch-For-Review: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3896000 (10kaldari) Yeah, I'm not actually sure what is required to make fonts available to librsvg. There's also th... [00:26:26] (03CR) 10Dzahn: [C: 031] "not sure, shouldn't canary hosts be exactly like actual prod hosts on the other hand" [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [00:29:22] ...gettin' closer :) [00:30:26] * Lucas_WMDE eagerly watches Zuul ;) [00:34:13] Lucas_WMDE: I've got the css changes pulled down on mwdebug1002 if you want to check those out. [00:34:21] okay, I’ll try it out [00:34:47] yup, seems to work! [00:35:47] ok, I'll go live with that one [00:35:49] thcipriani: does that also include the dependency change? I’m not seeing that error either but that might also be random [00:35:56] ok thanks :) [00:36:16] the extension.json change? No that didn't include that change yet. [00:37:48] yeah, that one [00:38:26] I don’t have a way to reproduce the error that this fixes, unfortunately [00:38:31] so I can’t really say whether the fix works or not [00:40:11] !log thcipriani@tin Synchronized php-1.31.0-wmf.16/extensions/WikibaseQualityConstraints/modules/ui/ConstraintReportGroup.less: SWAT: [[gerrit:403846|Do not hide default [Expand] link]] (duration: 01m 24s) [00:40:24] there's wmf.16 [00:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:09] !log thcipriani@tin Synchronized php-1.31.0-wmf.15/extensions/WikibaseQualityConstraints/modules/ui/ConstraintReportGroup.less: SWAT: [[gerrit:403845|Do not hide default [Expand] link]] (duration: 01m 22s) [00:43:12] Lucas_WMDE: ^ there's wmf.15 so css fix should be live [00:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:21] thcipriani: yup, CSS fix works without the debug header as well now :) [00:44:23] thank you! [00:44:27] yw :) [00:44:55] Lucas_WMDE: I pulled the extensions.json fix over to mwdebug1002 anything you can/want to check there before sync? [00:45:05] *extension [00:47:00] thcipriani: no, I just tried again but couldn’t reproduce the error even without the fix :) [00:47:07] it’s probably something spurious [00:47:45] or something that only happens occasionally, depending on the order in which RL modules are loaded, not sure [00:47:51] ok, I'll go ahead and sync [00:47:57] ok [00:48:06] I’m reasonably sure it won’t break anything at least :D [00:49:55] that's good :) [00:51:06] !log thcipriani@tin Synchronized php-1.31.0-wmf.15/extensions/WikibaseQualityConstraints/extension.json: SWAT: [[gerrit:403850|Declare dependency on jquery.makeCollapsible]] (duration: 01m 21s) [00:51:16] ^ Lucas_WMDE live now [00:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:09] alright great [00:53:07] I think that’s everything from my side, thank you for the SWAT :) [00:55:23] absolutely, yw :) [00:59:14] (03PS27) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:10:37] (03PS28) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:27:18] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584#3896064 (10Legoktm) [01:35:24] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [01:36:44] RECOVERY - Host cp3048 is UP: PING WARNING - Packet loss = 93%, RTA = 83.77 ms [02:11:53] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 6 minutes ago with 9 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[ores/deploy],Exec[chown /srv/deployment/ores for deploy-service],Exec[remove_uwsgi_initd] [02:21:04] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1515723660 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3078639 keys, up 2 minutes 3 seconds - replication_delay is 1515723660 [02:22:04] RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3052062 keys, up 3 minutes 4 seconds - replication_delay is 0 [02:36:48] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:27:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 829.15 seconds [03:55:46] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, and 2 others: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3896104 (10Shizhao) [03:58:08] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 181.79 seconds [05:10:48] (03PS1) 10Andrew Bogott: puppet_compiler: puppetdb::app no longer takes a heap_size arg [puppet] - 10https://gerrit.wikimedia.org/r/403883 [06:19:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403884 (https://phabricator.wikimedia.org/T174569) [06:21:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403884 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:22:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403884 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:22:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403884 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:24:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 - T174569 (duration: 01m 22s) [06:24:42] !log Deploy schema change on db1100 - T174569 [06:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:45] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:27] PROBLEM - Nginx local proxy to apache on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:17] RECOVERY - Nginx local proxy to apache on mw2129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.201 second response time [06:40:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311,db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403886 (https://phabricator.wikimedia.org/T162807) [06:42:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3311,db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403886 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:44:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311,db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403886 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:46:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105:3311, db1105:3312 - T162807 T184256 (duration: 01m 22s) [06:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:21] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:46:25] !log Update mariadb and kernel on db1105 - T184256 [06:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311,db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403886 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:01:05] !log Stop replication in sync db1089 db1105:3311 - T162807 [07:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:17] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:29:27] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:17] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 80732 bytes in 0.369 second response time [07:36:46] (03PS1) 10Marostegui: db-eqiad.php: Repool db1105:3312 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403887 [07:39:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1105:3312 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403887 (owner: 10Marostegui) [07:41:01] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1105:3312 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403887 (owner: 10Marostegui) [07:41:11] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1105:3312 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403887 (owner: 10Marostegui) [07:42:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105:3312 with low weight (duration: 01m 22s) [07:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:50] !log reboot video scalers in codfw for kernel security update [08:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:18] !log reboot remaining API servers in codfw for kernel security update (along with update to HHVM 3.18.6) [08:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:29] !log forced remount of /mnt/hdfs on stat1005 after OOM [09:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:41] apergos: --^ :) [09:03:10] thank you elukey [09:03:54] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1105:3311,3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403892 [09:04:32] !log Upgrade kernel on db1100 [09:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1105:3311,3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403892 (owner: 10Marostegui) [09:07:40] !log reboot analytics1063->65 for kernel updates [09:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:54] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1105:3311,3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403892 (owner: 10Marostegui) [09:08:05] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1105:3311,3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403892 (owner: 10Marostegui) [09:08:07] RECOVERY - Disk space on ms-be2023 is OK: DISK OK [09:09:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1105:3311 and db1105:3312 (duration: 01m 23s) [09:09:47] (03CR) 10Giuseppe Lavagetto: [C: 032] base::resolving: explicitly pass arguments [puppet] - 10https://gerrit.wikimedia.org/r/403440 (owner: 10Giuseppe Lavagetto) [09:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:03] (03PS5) 10Giuseppe Lavagetto: base::resolving: explicitly pass arguments [puppet] - 10https://gerrit.wikimedia.org/r/403440 [09:11:30] !log reboot ms-be2023 - sdn failed and raid controller isn't happy [09:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:28] PROBLEM - HHVM rendering on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:13:18] RECOVERY - HHVM rendering on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 80730 bytes in 0.116 second response time [09:13:29] (03PS1) 10Marostegui: db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403894 [09:16:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403894 (owner: 10Marostegui) [09:17:49] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403894 (owner: 10Marostegui) [09:18:01] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403894 (owner: 10Marostegui) [09:18:53] (03CR) 10Jcrespo: "To not confuse cloud people, this commit should be called "add GlobalPreferences to maintain-view", nothing to do with sanitarium" [puppet] - 10https://gerrit.wikimedia.org/r/403833 (https://phabricator.wikimedia.org/T184666) (owner: 10MaxSem) [09:19:14] (03CR) 10Jcrespo: "*maintain-views" [puppet] - 10https://gerrit.wikimedia.org/r/403833 (https://phabricator.wikimedia.org/T184666) (owner: 10MaxSem) [09:19:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1100 (duration: 01m 22s) [09:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:05] 10Operations, 10ops-codfw: ms-be2023 fails to (re)boot - https://phabricator.wikimedia.org/T184785#3896225 (10fgiunchedi) [09:29:37] (03CR) 10Elukey: "Seems a no-op (config wise) from https://puppet-compiler.wmflabs.org/compiler02/9716/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/403701 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [09:32:42] ACKNOWLEDGEMENT - HP RAID on ms-be2023 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3 - Failed: 2I:2:4 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184787 [09:32:46] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3896254 (10ops-monitoring-bot) [09:43:43] (03PS1) 10Marostegui: db-eqiad.php: db1105:331{1,2}, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403895 [09:45:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1105:331{1,2}, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403895 (owner: 10Marostegui) [09:47:10] (03Merged) 10jenkins-bot: db-eqiad.php: db1105:331{1,2}, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403895 (owner: 10Marostegui) [09:47:21] (03CR) 10jenkins-bot: db-eqiad.php: db1105:331{1,2}, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403895 (owner: 10Marostegui) [09:49:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1100, db1105:3311, db1105:3312 (duration: 01m 23s) [09:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:51] 10Operations, 10ops-codfw: ms-be2023 fails to (re)boot - https://phabricator.wikimedia.org/T184785#3896271 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Turns out, eventually the machine booted. I've upgraded the raid firmware for good measure. [09:51:18] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3896275 (10fgiunchedi) a:03Papaul @papaul please replace sdn! thanks [10:02:10] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1100 and db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403897 [10:02:35] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw2140.codfw.wmnet [10:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:26] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3893173 (10Joe) I don't think this is entirely correct. Pybal has 3 states: # `enabled/disabled`: the logical state in etcd, so `pooled=yes` or... [10:05:15] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3896291 (10MoritzMuehlenhoff) [10:05:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1100 and db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403897 (owner: 10Marostegui) [10:06:22] ACKNOWLEDGEMENT - Host mw2140 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T184788 [10:07:25] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1100 and db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403897 (owner: 10Marostegui) [10:07:34] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1100 and db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403897 (owner: 10Marostegui) [10:09:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1082 and db1100 (duration: 01m 22s) [10:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:15] !log upload scap 3.7.5-1 - T184774 [10:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:27] T184774: SCAP: Upload debian package version 3.7.5-1 - https://phabricator.wikimedia.org/T184774 [10:12:41] ACKNOWLEDGEMENT - HP RAID on ms-be2023 is CRITICAL: CRITICAL: Slot 3: Failed: 2I:2:4 - OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184789 [10:12:46] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184789#3896316 (10ops-monitoring-bot) [10:15:07] (03PS2) 10Filippo Giunchedi: Upgrade scap package to 3.7.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/403775 (https://phabricator.wikimedia.org/T184774) (owner: 1020after4) [10:17:03] (03CR) 1020after4: [C: 031] "I'm around to help test after upgrade if you'd like me to." [puppet] - 10https://gerrit.wikimedia.org/r/403775 (https://phabricator.wikimedia.org/T184774) (owner: 1020after4) [10:17:10] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403898 [10:18:04] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade scap package to 3.7.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/403775 (https://phabricator.wikimedia.org/T184774) (owner: 1020after4) [10:18:45] twentyafterfour: awesome! I'm forcing a puppet run on tin now [10:19:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403898 (owner: 10Marostegui) [10:19:58] (03PS1) 10Jcrespo: mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) [10:20:48] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403898 (owner: 10Marostegui) [10:20:56] godog: looks like it upgraded [10:20:59] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403898 (owner: 10Marostegui) [10:22:15] indeed [10:22:16] * twentyafterfour will deploy to phab2001 to test [10:22:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight db1105:3311 (duration: 01m 13s) [10:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:57] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3896339 (10fgiunchedi) [10:23:59] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184789#3896341 (10fgiunchedi) [10:24:40] !log reboot job runners in codfw for kernel security update (along with update to HHVM 3.18.6) [10:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:06] E: Version '3.7.5-1' for 'scap' was not found [10:25:36] godog: doesn't puppet run `apt-get update`? ^ [10:25:40] twentyafterfour: no [10:25:56] * twentyafterfour runs apt-get update [10:26:33] <_joe_> twentyafterfour: our script that runs puppet automatically *does* it before puppet runs IIRC [10:26:42] that'd be /usr/local/sbin/puppet-run [10:27:34] (03PS2) 10Jcrespo: mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) [10:28:56] yeah, apt-get update is run with every puppet run [10:29:34] !log twentyafterfour@tin Started deploy [phabricator/deployment@61f1099]: (no justification provided) [10:29:41] !log twentyafterfour@tin Finished deploy [phabricator/deployment@61f1099]: (no justification provided) (duration: 00m 07s) [10:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:02] !log twentyafterfour@tin Started deploy [phabricator/deployment@61f1099]: (no justification provided) [10:30:08] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3896360 (10hashar) [10:30:11] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): npm 1.4.21 can't use a http proxy - https://phabricator.wikimedia.org/T183569#3896358 (10hashar) 05Open>03Resolved Awesome. That fixed the issue I had locally as well as when runni... [10:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:25] Received disconnect from 10.192.32.147: 2: Too many authentication failures for phab-deploy from 10.64.0.196 port 38486 ssh2 [10:31:13] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3896364 (10hashar) a:05Joe>03hashar **Status update** Containers left... [10:31:14] hmm well scap appears to be working, though my phab-deploy ssh isn't [10:33:27] !log reboot analytics1066->69 for kernel updates [10:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:51] !log twentyafterfour@tin Finished deploy [phabricator/deployment@61f1099]: (no justification provided) (duration: 03m 49s) [10:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:27] (03PS2) 10Jcrespo: dblist: Promote db1055 to be the x1 eqiad master instead of db1031 [software] - 10https://gerrit.wikimedia.org/r/403679 (https://phabricator.wikimedia.org/T183469) [10:38:29] (03PS3) 10Jcrespo: mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) [10:41:10] (03CR) 10Marostegui: [C: 031] mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) (owner: 10Jcrespo) [10:43:33] (03PS1) 10Marostegui: db-eqiad.php: Repool db1105:3311, db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403900 [10:44:31] (03PS2) 10Marostegui: db-eqiad.php: Repool db1105:3311, db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403900 [10:46:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1105:3311, db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403900 (owner: 10Marostegui) [10:47:33] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1105:3311, db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403900 (owner: 10Marostegui) [10:47:43] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1105:3311, db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403900 (owner: 10Marostegui) [10:49:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1105:3311 and slowly repool db1066 (duration: 01m 13s) [10:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:47] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2007.codfw.wmnet [10:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:39] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.024 second response time [11:00:09] RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.010 second response time [11:03:28] RECOVERY - Restbase root url on restbase-test2001 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.086 second response time [11:04:58] RECOVERY - Restbase root url on restbase-test2002 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.099 second response time [11:06:39] RECOVERY - Restbase root url on restbase-test2003 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.089 second response time [11:07:18] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2001.codfw.wmnet [11:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:13] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403903 [11:16:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403903 (owner: 10Marostegui) [11:18:13] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3896429 (10ema) >>! In T184715#3896284, @Joe wrote: > `pooled/not pooled`: the status of the server in the ipvs pool As we've found out today, `p... [11:18:24] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403903 (owner: 10Marostegui) [11:18:34] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403903 (owner: 10Marostegui) [11:19:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1066 (duration: 01m 12s) [11:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:03] (03PS2) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403454 (https://phabricator.wikimedia.org/T183469) [11:24:05] (03PS1) 10Jcrespo: mariadb: Depool db2035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403905 (https://phabricator.wikimedia.org/T176243) [11:24:15] (03PS2) 10Jcrespo: mariadb: Depool db2035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403905 (https://phabricator.wikimedia.org/T176243) [11:41:53] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2001.codfw.wmnet [11:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:08] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2002.codfw.wmnet [11:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:13] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3896508 (10Addshore) [12:05:26] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3674105 (10Addshore) p:05Triage>03Normal [12:09:00] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, and 2 others: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3896523 (10Liuxinyu970226) @Shizhao By adding that tag, would you please explain how this is suitable to write into Tech News? [12:14:39] 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3896530 (10elukey) [12:14:44] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403905 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [12:14:52] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2002.codfw.wmnet [12:14:53] (03PS2) 10Faidon Liambotis: Remove utils/expanderb.rb [puppet] - 10https://gerrit.wikimedia.org/r/403697 [12:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:08] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2003.codfw.wmnet [12:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:25] (03CR) 10Faidon Liambotis: [C: 032] "Per Hashar and Giuseppe." [puppet] - 10https://gerrit.wikimedia.org/r/403697 (owner: 10Faidon Liambotis) [12:16:04] 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3896556 (10elukey) Current status on nitrogen (no jvm metrics displayed since they should already be ok): ``` elukey@nitrogen:~$ curl http://10.64.32.199:9400/metrics -s | grep... [12:16:12] (03Merged) 10jenkins-bot: mariadb: Depool db2035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403905 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [12:16:55] (03CR) 10jenkins-bot: mariadb: Depool db2035 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403905 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [12:19:35] 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3896563 (10fgiunchedi) The rates we can get rid of since we have the total counts, the rest LGTM, thanks @elukey ! [12:23:31] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2035 for maintenance (duration: 01m 13s) [12:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:18] !log stop db2035 replication for maintenance [12:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:02] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2003.codfw.wmnet [12:37:10] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2004.codfw.wmnet [12:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:34] 10Operations, 10DBA: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#3896577 (10jcrespo) [12:42:58] 10Operations, 10Puppet, 10DBA: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#3896587 (10jcrespo) [12:53:10] (03PS1) 10Filippo Giunchedi: debian: start after blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403915 (https://phabricator.wikimedia.org/T184434) [12:53:30] 10Operations, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3896597 (10Marostegui) labsdb1001 is no longer available, not even in read_only. storage has definitely given up. [12:54:34] (03PS2) 10Filippo Giunchedi: debian: start after blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403915 (https://phabricator.wikimedia.org/T184434) [12:55:19] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2004.codfw.wmnet [12:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:58] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403915 (https://phabricator.wikimedia.org/T184434) (owner: 10Filippo Giunchedi) [12:57:05] 10Operations, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3896602 (10jcrespo) Followup T142807, there is nothing else we can do. [12:59:10] (03PS2) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [puppet] - 10https://gerrit.wikimedia.org/r/403678 (https://phabricator.wikimedia.org/T183469) [12:59:12] (03PS1) 10Jcrespo: mariadb: Move db2041 mariadb socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403917 (https://phabricator.wikimedia.org/T148507) [13:02:45] (03CR) 10Jcrespo: "Gergő: Don't worry too much, people with access to root@ receive that, and we will ping you if it creates cronspam, it was just a heads up" [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [13:04:02] (03PS1) 10Jcrespo: Revert "Add cron job for purging ReadingLists data" [puppet] - 10https://gerrit.wikimedia.org/r/403919 [13:04:43] (03CR) 10Jcrespo: "Actually, it is not theoretical:" [puppet] - 10https://gerrit.wikimedia.org/r/403919 (owner: 10Jcrespo) [13:08:39] (03PS2) 10Jcrespo: Revert "Add cron job for purging ReadingLists data" [puppet] - 10https://gerrit.wikimedia.org/r/403919 [13:10:57] (03CR) 10Jcrespo: [C: 032] Revert "Add cron job for purging ReadingLists data" [puppet] - 10https://gerrit.wikimedia.org/r/403919 (owner: 10Jcrespo) [13:11:03] (03PS3) 10Jcrespo: Revert "Add cron job for purging ReadingLists data" [puppet] - 10https://gerrit.wikimedia.org/r/403919 [13:24:01] !log upgrade and restart db2041 [13:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:42] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2041 mariadb socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403917 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:24:48] (03PS2) 10Jcrespo: mariadb: Move db2041 mariadb socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403917 (https://phabricator.wikimedia.org/T148507) [13:27:19] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review, 10Release: SCAP: Upload debian package version 3.7.5-1 - https://phabricator.wikimedia.org/T184774#3896652 (10mmodell) 05Open>03Resolved a:03mmodell Thanks @fgiunchedi [13:27:34] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review, 10Release: SCAP: Upload debian package version 3.7.5-1 - https://phabricator.wikimedia.org/T184774#3896655 (10mmodell) a:05mmodell>03fgiunchedi [13:27:41] (03PS10) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [13:40:06] (03CR) 10Gehel: [C: 031] "This will fix the immediate issue. I still think we should have a second look into the exporter itself. There is no reason it couldn't sta" [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403915 (https://phabricator.wikimedia.org/T184434) (owner: 10Filippo Giunchedi) [13:48:03] 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3896703 (10elukey) Since rates and other things like stdev are Mbean's attributes I cannot easily blacklist them, but sole rewrite rules are needed (in which we can explicitly s... [13:58:13] (03CR) 10Faidon Liambotis: [C: 04-1] apt: unattended-upgrades: add targetted upgrades script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:10:06] (03PS11) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [14:22:14] (03PS3) 10Ema: Use up-and-enabled servers in can-depool logic [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) [14:23:39] (03CR) 10jerkins-bot: [V: 04-1] Use up-and-enabled servers in can-depool logic [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [14:27:46] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2005.codfw.wmnet [14:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:26] 10Operations, 10Prometheus-metrics-monitoring, 10User-Elukey: Create prometheus nutcracker exporter - https://phabricator.wikimedia.org/T155129#3896784 (10elukey) 05Open>03Resolved [14:31:57] (03PS2) 10Andrew Bogott: puppet_compiler: puppetdb::app no longer takes a heap_size arg [puppet] - 10https://gerrit.wikimedia.org/r/403883 [14:32:41] (03CR) 10Andrew Bogott: [C: 032] puppet_compiler: puppetdb::app no longer takes a heap_size arg [puppet] - 10https://gerrit.wikimedia.org/r/403883 (owner: 10Andrew Bogott) [14:33:43] (03PS4) 10Ema: Use up-and-enabled servers in can-depool logic [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) [14:52:47] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3896838 (10hashar) 05Open>03Resolved I have migrated all the remaining... [14:53:11] (03PS1) 10Elukey: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) [14:56:03] <_joe_> elukey: are you reimaging those? [14:56:49] (03PS2) 10Elukey: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) [14:56:51] _joe_ nono only prepping the code reviews for monday [14:58:17] <_joe_> ok, shout out when you want to start [14:58:47] (03PS3) 10Elukey: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) [14:58:55] _joe_ if you want to check --^ [14:59:10] I added 1 vs and 10 apis, but IIRC we were not sure about the vs [14:59:15] so I can amend that [14:59:20] (with 11 apis) [14:59:37] <_joe_> remember videoscalers are stretch already [15:00:13] yep, I've set host entries accordingly [15:00:24] 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3896851 (10jcrespo) [15:02:03] 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3896861 (10jcrespo) Only adding #releng and #wmcs in case they can think of a reason not to move them (groupX reasons) or not to put labswiki there. [15:15:57] (03PS3) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [puppet] - 10https://gerrit.wikimedia.org/r/403678 (https://phabricator.wikimedia.org/T183469) [15:15:59] (03PS1) 10Jcrespo: mariadb: Move db2049 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403931 (https://phabricator.wikimedia.org/T148507) [15:20:59] (03PS4) 10Giuseppe Lavagetto: wmflib: simplify the role() function, convert to the new API [puppet] - 10https://gerrit.wikimedia.org/r/402345 [15:23:44] (03PS2) 10Jcrespo: mariadb: Move db2049 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403931 (https://phabricator.wikimedia.org/T148507) [15:36:26] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:16] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 80631 bytes in 0.333 second response time [15:39:35] !log upgrade and restart db2049 [15:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:07] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2049 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403931 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:44:25] arturo: what do you mean by "does not behave as one would expect"? [15:44:26] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [15:44:36] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2005.codfw.wmnet [15:44:45] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2006.codfw.wmnet [15:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:48] (03PS2) 10Gehel: prometheus blazegraph exporter should not fail when blazegraph is down [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403933 (https://phabricator.wikimedia.org/T184434) [15:59:26] PROBLEM - DPKG on restbase1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:00:08] that's me [16:00:59] (03PS1) 10Marostegui: db-eqiad.php: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403936 [16:03:26] RECOVERY - DPKG on restbase1008 is OK: All packages OK [16:04:01] (03CR) 10Filippo Giunchedi: [C: 032] debian: start after blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403915 (https://phabricator.wikimedia.org/T184434) (owner: 10Filippo Giunchedi) [16:09:26] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:09:44] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403933 (https://phabricator.wikimedia.org/T184434) (owner: 10Gehel) [16:11:35] (03PS3) 10Gehel: prometheus blazegraph exporter should not fail when blazegraph is down [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403933 (https://phabricator.wikimedia.org/T184434) [16:12:29] (03CR) 10Gehel: [C: 032] prometheus blazegraph exporter should not fail when blazegraph is down [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/403933 (https://phabricator.wikimedia.org/T184434) (owner: 10Gehel) [16:15:20] !log upgrade and restart db2056 [16:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:44] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2006.codfw.wmnet [16:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:07] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1007.eqiad.wmnet [16:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:03] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wgWBQualityConstraintsIncludeDetailInApi setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403939 (https://phabricator.wikimedia.org/T180614) [16:29:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403936 (owner: 10Marostegui) [16:31:04] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403936 (owner: 10Marostegui) [16:31:17] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403936 (owner: 10Marostegui) [16:32:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1089 (duration: 01m 12s) [16:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:48] 10Operations, 10Mail: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230#3897019 (10faidon) p:05Triage>03Normal a:03herron [16:34:14] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1007.eqiad.wmnet [16:34:22] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1008.eqiad.wmnet [16:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:25] PROBLEM - DPKG on restbase1014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:39:13] ggrrr forgot downtime for that last host [16:42:16] RECOVERY - DPKG on restbase1014 is OK: All packages OK [16:44:52] (03PS4) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [puppet] - 10https://gerrit.wikimedia.org/r/403678 (https://phabricator.wikimedia.org/T183469) [16:44:53] (03PS1) 10Jcrespo: mariadb: Move db2063 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403942 (https://phabricator.wikimedia.org/T148507) [16:45:14] (03PS2) 10Jcrespo: mariadb: Move db2063 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403942 (https://phabricator.wikimedia.org/T148507) [16:46:55] !log upgrade and restart db2063 [16:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:51] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2063 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403942 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [16:52:24] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1008.eqiad.wmnet [16:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:37] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1009.eqiad.wmnet [16:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:55] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403945 [17:01:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403945 (owner: 10Marostegui) [17:03:28] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403945 (owner: 10Marostegui) [17:03:40] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403945 (owner: 10Marostegui) [17:04:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1089 (duration: 01m 12s) [17:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:15] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1009.eqiad.wmnet [17:11:24] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1010.eqiad.wmnet [17:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:26] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403956 [17:24:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403956 (owner: 10Marostegui) [17:25:52] (03CR) 10Chad: [C: 032] "Considering it already says "Wikimedia public key list" I don't think we're changing any branding here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403569 (https://phabricator.wikimedia.org/T181018) (owner: 10Krinkle) [17:26:59] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403956 (owner: 10Marostegui) [17:27:09] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403956 (owner: 10Marostegui) [17:28:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1089 (duration: 01m 09s) [17:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:46] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#3897267 (10demon) >>! In T181936#3890096, @ArielGlenn wrote: > I'm adding @Nikerabbit, @demon and @hoo because they will be the main benefici... [17:30:09] (03Merged) 10jenkins-bot: keys: Simplify and update keys.html styling to match other simple pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403569 (https://phabricator.wikimedia.org/T181018) (owner: 10Krinkle) [17:30:19] (03CR) 10jenkins-bot: keys: Simplify and update keys.html styling to match other simple pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403569 (https://phabricator.wikimedia.org/T181018) (owner: 10Krinkle) [17:31:43] (03CR) 10Chad: [C: 032] Add wikidata and mediawiki.org to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes) [17:31:49] !log demon@tin Synchronized docroot/mediawiki/: prettier keys page (duration: 01m 13s) [17:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:52] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1010.eqiad.wmnet [17:34:01] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1012.eqiad.wmnet [17:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:31] (03Merged) 10jenkins-bot: Add wikidata and mediawiki.org to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes) [17:37:12] (03CR) 10jenkins-bot: Add wikidata and mediawiki.org to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes) [17:41:58] !log upgrade and restart db2064 [17:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:50] (03PS1) 10Jcrespo: mariadb: Move db2064 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403962 (https://phabricator.wikimedia.org/T148507) [17:43:37] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2064 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/403962 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [17:58:59] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1012.eqiad.wmnet [17:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:37] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1014.eqiad.wmnet [17:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:06] (03CR) 10Aaron Schulz: [C: 031] mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) (owner: 10Jcrespo) [18:15:57] (03PS1) 10Cmjohnson: Adding mgmt dns for lvs1013-16 [dns] - 10https://gerrit.wikimedia.org/r/403971 (https://phabricator.wikimedia.org/T184293) [18:17:24] PROBLEM - HHVM rendering on mw1275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [18:17:44] PROBLEM - Apache HTTP on mw1332 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [18:18:24] RECOVERY - HHVM rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 80718 bytes in 0.105 second response time [18:18:44] RECOVERY - Apache HTTP on mw1332 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.023 second response time [18:19:14] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [18:20:14] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.035 second response time [18:20:36] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for lvs1013-16 [dns] - 10https://gerrit.wikimedia.org/r/403971 (https://phabricator.wikimedia.org/T184293) (owner: 10Cmjohnson) [18:22:18] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3897465 (10Cmjohnson) [18:26:43] 10Operations, 10Cloud-VPS, 10Traffic, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3897475 (10chasemp) 05Open>03stalled p:05Normal>03Low @ayounsi and I talked about this a bit and the use case is... [18:28:10] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1014.eqiad.wmnet [18:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:29] (03CR) 10Rush: "few notes and I didn't test it yet but idea seems sane to me." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [18:30:33] (03CR) 10Rush: "I think some portions of https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Attended_package_upgrades are stale w/ this version" [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [18:30:38] (03PS12) 10Rush: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [18:32:17] !log upgrade and restart db2088 [18:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:29] bblack: yt? [18:33:11] (03PS1) 10Krinkle: Remove unused PhpAutoPrepend.php file for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403973 (https://phabricator.wikimedia.org/T180183) [18:33:13] (03PS1) 10Krinkle: Initial profiler for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403974 (https://phabricator.wikimedia.org/T180183) [18:33:31] (03CR) 10Jcrespo: [C: 032] mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) (owner: 10Jcrespo) [18:33:36] (03PS4) 10Jcrespo: mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) [18:35:43] (03CR) 10Jcrespo: [C: 032] "I will be deploying slowly to production, as this is the typical thing we never realized that could break the persistance logic for some o" [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) (owner: 10Jcrespo) [18:37:22] (03Merged) 10jenkins-bot: mariadb: Kill almost all wikiuser queries, including replica control [software] - 10https://gerrit.wikimedia.org/r/403899 (https://phabricator.wikimedia.org/T180918) (owner: 10Jcrespo) [18:45:28] (03PS1) 10Jcrespo: admin: Add skip-slave-start alias for jynus [puppet] - 10https://gerrit.wikimedia.org/r/403976 [18:46:07] PROBLEM - Freshness of OCSP Stapling files on cp3007 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2016-rsa-unified.ocsp is more than 259500 secs old! [18:49:15] robh: do you happen to know if some of those could be due to old, no longer in use certs? [18:49:53] uh, the unified certs are in use [18:49:55] bblack: ^ [18:50:53] just stating its over 72 hours old [18:51:33] (03CR) 10Jcrespo: [C: 032] admin: Add skip-slave-start alias for jynus [puppet] - 10https://gerrit.wikimedia.org/r/403976 (owner: 10Jcrespo) [18:52:19] (03CR) 10Marostegui: "I will copy this one to my .bashrc too :)" [puppet] - 10https://gerrit.wikimedia.org/r/403976 (owner: 10Jcrespo) [18:54:53] robh: looking.... it's just that one? [18:55:04] bblack: it is all [18:55:09] others are warnings [18:55:13] ok [18:55:26] yeah, ok, it's 2016 [18:55:35] old cert? [18:55:37] so its an old cert sitting on the hosts? [18:55:42] basically, failed to clean out state from the second certs, yes [18:55:42] jynus called it ;D [18:55:54] err, the second set of new certs :) [18:55:59] I can clean it up shortly [18:56:06] ok, sorry to distudb you, but last thing I wanted is certs slowly failing over hte weekend [18:56:17] PROBLEM - Freshness of OCSP Stapling files on cp1068 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2016-rsa-unified.ocsp is more than 259500 secs old! [18:56:19] im not sorry to disturb you its my job this week ;D [18:56:29] it makes sense, they expired on the 3rd [18:56:48] bblack: it is ok, ack them during the weekend works too, but you may not want to do that if it checks sevceral certs at the same time [18:56:53] the OCSP repsonders at digicert probably kept working for ~10d after expiry, then we got out 72h, then these [18:57:50] one last question- assume we get that for an in use cert [18:58:08] or is it impossible unless they are expired? [18:58:34] it's not impossible, and it is usually a serious issue [18:58:38] (03PS1) 10Dzahn: mw-maintenance: move mariadb maintenance to tendril [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) [18:58:51] it could mean that one of the certificate vendors has a failing OCSP service (or we're failing to contact it properly) [18:59:07] RECOVERY - Freshness of OCSP Stapling files on cp3007 is OK: OK [18:59:11] ok, so we were right on worying [18:59:14] I just did the cleanup [18:59:17] RECOVERY - Freshness of OCSP Stapling files on cp1068 is OK: OK [18:59:20] they should all recover now [18:59:33] thanks [19:00:19] the general info on this is at: https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [19:00:32] thanks [19:00:43] TL;DR is if we see a bunch of failures for just globalsign or just digicert, there's a hieradata edit to make to switch all datacenters to the remaining good vendor. [19:00:57] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:01:25] oh, my cleanup needs a puppet change first, heh [19:01:25] thanks, even if it is there, a refresh is welcome [19:02:46] (03PS1) 10BBlack: Stop distributing the 2016 digicert unified certs [puppet] - 10https://gerrit.wikimedia.org/r/403979 [19:03:08] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:03:17] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:03:18] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:04:06] (03CR) 10BBlack: [C: 032] Stop distributing the 2016 digicert unified certs [puppet] - 10https://gerrit.wikimedia.org/r/403979 (owner: 10BBlack) [19:04:07] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:04:17] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:04:17] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:05:11] the puppetfails are because I didn't merge the above before doing the cumin-based cleanup [19:05:57] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:06:07] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:06:08] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:06:29] understandably- take your time, last thing we want is to make an outage out of a clean up [19:06:45] :-) [19:06:47] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:06:57] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-ecdsa-unified-create-ocsp],Exec[digicert-2016-rsa-unified-create-ocsp] [19:08:07] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:08:17] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:08:18] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:08:29] !log upgrade and restart db2091 [19:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:56] oh hey bblack [19:09:05] !log leftover cruft from expired digicert-2016 certs all cleaned up now :) [19:09:07] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:17] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:09:17] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:09:32] Krenair: hi [19:09:54] any movement on secure redir service? [19:10:02] nope :/ [19:10:09] ok, figured [19:10:29] btw I made https://gerrit.wikimedia.org/r/#/c/403326/ [19:10:39] but, we have a new hire starting in ~1 month to take over doing a lot of related things, so that may help! :) [19:10:50] I heard :) [19:10:57] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:10:57] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:11:07] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:11:08] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:11:47] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:11:57] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:13:53] Krenair: hmmm is the agreement URL error blocking all our LE renewals now? I thought they gave ample time on switching usuall [19:13:56] *usually [19:17:15] not sure [19:17:20] doubt it [19:17:27] I encountered this on a new cert, so [19:18:17] if it's not now, it might be later [19:18:26] hmmm ok [19:20:20] (03CR) 10BBlack: [C: 031] "Overview of agreement changes here: https://community.letsencrypt.org/t/updating-our-subscriber-agreement-to-v1-2-on-november-15-2017/4560" [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [19:27:28] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, and 2 others: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3897629 (10Dzahn) @kaldari in "git log fc-list" i found a reference to "RT ticket #810". I then used phab advanced search w... [19:30:15] 10Operations: update svg font list - https://phabricator.wikimedia.org/T79424#3897641 (10Dzahn) [19:31:21] mutante: Oh cool, so it looks like someone needs to run "fc-list :fontformat=TrueType" from the command line on one of the scaling servers and then paste that list into operations/mediawiki-config/fc-list [19:33:05] kaldari: https://phabricator.wikimedia.org/P6582 :) [19:33:16] that's on a jessie-imagescaler [19:33:57] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, and 2 others: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3897664 (10Dzahn) {P6582} [19:35:15] Thanks. I can handle updating the file if you like. [19:36:05] kaldari: i started to copy it but now that you offered it, yes please :) [19:36:13] NP [19:40:06] mutante: The output for Padauk looks weird. Any idea on that one? [19:41:03] kaldari: yea, that probably broke because i did copy/paste, let me upload the raw file a different way [19:41:11] Thanks [19:42:34] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, and 2 others: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3897673 (10Dzahn) {F12516254} [19:43:30] kaldari: how's this https://phab.wmfusercontent.org/file/data/dt7evb3nzfpqhxkyobju/PHID-FILE-dszx3emoni4xzy4bdnat/fc-list [19:43:53] better :) [19:46:36] (03CR) 10Krinkle: [C: 031] modules/webperf: handle oversamples differently than regular samples [puppet] - 10https://gerrit.wikimedia.org/r/402867 (https://phabricator.wikimedia.org/T181413) (owner: 10Imarlier) [19:46:50] (03PS2) 10Krinkle: webperf: Handle oversamples differently than regular samples [puppet] - 10https://gerrit.wikimedia.org/r/402867 (https://phabricator.wikimedia.org/T181413) (owner: 10Imarlier) [19:49:44] (03CR) 10Dzahn: [C: 031] "in compiler output you can see it removes resources from terbium/wasat and adds them on dbmonitor*" [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [20:00:01] (03PS1) 10Kaldari: Updating fonts list and sorting it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) [20:00:12] 10Operations: bz work - https://phabricator.wikimedia.org/T79460#3897697 (10Dzahn) [20:01:18] 10Operations: bz work - https://phabricator.wikimedia.org/T79464#3897702 (10Dzahn) [20:01:46] 10Operations: bz work - https://phabricator.wikimedia.org/T79460#862972 (10Dzahn) [20:01:48] 10Operations: bz work - https://phabricator.wikimedia.org/T79464#863012 (10Dzahn) [20:03:40] 10Operations: Give Neil shell access - https://phabricator.wikimedia.org/T79518#3897716 (10Dzahn) [20:04:26] 10Operations: Upgrade shell access to 'mortals' for 'reedy' - https://phabricator.wikimedia.org/T79394#3897720 (10Dzahn) [20:06:03] (03CR) 10Dzahn: [C: 031] "the source is mw1293 (jessie imagescaler)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) (owner: 10Kaldari) [20:06:07] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 44.95, 36.08, 29.89 [20:07:29] !log mw1227 - high load: hhvm-dump-debug > /root/hhvm-dump-debug-2017012.log | Backtrace saved as /tmp/hhvm.2203.bt. [20:07:34] !log mw1227 hhvm-restart [20:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:16] (03PS6) 10Tjones: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) [20:08:31] (03CR) 10jerkins-bot: [V: 04-1] Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [20:08:33] (03PS4) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [20:08:48] (03CR) 10jerkins-bot: [V: 04-1] Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [20:15:07] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 13.48, 18.87, 23.89 [20:21:54] (03PS2) 10Krinkle: Initial profiler for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403974 (https://phabricator.wikimedia.org/T180183) [20:30:26] (03CR) 10Jcrespo: "Thanks, I didn't expect you do do this, but this is very helpful." [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [20:32:03] (03CR) 10Jcrespo: "Also, we have to do an intermediate commit forcing the "absent" on terbium/wasat for cleanup purposes." [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [20:39:47] (03PS9) 10Tjones: Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) [20:53:58] (03PS5) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [21:02:13] (03PS7) 10Tjones: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) [21:14:43] (03PS6) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [21:18:34] (03PS7) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [21:33:55] (03PS1) 10Krinkle: webperf: Re-use expected result by reference to simplify fixture [puppet] - 10https://gerrit.wikimedia.org/r/404045 [21:33:57] (03PS1) 10Krinkle: webperf: Introduce 'templates' in test fixture and use for mwload [puppet] - 10https://gerrit.wikimedia.org/r/404046 [22:00:17] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:00:57] 10Operations, 10DBA, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3897891 (10bd808) [22:05:27] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms [22:09:32] 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018): Setup reply via email in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184592#3897920 (10Qgil) Maybe not so simple. I got a Wikimedia.org address created and I have introduced the credentials in the admin interface after... [22:20:34] (03CR) 10Dzahn: [C: 031] "cool, sounds all good, i'll wait with merging until after weekend and when you're around. re: forcing the absent i was planning to manuall" [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [22:22:03] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3897940 (10chasemp) p:05Triage>03Normal [22:30:48] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [22:32:47] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:39] (03CR) 10Dzahn: DHCP: switch from jessie to stretch as default installer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [22:35:57] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [22:37:28] (03PS2) 10Dzahn: DHCP: switch from jessie to stretch as default installer [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) [22:37:58] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.86 ms [22:39:17] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3897978 (10bmansurov) @Krenair and @Dzahn to answer your earlier questions, please see T107389 for more info. [22:39:57] (03PS1) 10Dzahn: DHCP: rm rm linux-host-entries.ttyS0-9600 (cameras) [puppet] - 10https://gerrit.wikimedia.org/r/404053 [22:44:07] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3897982 (10Dzahn) @bmansurov thanks for that link and detailed explanation. ping me when it's time to launch. [22:53:03] (03PS1) 10Dzahn: DHCP: switch to using http to serve installer [puppet] - 10https://gerrit.wikimedia.org/r/404054 (https://phabricator.wikimedia.org/T182215) [22:53:47] (03PS2) 10Dzahn: DHCP: switch to http to serve installer [puppet] - 10https://gerrit.wikimedia.org/r/404054 (https://phabricator.wikimedia.org/T182215) [22:54:47] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3898050 (10bmansurov) @Dzahn OK, thanks! [23:01:08] (03PS3) 10Dzahn: DHCP: switch all to http to serve installer [puppet] - 10https://gerrit.wikimedia.org/r/404054 (https://phabricator.wikimedia.org/T182215) [23:07:08] PROBLEM - HP RAID on db2036 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:6 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [23:07:10] ACKNOWLEDGEMENT - HP RAID on db2036 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:6 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184836 [23:07:13] 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3898087 (10ops-monitoring-bot) [23:08:23] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, and 2 others: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3898092 (10kaldari) 05Open>03Resolved a:03kaldari I tested and verified that the new Noto fonts are working on the sca... [23:13:27] (03PS4) 10Dzahn: DHCP: switch all to http to serve installer [puppet] - 10https://gerrit.wikimedia.org/r/404054 (https://phabricator.wikimedia.org/T182215) [23:14:37] 10Operations, 10DBA, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3898111 (10bd808) [23:19:53] (03CR) 10Subramanya Sastry: [C: 031] Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [23:20:00] (03CR) 10Subramanya Sastry: [C: 031] Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [23:20:10] (03CR) 10Subramanya Sastry: [C: 031] Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [23:32:19] (03PS3) 10Dzahn: DHCP: switch from jessie to stretch as default installer [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) [23:35:50] (03CR) 10Dzahn: [C: 04-1] "broken rebase again because there are constant changes to this.. :/" [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [23:44:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3898173 (10Peachey88) [23:44:27] 10Operations, 10Ops-Access-Requests: Requesting access to stat1004, stat1005, stat1006 for mneisler - https://phabricator.wikimedia.org/T184838#3898174 (10MNeisler)