[00:04:13] !log let icinga own /var/log/icinga on einsteinium, restart icinga [00:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:23] addshore , any reason you don't properly log tasks to SAL? [00:09:43] arseny92: ? *checks* [00:10:27] aah arseny92 your patch (the abuse filter change) was never synced, thus no automatic SAL entry [00:10:28] !log restarted icinga-wm, now there is /var/log/icinga/irc.log, it should talk now, but doesnt [00:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:46] arseny92: it was never actually deployed [00:10:59] For stashbot to reply on merged / synced tasks , the task id need to be in the log line when posted. I mean the Uploadwizard log entry [00:11:39] they all appear for me https://usercontent.irccloud-cdn.com/file/Nk6Ho4cd/ [00:12:28] oh wait, the task id...? (I don't usually add the task ID, only the gerrit patch link). [00:14:01] the task id is needed for stashbot to reply on tasks to indicate the change was synced and is on sal [00:14:47] see the 10/24 patches for example when I and tyler had an IS/CS dance [00:16:24] addshore see? [00:16:31] arseny92: added to my notes for next time! [00:17:22] and arseny92 the notes @ https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Full_deployment do include a phab ID but don't say what that results in / why it should be done, might be worth a poke. [00:17:45] going to bed now though! :) [00:19:08] 06Operations, 10Traffic: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2743594 (10BBlack) 05Open>03Resolved a:03BBlack [00:26:19] (03Abandoned) 10Dereckson: Enable Flow personal talk opt-in Beta Feature on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307788 (https://phabricator.wikimedia.org/T144384) (owner: 10Dereckson) [00:28:46] addshore , poked that page ;) [00:30:36] (03PS1) 10Cenarium: Remove 'validate' from enwiki reviewers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318018 [00:35:56] RECOVERY - TEST [00:36:07] icinga-wm: -.- [00:36:37] godog: what else did you do :) [00:37:10] nothing strange really, started it manually from the shell [00:37:14] RECOVERY - SELF [00:37:34] hmm, ok, i used /etc/init.d/ircecho [00:38:40] RECOVERY - TEST [00:38:52] ok now started with systemctl, looks like it is working [00:38:58] cool [00:40:24] ACKNOWLEDGEMENT - NTP on mw2098 is CRITICAL: NTP CRITICAL: Offset unknown daniel_zahn TEST ACK [00:40:45] bblack: ^ i got the button when i logged in with Chrome and new session [00:43:08] (03CR) 10Dereckson: "I've run again ./createTxtFileSymlinks.sh, we still need this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309743 (owner: 10Dereckson) [00:47:11] (03PS2) 10Dereckson: Update noc.wikimedia.org dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309743 [00:49:39] (03PS1) 10Dereckson: Add missing configuration files in noc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318027 [00:50:17] ^ scheduled for 2016-10-26 evening SWAT [00:54:40] (03PS1) 10Dzahn: icinga: let icinga own /var/log/icinga [puppet] - 10https://gerrit.wikimedia.org/r/318030 [00:55:03] (03PS2) 10Dereckson: Improve dblist name coherence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 [00:55:40] (03CR) 10Dereckson: "PS2: fixed directory issue, noc. handled too per previous comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson) [00:56:24] (03PS2) 10Dzahn: icinga: let icinga own /var/log/icinga [puppet] - 10https://gerrit.wikimedia.org/r/318030 [01:08:20] (03CR) 10Filippo Giunchedi: "LGTM, but Alex should chime in in case we're missing sth" [puppet] - 10https://gerrit.wikimedia.org/r/318030 (owner: 10Dzahn) [01:09:16] (03CR) 10Dzahn: "2757 on neon, 2755 on einsteinium, fwiw" [puppet] - 10https://gerrit.wikimedia.org/r/318030 (owner: 10Dzahn) [01:13:28] !log palladium - shutdown -h now [01:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:15:02] is anyone doing anything on 1017? I would like to use it to try to figure out what's wrong with zero in general [01:15:31] my understanding is that it doesn't affect prod traffic [01:15:44] unless explicitly requested [01:19:38] yurik, that's right [01:19:48] you can check the list of people logged into that server [01:20:54] Krenair, do i do scap pull to re-sync it back to original? [01:20:58] after i'm done [01:21:03] yes [01:21:30] Krenair, is there a way to do git pull from mw1017? [01:22:21] I'd touch the scap lock on the deployment server, get the repository in the state you want it, and scap pull on the target hosts [01:22:40] then put the repository back how you found it and release the lock [01:24:10] Krenair, i was hoping to avoid changing depl server :( [01:24:34] yurik, yeah well the git repository doesn't get synchronised to non-deployment servers [01:24:43] you can edit the files manually without git [01:24:51] might have to sudo as the right user but you can do it [01:25:26] sudo -u mwdeploy bash seems to work [01:26:38] i might need to scp php files [01:26:44] should be easy enough [01:26:46] thanks for your help [01:27:08] yurik, you know how to make your traffic hit the right server? [01:27:21] the chrome extension? [01:27:54] that's what i have been using for the testing [01:27:59] yes [01:28:05] ok [01:28:12] ACKNOWLEDGEMENT - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn puppet currently deactivated - role needs fixes [01:31:10] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: labstore1003 - RAID fail - https://phabricator.wikimedia.org/T149156#2743689 (10Dzahn) [01:32:05] ACKNOWLEDGEMENT - MegaRAID on labstore1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) daniel_zahn https://phabricator.wikimedia.org/T149156 [01:33:15] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2743702 (10Dzahn) [01:34:55] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2021137 (10Dzahn) palladium down , potassium reimaged, count down to **9** with neon to follow soon [01:36:40] !log lead - (formerly gerrit) - shutdown -h now (T147905) [01:36:41] T147905: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905 [01:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:39:40] (03PS1) 10Dzahn: remove lead.wikimedia.org, keep lead.mgmt.eqiad [dns] - 10https://gerrit.wikimedia.org/r/318033 (https://phabricator.wikimedia.org/T147905) [01:41:30] (03PS1) 10Dzahn: remove palladium.eqiad, keep palladium.mgmt.eqiad [dns] - 10https://gerrit.wikimedia.org/r/318034 (https://phabricator.wikimedia.org/T147320) [01:45:59] ema: bblack: what does it mean when you have several "pass" entries in an x-cache header? for example, "cp1055 pass, cp1065 pass"? Is that, pass on the in-memory cache layer followed by pass on the disk cache layer? [01:46:17] also hi, and thx in advance! [01:58:11] all okay yurik? [01:58:37] Krenair, yep, all's good, i haven't started actually yet - still researching on my own machine. I might need to do it tomorrow [01:58:49] ok [01:58:50] zero config is nasty :( [01:59:11] i wonder who wrote all that crap [01:59:31] oh, wait, that was me [01:59:36] darn :( [01:59:39] (03PS1) 10Dzahn: decom lead [puppet] - 10https://gerrit.wikimedia.org/r/318035 [01:59:56] heh [02:16:34] (03PS1) 10BryanDavis: wikitech: Set wgMWOAuthCentralWiki = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318036 (https://phabricator.wikimedia.org/T149150) [02:18:08] AndyRussG: yes, that would be the frontend cache and then the consistently hashed backend cache both doing hit-for-pass based on the VCL rules. [02:26:58] (03CR) 10Gergő Tisza: [C: 031] wikitech: Set wgMWOAuthCentralWiki = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318036 (https://phabricator.wikimedia.org/T149150) (owner: 10BryanDavis) [02:27:49] bd808: huh interesting! thx!!! [02:28:07] running out for a bit, I'll see any backscroll tho :) bye! [02:28:47] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 09m 38s) [02:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:55:00] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 10m 13s) [02:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:56:24] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.75 seconds [02:59:28] (03PS3) 10Kaldari: Create patroller usergroup for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [03:00:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 26 03:00:28 UTC 2016 (duration 5m 28s) [03:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:19:55] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:24:04] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 759.33 seconds [03:27:45] (03CR) 10Arlolra: [C: 031] Parsoid: Use Scap3 for config-file deploys [puppet] - 10https://gerrit.wikimedia.org/r/315069 (https://phabricator.wikimedia.org/T144596) (owner: 10Mobrovac) [03:32:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.08 seconds [03:32:54] PROBLEM - traffic-pool service on cp1047 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [03:44:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [03:44:24] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [03:47:24] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:51:51] bblack: you awake? [04:06:54] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:44] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [04:14:14] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [04:14:45] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [04:34:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:35:14] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [05:03:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:04:34] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:49:22] (TeliaSonera announced maintenance --^) [06:34:55] !log repooled mw2098 (was previously down for hardware check) [06:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:45:32] (03PS1) 10Yurik: LABS: Enable Tabular data on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318046 (https://phabricator.wikimedia.org/T148745) [06:46:07] (03CR) 10jenkins-bot: [V: 04-1] LABS: Enable Tabular data on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [06:46:25] 06Operations, 06Services (next), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2743972 (10Gilles) Afaik --noprofile removes a lot of protections we usually have in place. Eg. this is the profile currently used for th... [06:46:31] (03PS1) 10Yurik: Remove obsolete config values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318047 [06:47:35] (03PS2) 10Yurik: LABS: Enable Tabular data on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318046 (https://phabricator.wikimedia.org/T148745) [06:56:23] (03CR) 10Giuseppe Lavagetto: [C: 032] esams: introduce svc records for swift [dns] - 10https://gerrit.wikimedia.org/r/318010 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [06:58:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You need to define" [puppet] - 10https://gerrit.wikimedia.org/r/316954 (owner: 10Mobrovac) [07:00:24] RECOVERY - mediawiki-installation DSH group on mw2098 is OK: OK [07:10:42] !log rebooting nescio for kernel update [07:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:12:25] (03PS1) 10Giuseppe Lavagetto: docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) [07:13:05] PROBLEM - Host 91.198.174.106 is DOWN: PING CRITICAL - Packet loss = 100% [07:13:31] (03CR) 10jenkins-bot: [V: 04-1] docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) (owner: 10Giuseppe Lavagetto) [07:13:38] 06Operations, 10Ops-Access-Requests, 10Analytics, 06Discovery, 06Discovery-Analysis: Pivot access for Discovery's Analysis team - https://phabricator.wikimedia.org/T149144#2744012 (10Peachey88) [07:14:33] !log Deploying ALTER table s4 commonswiki.templatelinks - db2051 - T149079 [07:14:34] T149079: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079 [07:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:14:54] RECOVERY - Host 91.198.174.106 is UP: PING OK - Packet loss = 0%, RTA = 84.03 ms [07:20:26] !log rebooting oxygen for kernel update [07:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:22:05] ACKNOWLEDGEMENT - MD RAID on oxygen is CRITICAL: Return code of 255 is out of bounds nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149167 [07:22:08] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2744018 (10ops-monitoring-bot) [07:28:18] 06Operations, 10ops-eqiad, 10Analytics-Cluster: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2744023 (10Peachey88) kafkatee cluster best I can tell [07:37:41] mmmm mstat on oxygen seems fine [07:38:00] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2744028 (10elukey) [07:41:28] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2744031 (10elukey) p:05Triage>03Normal @Volans adding you to double check this alert, it seems a false positive but I might be wrong. [07:41:50] ah snap just seen the reboot [07:42:02] lol [07:43:18] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2744033 (10elukey) p:05Normal>03Low The host was rebooted right before the alarm was fired: ``` !log rebooting oxygen for kernel update elukey@oxygen:~$ uptime 07:42:10 up 20 min, 1 user, load aver... [07:45:35] !log rebooting mc* servers in codfw for kernel update [07:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:03] !log bounced ntp on oxygen (stuck in XFAC state) [07:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:50:48] 06Operations, 10ops-eqiad, 10Dumps-Generation: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2744036 (10ArielGlenn) Host looks good to me now. Can I just delete the icinga comment and close this? [07:51:30] 06Operations, 10Datasets-General-or-Unknown: reinstall snapshot1001.eqiad.wmnet with RAID - https://phabricator.wikimedia.org/T140439#2744039 (10ArielGlenn) [07:51:32] 06Operations, 10Datasets-General-or-Unknown, 10hardware-requests: reallocate snapshot1001 for use as canary/testbed for dumps - https://phabricator.wikimedia.org/T144728#2744037 (10ArielGlenn) 05Open>03Resolved [07:54:57] (03PS1) 10Elukey: Restore mc2009/mc2010 to standard settings [puppet] - 10https://gerrit.wikimedia.org/r/318051 [08:02:56] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:06:18] (03CR) 10Elukey: "Extra paranoid pcc:" [puppet] - 10https://gerrit.wikimedia.org/r/318051 (owner: 10Elukey) [08:06:51] !log Stoppping replication on db2058 - using it to clone another host - T146261 [08:06:52] T146261: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261 [08:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:08] (03CR) 10Thiemo Mättig (WMDE): "I don't think this is the right place for a discussion like this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [08:13:02] (03CR) 10Giuseppe Lavagetto: [C: 031] Restore mc2009/mc2010 to standard settings [puppet] - 10https://gerrit.wikimedia.org/r/318051 (owner: 10Elukey) [08:13:39] (03CR) 10Elukey: [C: 032] Restore mc2009/mc2010 to standard settings [puppet] - 10https://gerrit.wikimedia.org/r/318051 (owner: 10Elukey) [08:18:27] (03PS1) 10R4q3NWnUx2CEhVyr: Allocate only the needed size for the format structure array [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 [08:25:10] !log elastic@eqiad reindexing enwiki (take 3) with BM25 from wasat.codfw.wmnet T147508 (logs in ~dcausse/bm25_reindex/cirrus_log) [08:25:11] T147508: BM25: initial limited release into production - https://phabricator.wikimedia.org/T147508 [08:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:52] !log downgraded memcached on mc2009 to the Debian Jessie version (was part of a performance experiment) [08:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:33] Hey guys, I wonder if there is a downloadable list of Properties, where there are their categories within as well? I only found the property browser from Hay's tools which provides me with a json string containing some information, but I really only need the Property ID and its topmost category, liek Generic, Person, etc. [08:31:17] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:43:23] !log increasing the AQS cassandra system_auth keyspace replication from 1 to 6 (and running nodetool-{a,b} repair system_auth on all nodes) [08:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:47:25] 06Operations, 10Parsoid: wtp2019.codfw.wmnet is down - https://phabricator.wikimedia.org/T149110#2744082 (10akosiaris) [08:47:27] 06Operations, 10ops-codfw, 06DC-Ops: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2744085 (10akosiaris) [09:02:57] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2744109 (10ArielGlenn) [09:02:59] 06Operations, 10Datasets-General-or-Unknown: reinstall snapshot1001.eqiad.wmnet with RAID - https://phabricator.wikimedia.org/T140439#2744107 (10ArielGlenn) 05Open>03Resolved Given that this is a testbed host, we can afford to experiment. I've installed with HW RAID using the H200. We'll see how that goes. [09:03:17] (03PS1) 10Gehel: elasticsearch - enable garbage collection logs on relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) [09:04:15] (03CR) 10Gehel: Gerrit: Enable logging for jvm gc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/317582 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [09:04:23] <_joe_> incoming... [09:04:44] (03PS2) 10Giuseppe Lavagetto: docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) [09:04:46] (03PS1) 10Giuseppe Lavagetto: docker::registry: add support for swift storage backend [puppet] - 10https://gerrit.wikimedia.org/r/318056 [09:04:48] (03PS1) 10Giuseppe Lavagetto: docker::registry: move hiera lookups to the role [puppet] - 10https://gerrit.wikimedia.org/r/318057 [09:04:50] (03PS1) 10Giuseppe Lavagetto: docker::registry: drop setcap [puppet] - 10https://gerrit.wikimedia.org/r/318058 [09:04:52] (03PS1) 10Giuseppe Lavagetto: docker::registry: move htpasswd file to /etc/nginx [puppet] - 10https://gerrit.wikimedia.org/r/318059 [09:04:54] (03PS1) 10Giuseppe Lavagetto: docker::registry: separate nginx config from the main one [puppet] - 10https://gerrit.wikimedia.org/r/318060 [09:04:56] (03PS1) 10Giuseppe Lavagetto: docker::registry::web: listen on ipv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/318061 [09:04:58] (03PS1) 10Giuseppe Lavagetto: docker::web: allow defining multiple build servers [puppet] - 10https://gerrit.wikimedia.org/r/318062 [09:05:00] (03PS1) 10Giuseppe Lavagetto: docker::registry::web: allow using puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/318063 [09:05:02] (03PS1) 10Giuseppe Lavagetto: docker::registry: allow passing configurations [puppet] - 10https://gerrit.wikimedia.org/r/318064 [09:05:04] (03PS1) 10Giuseppe Lavagetto: docker::registry: drop http host setting [puppet] - 10https://gerrit.wikimedia.org/r/318065 [09:05:22] <_joe_> brb [09:07:39] (03CR) 10jenkins-bot: [V: 04-1] docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) (owner: 10Giuseppe Lavagetto) [09:23:55] PROBLEM - Disk space on ms-be2011 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error [09:25:33] (03PS1) 10Paladox: Gerrit: Adding option -XX:+PrintGCApplicationStoppedTime to gc logging [puppet] - 10https://gerrit.wikimedia.org/r/318067 (https://phabricator.wikimedia.org/T148478) [09:25:44] (03PS1) 10R4q3NWnUx2CEhVyr: Recent versions of librdkafka allow to negociate API versions [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318068 [09:25:56] (03CR) 10Paladox: Gerrit: Enable logging for jvm gc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/317582 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [09:30:49] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdi1] [09:37:19] PROBLEM - MegaRAID on ms-be2011 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [09:38:08] (03CR) 10Gehel: [C: 031] "LGTM and simple enough" [puppet] - 10https://gerrit.wikimedia.org/r/318067 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [09:38:38] (03PS3) 10Giuseppe Lavagetto: docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) [09:38:49] (03PS1) 10ArielGlenn: create dumps testbed role for snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/318069 [09:39:41] (03PS2) 10ArielGlenn: create dumps testbed role for snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/318069 (https://phabricator.wikimedia.org/T149171) [09:43:59] (03CR) 10Gilles: "Should I schedule this for a SWAT? It's been sitting for a while." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester) [09:47:28] (03PS2) 10Giuseppe Lavagetto: docker::registry: add support for swift storage backend [puppet] - 10https://gerrit.wikimedia.org/r/318056 [09:47:47] (03CR) 10Giuseppe Lavagetto: [C: 032] "Tested on toollabs: noop" [puppet] - 10https://gerrit.wikimedia.org/r/318056 (owner: 10Giuseppe Lavagetto) [09:52:49] !log starting schema change (imagelinks) on s1 T139090 [09:52:50] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [09:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:00] (03PS1) 10Thiemo Mättig (WMDE): Enable Wikibase #statements parser function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) [09:54:57] (03CR) 10Giuseppe Lavagetto: [V: 032] docker::registry: add support for swift storage backend [puppet] - 10https://gerrit.wikimedia.org/r/318056 (owner: 10Giuseppe Lavagetto) [09:55:07] (03PS2) 10Thiemo Mättig (WMDE): Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) [10:01:30] RECOVERY - Disk space on ms-be2011 is OK: DISK OK [10:01:59] (03PS1) 10Elukey: Force Content-type for files without extensions (noc.w.o) [puppet] - 10https://gerrit.wikimedia.org/r/318074 (https://phabricator.wikimedia.org/T146421) [10:02:15] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: Build Kubernetes for production use - https://phabricator.wikimedia.org/T148968#2744248 (10Joe) p:05Triage>03Normal [10:02:51] !log rebooting codfw lvs primaries (lvs200[1-3]) [10:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:34] (03PS2) 10Giuseppe Lavagetto: docker::registry: move hiera lookups to the role [puppet] - 10https://gerrit.wikimedia.org/r/318057 [10:09:29] (03PS2) 10BBlack: eqiad recdns IP fix: remove old from DNS [dns] - 10https://gerrit.wikimedia.org/r/315928 (https://phabricator.wikimedia.org/T143915) [10:09:53] (03CR) 10BBlack: [C: 032] eqiad recdns IP fix: remove old from DNS [dns] - 10https://gerrit.wikimedia.org/r/315928 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [10:11:00] (03PS3) 10Giuseppe Lavagetto: docker::registry: move hiera lookups to the role [puppet] - 10https://gerrit.wikimedia.org/r/318057 [10:13:03] !log rebooting ulsfo lvs primaries (lvs400[12]) [10:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:16:04] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: move hiera lookups to the role [puppet] - 10https://gerrit.wikimedia.org/r/318057 (owner: 10Giuseppe Lavagetto) [10:16:38] (03CR) 10Giuseppe Lavagetto: [C: 032] "Discussed with yuvi yesterday" [puppet] - 10https://gerrit.wikimedia.org/r/318058 (owner: 10Giuseppe Lavagetto) [10:16:44] (03PS2) 10Giuseppe Lavagetto: docker::registry: drop setcap [puppet] - 10https://gerrit.wikimedia.org/r/318058 [10:19:45] !log rebooting esams lvs primaries (lvs300[12]) [10:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:58] Anybody online who can provide me with information about fetching categories for properties, like displayed in https://www.wikidata.org/wiki/Wikidata:List_of_properties ? [10:23:57] (03CR) 10Thiemo Mättig (WMDE): "Note: As part of the Wikidata birthday we will announce this feature on November, 1st. We (Lea) believe such an announcement does not make" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [10:25:35] (03PS2) 10Thiemo Mättig (WMDE): Enable Wikibase #statements parser function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) [10:25:44] (03PS3) 10Thiemo Mättig (WMDE): Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) [10:32:54] AlexK: Have you tried in #wikidata ? I doubt you will find help about such a specific thing in this channel [10:33:35] (03PS2) 10BBlack: eqiad recdns IP fix: remove old from LVS [puppet] - 10https://gerrit.wikimedia.org/r/315931 (https://phabricator.wikimedia.org/T143915) [10:33:48] I will try there, thanks :) [10:35:08] (03CR) 10BBlack: [C: 032] eqiad recdns IP fix: remove old from LVS [puppet] - 10https://gerrit.wikimedia.org/r/315931 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [10:35:43] (03PS2) 10Giuseppe Lavagetto: docker::registry: move htpasswd file to /etc/nginx [puppet] - 10https://gerrit.wikimedia.org/r/318059 [10:38:10] (03CR) 10Giuseppe Lavagetto: [C: 032] "Works well in toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/318059 (owner: 10Giuseppe Lavagetto) [10:40:25] !log rebooting eqiad lvs primaries (lvs100[1-3]) [10:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:07] RECOVERY - traffic-pool service on cp1047 is OK: OK - traffic-pool is active [10:47:07] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2744294 (10elukey) [10:47:15] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10elukey) p:05Triage>03Normal [10:48:57] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10elukey) @DarTar would it make sense to add initially one GPU card only to stat1003 (research data cruncher) and see how it goes, rather tha... [10:51:50] 06Operations, 10ops-eqiad: decom titanium - https://phabricator.wikimedia.org/T145666#2744311 (10elukey) [10:55:31] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2744318 (10BBlack) The recdns case is fully-fixed now (the old/bad IP no longer present anywhere or functional). [10:56:22] 06Operations, 10hardware-requests: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313#2744320 (10elukey) [10:57:05] hello [11:04:15] (03PS1) 10Giuseppe Lavagetto: terbium/wasat: add noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/318075 [11:04:17] (03PS1) 10Giuseppe Lavagetto: cache::misc: switch noc.w.o to terbium [puppet] - 10https://gerrit.wikimedia.org/r/318076 [11:04:19] (03PS1) 10Giuseppe Lavagetto: mediawiki: decommission mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/318077 [11:05:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] terbium/wasat: add noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/318075 (owner: 10Giuseppe Lavagetto) [11:06:26] (03PS1) 10Muehlenhoff: Switch back to tin as deployment server [dns] - 10https://gerrit.wikimedia.org/r/318078 [11:08:35] (03PS1) 10Muehlenhoff: Switch back to tin as deployment server [puppet] - 10https://gerrit.wikimedia.org/r/318079 [11:10:28] (03CR) 10Hashar: [C: 031] Switch back to tin as deployment server [dns] - 10https://gerrit.wikimedia.org/r/318078 (owner: 10Muehlenhoff) [11:10:52] (03CR) 10Hashar: [C: 031] Switch back to tin as deployment server [puppet] - 10https://gerrit.wikimedia.org/r/318079 (owner: 10Muehlenhoff) [11:10:56] (03CR) 10Muehlenhoff: [C: 032] Switch back to tin as deployment server [dns] - 10https://gerrit.wikimedia.org/r/318078 (owner: 10Muehlenhoff) [11:12:56] (03CR) 10Muehlenhoff: [C: 032] Switch back to tin as deployment server [puppet] - 10https://gerrit.wikimedia.org/r/318079 (owner: 10Muehlenhoff) [11:23:35] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:37:00] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: Trebuchet targets for test/testrepo are out of date - https://phabricator.wikimedia.org/T149180#2744454 (10hashar) [11:41:42] 06Operations, 06Operations-Software-Development, 07HHVM, 13Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2744532 (10MoritzMuehlenhoff) [11:41:45] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review, 06Services (doing): Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2744530 (10MoritzMuehlenhoff) 05Open>03Resolved This is now complete. [11:43:02] (03PS2) 10Giuseppe Lavagetto: cache::misc: switch noc.w.o to terbium [puppet] - 10https://gerrit.wikimedia.org/r/318076 [11:46:17] (03CR) 10Paladox: "@Muehlenhoff would you be able to review this please?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/317654 (https://phabricator.wikimedia.org/T143089) (owner: 10Chad) [11:50:12] (03CR) 10Giuseppe Lavagetto: [C: 032] cache::misc: switch noc.w.o to terbium [puppet] - 10https://gerrit.wikimedia.org/r/318076 (owner: 10Giuseppe Lavagetto) [11:52:03] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:00:59] (03PS2) 10BBlack: LVS: move ocg to low-traffic set [puppet] - 10https://gerrit.wikimedia.org/r/316920 (https://phabricator.wikimedia.org/T143915) [12:02:29] (03PS1) 10BBlack: revdns: document LVS traffic classes [dns] - 10https://gerrit.wikimedia.org/r/318081 (https://phabricator.wikimedia.org/T143915) [12:03:29] (03CR) 10BBlack: [C: 032] revdns: document LVS traffic classes [dns] - 10https://gerrit.wikimedia.org/r/318081 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [12:04:27] (03CR) 10BBlack: [C: 032] LVS: move ocg to low-traffic set [puppet] - 10https://gerrit.wikimedia.org/r/316920 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [12:05:09] !log moving ocg LVS from high-traffic2 -> low-traffic - T143915 [12:05:10] T143915: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915 [12:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:04] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2744591 (10mark) Approved. [12:12:22] 06Operations, 10ops-eqiad, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#2744592 (10Joe) [12:13:16] (03CR) 10Jcrespo: [C: 031] "+1 about the mariadb grants" [puppet] - 10https://gerrit.wikimedia.org/r/318077 (owner: 10Giuseppe Lavagetto) [12:14:23] (03PS2) 10Giuseppe Lavagetto: mediawiki: decommission mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/318077 (https://phabricator.wikimedia.org/T149185) [12:15:04] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: decommission mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/318077 (https://phabricator.wikimedia.org/T149185) (owner: 10Giuseppe Lavagetto) [12:15:23] nice! ---^ [12:15:30] (03PS2) 10BBlack: LVS: move git-ssh to high-traffic2 set [puppet] - 10https://gerrit.wikimedia.org/r/316921 (https://phabricator.wikimedia.org/T143915) [12:17:29] (03CR) 10BBlack: [C: 032] LVS: move git-ssh to high-traffic2 set [puppet] - 10https://gerrit.wikimedia.org/r/316921 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [12:18:20] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#2744622 (10Joe) [12:19:40] !log moving git-ssh LVS from low-traffic -> high-traffic2 - T143915 [12:19:41] T143915: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915 [12:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:20:01] <_joe_> !log turned off mw1152, removed salt/puppet data, T149185 [12:20:02] T149185: Decommission mw1152 - https://phabricator.wikimedia.org/T149185 [12:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:28:11] dbstores may lag for a bit while the schema change is happening [12:28:29] you can ignore those alerts here on irc [12:31:04] !log rebooting mira for kernel update [12:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:04] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2744631 (10BBlack) ocg and git-ssh are fixed as well! [12:36:04] (03PS3) 10ArielGlenn: create dumps testbed role for snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/318069 (https://phabricator.wikimedia.org/T149171) [12:36:29] !log Deploy schema change s5 dewiki.revision only codfw - T148967 [12:36:30] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [12:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:38:26] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317790 (owner: 10Ema) [12:38:31] (03CR) 10ArielGlenn: [C: 032] create dumps testbed role for snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/318069 (https://phabricator.wikimedia.org/T149171) (owner: 10ArielGlenn) [12:40:19] (03PS1) 10BBlack: LVS: document subnets in balancer assignment [puppet] - 10https://gerrit.wikimedia.org/r/318083 (https://phabricator.wikimedia.org/T143915) [12:42:24] (03CR) 10BBlack: [C: 032] LVS: document subnets in balancer assignment [puppet] - 10https://gerrit.wikimedia.org/r/318083 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [12:43:03] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2744637 (10BBlack) 05Open>03Resolved a:03BBlack [12:45:33] PROBLEM - DPKG on snapshot1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:50] (03PS1) 10ArielGlenn: add snapshot1001 to dsh group for installs [puppet] - 10https://gerrit.wikimedia.org/r/318084 [12:46:23] PROBLEM - salt-minion processes on snapshot1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:43] PROBLEM - dhclient process on snapshot1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:03] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:03] PROBLEM - configured eth on snapshot1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:13] PROBLEM - Disk space on snapshot1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:20] (03CR) 10ArielGlenn: [C: 032] add snapshot1001 to dsh group for installs [puppet] - 10https://gerrit.wikimedia.org/r/318084 (owner: 10ArielGlenn) [12:48:53] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [12:49:14] RECOVERY - configured eth on snapshot1001 is OK: OK - interfaces up [12:49:14] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [12:49:23] RECOVERY - salt-minion processes on snapshot1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:49:43] RECOVERY - dhclient process on snapshot1001 is OK: PROCS OK: 0 processes with command name dhclient [12:50:12] (03PS1) 10ArielGlenn: add snapshot1001 to list of nfs exports for dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/318085 [12:50:33] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [12:51:26] (03CR) 10ArielGlenn: [C: 032] add snapshot1001 to list of nfs exports for dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/318085 (owner: 10ArielGlenn) [12:51:34] PROBLEM - HHVM rendering on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [12:52:45] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 71240 bytes in 1.123 second response time [12:54:26] (03CR) 10Muehlenhoff: [C: 04-1] "The size and checksum of the 2.12.5 gerrit.war as downloaded from https://www.gerritcodereview.com/releases/2.12.md#2.12.5 is different fr" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/317654 (https://phabricator.wikimedia.org/T143089) (owner: 10Chad) [13:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161026T1300). [13:00:05] bd808 and yurik: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:01:37] oh already [13:01:40] jouncebot: next [13:01:40] In 4 hour(s) and 58 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161026T1800) [13:01:45] * bd808 yawns [13:01:45] jouncebot: now [13:01:45] For the next 0 hour(s) and 58 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161026T1300) [13:01:59] ouch it is that early!!! :( [13:02:20] bd808: want me to handle the grunt work for you ? [13:02:24] here [13:02:26] ? [13:02:32] hashar: that would be awesome [13:02:50] (03PS1) 10ArielGlenn: clean up instruactions for adding new snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/318086 [13:03:04] (03CR) 10Hashar: [C: 032] "It is SWAT time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318036 (https://phabricator.wikimedia.org/T149150) (owner: 10BryanDavis) [13:03:19] yurik: bd808 we should find a way to insert some sneak deployment slot that is not so early for you [13:03:31] (03Merged) 10jenkins-bot: wikitech: Set wgMWOAuthCentralWiki = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318036 (https://phabricator.wikimedia.org/T149150) (owner: 10BryanDavis) [13:03:36] note I have no idea how to test that patch above [13:03:40] will push it on mw1099 [13:03:43] hashar, its 9am, not that bad [13:04:02] 07:00 for me [13:04:11] ah not "so" terrible so [13:04:23] primary deployment server is tin now [13:04:34] technically i have been getting up around 11 ever since i joined wmf :) [13:05:00] (03PS2) 10ArielGlenn: clean up instructions for adding new snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/318086 [13:05:04] bd808: patch is on mw1099 now [13:05:43] hashar: ok. all we can test there is that it doesn't break normal wikis [13:05:51] (it won't) [13:05:59] (famous last words) [13:06:02] twist: it will? :D [13:06:43] (03CR) 10ArielGlenn: [C: 032] clean up instructions for adding new snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/318086 (owner: 10ArielGlenn) [13:07:17] oh that is just for silver/wikitech right ? [13:07:24] yeah [13:07:41] I can test it on labtestweb2001.wikimedia.org if you want to stage it there [13:07:47] !log rebooting labtest hosts for kernel update [13:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:08:56] sure [13:09:15] (yet another host I never heard of) [13:09:58] bd808: I ran scap pull on labtestweb2001 [13:10:07] its kind of mw1099 for wikitech [13:10:25] hashar: \o/ works [13:10:33] though rsync yields bunch of error not being able to delete some old wmf branches [13:10:34] awesome! [13:10:37] thx for the testing :} [13:11:49] !log hashar@tin Synchronized wmf-config/CommonSettings.php: wikitech: Set wgMWOAuthCentralWiki = false - T149150 (duration: 00m 47s) [13:11:50] T149150: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150 [13:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:15] * hashar gives some fresh brewed coffee and french croissants to bd808 [13:12:15] happy breakfast [13:13:15] yurik: is there an order to deploy your changes? [13:13:19] I have CR+2 the one for mw [13:14:18] hashar, nah [13:14:43] for labs only changes ( https://gerrit.wikimedia.org/r/#/c/318046/ ) we can get them deployed pretty much at anytime :D [13:15:50] (03PS3) 10Hashar: LABS: Enable Tabular data on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [13:15:52] (03PS2) 10Hashar: Remove obsolete config values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318047 (owner: 10Yurik) [13:16:10] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318047 (owner: 10Yurik) [13:16:15] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [13:16:43] yurik: I am going to pull of that on mw1099 [13:16:44] (03Merged) 10jenkins-bot: LABS: Enable Tabular data on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [13:16:46] (03Merged) 10jenkins-bot: Remove obsolete config values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318047 (owner: 10Yurik) [13:16:55] ok [13:16:59] tell me when [13:17:10] and which ones :) [13:17:17] or all at once [13:17:24] should be ok i guess [13:18:54] yurik: I have pulled on mw1099 all four changes :D [13:19:06] thx, testing... [13:20:33] PROBLEM - Apache HTTP on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.017 second response time [13:21:07] hashar, all's good [13:21:33] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.037 second response time [13:22:43] syncing Kartographer [13:23:33] !log hashar@tin Synchronized php-1.28.0-wmf.23/extensions/Kartographer: T149145: Fix empty groups params T149154: Fix external links (duration: 00m 57s) [13:23:35] T149154: Kartographer no longer displays external data credits - https://phabricator.wikimedia.org/T149154 [13:23:35] T149145: Snapshot service not working with empty groups param - https://phabricator.wikimedia.org/T149145 [13:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:31] !log hashar@tin Synchronized wmf-config/CommonSettings-labs.php: LABS: Enable Tabular data on Commons - T148745 (duration: 00m 45s) [13:24:32] T148745: Epic: Enable data namespace with tabular support on Commons - https://phabricator.wikimedia.org/T148745 [13:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:37] !log hashar@tin Synchronized wmf-config/CommonSettings.php: Remove obsolete config values (duration: 00m 46s) [13:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:47] yurik: all patches pushed to the whole fleet! [13:26:00] thanks :) [13:26:12] * yurik ponders what he can break next [13:28:09] bah [13:28:19] mw2098 spurts bunch of Notice: Undefined variable: wmgWatchlistDefault in /srv/mediawiki/wmf-config/CommonSettings.php on line 1871 [13:28:25] !log mw2098 spurts bunch of Notice: Undefined variable: wmgWatchlistDefault in /srv/mediawiki/wmf-config/CommonSettings.php on line 1871 [13:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:24] moritzm: paravoid: mind if I scap pull on mw2098 ? Looks like it mediawiki-config might be inconsistent [13:29:47] I know nothing about this server [13:30:07] ah I messed up output of "last" [13:30:07] 2098 failed to come back after a reboot for the new kernel, mgmt was unreachable [13:30:15] you connected to it yesterday [13:30:16] if it is back, then scap pull is fine [13:30:33] I logged in yesterday to see why puppet wasn't running [13:30:35] !log mw2098: scap pull . It failed yesterday on reboot and is back in pull [13:30:37] pool [13:30:38] bah [13:30:40] apart from that, I don't know its status [13:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:42] * apergos doublechecks that is the host but I think so [13:30:57] paravoid: yeah all set. Sorry I thought you were still connected to it [13:31:32] https://phabricator.wikimedia.org/T148719 yep it was ac powercycled by papaul yesterday [13:31:49] I guess it lacked a bunch of update [13:31:56] the rsync is taking age [13:32:05] busy developers :-) [13:32:08] ;D [13:32:19] that server might well have failed to receive the latest mw version [13:32:24] I can imagine [13:32:43] luckily it only complains about a single missing variable [13:32:46] so probably not much harm [13:33:25] I wonder if it is not back in the dsh list [13:35:07] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#2744757 (10Joe) [13:36:24] mw2098 is in the tin /etc/dsh/group/mediawiki-installation [13:36:37] ok [13:36:50] and a "scap pull" on the host fixed the inconsistency in the config [13:36:51] all set [13:36:53] good [13:37:28] !log mw2098 is all set now after I ran "scap pull". It is properly in tin:/etc/dsh/group/mediawiki-installation [13:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:52] hashar , the watchlist default for cswiki was deployed two days ago, how come it only now errors? [13:38:24] arseny92: that server had an issue yesterday [13:38:47] and somehow during the scap sync I did earlier, that server ended up with some wrong config [13:38:57] anyway it is all fixed now [13:39:15] !log European SWAT deploy completed [13:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:47] ending in config be in a state in the middle of my is/cs dance with tyler? [13:43:27] like the server got the wrong version of the files on scap? [13:47:28] (03PS1) 10Giuseppe Lavagetto: Remove entries for decommissioned appservers, including mw1152 [dns] - 10https://gerrit.wikimedia.org/r/318089 (https://phabricator.wikimedia.org/T149185) [13:48:05] <_joe_> moritzm/elukey: care to take a look? ^^ [13:48:14] <_joe_> careful proofreading is needed [13:48:16] * elukey looks [13:49:33] (03CR) 10Chad: "Yeah, I built it manually since I also had to build its-phabricator. I can use the upstream if you'd prefer (also includes the non-phab pl" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/317654 (https://phabricator.wikimedia.org/T143089) (owner: 10Chad) [13:51:55] (03CR) 10ArielGlenn: [C: 031] "Left that out! Good catch, Gehel." [puppet] - 10https://gerrit.wikimedia.org/r/318067 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [13:52:34] (03CR) 10Elukey: [C: 031] Remove entries for decommissioned appservers, including mw1152 [dns] - 10https://gerrit.wikimedia.org/r/318089 (https://phabricator.wikimedia.org/T149185) (owner: 10Giuseppe Lavagetto) [13:54:22] (03CR) 10Chad: [C: 031] "Looks fine, feel free to merge & deploy whenever (it'll restart gerrit, obvs)" [puppet] - 10https://gerrit.wikimedia.org/r/318067 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [13:54:25] (03CR) 10Muehlenhoff: "It's fine, I just wanted to exclude a mixup/error. Will build the package in a bit." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/317654 (https://phabricator.wikimedia.org/T143089) (owner: 10Chad) [13:54:29] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove entries for decommissioned appservers, including mw1152 [dns] - 10https://gerrit.wikimedia.org/r/318089 (https://phabricator.wikimedia.org/T149185) (owner: 10Giuseppe Lavagetto) [13:55:44] ostriches: ah since you're around, you want I should merge and we get the restart done? [13:56:27] Might as well, I'll only be around about 4 more hours :P [13:56:37] all righty then [13:57:06] (03PS2) 10ArielGlenn: Gerrit: Adding option -XX:+PrintGCApplicationStoppedTime to gc logging [puppet] - 10https://gerrit.wikimedia.org/r/318067 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [13:58:09] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#2744857 (10Joe) [13:58:19] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#2744592 (10Joe) p:05Triage>03Low a:05Joe>03None [13:58:39] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#2744592 (10Joe) @Cmjohnson you can decommission this server whenever you see fit. [13:58:40] (03CR) 10ArielGlenn: [C: 032] Gerrit: Adding option -XX:+PrintGCApplicationStoppedTime to gc logging [puppet] - 10https://gerrit.wikimedia.org/r/318067 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [13:59:43] (03CR) 10Muehlenhoff: [C: 032] gerrit (2.12.5-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/317654 (https://phabricator.wikimedia.org/T143089) (owner: 10Chad) [14:00:22] ostriches: puppet run kicking off now [14:02:05] bah [14:02:10] Gerrit 503 are apparently cached somehow : [14:02:15] ( [14:02:30] ostriches: hi could you restart grrrit-wm please [14:02:41] I'm not near a pc so I carnt do it :) [14:03:04] hashar: No they aren't.... [14:03:26] it's not behind varnish & it definitely doesn't cache them itself. [14:03:28] I just had the issue and had to force clear my browser cache [14:03:33] :) [14:03:49] then I havent investigated :D [14:06:23] godog: re https://gerrit.wikimedia.org/r/#/c/314029/3/modules/role/manifests/grafana/base.pp it doesnt look like git clone has a default directory..? [14:10:25] !log demon@tin Synchronized w: replacing wiki.phtml with a symlink (duration: 00m 47s) [14:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:08] godog, addshore: Yeah git::clone needs a destination, there is no default "clone to here" with it.... [14:11:37] ostriches: thanks! :) [14:11:48] # === Required parameters [14:11:48] # $+directory+:: path to clone the repository into. [14:12:04] Everything else is optional beyond that. [14:12:09] So your bare minimum is: [14:12:26] git::clone { 'foo': directory => '/bar' } [14:13:08] (git clone assumes the $title is from Gerrit, unless you override that behavior using $source or $origin) [14:18:57] PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.007 second response time [14:19:17] PROBLEM - Apache HTTP on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [14:19:30] !log rolling reboot of ocg cluster for kernel update [14:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:57] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 71223 bytes in 0.152 second response time [14:20:17] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.036 second response time [14:22:54] <_joe_> uh what was that [14:23:23] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2744913 (10mark) We've been able to find some H710 controllers, which we can swap for the H310s. That should allow these 3 box... [14:24:29] !log cache_upload - start rolling downtimed reboots for kernel update (~4 hours to completion) [14:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:55] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2744953 (10mark) [14:29:37] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [14:30:17] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [14:30:17] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:57] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:58] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:18] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:31:47] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [14:33:07] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:34:07] !log rearmed keyholder on mira after reboot [14:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:28] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [14:41:18] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR [14:41:27] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [14:41:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [14:46:26] 06Operations, 06Services (next), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2744988 (10GWicke) I added `--noprofile` to get around this error: `libudev: udev_monitor_new_from_netlink_fd: error getting socket: Ope... [14:48:10] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2744991 (10mark) p:05Triage>03High [14:50:57] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [14:51:07] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [14:51:07] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [14:53:14] 06Operations, 10hardware-requests: codfw/eqiad: 2x systems for prometheus - https://phabricator.wikimedia.org/T148513#2744995 (10mark) Prometheus might be much better off with SSDs, in which case I assume we wouldn't be using these misc spares? [14:57:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:57:37] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:00:59] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2745058 (10RobH) a:05mark>03RobH [15:01:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 635 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3070123 keys, up 217 days 7 hours - replication_delay is 635 [15:02:17] PROBLEM - confd service on cp3038 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [15:03:08] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3055340 keys, up 217 days 7 hours - replication_delay is 0 [15:04:27] RECOVERY - confd service on cp3038 is OK: OK - confd is active [15:04:47] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:08:08] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:10:17] grrrit-wm: ping [15:11:25] I think it's dead [15:14:00] jouncebot: hi [15:14:04] jouncebot: next [15:14:04] In 2 hour(s) and 45 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161026T1800) [15:16:17] !log restarting grrrit-wm [15:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:04] (03Abandoned) 10BBlack: revert potential event pipe breakage from 1.11.4 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317811 (owner: 10BBlack) [15:22:38] (03Abandoned) 10BBlack: add 3x post-1.11.4 bugfixes [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317812 (owner: 10BBlack) [15:22:54] (03Abandoned) 10BBlack: nginx (1.11.4-1+wmf4) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317813 (owner: 10BBlack) [15:22:59] (03Abandoned) 10BBlack: control: back to openssl-1.0.2 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317821 (owner: 10BBlack) [15:23:02] (03Abandoned) 10BBlack: nginx (1.11.4-1+wmf5) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317822 (owner: 10BBlack) [15:23:59] (03PS3) 10MarcoAurelio: Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) [15:24:08] 06Operations, 06Services (next), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2745129 (10GWicke) Looking into this some more, I followed the hints [in this discussion](https://l3net.wordpress.com/projects/firejail/)... [15:24:58] (03PS4) 10MarcoAurelio: Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) [15:26:18] (03PS12) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [15:26:27] PROBLEM - NTP on mc2008 is CRITICAL: NTP CRITICAL: Offset unknown [15:27:43] !log cp1054 reboot for kernel update [15:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:37] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:29:47] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:30:57] RECOVERY - MegaRAID on labstore1003 is OK: OK: optimal, 5 logical, 34 physical [15:33:28] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2745166 (10ArielGlenn) [15:37:02] !log cache_text - start rolling downtimed reboots for kernel update (~3 hours to completion) [15:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:00] (03PS1) 10BryanDavis: wikitech: Re-enable OAuth management interfaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318110 (https://phabricator.wikimedia.org/T149150) [15:39:50] jouncebot: refresh [15:39:52] I refreshed my knowledge about deployments. [15:40:17] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:37] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4006_v4, cp4006_v6 [15:42:47] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4006_v4, cp4006_v6 [15:43:27] the ipsec alerts will probably come and go (or just stay for a couple hours) [15:43:28] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3031_v4, cp3031_v6, cp4006_v4, cp4006_v6 [15:43:28] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3031_v4, cp3031_v6, cp4006_v4, cp4006_v6 [15:43:28] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3031_v4, cp3031_v6, cp4006_v4, cp4006_v6 [15:43:47] with 2x clusters rolling through reboots, too often they'll overlap failing checked from two sets of reboots, basically [15:43:56] sorry! [15:44:13] till T148976 is implemented! [15:44:13] T148976: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 [15:44:16] probably only on the kafka hosts, though [15:44:20] so not too many [15:46:07] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [15:46:37] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [15:46:37] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [15:46:38] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [15:46:58] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [15:52:04] addshore: doh! ok I'll comment on the review [15:53:19] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2745222 (10GWicke) @mark: That sounds great, thank you! Do you need anything else from us for the disk procurement? [15:57:05] !log Restored cr1-eqiad:ae4 [15:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:10] !log Reenabling cr1-eqiad:ae4 [15:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:33] !log Disabling cr1-eqiad:ae4; VRRP conflict [15:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:48] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4015_v4, cp4015_v6 [16:02:48] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4015_v4, cp4015_v6 [16:02:57] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4015_v4, cp4015_v6 [16:02:57] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4015_v4, cp4015_v6 [16:02:58] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4015_v4, cp4015_v6 [16:03:27] PROBLEM - traffic-pool service on cp3035 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is failed [16:07:07] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [16:07:07] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [16:07:17] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [16:07:38] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [16:07:38] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [16:07:38] RECOVERY - traffic-pool service on cp3035 is OK: OK - traffic-pool is active [16:08:17] PROBLEM - traffic-pool service on cp4015 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is failed [16:08:37] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:08:59] those traffic-pool failures on startup are basically Raft Internal Error issues (we don't really see them if they happen on shutdown and leave things pooled, but we do see them on startup) [16:09:51] (03PS1) 10Gehel: elasticsearch - mount elasticsearch data partition with noatime [puppet] - 10https://gerrit.wikimedia.org/r/318117 [16:10:27] RECOVERY - traffic-pool service on cp4015 is OK: OK - traffic-pool is active [16:11:05] (03CR) 10Gehel: "I'm uncomfortable playing with mounted partitions. I'd welcome an in depth review and some pointers on the things that can go wrong here!" [puppet] - 10https://gerrit.wikimedia.org/r/318117 (owner: 10Gehel) [16:11:48] 06Operations, 10ops-codfw, 06DC-Ops: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2745283 (10Papaul) 05Open>03Resolved Memory replacement complete. System is back up. * Drain the system flea power * Replaced the memory * Clean the SBE logs * Update the BIOS fro... [16:13:28] RECOVERY - Host wtp2019 is UP: PING OK - Packet loss = 0%, RTA = 36.54 ms [16:16:48] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2745287 (10Papaul) @RobH I think we are okay on using the disks from the decommissioned es servers. Please see below for disk information Dell ST3600057SS 3.5: SAS 15K [16:16:53] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745288 (10mark) p:05Unbreak!>03High @cmjohnson: "Unbreak Now!" (UBN) is reserved only for critical emergency, "don't go home until this is fixed" kind of things.... [16:17:59] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2745290 (10RobH) I agree, the 15k and the larger size typically means they can replace smaller capacity disks without issues. Since they are larger, they'll likely be re-added to the raid array and only ma... [16:18:18] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2745291 (10Andrew) 05Open>03Resolved Yep, upgraded labcontrol1001 to 3.8.5 and now everything is fine. [16:29:24] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2745351 (10mark) I've added new ports for the row A-D uplinks to the aggregated links (desc no-mon), so they can be moved one by one. No other ports (transit/transport/etc) yet. [16:29:25] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6 [16:29:30] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2745352 (10jcrespo) I am cool with this, this worked last time we tried. [16:30:05] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2014_v4, cp2014_v6 [16:30:25] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [16:30:35] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4010_v4, cp4010_v6 [16:30:41] 06Operations, 10ops-codfw, 06DC-Ops, 10Parsoid: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2745354 (10Arlolra) @akosiaris Before repooling, can I deploy the commit the rest of the cluster is on now? Also, would it have been prudent to also tag this w/ #parsoid [16:30:45] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4010_v4, cp4010_v6 [16:30:57] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4010_v4, cp4010_v6 [16:31:15] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [16:31:25] PROBLEM - NTP on cp3032 is CRITICAL: NTP CRITICAL: Offset unknown [16:31:35] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4010_v4, cp4010_v6 [16:31:45] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4010_v4, cp4010_v6 [16:32:35] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3044_v4, cp3044_v6, cp4010_v4, cp4010_v6 [16:33:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:33:38] jynus: modules/mediawiki/manifests/maintenance/jobqueue_stats.pp . Looks like the script itself sends the data to statsd, rather than having something read its output like I thought. I suppose I could just add a --report option to getLagTimes.php too. [16:34:33] not sure if job queue is a great place, but nothing to object to try it [16:35:05] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [16:35:18] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [16:35:45] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [16:36:05] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [16:36:09] !log stopping db2011 to replace disks T149099 [16:36:10] T149099: db2011 disk media errors - https://phabricator.wikimedia.org/T149099 [16:36:15] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [16:36:15] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [16:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:33] (03PS1) 10Muehlenhoff: Update to 4.4.27 [debs/linux44] - 10https://gerrit.wikimedia.org/r/318123 [16:38:29] jynus: oh, it would be separate chron, I was just trying to find that example of something similar. [16:38:50] ok ok [16:39:12] maybe setup a terbium maintenance script there [16:39:33] with a slightly modified version of that [16:39:36] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:39:38] !log restarted ntp on mc2008 (stuck in XFAC state) [16:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:48] dbproxy1007? [16:40:01] 06Operations, 10Analytics, 06Discovery, 06Discovery-Analysis, 10LDAP-Access-Requests: Pivot access for Discovery's Analysis team - https://phabricator.wikimedia.org/T149144#2745374 (10Krenair) I think pivot access just relies on wmf/nda grouping in LDAP, not production shell [16:40:24] of course [16:40:25] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:40:27] it is db2011 [16:40:34] s2, I am dumb [16:40:43] sorry [16:40:46] m2, I mean [16:40:50] all scheduled [16:43:01] papaul, db2011 is shutting down right now [16:43:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:45:46] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:46:21] not sure if it is 500 or 5xx [16:46:25] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3041_v4, cp3041_v6 [16:46:34] one is high, the other has a spike [16:46:45] RECOVERY - NTP on cp3032 is OK: NTP OK: Offset 0.0002358555794 secs [16:46:45] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3041_v4, cp3041_v6 [16:46:55] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3041_v4, cp3041_v6 [16:46:55] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3041_v4, cp3041_v6 [16:47:05] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3041_v4, cp3041_v6 [16:47:15] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3041_v4, cp3041_v6, cp3048_v4, cp3048_v6 [16:47:22] jynus: see it [16:47:42] thank you! [16:49:47] !log Shutting down cr1-eqiad:xe-5/1/[0-3] (part of aggregated links to rows A-D switches) [16:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:26] 06Operations, 10Ops-Access-Requests: Requesting access researchers, statistics-users, analytics-users, statistics-privatedata-users, analytics-privatedata-users and bastiononly for Zareen - https://phabricator.wikimedia.org/T149211#2745424 (10Zareenf) [16:51:01] jynus: I made/assigned https://phabricator.wikimedia.org/T149210 to myself since the only thing I'd need from anyone is a puppet merge after I do the code changes. [16:51:54] 06Operations, 10Ops-Access-Requests: Requesting access researchers, statistics-users, analytics-users, statistics-privatedata-users, analytics-privatedata-users and bastiononly for Zareen - https://phabricator.wikimedia.org/T149211#2745424 (10Krenair) bastiononly doesn't exist. Please re-check each of your gr... [16:52:04] I would really would like to help, and would do eventually, I just cannot promise when at this time [16:52:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 878.11 seconds [16:53:35] 06Operations, 10Ops-Access-Requests: Requesting access researchers, statistics-users, analytics-users, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2745445 (10Zareenf) [16:53:55] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [16:53:55] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [16:53:56] the s1 lag is me, I logged it and warned about it [16:54:12] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2745447 (10Papaul) Disk replacement complete [16:54:16] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [16:54:22] the haproxies should recover now [16:54:25] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [16:54:28] !log Chris moved cr1-eqiad:xe-5/1/[0-3] to xe-3/1/[0-3] [16:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:35] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [16:54:35] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [16:54:35] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [16:54:45] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [16:55:18] 06Operations, 10Ops-Access-Requests: Requesting access researchers, statistics-users, analytics-users, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2745448 (10Krenair) I also don't think you need both researchers and statistics-users simul... [16:56:56] and db2011 is back up [16:58:32] !log Chris moved cr1-eqiad:xe-5/2/1 to xe-3/0/3 [16:58:35] RECOVERY - NTP on mc2008 is OK: NTP OK: Offset -0.002177357674 secs [16:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:55] !log Shutting down cr1-eqiad:xe-5/0/[0-2] (part of aggregated links to rows A-C switches) [16:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:45] (03PS3) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315299 (https://phabricator.wikimedia.org/T147508) [17:02:07] ACKNOWLEDGEMENT - MegaRAID on db2011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149212 [17:02:10] 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T149212#2745456 (10ops-monitoring-bot) [17:03:08] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2745461 (10jcrespo) 05stalled>03Open a:03jcrespo The disks are unconfigured, they need to be put into the RAID still: ``` Raw Size: 558.911 GB [0x45dd2fb0 Sectors] Non Coerced Size: 558.411 GB [0x45c... [17:03:45] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3036_v4, cp3036_v6 [17:04:06] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3036_v4, cp3036_v6 [17:04:15] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3036_v4, cp3036_v6 [17:04:25] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3036_v4, cp3036_v6 [17:04:45] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3036_v4, cp3036_v6 [17:05:13] (03PS1) 10Volans: Icinga: raid_handler improve failure detection [puppet] - 10https://gerrit.wikimedia.org/r/318128 (https://phabricator.wikimedia.org/T142085) [17:05:35] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [17:06:19] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2745476 (10Volans) 05Open>03Invalid @elukey thanks for adding me, it is indeed a false positive, I guess the host was not put in downtime on Icinga before the reboot. The RAID handler script was called with a new... [17:06:32] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2745478 (10jcrespo) p:05Triage>03Normal [17:06:49] (03PS1) 10Filippo Giunchedi: swift: don't track connection to backend services [puppet] - 10https://gerrit.wikimedia.org/r/318129 [17:08:35] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [17:08:45] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [17:08:55] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [17:09:26] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [17:09:52] 06Operations, 10Analytics, 06Discovery, 06Discovery-Analysis, 10LDAP-Access-Requests: Pivot access for Discovery's Analysis team - https://phabricator.wikimedia.org/T149144#2745487 (10elukey) [17:10:25] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1921.77 seconds Jcrespo Applying schema change T139090 [17:13:27] (03CR) 10Muehlenhoff: [C: 031] "The increased conntrack sizes were caused by T136094 hitting after the recent reboots (I fixed the settings). With correct settings we usu" [puppet] - 10https://gerrit.wikimedia.org/r/318129 (owner: 10Filippo Giunchedi) [17:14:10] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2744018 (10MoritzMuehlenhoff) It was put in downtime, Icinga didn't alert for the actual reboot. [17:15:56] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:19:50] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.27 [debs/linux44] - 10https://gerrit.wikimedia.org/r/318123 (owner: 10Muehlenhoff) [17:20:06] (03PS2) 10Filippo Giunchedi: swift: don't track connection to backend services [puppet] - 10https://gerrit.wikimedia.org/r/318129 [17:22:06] (03PS2) 10Muehlenhoff: Assign debdeploy grain for url_downloader via the role [puppet] - 10https://gerrit.wikimedia.org/r/317806 [17:23:45] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745562 (10Cmjohnson) [17:24:28] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10Cmjohnson) @faidon all 8 switches are accessible via serial and connected to mgmt. [17:24:40] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745564 (10Cmjohnson) [17:24:44] (03CR) 10Filippo Giunchedi: [C: 032] swift: don't track connection to backend services [puppet] - 10https://gerrit.wikimedia.org/r/318129 (owner: 10Filippo Giunchedi) [17:24:45] 06Operations, 10ops-codfw, 06DC-Ops, 10Parsoid: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2745565 (10akosiaris) @arlorla yes of course, feel free to. In fact, thanks for doing it. [17:25:56] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:27:25] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 211, down: 5, dormant: 0, excluded: 0, unused: 0BRae4.1020: down - Subnet private1-d-eqiadBRae4.32767: down - BRae4.1023: down - Subnet analytics1-d-eqiadBRxe-5/2/1: down - BRae4.1004: down - Subnet public1-d-eqiadBR [17:27:46] (03CR) 10Muehlenhoff: [C: 032] Assign debdeploy grain for url_downloader via the role [puppet] - 10https://gerrit.wikimedia.org/r/317806 (owner: 10Muehlenhoff) [17:27:51] (03PS3) 10Muehlenhoff: Assign debdeploy grain for url_downloader via the role [puppet] - 10https://gerrit.wikimedia.org/r/317806 [17:28:10] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2745581 (10mark) cr1-eqiad's row A-D links have been moved to xe-3/0/[0-3] and xe-3/1/[0-3] respectively. All port descriptions should be correct, and the old port configs have been cleaned... [17:28:43] (03CR) 10Muehlenhoff: [V: 032] Assign debdeploy grain for url_downloader via the role [puppet] - 10https://gerrit.wikimedia.org/r/317806 (owner: 10Muehlenhoff) [17:29:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:30:25] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3046_v4, cp3046_v6 [17:31:36] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [17:32:47] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [17:32:55] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [17:33:15] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp4007_v4, cp4007_v6, cp4016_v4, cp4016_v6 [17:33:15] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp4007_v4, cp4007_v6, cp4016_v4, cp4016_v6 [17:33:35] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp4007_v4, cp4007_v6, cp4016_v4, cp4016_v6 [17:33:35] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp4007_v4, cp4007_v6, cp4016_v4, cp4016_v6 [17:34:17] (03CR) 10BryanDavis: "I see that Chase gave this a +1, but I think I would have been against it if I had seen the patch. This package is one that we manage ours" [puppet] - 10https://gerrit.wikimedia.org/r/310710 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [17:35:24] moritzm: should this be reverted now? https://gerrit.wikimedia.org/r/#/c/315727/ [17:35:43] (03PS1) 10Volans: wmf-auto-reimage: support mutiple conftool roles [puppet] - 10https://gerrit.wikimedia.org/r/318131 (https://phabricator.wikimedia.org/T149216) [17:35:55] PROBLEM - NTP on cp4018 is CRITICAL: NTP CRITICAL: Offset unknown [17:36:15] arlolra: already done, a change for that was merged earlier the day [17:37:15] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [17:37:15] I'm trying to debug whether an ORES request is being cached in varnish--the application response includes: X-Cache:"cp1061 miss, cp2012 miss, cp4002 miss, cp4001 hit/2" [17:37:24] moritzm: that's in parsoid deploy repo ... I don't see the commit [17:37:35] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [17:37:35] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [17:37:36] https://github.com/wikimedia/mediawiki-services-parsoid-deploy/commits/master [17:37:45] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [17:37:55] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [17:37:55] This is worrying me though, because the application should have responded with Cache-Control: "no-store, no-cache, max-age=0" [17:38:13] also strange because ganglia reports that cp4001 is down: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Misc%2520Web%2520caching%2520cluster%2520ulsfo&tab=m&vn=&hide-hf=false [17:38:54] (03PS1) 10Chad: CI firewall: remove lead from the ferm rule, it doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/318132 [17:38:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:39:33] arlolra: ah, sorry, I misread and thought this were for the change in Hiera. yes, that change should be reverted as well [17:39:47] thanks [17:39:51] 06Operations, 10Traffic: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#2745596 (10ema) [17:40:14] 06Operations, 10Traffic: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#2745622 (10ema) p:05Triage>03Normal [17:40:16] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3037_v4, cp3037_v6 [17:41:14] (03CR) 10Dzahn: [C: 032] "kind of duplicate of https://gerrit.wikimedia.org/r/#/c/318035/ and some more decom there, but yea, rebasing the other one" [puppet] - 10https://gerrit.wikimedia.org/r/318132 (owner: 10Chad) [17:41:26] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [17:41:26] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [17:41:52] (03PS2) 10Filippo Giunchedi: esams: introduce svc records for swift [dns] - 10https://gerrit.wikimedia.org/r/318010 (https://phabricator.wikimedia.org/T149098) [17:44:51] (03CR) 10Filippo Giunchedi: [C: 032] esams: introduce svc records for swift [dns] - 10https://gerrit.wikimedia.org/r/318010 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [17:45:37] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2745636 (10ellery) @elukey I have a slight preference for stat1004 since it has access to HDFS [17:47:08] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2745641 (10Volans) @MoritzMuehlenhoff probably due to the fact that `icinga-downtime` put in dowtime only the host and not the related services. I'll follow up with Ops to check if we want to change it's behaviour.... [17:54:49] so the final upload reboot came back online at ~17:41, start time was ~14:24, so ~3h17m [17:55:25] minimum global overall hitrate was 91% in small dips, it's already back up 94% and climbing now [17:55:45] (was ~97.4% before the reboots started) [17:55:54] 06Operations, 10ops-eqiad: Degraded RAID on oxygen - https://phabricator.wikimedia.org/T149167#2744018 (10Dzahn) >>! In T149167#2745641, @Volans wrote: > @MoritzMuehlenhoff probably due to the fact that `icinga-downtime` put in dowtime only the host and not the related services. I'll follow up with Ops to chec... [17:56:26] and swift loadavg and iowait seems to be recovering now on a similar curve [17:56:42] heh wrong channel, but this channel's fine too :) [17:57:35] PROBLEM - NTP on cp4007 is CRITICAL: NTP CRITICAL: Offset unknown [17:57:47] yup that was >2x load on swift in terms of qps heh [17:58:18] brb [17:59:45] 2x load from what? [18:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161026T1800). Please do the needful. [18:00:05] bd808: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:15] o/ [18:00:21] apergos: from all the upload caches rebooting (which wipes their cache storage) [18:00:28] ahhhh [18:00:29] If I'm the only one I can just take care of it [18:00:59] so I should have that in mind too when this script is running [18:01:03] gtk [18:01:26] yeah it will be interesting to look at stats, if your scan-rate is high [18:01:48] well also, it will be a walk through all media used on all the projects, not just the popular stuff [18:01:56] it probably won't impact frontend caching behavior at all since we have the N-hit-wonder protection there, but it may churn up backend caching a little. [18:01:59] 06Operations: Reconsider/check naming of 'privatedata' shell groups compared to their theoretically non-sensitive counterparts - https://phabricator.wikimedia.org/T149222#2745739 (10AlexMonk-WMF) [18:02:04] oh right, 2 is it? or 3? [18:02:13] (03CR) 10BryanDavis: [C: 032] wikitech: Re-enable OAuth management interfaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318110 (https://phabricator.wikimedia.org/T149150) (owner: 10BryanDavis) [18:02:42] (03Merged) 10jenkins-bot: wikitech: Re-enable OAuth management interfaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318110 (https://phabricator.wikimedia.org/T149150) (owner: 10BryanDavis) [18:03:08] apergos: it's currently 4-hit-wonder (meaning only when an object is accessed for the 5th time in a given cache's region of the globe is it allowed to enter frontend cached) [18:03:15] ah 4 [18:03:43] back end will take a hit for sure, hm [18:03:44] (03PS2) 10Dzahn: decom lead (ex-gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/318035 (https://phabricator.wikimedia.org/T147905) [18:03:45] we shall see [18:03:51] (03PS3) 10Dzahn: decom lead (ex-gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/318035 (https://phabricator.wikimedia.org/T147905) [18:03:53] staged on labtestweb2001.wikimedia.org. testing there [18:04:04] !log cache_text - finished rolling downtimed reboots for kernel update [18:04:05] but the larger backends have no such protection, they'll gleefully evict more-useful objects as you scan down your list heh [18:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:29] but the frontends will protect against any really bad fallout from that, and you'd have to achieve a pretty high scan rate to do anything to it in general [18:04:54] 06Operations, 10ops-eqiad, 10netops: Decommission psw1-eqiad - https://phabricator.wikimedia.org/T149224#2745777 (10mark) [18:05:49] I'll let you know when the first scan kicks off [18:05:54] thanks! [18:06:06] thanks for the info! [18:07:07] 06Operations: Rethink/clarify/document use of 'analytics' vs. 'statistics' in group names - https://phabricator.wikimedia.org/T149225#2745794 (10AlexMonk-WMF) [18:07:17] 06Operations, 10netops: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#2745808 (10mark) [18:07:41] looks good on labtestwiki2001. now I'll check that nothing goes goofy for other wikis on mw1099 [18:08:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 250.64 seconds [18:08:25] PROBLEM - HHVM rendering on mw1234 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.004 second response time [18:08:42] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745822 (10mark) [18:08:44] 06Operations, 10netops: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#2745821 (10mark) [18:09:25] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 71216 bytes in 0.186 second response time [18:10:16] !log bd808@tin Synchronized wmf-config/CommonSettings.php: wikitech: Re-enable OAuth management interfaces T149150 (duration: 00m 46s) [18:10:17] T149150: OAuth api access on wikitech fails with consumed nonce error - https://phabricator.wikimedia.org/T149150 [18:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:29] done with swat. thanks for playing folks [18:11:38] I'll swat some moar [18:12:42] !log Disabling cr1-eqiad:xe-5/2/0 [18:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:56] 06Operations: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2745858 (10AlexMonk-WMF) [18:15:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 2, unused: 0BRxe-5/2/0: down - Core: cr1-eqiad:xe-5/2/0 {#1983} [10Gbps DF]BR [18:16:30] 06Operations, 10ops-eqiad, 10Dumps-Generation: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2730900 (10Volans) @ArielGlenn thanks for noticing it, it shouldn't persist after the service is back to OK. You can perfectly delete the comment on Icinga. I've opened T149229 for more inv... [18:17:45] RECOVERY - NTP on cp4007 is OK: NTP OK: Offset 0.0006181001663 secs [18:18:50] 06Operations, 10ops-eqiad, 10Dumps-Generation: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2745894 (10ArielGlenn) 05Open>03Resolved a:03ArielGlenn Thanks! Comment gone, closing :-) [18:19:27] (03CR) 10BBlack: [C: 031] wmf-auto-reimage: support mutiple conftool roles [puppet] - 10https://gerrit.wikimedia.org/r/318131 (https://phabricator.wikimedia.org/T149216) (owner: 10Volans) [18:19:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 2, unused: 0 [18:19:38] !log Chris is moving cr1-eqiad and cr2-eqiad xe-5/2/0 to xe-3/2/0 (both sides) [18:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:50] !log maxsem@tin Synchronized php-1.28.0-wmf.23/extensions/GeoData/: https://gerrit.wikimedia.org/r/#/c/318138/ (duration: 00m 47s) [18:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:21:47] (03PS3) 10Madhuvishy: tools proxy: Add health check and icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) [18:27:10] !log Chris is moving cr1-eqiad and cr2-eqiad xe-5/3/0 to xe-3/3/0 (both sides) [18:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:28:26] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2745920 (10Ottomata) Would stat1002 be acceptable? It is more of a local ‘compute’ node (more storage and RAM) than stat1004. [18:31:31] (03CR) 10Dzahn: [C: 032] decom lead (ex-gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/318035 (https://phabricator.wikimedia.org/T147905) (owner: 10Dzahn) [18:31:35] RECOVERY - NTP on cp4018 is OK: NTP OK: Offset -0.0005474984646 secs [18:31:44] !log Disabling BGP session to AS6461 on cr1-eqiad, preparing for port migration [18:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:30] (03PS1) 10Urbanecm: Account creation throttle exception for WIkipedia Editathon at Ohio State University on 2016-11-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318141 (https://phabricator.wikimedia.org/T149200) [18:35:15] (03PS1) 10Urbanecm: Fix a typo (ramge -> range) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318142 [18:35:45] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:51] (03PS2) 10Urbanecm: Fix a typo (ramge -> range) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318142 (https://phabricator.wikimedia.org/T146600) [18:37:05] Should https://gerrit.wikimedia.org/r/318142 be scheduled to a window or could it be deployed now? It's a simple typo (ramge -> range) in throttle.png, the event is in December... [18:37:29] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2745932 (10ellery) Yes, I was operating under the assumption that stat1004 was the local "compute" node and that stat1002 is more or less reserved for... [18:38:57] !log Chris moved cr1-eqiad:xe-5/3/1 to xe-3/3/1 [18:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:06] !log Reactivated BGP to AS6461 on cr1-eqiad [18:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:31] (03PS3) 10Hoo man: Enable Wikibase #statements parser function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:43:52] (03CR) 10Hoo man: Enable Wikibase #statements parser function on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:44:22] (03PS4) 10Hoo man: Enable Wikibase #statements parser function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:44:42] (03CR) 10Hoo man: "Removed the superfluous repo setting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:45:31] (03PS2) 10Urbanecm: Account creation throttle exception for WIkipedia Editathon at Ohio State University on 2016-11-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318141 (https://phabricator.wikimedia.org/T149200) [18:45:39] 06Operations, 10ops-codfw, 06DC-Ops, 10Parsoid: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2745958 (10Arlolra) > @arlorla yes of course, feel free to. In fact, thanks for doing it. No problem. {{done}} Is there a better way of checking if a host is pooled th... [18:45:53] (03CR) 10Hoo man: [C: 032] "Beta only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:46:24] (03Merged) 10jenkins-bot: Enable Wikibase #statements parser function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:47:49] !log hoo@tin Synchronized wmf-config/Wikibase-labs.php: For consistency (duration: 00m 47s) [18:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:09] hoo: Should https://gerrit.wikimedia.org/r/318142 be scheduled to a window or could it be deployed now? It's a simple typo (ramge -> range) in throttle.png, the event is in December... [18:49:59] (03PS2) 10Volans: wmf-auto-reimage: support mutiple conftool roles [puppet] - 10https://gerrit.wikimedia.org/r/318131 (https://phabricator.wikimedia.org/T149216) [18:50:11] Urbanecm: hm… put it up for swat [18:50:14] well :/ [18:50:15] PROBLEM - Host psw1-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:55:29] Let's see if we can prepare a test for that. [18:55:56] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2745997 (10mark) All ports on cr1-eqiad FPC5 have been moved to FPC3, except for the uplink to pfw1-eqiad, which we need to schedule downtime for with Fundraising. @Jgreen Let's schedule s... [18:57:40] Glaisher: https://phabricator.wikimedia.org/T149232 [18:58:25] (03PS1) 10Filippo Giunchedi: swift: add lvs configuration for esams [puppet] - 10https://gerrit.wikimedia.org/r/318145 (https://phabricator.wikimedia.org/T149098) [18:59:15] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:59:24] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Requesting access to contint for - https://phabricator.wikimedia.org/T149233#2746004 (10hashar) [19:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161026T1900). Please do the needful. [19:01:11] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2746004 (10hashar) [19:03:25] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:05:22] 06Operations, 10Ops-Access-Requests: Requesting access researchers, statistics-users, analytics-users, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2746026 (10Zareenf) @Krenair it doesn't seem like statistics-users adds any additional acce... [19:05:44] 06Operations, 10Ops-Access-Requests: Requesting access researchers, analytics-users, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2746027 (10Zareenf) [19:06:51] 1) ThrottleTest::testThrottlingExceptionsKeys [19:06:51] Invalid parameter in a throttle rule detected: ramge [19:06:52] Failed asserting that an array contains 'ramge'. [19:06:54] good [19:10:45] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:52] (03PS1) 10Hoo man: Use $wgWBClientSettings to set Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318147 [19:12:23] (03CR) 10Hoo man: [C: 032] Use $wgWBClientSettings to set Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318147 (owner: 10Hoo man) [19:12:26] (03PS1) 10Filippo Giunchedi: hieradata: add swift user for docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/318148 (https://phabricator.wikimedia.org/T149098) [19:12:38] (03PS1) 10Dereckson: Tests for throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 [19:12:51] (03CR) 10Hoo man: [C: 04-1] Enable Wikibase #statements parser function on all test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [19:12:55] (03Merged) 10jenkins-bot: Use $wgWBClientSettings to set Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318147 (owner: 10Hoo man) [19:13:19] (03CR) 10jenkins-bot: [V: 04-1] Tests for throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [19:13:45] https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/9938/console 19:13:01 Invalid parameter in a throttle rule detected: ramge [19:13:49] nice [19:14:00] !log hoo@tin Synchronized wmf-config/Wikibase-labs.php: For consistency (duration: 00m 45s) [19:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:09] * twentyafterfour will be deploying the train today in ostriches' place [19:22:55] 06Operations, 10ops-codfw: RAID degraded on ms-be2011 - https://phabricator.wikimedia.org/T149234#2746039 (10fgiunchedi) [19:23:16] ACKNOWLEDGEMENT - MegaRAID on ms-be2011 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi T149234 [19:23:16] ACKNOWLEDGEMENT - puppet last run on ms-be2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 24 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdi1] Filippo Giunchedi T149234 [19:25:20] 06Operations, 10ops-codfw, 06DC-Ops, 10Parsoid: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2746051 (10akosiaris) >>! In T148710#2745958, @Arlolra wrote: >> @arlorla yes of course, feel free to. In fact, thanks for doing it. > > No problem. {{done}} > > Is th... [19:25:31] (03PS2) 10Dzahn: remove lead.wikimedia.org, keep lead.mgmt.eqiad [dns] - 10https://gerrit.wikimedia.org/r/318033 (https://phabricator.wikimedia.org/T147905) [19:26:07] (03CR) 10Dzahn: [C: 032] remove lead.wikimedia.org, keep lead.mgmt.eqiad [dns] - 10https://gerrit.wikimedia.org/r/318033 (https://phabricator.wikimedia.org/T147905) (owner: 10Dzahn) [19:26:15] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:26:39] 06Operations, 10ops-codfw, 06DC-Ops, 10Parsoid: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2746052 (10Arlolra) > But, better in what way ? From the command line. [19:30:15] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:44] (03CR) 10Filippo Giunchedi: [C: 04-1] "Minor adjustment, LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [19:33:12] (03PS2) 10Dereckson: Tests for throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 [19:35:23] (03CR) 10Dereckson: "The new test failed (as expected) in PS1 when against master, and passed (as expected) when rebased against the ramge → range fix commit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [19:35:25] PROBLEM - Apache HTTP on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.010 second response time [19:35:35] PROBLEM - HHVM rendering on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [19:35:42] 06Operations, 10ops-eqiad, 13Patch-For-Review: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2746097 (10Dzahn) @Robh @cmjohnson lead has been removed from puppet/install/DNS and shutdown. mgmt DNS has been kept. I am now moving the ticket to ops-eqiad to follow-up on it with the... [19:36:08] 06Operations, 10ops-eqiad, 10netops: Decommission psw1-eqiad - https://phabricator.wikimedia.org/T149224#2746100 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson psw1-eqiad has been reset to factory settings, removed from rack and placed in storage. Racktables has been updated. [19:36:35] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.021 second response time [19:36:35] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 71244 bytes in 0.081 second response time [19:37:03] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2746103 (10Dzahn) [19:37:17] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2707885 (10Dzahn) a:05Dzahn>03Cmjohnson [19:38:26] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:42:26] twentyafterfour did we deploy the train yet? [19:43:15] no audephone [19:43:25] ok [19:43:51] 06Operations, 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2746113 (10mpopov) @Ottomata: Any luck? [19:45:05] 06Operations, 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2746114 (10Ottomata) No, sorry, I haven't had time yet :/ not sure when I'll get to this. W... [19:45:30] audephone: no but I'm about to [19:45:35] any reason to hold it up? [19:45:54] twentyafterfour: I opened two blockers [19:46:05] matanya: looking at that now [19:46:49] matanya: there is only one at https://phabricator.wikimedia.org/T147517 [19:47:28] oh, MaxSem already fixed the other [19:47:31] neat [19:49:02] (03Draft2) 10Urbanecm: [throttle] Remove old rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318157 [19:49:31] Twentyafterfour no blockers or such [19:50:02] I just prefer to be around some after the deploy in case of any unlikely issues with wikidata [19:51:40] (03PS3) 10Urbanecm: [throttle] Rule for Wikipedia Editathon at Ohio State University on 2016-11-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318141 (https://phabricator.wikimedia.org/T149200) [19:51:56] I think I should revert https://gerrit.wikimedia.org/r/#/c/316226/ [19:52:09] to resolve T149232 [19:52:09] T149232: [bug] globalblocking fatal in 1.28.0-wmf.23 - - https://phabricator.wikimedia.org/T149232 [19:52:11] twentyafterfour: wait, I've a fix, let me test it on mw1017 [19:52:26] Dereckson: ok, standing by [19:53:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:54:52] (03PS7) 10Hashar: zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 [19:54:59] matanya: could you check your test case link on mw1017? https://www.mediawiki.org/w/api.php?bgip=127.0.4.4&format=json&action=query&maxlag=5&bgprop=address&list=globalblocks&bglimit=1 [19:55:11] Is that the expected output? [19:55:25] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:55:40] that doesn't look right to me? "Exception Caught: Call to a member function buildLike() on a non-object (null)" [19:55:53] Dereckson: how can a specific host ? [19:56:02] twentyafterfour: on mw1017? [19:56:31] matanya: with a special header, that could be handled by one of the extension offered by https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [19:56:56] (03CR) 10Hashar: "Rebased recompiled https://puppet-compiler.wmflabs.org/4485/ Still only changes the zuul-merger on scandium :]" [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [19:57:12] Dereckson: i don't have access to wmnet [19:57:37] matanya: you don't need that, you can ask the target server in any request with a special header [19:57:40] (03CR) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [19:58:03] matanya: you install the extension, pick mw1017 in the list, and toggle the button to on, and you're there [19:58:14] Dereckson: sorry wikimedia-debug was disabled. with it enabled I get "{"batchcomplete":"","query":{"globalblocks":[]}}" [19:58:48] matanya: is that the expected result or should some blocks appear? ^ [19:59:14] (I imagine there is no block for 127.0.4.4 so that looks good) [19:59:25] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:59:30] if there are blocks you should see list [19:59:33] if not, [] [20:00:06] would you have an IP with some blocks to list? [20:00:39] checking [20:01:40] https://www.mediawiki.org/w/api.php?bgip=180.169.19.170&format=json&action=query&maxlag=5&bgprop=address&list=globalblocks&bglimit=1 [20:01:44] {"batchcomplete":"","query":{"globalblocks":[{"address":"180.169.19.170","anononly":""}]}} [20:01:45] https://www.mediawiki.org/w/api.php?bgip=71.80.171.128&format=json&action=query&maxlag=5&bgprop=address&list=globalblocks&bglimit=1 [20:02:05] yes, fixed [20:02:26] sweet. [20:02:31] okay I scap pull on mw1017 to restore genuine (bug) state [20:02:49] Dereckson: where's the patch, I'll +2, merge and deploy with the train [20:02:55] https://gerrit.wikimedia.org/r/#/c/318158/ [20:03:15] PROBLEM - HHVM rendering on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.031 second response time [20:03:35] PROBLEM - Apache HTTP on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [20:04:15] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 71223 bytes in 0.139 second response time [20:04:35] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.044 second response time [20:10:25] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:14:47] !log starting Parsoid deploy [20:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:18] (03CR) 10Jforrester: "> Should I schedule this for a SWAT? It's been sitting for a while." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester) [20:20:10] (03PS4) 10Jforrester: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 [20:23:35] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:24:13] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2746281 (10Legoktm) +1 [20:25:56] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2746004 (10greg) Approved from my side. [20:28:42] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.23/extensions/GlobalBlocking/: Deploy fix for T149232 to unblock the train refs T147517 (duration: 00m 51s) [20:28:44] T149232: [bug] globalblocking fatal in 1.28.0-wmf.23 - - https://phabricator.wikimedia.org/T149232 [20:28:44] T147517: MW-1.28.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T147517 [20:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:30] and https://www.mediawiki.org/w/api.php?bgip=180.169.19.170&format=json&action=query&maxlag=5&bgprop=address&list=globalblocks&bglimit=1 still works, good [20:30:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:31:03] !log updated Parsoid to version ede4353 [20:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:22] !log reverting Parsoid to version 63f1e151 [20:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:45] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:35:22] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318174 (owner: 1020after4) [20:36:01] 06Operations, 06Analytics-Kanban, 06Performance-Team, 06Reading-Admin, 10Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2746331 (10Nuria) @BBlack Would you be so kind as to look at our latest proposal to bucket users on doc : https://docs.google.com/docum... [20:36:47] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.23 [20:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:07] !log RESTBase deploy e835f9b8 - staging [20:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:57] twentyafterfour: we should really have morebots change the T*** to a phab link [20:43:10] will open a task for that [20:43:11] matanya: hmmm ? [20:43:11] !log reverted Parsoid to version 63f1e151 [20:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:09] Dereckson: https://wikitech.wikimedia.org/wiki/Server_Admin_Log in SAL is not linked to phab [20:44:12] matanya: oh well yes, makes sense, for another project, I've a decorator trying to parse T[0-9][0-9] to offer phab links [20:44:25] only on the cool one on tools it does [20:44:37] 06Operations, 06Analytics-Kanban, 06Performance-Team, 06Reading-Admin, 10Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2746336 (10ellery) The pseudo code does not quite match the current text description of the Double Bucket proposal. [20:44:58] That easy to parse, and not evil as SHA-1 hashes. [20:45:02] Dereckson: https://tools.wmflabs.org/sal/production [20:45:08] this one has it [20:45:25] and gerrit too [20:45:51] you can try to parse commit messages too (would be useful for Parsoid, Central Notice related mesasges for example) [20:46:12] commit hashes, but that needs a white list of ascii words [20:46:20] i just use the one on tools :) [20:46:50] !log RESTBase deploy e835f9b8 - canary on restbase1007 [20:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:48:31] Words like for added and ed25519 validates [0-9a-f] [20:49:01] !log RESTBase deploy e835f9b8 [20:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:14] that's why phabricator prefixes commit hashes with rCALLSIGN [20:49:20] Dereckson: ^ [20:49:28] er nope [20:49:51] (03PS1) 10Jcrespo: mariadb: set secure_file_priv to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/318175 [20:49:54] it was because for SVN you needed to disambiguate r1 r2 r3 r4 r5 [20:50:07] across repositories. [20:50:21] !log syncing /var/lib/jenkins from gallium to contint1001 . rsync server spawned on gallium in a term, contint1001 using rsync --bwlimit=5m --delete --info=progress2 -az rsync://gallium.wikimedia.org/jenkins /var/lib/jenkins [20:50:23] mutante: ^ :] [20:50:26] ok but it does fix the regex problem ;) [20:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:39] hashar: ah, you are already doing it :) [20:52:03] twentyafterfour: they could become a past artefact: when you create a new repo, it's an optional field now [20:52:10] (03CR) 10Jcrespo: [C: 032] mariadb: set secure_file_priv to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/318175 (owner: 10Jcrespo) [20:52:35] when omitted, it assigns a numeric value for path purposes [20:53:10] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2746342 (10Eevans) This seems to have stopped as of about 2016-10-21T10:49. I'll go ahead and close this issue, but if anyone knows what this was about, I'm still curious. [20:53:15] which is unfortunate because we lose memorable repository urls [20:53:30] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2746343 (10Eevans) 05Open>03Resolved a:03Eevans [20:53:36] (not that callsigns are memorable when we've got > 1000 of them and all are short random strings) [20:54:43] Yes, callsigns are cleary really useful to refer to a repository when you've only 100 repos with logic names. [20:54:56] !log restarting mariadb on db2011 to test configuration change [20:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:32] twentyafterfour Dereckson thanks both, i'd call it a day for now, group1 seems sane from my POV [20:59:05] matanya: yep, I've been watching kibana and everything seems to be normal :) [20:59:25] thanks for the quick resolution. [21:00:11] matanya: thanks to have reported that issue [21:00:28] sure, that is my weekly fatal-fun :) [21:01:43] matanya: Dereckson: thank you :] [21:02:37] !log restarting mysql and rebooting db1035 [21:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:57] twentyafterfour: spoke to soon [21:03:00] found a fatal [21:03:14] matanya ? [21:03:18] https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/strategicplan2 [21:03:55] hu ? [21:04:07] CentralNotice isn't pulled by the train [21:04:24] created : https://phabricator.wikimedia.org/T149240 [21:05:16] that should be a 404 instead of a fatal [21:05:42] (03PS2) 10Dzahn: remove palladium.eqiad, keep palladium.mgmt.eqiad [dns] - 10https://gerrit.wikimedia.org/r/318034 (https://phabricator.wikimedia.org/T147320) [21:06:29] I don't think it is wmf.23 specific [21:06:32] twentyafterfour: with /edit/, the intent could bo to create it [21:06:39] indeed [21:07:07] (if the exception is also thrown to view, 404 yes) [21:07:20] https://meta.wikimedia.org/wiki/Sdfsdffdgsdgfsdgfsd is served with a 404 code for example [21:08:02] !log gallium: stopped rsync server [21:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:45] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.28 seconds [21:08:55] matanya: probably not, they have a wmf_deploy branch and they ask when they found that opportune to rebase wmfxx against their wmf_deploy [21:09:24] (03CR) 10Dzahn: [C: 032] remove palladium.eqiad, keep palladium.mgmt.eqiad [dns] - 10https://gerrit.wikimedia.org/r/318034 (https://phabricator.wikimedia.org/T147320) (owner: 10Dzahn) [21:10:42] "strategicplan2" legitimately does not exist as far as I can see [21:10:45] and the /edit/ uri does not seem to be used for creating them. [21:11:28] ah, wait, let me get files from palladium anyways [21:21:02] (03CR) 10Dereckson: Show changes from last 14 days in watchlist in cswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [21:29:40] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/tendril/lib/config.php] [21:33:32] (03PS1) 10Filippo Giunchedi: site: remove explicit role prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/318203 [21:39:15] (03PS2) 10Filippo Giunchedi: site: remove explicit role prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/318203 [21:39:17] (03PS1) 10Filippo Giunchedi: role::logstash::elasticsearch: include ::standard [puppet] - 10https://gerrit.wikimedia.org/r/318205 [21:39:18] twentyafterfour: flow fatal as wellhttps://www.mediawiki.org/w/index.php?action=history&offset=20160126191716&title=Talk%3AMediaWiki [21:39:26] twentyafterfour: https://www.mediawiki.org/w/index.php?action=history&offset=20160126191716&title=Talk%3AMediaWiki [21:39:58] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2436.50 seconds Jcrespo running schema change (imagelinks) [21:41:07] twentyafterfour: created https://phabricator.wikimedia.org/T149251 [21:43:05] (03CR) 10Filippo Giunchedi: "Trying to debug why logstash100[456] didn't get node_exporter role via standard, this is not it though: https://puppet-compiler.wmflabs.or" [puppet] - 10https://gerrit.wikimedia.org/r/318205 (owner: 10Filippo Giunchedi) [21:47:29] matanya: ok, I'm not sure if it's related to wmf.23 but I'll poke it a little [21:53:04] (03PS2) 10Filippo Giunchedi: role::logstash::elasticsearch: include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/318205 [21:53:41] 06Operations, 10Ops-Access-Requests: Requesting access researchers, analytics-users, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2746640 (10JKatzWMF) @Krenair it is hard for me to figure out what access groups are required for what, even... [21:55:20] (03PS1) 10BBlack: cp1008: disable do_ocsp_int while experimenting with nginx packages [puppet] - 10https://gerrit.wikimedia.org/r/318210 [21:55:31] (03CR) 10BBlack: [C: 032 V: 032] cp1008: disable do_ocsp_int while experimenting with nginx packages [puppet] - 10https://gerrit.wikimedia.org/r/318210 (owner: 10BBlack) [21:56:12] (03CR) 10Filippo Giunchedi: "Looks like ferm wasn't being updated because role::logstash::elasticsearch doesn't include role::logstash and thus base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/318205 (owner: 10Filippo Giunchedi) [21:57:10] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:08:29] 06Operations, 10Ops-Access-Requests: Requesting access researchers, analytics-users, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2746664 (10Krenair) Yeah, this is a big mess. What I know for sure is that researchers will certainly give ac... [22:21:42] 06Operations, 10netops, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2746698 (10fgiunchedi) [22:25:36] twentyafterfour: https://phabricator.wikimedia.org/T149254 [22:26:17] 06Operations, 06Labs, 10netops, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2746728 (10Krenair) [22:36:55] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2746764 (10fgiunchedi) I _think_ https://gerrit.wikimedia.org/r/#/c/316906/ might have been related [22:37:45] matanya: I don't see anything since the branch cut that would cause that error. [22:38:17] twentyafterfour: SMalyshev is looking into it [22:42:39] (03PS1) 10Dzahn: remove gallium from site.pp, installserver [puppet] - 10https://gerrit.wikimedia.org/r/318216 (https://phabricator.wikimedia.org/T95757) [22:44:14] (03PS2) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [22:45:12] (03PS1) 10Dzahn: contint: remove gallium conditional from contint::master_dir [puppet] - 10https://gerrit.wikimedia.org/r/318217 (https://phabricator.wikimedia.org/T95757) [22:47:37] twentyafterfour: this is a fatal as well : https://www.mediawiki.org/w/index.php?title=Special%3ALog&type=rights&user=k.4%3Blinux.ariesa.aroesa%40hotwail.com.opera&page=User%3A27-10-2016&year=28&month=-1&tagfilter=k.4%3Blinux.ariesa.aroesa%40hotwail.com.opera&subtype= [22:47:40] (03PS1) 10Dzahn: nodepool: switch gallium to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318218 (https://phabricator.wikimedia.org/T95757) [22:48:36] seems related to https://gerrit.wikimedia.org/r/#/c/315998/ [22:49:40] (03Abandoned) 10Dzahn: nodepool: switch gallium to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318218 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [22:49:54] matanya: you can raise concern on https://phabricator.wikimedia.org/rMWfdce245e9fe5da3da5d869561b4bbcf0232e9b5e [22:50:36] TimestampException isn't raised in that commit though [22:51:01] i am just creating the ticket, will try to debug after [22:51:49] (03PS2) 10Hashar: nodepool: point to Jenkins on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/313599 (https://phabricator.wikimedia.org/T95757) [22:52:41] (03PS2) 10Dzahn: contint: remove python-requests [puppet] - 10https://gerrit.wikimedia.org/r/317923 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [22:56:20] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time [22:56:40] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [22:57:20] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.072 second response time [22:57:21] I got 503 on the he.wiki [22:57:42] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 71227 bytes in 0.244 second response time [22:57:59] oh, i see above, probably coincidence [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161026T2300). Please do the needful. [23:00:05] Dereckson and James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:17] Yeah yeah. [23:02:44] (03PS2) 10Filippo Giunchedi: prometheus: upgrade to new config syntax [puppet] - 10https://gerrit.wikimedia.org/r/317880 (https://phabricator.wikimedia.org/T147207) [23:03:29] James_F: don't sound too excited or anything [23:03:45] * James_F grins. [23:03:56] greg-g: It's a Beta-Cluster-only config change. Not the most exciting. [23:04:24] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: upgrade to new config syntax [puppet] - 10https://gerrit.wikimedia.org/r/317880 (https://phabricator.wikimedia.org/T147207) (owner: 10Filippo Giunchedi) [23:04:41] fair 'nough [23:05:26] I might add in a patch at the end if I figure out how to test it by then [23:05:39] Hello. I can SWAT if the train is done. [23:05:44] twentyafterfour: we're okay now? [23:06:30] Dereckson: train has been done for a while [23:06:35] ok [23:06:48] there are some spurious fatals matanya found but they aren't critical as far as I can see [23:08:40] (03CR) 10Dzahn: [C: 032] contint: remove python-requests [puppet] - 10https://gerrit.wikimedia.org/r/317923 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [23:09:27] (03PS3) 10Dzahn: contint: remove python-requests [puppet] - 10https://gerrit.wikimedia.org/r/317923 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [23:10:53] (03PS5) 10Dereckson: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester) [23:11:54] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester) [23:12:01] (03Abandoned) 1020after4: Add conduit_token to the .arcrc on nodepool slaves [puppet] - 10https://gerrit.wikimedia.org/r/298097 (owner: 1020after4) [23:12:22] (03Merged) 10jenkins-bot: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester) [23:12:25] (03CR) 10Yuvipanda: [C: 031] tools proxy: Add health check and icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) (owner: 10Madhuvishy) [23:12:39] James_F: live on mw1099 [23:15:20] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.52 seconds [23:15:28] * James_F checks [23:16:39] It didn't break anything in prod through CS apparently. [23:16:47] Dereckson: Yeah, LGTM. [23:17:21] ok, syncing [23:17:21] (03PS4) 10Madhuvishy: tools proxy: Add health check and icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) [23:18:01] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Test setting gallery config differently on Beta Cluster enwiki (T141349, 1/2, no-op in prod) (duration: 00m 49s) [23:18:02] T141349: Change the default mode of tags to mode=packed on English Wikipedia - https://phabricator.wikimedia.org/T141349 [23:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:23] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Test setting gallery config differently on Beta Cluster enwiki (T141349, 2/2) (duration: 00m 45s) [23:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:06] (03CR) 10Yuvipanda: [C: 031] "I like it! Needs slight care when applying to tools, since i think we need to change the variable names for the passwords and secrets in h" [puppet] - 10https://gerrit.wikimedia.org/r/318060 (owner: 10Giuseppe Lavagetto) [23:20:38] 500 Undefined variable: wmgGalleryOptions in /srv/mediawiki/wmf-config/CommonSettings.php on line 464 [23:20:58] (03CR) 10Madhuvishy: [C: 032] tools proxy: Add health check and icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) (owner: 10Madhuvishy) [23:21:02] Hmm. [23:21:03] Lots of them? [23:21:17] (03CR) 10Yuvipanda: [C: 04-1] "Minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318061 (owner: 10Giuseppe Lavagetto) [23:21:31] nope [23:21:31] hiiii. can i have an exra late change swatted? [23:21:39] https://gerrit.wikimedia.org/r/318221 [23:21:44] Dereckson: OK, I'll update the task. [23:21:46] MatmaRex: yes, you can [23:21:52] James_F: I'm fixing CS [23:22:07] CS? [23:22:13] CommonSettings.php [23:22:18] Oh. [23:22:22] (03CR) 10Yuvipanda: [C: 031] "Note that 'ip' can also be a CIDR. That isn't useful in labs, but might be in prod?" [puppet] - 10https://gerrit.wikimedia.org/r/318062 (owner: 10Giuseppe Lavagetto) [23:22:23] In what way? [23:22:29] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Fix for current Undefined variable: wmgGalleryOptions issue (duration: 00m 48s) [23:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:37] James_F: commenting the variable call. Could you submit a follow-up change with the array defined in InitialiseSettings.php with the default value or to check with isset ? [23:22:56] There's already a follow-up which isn't going out until the community agrees. [23:22:57] (03CR) 10Yuvipanda: [C: 031] docker::registry::web: allow using puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/318063 (owner: 10Giuseppe Lavagetto) [23:23:19] The whole point of the isset() is to not have to change InitialiseSettings, right? [23:23:26] right, but you didn't used isset [23:23:37] I… didn't? [23:23:38] so if you add isset to CommonSettings.php, that will fix the issue, yes [23:23:46] no a straight if ( $wmg ) [23:23:54] Bah. [23:23:54] https://gerrit.wikimedia.org/r/#/c/301128/5/wmf-config/CommonSettings.php [23:23:59] Fix that instead? [23:24:04] * Dereckson nods [23:24:09] Sorry. [23:25:32] (03CR) 10Dzahn: "note this doesnt actually remove python-requests and it's still pulled in by salt packages.. but yea, contint itself won't need it" [puppet] - 10https://gerrit.wikimedia.org/r/317923 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [23:27:35] (03CR) 10Yuvipanda: "minor nit, nbd." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318064 (owner: 10Giuseppe Lavagetto) [23:28:13] (03CR) 10Yuvipanda: [C: 031] "This *should* be the case." [puppet] - 10https://gerrit.wikimedia.org/r/318065 (owner: 10Giuseppe Lavagetto) [23:28:47] James_F: so you're preparing a change to add isset? [23:28:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:29:05] (03PS1) 10Jforrester: For $wmgGalleryOptions, use isset() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318223 [23:29:05] Dereckson: ^ [23:29:23] (03PS3) 10Dzahn: contint: add phpdbg for code coverage [puppet] - 10https://gerrit.wikimedia.org/r/314563 (https://phabricator.wikimedia.org/T147778) (owner: 10Hashar) [23:29:32] (03CR) 10Dereckson: [C: 031] For $wmgGalleryOptions, use isset() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318223 (owner: 10Jforrester) [23:29:36] yep looks good to me [23:29:44] (03CR) 10Yuvipanda: "I still don't like the hiera() calls in profiles vs having them be parameters... (lgtm otherwise)" [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) (owner: 10Giuseppe Lavagetto) [23:30:52] (03CR) 10Dereckson: "Follow-up: I5897bd8e997ab41f6ae45aac846605bb4a2c5f01" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester) [23:31:12] (03PS2) 10Jforrester: For $wmgGalleryOptions, use isset() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318223 [23:31:22] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318223 (owner: 10Jforrester) [23:31:51] (03Merged) 10jenkins-bot: For $wmgGalleryOptions, use isset() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318223 (owner: 10Jforrester) [23:32:12] Fix live on mw1099 [23:32:51] Dereckson: i scheduled https://gerrit.wikimedia.org/r/#/c/318222/ , when you're done. thanks. [23:33:45] MatmaRex: https://gerrit.wikimedia.org/r/#/c/318221/ has been self-merged [23:34:12] indeed [23:34:19] it is a very simple change, and it is fixing a glaring bug [23:34:37] i think i am justified in self-merging it and having it deployed [23:34:45] ok [23:39:40] (03PS1) 10Madhuvishy: dynamicproxy: Fix health check endpoint location [puppet] - 10https://gerrit.wikimedia.org/r/318226 (https://phabricator.wikimedia.org/T143638) [23:39:49] James_F: so logs are good on mw1099 [23:40:02] Dereckson: OK, please push it. [23:40:48] (03CR) 10Madhuvishy: [C: 032] dynamicproxy: Fix health check endpoint location [puppet] - 10https://gerrit.wikimedia.org/r/318226 (https://phabricator.wikimedia.org/T143638) (owner: 10Madhuvishy) [23:41:36] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: For $wmgGalleryOptions, use isset() ([[Gerrit:318223]]) (duration: 00m 45s) [23:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:06] logs still look good, we're done [23:42:28] tgr: would your change be ready? [23:43:30] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:33] Dereckson: it's ready but couldn't find a way to test it on beta/mw1099, I'll just have to test it in production [23:43:40] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:44:13] https://gerrit.wikimedia.org/r/#/c/318219/ [23:45:12] MatmaRex: live on mw1099 [23:45:51] Dereckson: thanks, looking [23:46:05] !log tools reenabled puppet across proxy hosts. /.well-known/healthz now live on tools-proxy T143638 [23:46:06] T143638: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638 [23:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:44] Dereckson: confirmed fixed [23:48:25] tgr: we can live with that: the change is simple, and impact seems limited to a special page (involved in login process yes) [23:48:43] and you tested in on labs with pwb so? [23:48:50] tested it [23:48:55] Dereckson: on second thought, it will be more complicated; I can take over [23:49:14] (wmf.22 needs it too and it will probably conflict with the security patch there) [23:49:44] I tested it and got a bunch of errors which I'm pretty sure are not caused by the patch :/ [23:49:47] tgr: okay, I sync MatmaRex's one and two changes for noc. and I ping you when I'm adone [23:49:51] thanks [23:52:10] !log dereckson@tin Synchronized php-1.28.0-wmf.23/extensions/UploadWizard/resources/details/uw.DateDetailsWidget.js: Unbreak Flickr uploads (T149259) (duration: 00m 48s) [23:52:11] T149259: "TypeError: this.upload.deedChoser is undefined" when uploading from Flickr - https://phabricator.wikimedia.org/T149259 [23:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:52:30] (03PS3) 10Dereckson: Update noc.wikimedia.org dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309743 [23:52:51] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309743 (owner: 10Dereckson) [23:53:17] (03Merged) 10jenkins-bot: Update noc.wikimedia.org dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309743 (owner: 10Dereckson) [23:53:31] (03PS2) 10Dereckson: Add missing configuration files in noc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318027 [23:53:37] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318027 (owner: 10Dereckson) [23:54:06] (03Merged) 10jenkins-bot: Add missing configuration files in noc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318027 (owner: 10Dereckson) [23:54:20] PROBLEM - HHVM rendering on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [23:55:16] noc. changes live on mw1099 [23:55:20] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 71227 bytes in 0.164 second response time [23:55:26] thanks [23:55:52] ah the extension doesn't send the header for noc.? [23:57:05] MatmaRex: you're welcome :) [23:57:39] !log dereckson@tin Synchronized docroot/noc/conf/: Update noc.wikimedia.org dblist and config files (duration: 00m 45s) [23:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:01] * Dereckson purges noc.