[00:06:34] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#4086825 (10Krinkle) 05Resolved>03Open It seems the `click_tracking_events` table w...
[00:06:54] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#4086828 (10Krinkle)
[00:25:22] <wikibugs>	 (03PS2) 10Dzahn: bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623)
[00:25:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn)
[00:31:24] <wikibugs>	 (03PS3) 10Dzahn: bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623)
[00:42:08] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4086871 (10Dzahn) Was just working on the Bastion related Wikitech pages due to Bast1001 being replaced and i noticed we have 2 bastions in ULSFO, 4001 and 4002.  stalled?
[00:45:29] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4086872 (10ayounsi)
[01:11:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn)
[02:28:57] <logmsgbot>	 !log l10nupdate@deploy1001 scap sync-l10n completed (1.31.0-wmf.26) (duration: 13m 33s)
[02:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:32:46] <wikibugs>	 (03PS2) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143
[02:33:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[02:35:49] <mutante>	 is it currently possible to add  interface::add_ip6_mapped { 'main': }  without getting dislikes from wmf-style?
[02:36:15] <mutante>	 it doesn't like node-level anymore but also not role (as before)
[02:37:03] <wikibugs>	 (03CR) 10Dzahn: "22:35 < mutante> is it currently possible to add  interface::add_ip6_mapped { 'main': }  without getting dislikes from wmf-style?" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[03:16:41] <wikibugs>	 (03PS1) 10KartikMistry: Update ssh public key for Kartik Mistry [puppet] - 10https://gerrit.wikimedia.org/r/422361
[03:25:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 843.70 seconds
[03:50:15] <wikibugs>	 10Operations, 10Cloud-Services, 10hardware-requests, 10Labs-Sprint-101, and 2 others: Kill off virt1000 - https://phabricator.wikimedia.org/T102005#4086985 (10Krinkle)
[03:50:27] <wikibugs>	 (03PS1) 10Krinkle: Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005)
[03:50:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle)
[03:51:10] <wikibugs>	 (03PS2) 10Krinkle: Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005)
[04:01:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 168.87 seconds
[05:34:23] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2
[05:35:14] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0
[05:44:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update symbols for 1.1.0h [debs/openssl11] - 10https://gerrit.wikimedia.org/r/422177 (owner: 10Muehlenhoff)
[06:31:03] <wikibugs>	 (03PS1) 10Jcrespo: mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384)
[06:31:16] <wikibugs>	 (03PS2) 10Jcrespo: mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384)
[06:31:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo)
[06:31:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo)
[06:32:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update ssh public key for Kartik Mistry [puppet] - 10https://gerrit.wikimedia.org/r/422361 (owner: 10KartikMistry)
[06:48:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata)
[06:51:29] <moritzm>	 !log installing remaining ICU security updates
[06:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:02] <wikibugs>	 (03PS3) 10Jcrespo: mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384)
[07:03:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo)
[07:08:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/422369
[07:10:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/422369
[07:17:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/422369 (owner: 10Muehlenhoff)
[07:27:10] <wikibugs>	 (03PS1) 10Jcrespo: mariadb backups: Start backups earlier [puppet] - 10https://gerrit.wikimedia.org/r/422370 (https://phabricator.wikimedia.org/T189384)
[07:29:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb backups: Start backups earlier [puppet] - 10https://gerrit.wikimedia.org/r/422370 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo)
[07:49:33] <moritzm>	 !log uploaded openssl 1.0.2o to apt.wikimedia.org/jessie-wikimedia
[07:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:05] <wikibugs>	 (03CR) 10Elukey: Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata)
[07:55:36] <elukey>	 moritzm: thanks for --^ !!
[07:57:57] <moritzm>	 sure :-)
[08:07:11] <apergos>	 I see the l10n upate ran overnight on deploy1001, now to see if it wrked properly or not, and tbh no idea how to check that
[08:15:33] <arturo>	 !log reboot labstore1001 for T189115
[08:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:40] <wikibugs>	 (03PS1) 10Volans: Puppetboard: disable listing of static files [puppet] - 10https://gerrit.wikimedia.org/r/422371
[08:17:16] <arturo>	 !log reboot labstore1002 for T189115
[08:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:23] <wikibugs>	 (03CR) 10Elukey: [C: 031] Puppetboard: disable listing of static files [puppet] - 10https://gerrit.wikimedia.org/r/422371 (owner: 10Volans)
[08:18:23] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[08:18:34] <arturo>	 !log reboot labstore2001 for T189115
[08:18:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:39] <elukey>	 hello ulsfo
[08:19:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks for taking care of this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[08:19:24] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:19:36] <wikibugs>	 (03CR) 10Volans: [C: 032] Puppetboard: disable listing of static files [puppet] - 10https://gerrit.wikimedia.org/r/422371 (owner: 10Volans)
[08:19:57] <elukey>	 so it seems a single spike that now is gone, ints from codfw caches
[08:20:34] <elukey>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now
[08:20:38] <elukey>	 ema --^
[08:21:52] <wikibugs>	 (03PS1) 10Volans: wmf-auto-reimage: fix retcodes in sequential mode [puppet] - 10https://gerrit.wikimedia.org/r/422372
[08:22:49] <ema>	 hey, looking
[08:24:49] <elukey>	 <3
[08:24:56] <elukey>	 seems gone now, was only a fyi
[08:25:30] <godog>	 !log add more weight to ms-be204[0-3] - T189633
[08:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:35] <stashbot>	 T189633: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633
[08:25:49] <arturo>	 !log reboot labstore200[2,3,4] for T189115
[08:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:23] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[08:26:24] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:31:13] <wikibugs>	 (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix retcodes in sequential mode [puppet] - 10https://gerrit.wikimedia.org/r/422372 (owner: 10Volans)
[08:42:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool: strawman for a db-server object schema for mwconfig [puppet] - 10https://gerrit.wikimedia.org/r/422373
[08:42:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Manage slave databases load/presence via etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422374
[08:43:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Manage slave databases load/presence via etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422374 (owner: 10Giuseppe Lavagetto)
[08:47:33] <icinga-wm>	 PROBLEM - Host scb2005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:49:13] <icinga-wm>	 RECOVERY - Host scb2005 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[09:08:13] <icinga-wm>	 PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61574 MB (12% inode=99%)
[09:15:14] <icinga-wm>	 RECOVERY - Disk space on elastic1019 is OK: DISK OK
[09:22:32] <wikibugs>	 (03PS3) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410)
[09:22:47] <wikibugs>	 (03PS4) 10Ema: WIP: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712)
[09:24:33] <icinga-wm>	 PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:25] <godog>	 !log disable puppet on icinga servers before merging https://gerrit.wikimedia.org/r/c/413142/
[09:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:13] <icinga-wm>	 PROBLEM - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:58] <wikibugs>	 (03PS1) 10ArielGlenn: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422376
[09:28:06] <wikibugs>	 (03PS2) 10ArielGlenn: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422376
[09:28:15] <wikibugs>	 (03PS1) 1020after4: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422377 (https://phabricator.wikimedia.org/T190909)
[09:28:30] <godog>	 actually no I'm looking into why tegmen has a completely different cpu profile than einsteinium
[09:28:31] <wikibugs>	 (03Abandoned) 1020after4: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422377 (https://phabricator.wikimedia.org/T190909) (owner: 1020after4)
[09:28:55] <apergos>	 oops sorry
[09:28:56] <wikibugs>	 (03CR) 1020after4: [C: 031] "see T190909" [puppet] - 10https://gerrit.wikimedia.org/r/422376 (owner: 10ArielGlenn)
[09:29:02] <apergos>	 well i'll merge mine
[09:29:12] <twentyafterfour>	 no prob 
[09:29:31] <apergos>	 where is jenkins
[09:30:33] <apergos>	 now it decides to be slow?
[09:30:53] <wikibugs>	 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4087305 (10mmodell)
[09:30:56] <twentyafterfour>	 hah
[09:30:56] <wikibugs>	 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4087301 (10mmodell) 05Resolved>03Open see {T190909} and this patch: [[ https://gerrit.wikimedia.org/r/#/c/422376/ | Revert "switch deployment server from tin to deploy1001" ]]
[09:31:46] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422376 (owner: 10ArielGlenn)
[09:32:33] <godog>	 sweet! icinga restarted by puppet on each puppet run due to file ownership change )o)
[09:32:48] <apergos>	 lol
[09:33:20] <apergos>	 twentyafterfour: where does puppet have to run for that change to take effect?
[09:33:23] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[09:34:16] <dcausse>	 looking (reindex still in progress fwiw) ^
[09:34:33] <no_justification>	 apergos: deploy1001 and tin
[09:34:39] <apergos>	 you're not asleep
[09:34:42] <apergos>	 thank you however
[09:34:43] <no_justification>	 Of course not
[09:35:12] <no_justification>	 Technically everywhere, eventually, but those two kinda matter a bit more :)
[09:35:49] <apergos>	 there's a big ole motd on tin too
[09:36:29] <apergos>	 I'm running on those two now, we'll wait the 40 minutes or whatever for it to go around everywhere else, mail should be sent with an update 
[09:36:43] <apergos>	 and someone oughta test on tin after that 40 minutes
[09:38:18] <apergos>	 ah that took care of the motd, nice
[09:38:20] <twentyafterfour>	 I'll reply to the wikitech thread to note that it's switched back to tin
[09:38:30] <volans>	 would you revert the DNS change too?
[09:38:40] <apergos>	 was the dns change merged?
[09:38:59] <twentyafterfour>	 uh, chad beat me to the email
[09:39:04] <apergos>	 ok good for that
[09:39:26] <twentyafterfour>	 apergos: I think the dns change was merged, according to the email thread anyway
[09:39:38] <apergos>	 ah
[09:39:39] <no_justification>	 I didn't think of DNS
[09:39:39] <twentyafterfour>	 "We also just switched the DNS service name for deployment.eqiad/codfw (thanks Andrew Bogott!)"
[09:42:20] <wikibugs>	 (03PS1) 10ArielGlenn: Revert "Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/422380
[09:43:59] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] Revert "Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/422380 (owner: 10ArielGlenn)
[09:46:43] <icinga-wm>	 RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms
[09:47:50] <apergos>	 deployed and live
[09:48:00] <apergos>	 anything else we may have overlooked?
[09:48:37] <apergos>	 except actually testing tin in a while
[09:49:10] <apergos>	 !jouncebot: next
[09:49:28] <wikibugs>	 10Operations, 10Icinga, 10monitoring: icinga restarted on each puppet run on standby server - https://phabricator.wikimedia.org/T190912#4087341 (10fgiunchedi)
[09:49:45] <apergos>	 jouncebot: next
[09:49:45] <jouncebot>	 In 3 hour(s) and 10 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1300)
[09:50:07] <twentyafterfour>	 apergos: nope I think that's it
[09:50:33] <icinga-wm>	 RECOVERY - Host labstore2004 is UP: PING OK - Packet loss = 0%, RTA = 37.02 ms
[09:52:04] <apergos>	 care to try a test in 15 mins or so? 
[09:53:15] <twentyafterfour>	 apergos: ok
[09:53:23] <apergos>	 thanks
[09:53:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi)
[09:53:48] <wikibugs>	 (03PS4) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410)
[09:55:20] <apergos>	 dammit
[09:55:33] <apergos>	 l10update will run over there in two minutes. before puppet's run everywhere
[09:56:00] <apergos>	 oh well it will just break
[09:56:27] <apergos>	 s/two/four/ but you get the idea
[10:04:24] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[10:05:58] <jynus>	 !log upgrade and restart db2093
[10:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:24] <icinga-wm>	 PROBLEM - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100%
[10:11:44] <icinga-wm>	 PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100%
[10:13:33] <apergos>	 akosiaris: since I see you are logged in on deploy1001, no deploys from there, it's back to tin
[10:16:14] <apergos>	 twentyafterfour: testing time, if you would do the honors
[10:16:27] <wikibugs>	 (03PS13) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742)
[10:16:33] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[10:18:03] <icinga-wm>	 RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms
[10:20:43] <icinga-wm>	 RECOVERY - Host labstore2004 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[10:21:07] <wikibugs>	 (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler03/10706/" [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff)
[10:23:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff)
[10:26:22] <wikibugs>	 (03PS1) 10Vgutierrez: mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942)
[10:26:34] <wikibugs>	 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#3911588 (10ArielGlenn) A summary of things as I understand them:   - deploy1001 to php7 is needed for git-lfs, which is needed for ORES.  - icu collation order with libicu57 (default with php7) is different than with libicu52 (w...
[10:27:18] <godog>	 !log reload icinga on einsteinium after https://gerrit.wikimedia.org/r/c/413142
[10:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:05] <wikibugs>	 (03PS3) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942)
[10:33:22] <wikibugs>	 (03PS14) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742)
[10:34:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff)
[10:34:52] <wikibugs>	 (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[10:37:54] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4087564 (10Vgutierrez) >>! In T184942#4083351, @Krinkle wrote:  > The following counters are currently reported to StatsD from `ReqURL ^/w/load.php` ([va...
[10:45:38] <wikibugs>	 10Operations, 10Icinga, 10monitoring: icinga restarted on each puppet run on standby server - https://phabricator.wikimedia.org/T190912#4087584 (10fgiunchedi) The effect is also clear on host dashboards  {F16328381}
[10:58:50] <wikibugs>	 10Operations, 10Puppet: Puppet: enable reports to puppetdb - https://phabricator.wikimedia.org/T190918#4087592 (10Volans) p:05Triage>03Normal
[10:59:43] <volans>	 !log performing a few minutes live test of reporting Puppet reports to puppetdb too on puppetmaster1001 - T190918
[10:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:49] <stashbot>	 T190918: Puppet: enable reports to puppetdb - https://phabricator.wikimedia.org/T190918
[11:02:01] <wikibugs>	 (03PS3) 10Mark Bergsma: Fix testRepool test case for previously-down-but-pooled [debs/pybal] - 10https://gerrit.wikimedia.org/r/421051
[11:02:03] <wikibugs>	 (03PS3) 10Mark Bergsma: Fix StubLVSService to use a set instead of a dict for .servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/421052
[11:02:05] <wikibugs>	 (03PS4) 10Mark Bergsma: Introduce server.is_pooled and make server.pooled usage more consistent [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053
[11:03:12] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Fix testRepool test case for previously-down-but-pooled [debs/pybal] - 10https://gerrit.wikimedia.org/r/421051 (owner: 10Mark Bergsma)
[11:03:33] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Fix StubLVSService to use a set instead of a dict for .servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/421052 (owner: 10Mark Bergsma)
[11:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: Fix testRepool test case for previously-down-but-pooled [debs/pybal] - 10https://gerrit.wikimedia.org/r/421051 (owner: 10Mark Bergsma)
[11:04:03] <wikibugs>	 (03Merged) 10jenkins-bot: Fix StubLVSService to use a set instead of a dict for .servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/421052 (owner: 10Mark Bergsma)
[11:04:39] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:06:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: preserve ownership when purging resources [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912)
[11:06:48] <godog>	 volans: ^
[11:18:15] <wikibugs>	 (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[11:21:24] <wikibugs>	 (03PS13) 10Rduran: Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725
[11:23:33] <wikibugs>	 (03PS1) 10Jcrespo: haproxy: Remove older templates (haproxy<1.7) [puppet] - 10https://gerrit.wikimedia.org/r/422386 (https://phabricator.wikimedia.org/T183249)
[11:24:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch time server on dns5001 to Chrony [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742)
[11:31:49] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:34:09] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871)
[11:34:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez)
[11:35:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff)
[11:36:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[11:37:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[11:39:21] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871)
[11:41:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for cron [puppet] - 10https://gerrit.wikimedia.org/r/422391 (https://phabricator.wikimedia.org/T135991)
[11:46:22] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[11:46:42] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912) (owner: 10Filippo Giunchedi)
[11:47:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[11:48:05] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks fine to me in general. Is this intended to be uploaded to apt.wikimedia.org or are you aiming to upload this to Debian? If it's the " [debs/dynomite] - 10https://gerrit.wikimedia.org/r/421447 (owner: 10Aaron Schulz)
[11:48:08] <logmsgbot>	 !log twentyafterfour@tin Synchronized README: test deploy from tin.eqiad.wmnet (duration: 03m 35s)
[11:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:25] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[11:49:02] <logmsgbot>	 !log twentyafterfour@tin Started scap: test running full scap sync from tin
[11:49:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[11:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:56] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[11:52:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[11:54:39] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[12:06:58] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871)
[12:07:00] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[12:07:19] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871)
[12:07:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[12:07:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez)
[12:09:22] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[12:16:19] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:16:39] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:17:00] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:18:09] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:18:49] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:18:49] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:18:59] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:24:56] <elukey>	 whaaattt
[12:26:34] <elukey>	 so all videoscalers: https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=All&from=now-3h&to=now
[12:26:39] <elukey>	 load is skyrocketing
[12:26:57] <elukey>	 so hhvm is up but the health checks are not getting through
[12:27:16] <apergos>	 I'm on one now and there's a pile of stuff apparently running (mw1293)
[12:27:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 61.77, 25.67, 16.13
[12:27:46] <elukey>	 yeah I think that somebody triggered a massive re-encode or something similar
[12:28:19] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 37.50, 25.94, 16.87
[12:28:49] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.14, 26.98, 18.56
[12:29:33] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4087858 (10BBlack) Was stalled on my lack of time dealing with the prometheus switchover and then switching peoples' SSH configs, otherwise it's ready for switchover.
[12:29:49] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 25.76, 24.35, 18.18
[12:29:52] <wikibugs>	 (03PS8) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746
[12:29:54] <wikibugs>	 (03PS8) 10Rduran: Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340
[12:31:21] <elukey>	 so https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?orgId=1&var-site=eqiad&var-type=webVideoTranscode&var-type=webVideoTranscodePrioritized
[12:31:24] <elukey>	 apergos: --^
[12:31:30] <elukey>	 enqueue rate spiked a lot
[12:31:35] <apergos>	 https://commons.wikimedia.org/wiki/Special:NewFiles?user=&mediatype%5B%5D=UNKNOWN&mediatype%5B%5D=AUDIO&mediatype%5B%5D=VIDEO&mediatype%5B%5D=MULTIMEDIA&mediatype%5B%5D=ARCHIVE&start=&end=&wpFormIdentifier=specialnewimages&limit=50&offset=
[12:31:43] <apergos>	 check that out
[12:32:41] <apergos>	 English: Please  subscribe to my channel and my vlog channel! I make new videos here  every Wednesday and make vlogs during my majestical daily life.      from the description of one of the files
[12:32:53] <apergos>	 this seems like spam/self adv/needs block
[12:32:55] <Wiki13>	 lol
[12:32:56] <Wiki13>	 ye
[12:33:01] <Wiki13>	 advertising
[12:33:03] <elukey>	 so iiuc now is change prop that grabs jobs from kafka and then sends to the video scalers
[12:33:07] <apergos>	 let's see who's in commons channel
[12:33:30] <elukey>	 so there is also (possibly) and issue with sending too many jobs to the videoscalers fleet
[12:33:40] <elukey>	 mobrovac,Pchelolo hello :)
[12:35:07] <logmsgbot>	 !log twentyafterfour@tin Finished scap: test running full scap sync from tin (duration: 46m 05s)
[12:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:23] <Wiki13>	 well, I wouldnt say thats a bad neccesairly, better have problems now than later on
[12:37:12] <elukey>	 Wiki13: ?
[12:37:43] <apergos>	 yeah these are the source off the transcodes most likely
[12:37:55] <apergos>	 each one of these has a pile of transcodes per file
[12:38:08] <elukey>	 so checking metrics for each host (like https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=mw1293) it seems that something is not ok in our config, namely too many HHVM threads used 
[12:38:36] <elukey>	 this prevents health checks of course, but things are processing.. not sure how changeprop reacts when the cluster is overloaded
[12:39:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10708/" [puppet] - 10https://gerrit.wikimedia.org/r/422386 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo)
[12:39:59] <apergos>	 https://commons.wikimedia.org/w/index.php?title=Special:Contributions/MennasDosbin&offset=&limit=100&target=MennasDosbin    
[12:40:10] <apergos>	 looks like we will get no joy from commons admins
[12:40:40] <Wiki13>	 I can tag them as speedy if you guys wantr
[12:40:51] <Wiki13>	 maybe then it gets picked up
[12:41:00] <apergos>	 at this point the jobs are already queued
[12:41:20] <Wiki13>	 okay, so that wouldnt make any difference then
[12:41:36] <apergos>	 no, it's more about stopping more of it
[12:42:54] <jynus>	 I've deployed a change to profile/mariadb/proxy/master, I should be the only one using that
[12:43:12] <revi>	 I heard you need Commons admin
[12:43:13] <revi>	 sup
[12:43:46] <apergos>	 hey revi
[12:43:54] <Wiki13>	 I pinged him ^^
[12:44:00] <apergos>	 we just have a spike of video transcode jobs
[12:44:05] <apergos>	 turns out they are likely all from this:
[12:44:11] <apergos>	 https://commons.wikimedia.org/w/index.php?title=Special:Contributions/MennasDosbin&offset=&limit=100&target=MennasDosbin
[12:44:11] <revi>	 summoned during setting up new router lol
[12:44:12] <revi>	 hmm
[12:44:16] <apergos>	 oh! 
[12:44:52] <apergos>	 anyways the job queued now are queued but it would be nice to head off any more of that (have a look at a few of the descriptions)
[12:45:02] <revi>	 uh-uh
[12:45:03] <revi>	 yeah
[12:45:07] <apergos>	 basically I'm just lobbing it over the wall to you fols
[12:45:08] <revi>	 not that 'in scope'
[12:45:10] <wikibugs>	 (03PS1) 10BBlack: eqsin: turn-up HK + PH + JP [dns] - 10https://gerrit.wikimedia.org/r/422394 (https://phabricator.wikimedia.org/T189252)
[12:45:16] <apergos>	 we gotta clean up on our side
[12:45:30] <wikibugs>	 (03PS1) 10BBlack: eqsin: turn-up India [dns] - 10https://gerrit.wikimedia.org/r/422395 (https://phabricator.wikimedia.org/T189252)
[12:45:33] <wikibugs>	 (03PS1) 10BBlack: eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252)
[12:46:13] <revi>	 so what exactly do you need from me?
[12:46:29] <revi>	 (just to be clear - I'm still setting up my new internet so I may be out of connect for awhile)
[12:46:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] icinga: preserve ownership when purging resources [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912) (owner: 10Filippo Giunchedi)
[12:46:33] <apergos>	 right
[12:46:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga: preserve ownership when purging resources [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912)
[12:46:50] <elukey>	 so we set, in videoscalers.yaml, thread_count: 15
[12:47:11] <elukey>	 that is exactly how busy hhvm is right now on each scaler, so possibly it is a misconfig from our side
[12:47:29] <apergos>	 revi: yeah I hear you
[12:47:56] <apergos>	 it would be nice not to get another flood of those, whether that means communication with the user or whatever else
[12:48:09] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[12:48:09] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[12:48:19] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[12:48:19] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time
[12:48:28] <apergos>	 elukey: so how does 15 wind up nailing us against the wall?
[12:48:29] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[12:48:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] ci: Add kubernetes deployment classes to CI [puppet] - 10https://gerrit.wikimedia.org/r/422100 (https://phabricator.wikimedia.org/T184924) (owner: 10Alexandros Kosiaris)
[12:48:47] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: ci: Add kubernetes deployment classes to CI [puppet] - 10https://gerrit.wikimedia.org/r/422100 (https://phabricator.wikimedia.org/T184924)
[12:49:09] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[12:49:26] <revi>	 apergos: OK, but I think comment from (WMF) would be more trustworthy (or authoritative) than random volunteer admin commenting :P
[12:49:34] <elukey>	 apergos: so 15 is the number of available HHVM theads that we configure, but we also do some calculations to establish the number of jobrunner "runners"
[12:49:46] <apergos>	 revi: I dropped a note about it in the admin channel
[12:50:05] <revi>	 oh
[12:50:06] <revi>	 saw it now
[12:50:19] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[12:50:21] <apergos>	 elukey: ok, I didn't see any ridiculous number of them though when looking
[12:50:25] <elukey>	 in our case, for example, on mw1293 (/etc/jobrunner/jobrunner.conf) we have: 17 runners for transcode, 12 for transcode_prioritized
[12:50:40] <revi>	 hmm, deleting it do stop transcoding?
[12:50:41] <apergos>	 uh huh
[12:50:53] <elukey>	 apergos: https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?panelId=17&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=mw1338&from=now-3h&to=now
[12:50:56] <apergos>	 revi: nope, just deterrence
[12:51:02] <revi>	 oh.
[12:51:22] <elukey>	 it seems that we are getting through them
[12:51:27] <elukey>	 since queueing is decreasing
[12:51:36] <apergos>	 we maxed out there? ohdear
[12:51:54] <elukey>	 yeah, no more hhvm threads == no health checks passing == alarms
[12:51:56] <zhuyifei1999_>	 wasn’t transcoding queue ‘fixed’ a while ago?
[12:51:59] <apergos>	 https://commons.wikimedia.org/wiki/File:%22Body_Massage%22_-_Jenna_Marbles.webm  that was the latest one in the list, only two left to complete
[12:52:29] * apergos spotchecks a coupe others
[12:52:31] <elukey>	 zhuyifei1999_: it seems a problem of having videoscalers running too many processes at once, that's it
[12:52:33] <revi>	 I think proper course for these images are 7-days Deletion Requests since it seem to be scope stuff
[12:52:40] <zhuyifei1999_>	 k
[12:52:54] <revi>	 zhuyifei1999_: I'm not sure if I should just go raid with delete button lol
[12:53:05] <apergos>	 one one, another has a few left
[12:53:06] <apergos>	 meh
[12:53:09] <apergos>	 *one done
[12:53:25] <revi>	 meanwhile new router, 3x speed yay
[12:53:37] <apergos>	 nice
[12:53:39] <zhuyifei1999_>	 revi: ask jcb to do that, nobody will complain ;)
[12:53:55] <revi>	 zhuyifei1999_: I don't want to put myself into Commons Drama Season (x)
[12:54:19] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts]
[12:55:05] <elukey>	 apergos: nproc on those scalers is 40 or 48, I think that 15 hhvm threads is a bit low :)
[12:55:18] <elukey>	 but I don't remember if that value was intended or not
[12:55:32] <apergos>	 44 files in 10 minutes, all needing 16 trnscodes each, each one of those lasting anywhere from 4 to 30 minutes depending on the size
[12:55:37] <apergos>	 that would do it
[12:55:58] <apergos>	 *6 not 16
[12:55:59] <revi>	 I recall uploading 20~around video last December
[12:56:06] <elukey>	 yeah but in theory we shouldn't allow this amount of stress on all the scalers, they took too much work at once
[12:56:41] <revi>	 and I was kinda wondering wtf is wrong with the speed meh
[12:56:48] <akosiaris>	 contint1001 puppet issues is me, fixing
[12:56:54] <apergos>	 ok
[12:57:44] <apergos>	 if mw were smart it would queue the transcodes for a file serially: do each size one after another instead of all in parallel
[12:57:58] <apergos>	 so that other jobs can run if need be
[12:58:10] <apergos>	 then you'll say, suppose the other threads are idle
[12:58:14] <Wiki13>	 revi, zhuyifei1999_: https://commons.wikimedia.org/wiki/Commons:Deletion_requests/Files_uploaded_by_MennasDosbin I just nominated all of them
[12:58:17] <revi>	 great
[12:58:24] <revi>	 you just saved my time
[12:58:28] <zhuyifei1999_>	 thanks
[12:58:32] <elukey>	 Wiki13: thanks!
[12:58:40] <apergos>	 thanks for wrangling that 
[12:58:58] <revi>	 I'll remember to kill them by next week
[12:59:08] <revi>	 with priority
[12:59:11] <Wiki13>	 ^^
[13:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1300).
[13:00:04] <jouncebot>	 Amir1, RoanKattouw, and tgr: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Amir1>	 o/
[13:00:12] <RoanKattouw>	 I'll SWAT today
[13:00:19] <Amir1>	 cool
[13:00:29] <apergos>	 twentyafterfour did a full test scap which apparently ran ok 
[13:00:38] <apergos>	 so you should be good to go, I'm here just in case
[13:01:15] <apergos>	 elukey: how were we on cpu on those boxes?
[13:01:19] <wikibugs>	 (03PS3) 10Catrope: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup)
[13:01:37] <wikibugs>	 (03CR) 10Catrope: [C: 032] Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup)
[13:01:48] <revi>	 so I'm out, have a nice day (and goodnight!)
[13:01:56] <apergos>	 thanks again
[13:02:07] <revi>	 Wiki13 did the messy thing :)
[13:02:25] <elukey>	 apergos: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=mw1293&var-datasource=eqiad%20prometheus%2Fops
[13:02:35] <Wiki13>	 hehe :P
[13:02:36] <Amir1>	 RoanKattouw: Don't forget to run the creating database main. script
[13:02:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[13:02:48] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup)
[13:02:52] <wikibugs>	 (03CR) 10jenkins-bot: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup)
[13:02:56] <apergos>	 that initial spike isn't too cheery, can they handle much more in the way of work?
[13:03:01] <Amir1>	 https://phabricator.wikimedia.org/T180879#3916960
[13:03:03] <RoanKattouw>	 yup
[13:03:04] <apergos>	 elukey: 
[13:03:12] <Amir1>	 I forgot and brought down the whole wiki last time
[13:03:28] <apergos>	 I am so not ready for bringing down the wikis today
[13:03:35] <apergos>	 let's try to avoid that, shall we
[13:03:51] <Amir1>	 we shall :D
[13:04:00] <apergos>	 👍
[13:04:33] <RoanKattouw>	 Amir1: On mwdebug1002, please test
[13:04:40] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts]
[13:05:03] <elukey>	 apergos: we might possibly need to tune down the number of runners via profile::mediawiki::jobrunner
[13:05:13] <apergos>	 hmm
[13:05:14] <Amir1>	 RoanKattouw: https://am.wikimedia.org/wiki/%D5%8D%D5%BA%D5%A1%D5%BD%D5%A1%D6%80%D5%AF%D5%B8%D5%B2:%D5%8F%D5%A1%D6%80%D5%A2%D5%A5%D6%80%D5%A1%D5%AF says Translate is there
[13:05:19] <Amir1>	 let's move forward
[13:06:02] <RoanKattouw>	 OK
[13:07:15] <wikibugs>	 (03CR) 10Catrope: [C: 032] Enable Flow on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421606 (https://phabricator.wikimedia.org/T190500) (owner: 10Urbanecm)
[13:07:47] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Translate extension on amwikimedia (T180879) (duration: 01m 22s)
[13:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:53] <stashbot>	 T180879: Install translate extension in amwikimedia - https://phabricator.wikimedia.org/T180879
[13:08:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Flow on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421606 (https://phabricator.wikimedia.org/T190500) (owner: 10Urbanecm)
[13:08:57] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add profile::kubernetes::deployment_server::git_* [puppet] - 10https://gerrit.wikimedia.org/r/422399 (https://phabricator.wikimedia.org/T184924)
[13:09:40] <icinga-wm>	 PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:12:22] <wikibugs>	 (03CR) 10jenkins-bot: Enable Flow on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421606 (https://phabricator.wikimedia.org/T190500) (owner: 10Urbanecm)
[13:13:00] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add profile::kubernetes::deployment_server::git_* [puppet] - 10https://gerrit.wikimedia.org/r/422399 (https://phabricator.wikimedia.org/T184924)
[13:15:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Add profile::kubernetes::deployment_server::git_* [puppet] - 10https://gerrit.wikimedia.org/r/422399 (https://phabricator.wikimedia.org/T184924) (owner: 10Alexandros Kosiaris)
[13:18:29] <logmsgbot>	 !log catrope@tin Synchronized dblists/flow.dblist: Enable Flow on euwiki (T190500) (duration: 01m 17s)
[13:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:35] <stashbot>	 T190500: Enable Extension:StructuredDiscussions in Basque Wikipedia - https://phabricator.wikimedia.org/T190500
[13:19:19] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[13:24:10] <wikibugs>	 (03PS1) 10Imarlier: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252)
[13:27:04] <wikibugs>	 (03PS1) 10Elukey: role::mediawiki::videoscaler: reduce the number of available runners [puppet] - 10https://gerrit.wikimedia.org/r/422402
[13:29:39] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:29:59] <icinga-wm>	 RECOVERY - Long running screen/tmux on labstore2003 is OK: OK: No SCREEN or tmux processes detected.
[13:30:08] <wikibugs>	 (03PS4) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942)
[13:30:49] <RoanKattouw>	 tgr: Are you here for your SWAT patches?
[13:31:00] <tgr>	 RoanKattouw: present
[13:31:20] <RoanKattouw>	 twentyafterfour: Did mediawiki.org disappear from group0 somehow?
[13:31:26] <RoanKattouw>	 testwiki has wmf.27 but mw.org has 26
[13:31:43] <wikibugs>	 (03PS2) 10Catrope: Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza)
[13:31:47] <wikibugs>	 (03CR) 10Catrope: [C: 032] Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza)
[13:31:52] <wikibugs>	 (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[13:33:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza)
[13:34:49] <wikibugs>	 (03CR) 10jenkins-bot: Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza)
[13:35:02] <jynus>	 !log upgrade mariadb client on sarin, neodymium, terbium and wasat
[13:35:05] <wikibugs>	 (03CR) 10Elukey: "changes from the pcc perspective: https://puppet-compiler.wmflabs.org/compiler03/10715/" [puppet] - 10https://gerrit.wikimedia.org/r/422402 (owner: 10Elukey)
[13:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:07] <logmsgbot>	 !log catrope@tin Synchronized php-1.31.0-wmf.27/extensions/Echo/modules/nojs/mw.echo.badge.less: Prevent FOUC when loading notification badges (duration: 01m 20s)
[13:36:08] <RoanKattouw>	 tgr: Wikidata description override is on mwdebug1002, please test
[13:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:15] <wikibugs>	 (03CR) 10BBlack: [C: 032] eqsin: turn-up HK + PH + JP [dns] - 10https://gerrit.wikimedia.org/r/422394 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack)
[13:36:36] <twentyafterfour>	 RoanKattouw: group0 didn't go out last night
[13:36:50] <twentyafterfour>	 I intend to fix that but Greg said to wait for the train window today
[13:38:01] <RoanKattouw>	 OK
[13:38:24] <apergos>	 was this before or after switch to deploy1001? just for my info
[13:38:26] <wikibugs>	 (03PS5) 10Ema: WIP: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712)
[13:38:27] <tgr>	 RoanKattouw: I can see that the feature is enabled, I can't test more than that without writing content (which is probably a bad idea while it's only on one server)
[13:38:51] <twentyafterfour>	 apergos: after
[13:38:57] <apergos>	 ok, thanks
[13:39:40] <icinga-wm>	 RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[13:39:47] <jynus>	 tgr: you can always test other things are not broken, too :-)
[13:40:10] <tgr>	 well, the site works on mwdebug1002
[13:40:29] <jynus>	 for me that would be enough on that context
[13:40:29] <RoanKattouw>	 OK rolling out then
[13:40:31] <tgr>	 I can't think of anything more specific that would be broken by this
[13:41:45] <wikibugs>	 (03PS2) 10Catrope: Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza)
[13:41:48] <wikibugs>	 (03CR) 10Catrope: [C: 032] Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza)
[13:41:55] <jynus>	 a) it can be seen it got enabled correctly (you will be amazed the times where that doesn't work), b) the site is still up c) related funcionality still works
[13:42:06] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Wikidata description override on enwik (T184000) (duration: 01m 18s)
[13:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:12] <stashbot>	 T184000: Magic word on English WP to override display of Wikidata short description - https://phabricator.wikimedia.org/T184000
[13:43:09] <wikibugs>	 (03Merged) 10jenkins-bot: Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza)
[13:43:24] <wikibugs>	 (03CR) 10jenkins-bot: Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza)
[13:44:05] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::mediawiki::videoscaler: reduce the number of available runners [puppet] - 10https://gerrit.wikimedia.org/r/422402 (owner: 10Elukey)
[13:44:48] <wikibugs>	 (03CR) 10Ottomata: Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata)
[13:45:01] <RoanKattouw>	 tgr: TemplateStyles on Wikivoyage is now on mwdebug1002, please test (to the extent practical)
[13:47:58] <tgr>	 RoanKattouw: again, the only thing that I can test is that it's enabled and the site is up, and those pass
[13:48:43] <wikibugs>	 (03PS2) 10Ottomata: Update kafka java.security file with Java 8 u162 changes [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400)
[13:49:11] <wikibugs>	 (03CR) 10Ottomata: "Hm, btw, I wonder if my certpath.disableAlgorithms has some redundancy in it.  Some of the default disabledAlgorthims are also listed in m" [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata)
[13:49:32] <RoanKattouw>	 OK, deploying
[13:50:08] <wikibugs>	 (03CR) 10Ottomata: Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata)
[13:51:04] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable TemplateStyles on all Wikivoyages (T189838) (duration: 01m 17s)
[13:51:08] <elukey>	 !log reduced number of jobrunner runners on the videoscalers after the last burst of jobs that maxed out the cluster
[13:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:11] <stashbot>	 T189838: Create and deploy configuration change to enable TemplateStyles on Wikivoyages on 2018-03-28 - https://phabricator.wikimedia.org/T189838
[13:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:06] <tgr>	 I thought I would be able to test it in Special:ExpandTemplates but it seems like that ignores the default content model of the title and always assumes wikitext
[13:52:25] <tgr>	 not sure if that's a bug or I was just trying to use it for something it wasn't meant for
[13:59:16] <wikibugs>	 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4088092 (10fgiunchedi)
[13:59:19] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4088090 (10fgiunchedi) 05Open>03Resolved This is completed, added documentation on pooling/depooling frontend/backend at https://wikitech.wik...
[14:10:48] <wikibugs>	 (03PS1) 10Ottomata: Replicate everything except change-prop and internal topics from main to jumbo [puppet] - 10https://gerrit.wikimedia.org/r/422408 (https://phabricator.wikimedia.org/T189464)
[14:11:45] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Replicate everything except change-prop and internal topics from main to jumbo [puppet] - 10https://gerrit.wikimedia.org/r/422408 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata)
[14:14:57] <tgr>	 RoanKattouw: thanks! both changes seem to work fine
[14:17:11] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4088144 (10fgiunchedi)
[14:17:17] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891#4088141 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is complete.  Added documentation to https://wikitech.wikimedia.org/wiki/Puppet#Puppet_CA
[14:18:39] <wikibugs>	 (03PS3) 10Ppchelko: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137)
[14:18:57] <wikibugs>	 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4088146 (10fgiunchedi)
[14:20:16] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088153 (10ema) The memory error situation when it comes to codfw cache hosts is pretty bad. Besides cp2006, cp2010, and cp2017 (found rebooting), I've now checked SEL and the...
[14:21:29] <wikibugs>	 (03CR) 10DCausse: Disable redis queue for cirrusSearch jobs for test wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:24:34] <wikibugs>	 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: icinga restarted on each puppet run on standby server - https://phabricator.wikimedia.org/T190912#4088161 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The change above fixed the problem so we're down to one restart per hour driven by `sync_...
[14:25:21] <wikibugs>	 (03CR) 10Mobrovac: "LGTM modulo the duplicate line David pointed out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:27:35] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470)
[14:27:37] <wikibugs>	 (03PS4) 10Ppchelko: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137)
[14:27:50] <wikibugs>	 (03CR) 10Ppchelko: "Removed the duplicate line" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:28:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[14:28:07] <wikibugs>	 (03Abandoned) 10Jcrespo: mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[14:28:43] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088168 (10ema)
[14:30:14] <wikibugs>	 (03CR) 10DCausse: [C: 031] Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:31:08] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088172 (10faidon) These seem to be under warranty for another 2 months, so we should hurry up.  7 out of 22 identical hosts having memory errors so...
[14:33:35] <wikibugs>	 (03PS3) 10Jcrespo: phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679)
[14:33:41] <wikibugs>	 10Operations, 10wikidiff2, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4088178 (10thiemowmde)
[14:33:45] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:33:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo)
[14:34:36] <wikibugs>	 10Operations, 10wikidiff2, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4088182 (10Lea_WMDE) >So, if I understand this right, the wikidiff extension needs additional changes beyond what is currently deployed on production and bet...
[14:36:52] <wikibugs>	 (03PS4) 10Jcrespo: phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679)
[14:37:20] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable TemplateStyles on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422414 (https://phabricator.wikimedia.org/T190910)
[14:37:57] <mobrovac>	 heads-up, i'll take over deploy1001 for 10 mins or so
[14:39:03] <godog>	 mobrovac: in case you didn't know already, deployment server is back to tin
[14:39:21] <mobrovac>	 ah? wow
[14:39:22] <mobrovac>	 kk
[14:39:24] <mobrovac>	 thnx godog
[14:39:53] <godog>	 np mobrovac, deployment.eqiad.wmnet cname does the right thing fwiw
[14:40:27] <mobrovac>	 i know but i don't like to use it because of offending keys :P
[14:41:11] <wikibugs>	 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4088202 (10BBlack) See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that went by unnoticed (but we tend to notice on r...
[14:41:22] <mobrovac>	 uh, why is wikiversions.json modified locally?
[14:41:43] <mobrovac>	 for testwikis
[14:42:46] <mobrovac>	 RoanKattouw: know anything about ^ ?
[14:43:06] <mobrovac>	 /srv/mediawiki-staging/wikiversions.json
[14:43:20] <wikibugs>	 (03CR) 10Jcrespo: "Manuel: Most of these changes have been done on other commits already, but the ones pending should be interesting to merge, maybe." [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo)
[14:44:20] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "mariadb: Redo mariadb::backup class into role/profile style" [puppet] - 10https://gerrit.wikimedia.org/r/410131 (owner: 10Jcrespo)
[14:46:56] <mobrovac>	 ok, i'll just proceed, it won't interfere with what i want to do
[14:48:28] <wikibugs>	 (03CR) 10Mobrovac: [C: 032] Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:48:35] <wikibugs>	 (03PS5) 10Mobrovac: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:49:35] <wikibugs>	 (03PS6) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924)
[14:49:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Switch time server on dns5001 to Chrony [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff)
[14:50:41] <wikibugs>	 (03CR) 10jenkins-bot: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko)
[14:52:29] <wikibugs>	 (03CR) 10Jcrespo: "Joe, Volans: with the primary master setup being (complete?), maybe you have suggestion on how to complete the script ('# TODO: get the pr" [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo)
[14:52:39] <icinga-wm>	 PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst Lag: eqiad.mediawiki.job.wikibase-addUsagesForPage/p0 - lag:480 offset:4578205881
[14:53:03] <wikibugs>	 (03CR) 10Ema: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[14:53:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "IIRC, we 've settled on having them at the node level for now." [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[14:54:02] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@c84880a]: Switch CirrusSearch jobs to kafka for test wikis
[14:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:45] <logmsgbot>	 !log ppchelko@tin Finished deploy [cpjobqueue/deploy@c84880a]: Switch CirrusSearch jobs to kafka for test wikis (duration: 00m 44s)
[14:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "I was a little on the fence initially because cron not running is potentially harmful, though a failed restart should (!) trigger the "fai" [puppet] - 10https://gerrit.wikimedia.org/r/422391 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:56:15] <logmsgbot>	 !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Disable redis queue for cirrusSearch jobs for test wikis, file 1/2 - T189137 (duration: 01m 17s)
[14:56:15] <wikibugs>	 (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[14:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:21] <stashbot>	 T189137: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137
[14:56:51] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Is thjs still relevant or should it be abandoned?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399792 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo)
[14:57:19] <elukey>	 mobrovac: https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=change-prop-wikibase-addUsagesForPage
[14:57:23] <elukey>	 expected?
[14:57:27] <elukey>	 (I saw the new alarms firing)
[14:57:50] <elukey>	 ah went down, the alarms are probably too sensitive
[14:58:02] <RoanKattouw>	 mobrovac: No idea, ask twentyafterfour 
[14:58:05] <logmsgbot>	 !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Disable redis queue for cirrusSearch jobs for test wikis, file 2/2 - T189137 (duration: 01m 17s)
[14:58:09] <mobrovac>	 elukey: known
[14:58:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:28] <elukey>	 mobrovac: sure, let's discuss how to tune the alarms when you are less busy :)
[15:00:05] <andrewbogott>	 !log stopping nodepool on labnodepool1001
[15:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:27] <andrewbogott>	 !log stopping nova-fullstack on labnet1001 for T189115
[15:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:52] <mobrovac>	 ok, i'm done with tin
[15:02:09] <andrewbogott>	 !log rebooting labservices1001 and labcontrol1001 for T189115
[15:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:03] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4088271 (10Papaul) switch port information  asw-b6-codfw ge-6/0/13
[15:07:38] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4088272 (10Papaul)
[15:07:39] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 1434.3123149425287 10.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[15:07:54] <andrewbogott>	 !log restarting nova-network on labnet1001 in case it's upset by the rabbit outage
[15:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:30] <andrewbogott>	 !log restarting nova-fullstack on labnet1001
[15:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:40] <elukey>	 so the Kafka mirror maker alarm's graph is wrong, singe it is main -> jumbo the issue
[15:09:23] <elukey>	 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker-new-consumer?refresh=5m&orgId=1
[15:09:38] <elukey>	 the alarm needs to be updated
[15:09:45] <elukey>	 buuut it seems that mirror maker is doing a ton more work
[15:09:55] <elukey>	 possibly dueto the new events flowing in?
[15:10:28] <mobrovac>	 which new events?
[15:10:51] <elukey>	 mobrovac: I saw  Disable redis queue for cirrusSearch jobs and I thought there were changes :)
[15:11:11] <ebernhardson>	 that only switched on testwiki though, should be tiny # of events
[15:11:22] <mobrovac>	 oh no no, these events have been there for a while, we are just processing them now instead of ignoring them
[15:11:46] <elukey>	 ottomata: hello :)
[15:11:47] <elukey>	 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker-new-consumer?refresh=5m&orgId=1
[15:11:59] <ottomata>	 oh yeahhhhh
[15:12:32] <elukey>	 ottomata: might be related to https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=change-prop-wikibase-addUsagesForPage ?
[15:12:50] <ottomata>	 elukey:  are you talking about the extra volume?
[15:12:54] <elukey>	 yeah
[15:12:58] <ottomata>	 i re-added job topics just now
[15:13:02] <ottomata>	 logged it in analytics ! :)
[15:13:02] <elukey>	 ahhhh
[15:13:09] <ottomata>	 so its good!
[15:13:14] <andrewbogott>	 !log restarting nodepool on labnodepool1001 (cleanup from T189115)
[15:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:23] <elukey>	 ottomata:             scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 1 etc..
[15:13:27] <elukey>	 argh nope
[15:13:35] <elukey>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m
[15:13:38] <elukey>	 etc..
[15:13:40] <ottomata>	 interesting!
[15:13:43] <ottomata>	 an alert!
[15:13:45] <elukey>	 there are some icinga alerts firing
[15:13:48] <ottomata>	 oh that
[15:13:49] <ottomata>	 huhhh!
[15:13:54] <ottomata>	 cool well the alerts work!  taht's cool
[15:14:13] <ottomata>	 dropped 5 messages interesting indeed
[15:14:19] <elukey>	 a couple of comments though: 1) we'd need to update the dashboard link with the new-consumer stuff
[15:14:33] <elukey>	 2) this one also fired
[15:14:34] <elukey>	 PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst                    Lag: eqiad.mediawiki.job.wikibase-addUsagesForPage/p0 - lag:480 offset:4578205881
[15:14:56] <elukey>	 that is a bit confusing for people if they don't know what that is :D
[15:15:32] <ottomata>	 hm, the consumer lag i don't think shoudl have fired for this
[15:15:34] <ottomata>	 it was too short
[15:15:37] <ottomata>	 i should figure out how to adjust that
[15:15:40] <elukey>	 After a chat with Marko it seems to be a known issue, so we might need to tune the alarms  to be less sensitive
[15:15:43] <elukey>	 yeah
[15:15:47] <elukey>	 ok :)
[15:15:53] <ottomata>	 but the dropped messages one isn't good
[15:15:57] <ottomata>	 that one I want to be very sensitive
[15:16:05] <ottomata>	 that shoudlnt' happen
[15:16:22] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labs: monitoring: fix permissions of /var/log/graphite [puppet] - 10https://gerrit.wikimedia.org/r/422417 (https://phabricator.wikimedia.org/T189871)
[15:16:28] <elukey>	 how can this happen?
[15:16:34] <wikibugs>	 (03PS3) 10Andrew Bogott: keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954)
[15:17:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] labs: monitoring: fix permissions of /var/log/graphite [puppet] - 10https://gerrit.wikimedia.org/r/422417 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez)
[15:17:25] <ottomata>	 well, i'm not totally sure what that metric is
[15:17:36] <ottomata>	 i'd hope that it would not update committed offsets for that partition
[15:17:41] <ottomata>	 and it would just reconsume or something
[15:19:12] <elukey>	 ottomata: ack
[15:19:32] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4088298 (10Papaul)
[15:21:47] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez)
[15:22:09] <icinga-wm>	 PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst Lag: eqiad.mediawiki.job.htmlCacheUpdate/p0 - lag:1069 offset:887305497
[15:24:22] <ottomata>	 ok, looking at that lag alert...
[15:24:25] <ottomata>	 cool that it works though!
[15:24:42] <wikibugs>	 (03PS1) 10Imarlier: wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252)
[15:25:08] <ottomata>	 hmmm, interesting
[15:25:19] <wikibugs>	 (03CR) 10BBlack: [C: 031] "Please, with all haste, we're waiting on the IN data :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:25:36] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512)
[15:26:04] <ottomata>	 elukey:  so that alert is coming directly from burrow
[15:26:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:26:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: "While reviewing this it occurred to me that for less data loss upon unplanned failover you can send metrics to the slave via carbon-c-rela" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[15:27:14] <elukey>	 ottomata: yeah but it needs to be less sensitive since there are known lag that we shouldn't alert on
[15:27:19] <wikibugs>	 (03CR) 10Imarlier: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:27:20] <ottomata>	 i actually dont' think i can make it less sensitive, 
[15:27:27] <ottomata>	 this is something burrow would email for
[15:27:36] <ottomata>	 its using the /status endpoint
[15:27:39] <ottomata>	 and burrow decides
[15:28:13] <ottomata>	 there is a critical lag threshold in this check, but all it does is add extra alerting if burrow says things are warning, but we want to do critical earlier
[15:28:16] <ottomata>	 so
[15:28:33] <ottomata>	 i might need to switch the lag alert to be prometheus based instead of using the burrow checker
[15:28:47] <ottomata>	 burrow -> prometheus -> icinga
[15:28:50] <ottomata>	 rather than burrow -> icinga
[15:29:11] <elukey>	 ottomata: sure but we can tell to the nagios monitor to wait for say 3/4 times with X time between them before alerting no?
[15:29:32] <elukey>	 if burrow clears itself in the meantime, no alert fired
[15:29:33] <RoanKattouw>	 bblack: If you want https://gerrit.wikimedia.org/r/422401 to be deployed Right Now (ish), I'd be happy to do that (cc marlier)
[15:30:16] <ottomata>	 can we do that?
[15:30:46] <ottomata>	 oh!
[15:30:49] <ottomata>	 #    $retries
[15:30:49] <ottomata>	 #       Defaults to 3. The number of times a service will be retried before
[15:30:49] <ottomata>	 #       notifying
[15:30:50] <ottomata>	 we can!
[15:31:06] <ottomata>	 let's try that
[15:31:18] <ottomata>	 defaults to 3 though
[15:31:19] <ottomata>	 hm
[15:31:33] <elukey>	 there must be also a time between retries
[15:31:39] <elukey>	 the default should be very low
[15:31:56] <ottomata>	 yeah, wonder how often nrpe checks get run
[15:31:58] <bblack>	 marlier: RoanKattouw above offers to shove the first update now, if you're ok with how it looks presently
[15:32:23] <elukey>	 ottomata: iirc it was a minute, there are some defaults
[15:32:32] <marlier>	 RoanKattouw: bblack: works for me, if you don't mind
[15:32:36] <ottomata>	 hm, ok, then let's set retries to 30?
[15:32:47] <ottomata>	 oh retry_interval
[15:32:48] <ottomata>	 hm
[15:32:49] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is OK: OK - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[15:32:51] <elukey>	 ottomata: yeah
[15:32:51] <ottomata>	 ooo ok
[15:33:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[15:33:16] <RoanKattouw>	 Alright, on its way
[15:33:23] <bblack>	 thanks!
[15:33:26] <wikibugs>	 (03CR) 10Catrope: [C: 032] wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:33:30] <wikibugs>	 (03PS2) 10Catrope: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:33:35] <wikibugs>	 (03CR) 10Catrope: [C: 032] wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:33:49] <elukey>	 ottomata: let's also set the contact group for 'analytics' for the moment to avoid false positives in here that might confuse people
[15:34:27] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4088342 (10ayounsi)
[15:34:50] <wikibugs>	 (03Merged) 10jenkins-bot: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:35:26] <ottomata>	 hmm ok
[15:37:25] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable oversampling for IN, GU, MP in preparation for eqsin (T189252) (duration: 01m 18s)
[15:37:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:31] <stashbot>	 T189252: Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252
[15:37:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[15:38:09] <RoanKattouw>	 Alright there you go
[15:38:20] <wikibugs>	 (03CR) 10jenkins-bot: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[15:38:36] <marlier>	 Brilliant, thanks RoanKattouw!
[15:38:53] <wikibugs>	 (03PS1) 10Ottomata: Alert on lag in last 30 minutes, alert mirror maker lag for analytics [puppet] - 10https://gerrit.wikimedia.org/r/422424 (https://phabricator.wikimedia.org/T189611)
[15:39:16] <RoanKattouw>	 lmk if you need more deployed, I'm eating dinner but am still pingable
[15:39:35] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Alert on lag in last 30 minutes, alert mirror maker lag for analytics [puppet] - 10https://gerrit.wikimedia.org/r/422424 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata)
[15:39:52] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088357 (10RobH)
[15:41:19] <icinga-wm>	 PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst Lag: eqiad.mediawiki.job.htmlCacheUpdate/p0 - lag:1 offset:887358124
[15:43:44] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088365 (10RobH)
[15:51:43] <wikibugs>	 (03PS1) 10Jrdnch: Update to glibc >=2.19 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250)
[15:56:13] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088416 (10RobH)
[15:56:58] <wikibugs>	 (03CR) 10Bstorm: "Just a note: I am refactoring this to more correctly match standards as well as make the linter happier." [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm)
[16:05:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm)
[16:05:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm)
[16:07:41] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088458 (10RobH) a:03Papaul After reviewing with traffic team, we're goign to test memory in all of these.  I've updated the task description with...
[16:08:22] <wikibugs>	 (03PS2) 10BBlack: eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252)
[16:08:24] <wikibugs>	 (03PS2) 10BBlack: eqsin: turn-up India [dns] - 10https://gerrit.wikimedia.org/r/422395 (https://phabricator.wikimedia.org/T189252)
[16:12:49] <wikibugs>	 (03CR) 10Rush: [C: 031] "the only thing I'm not sure of is if there are specific package logrotate directives that are counting on daily runs to do the right thing" [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm)
[16:13:07] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4088465 (10Papaul) switch port information  asw-b5-codfw  restbase-test2001  ge-5/0/19 restbase-test2002  ge-5/0/16 restbase-test2003  ge-5/0/20
[16:15:08] <wikibugs>	 (03CR) 10Rush: [C: 031] "Part of me says we should put this under teh toollabs modules as it will make sense on execs and k8s workers too but good next step if we " [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185) (owner: 10Bstorm)
[16:15:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: ">" [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[16:24:08] <wikibugs>	 (03PS1) 10Chad: WIP: Initial crappy implementation of Github repo creation [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/422429
[16:26:22] <wikibugs>	 (03CR) 10Rush: [C: 04-1] wmcs: monitoring: rsync whisper files between mon servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[16:26:25] <wikibugs>	 (03PS1) 10Ottomata: Be more lenient about MirrorMaker numDroppedMessages alert [puppet] - 10https://gerrit.wikimedia.org/r/422430 (https://phabricator.wikimedia.org/T189611)
[16:27:34] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Be more lenient about MirrorMaker numDroppedMessages alert [puppet] - 10https://gerrit.wikimedia.org/r/422430 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata)
[16:28:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] lttoolbox: Update to latest upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/419346 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[16:28:55] <wikibugs>	 (03CR) 10Bstorm: "I looked through it.  I was surprised to find that there wasn't any.  You can tell it to rotate hourly, but it won't actually do it unless" [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm)
[16:29:34] <wikibugs>	 (03PS1) 10Ottomata: Increase MirrorMaker main -> jumbo heap size [puppet] - 10https://gerrit.wikimedia.org/r/422431 (https://phabricator.wikimedia.org/T189464)
[16:29:58] <wikibugs>	 (03PS2) 10Bstorm: dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218)
[16:30:07] <wikibugs>	 (03PS5) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942)
[16:30:14] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Increase MirrorMaker main -> jumbo heap size [puppet] - 10https://gerrit.wikimedia.org/r/422431 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata)
[16:31:47] <wikibugs>	 (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[16:32:41] <wikibugs>	 (03PS3) 10Bstorm: dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218)
[16:33:11] <wikibugs>	 (03PS2) 10Vgutierrez: mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942)
[16:33:36] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088571 (10Papaul) All those systems are running outdated IDRAC  and BIOS version. I will like to update the IDRAC and BIOS first before running the...
[16:33:48] <wikibugs>	 (03CR) 10Bstorm: [C: 032] dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm)
[16:33:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[16:37:00] <wikibugs>	 (03PS4) 10Andrew Bogott: keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954)
[16:37:02] <wikibugs>	 (03PS1) 10Andrew Bogott: nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432
[16:37:04] <wikibugs>	 (03PS1) 10Andrew Bogott: nova.conf: remove memcached setting [puppet] - 10https://gerrit.wikimedia.org/r/422433 (https://phabricator.wikimedia.org/T187954)
[16:37:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez)
[16:37:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 (owner: 10Andrew Bogott)
[16:37:41] <akosiaris>	 !log T189075 upload lttoolbox_3.4.0~r84331-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main
[16:37:43] <wikibugs>	 (03PS3) 10Vgutierrez: mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942)
[16:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:49] <stashbot>	 T189075: Package apertium-separable and dependencies - https://phabricator.wikimedia.org/T189075
[16:40:44] <wikibugs>	 (03CR) 10Elukey: "Thanks a lot for this work! Tested on a Jessie system with glibc 2.19, works fine. Left a comment for the documentation :)" (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250) (owner: 10Jrdnch)
[16:46:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[16:46:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[16:46:34] <wikibugs>	 (03PS2) 10Andrew Bogott: nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 (https://phabricator.wikimedia.org/T187954)
[16:46:36] <wikibugs>	 (03PS2) 10Andrew Bogott: nova.conf: remove memcached setting [puppet] - 10https://gerrit.wikimedia.org/r/422433 (https://phabricator.wikimedia.org/T187954)
[16:46:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[16:46:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[16:46:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[16:46:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[16:46:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[16:47:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-separable: Initial Debian packaging [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[16:47:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[16:47:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[16:47:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[16:48:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[16:48:02] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[16:54:01] <icinga-wm>	 PROBLEM - Host cp2002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:54:11] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:54:11] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:11] <icinga-wm>	 PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:11] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:11] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:11] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:11] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:54:12] <icinga-wm>	 PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:12] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:13] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:13] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:54:14] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:03] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4088630 (10ayounsi)
[16:58:21] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:58:22] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:58:22] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:58:22] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:31] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:31] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:32] <icinga-wm>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:58:32] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:58:32] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:41] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:41] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:41] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:41] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:41] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:42] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:51] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 133 no-child-sa: cp3007_v6 not-conn: cp2002_v4, cp2002_v6
[16:58:51] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:58:51] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:58:51] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:51] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:52] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:52] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:52] <icinga-wm>	 PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:58:52] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6
[16:59:02] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[16:59:48] <elukey>	 hello cp2001
[16:59:51] <elukey>	 err 2002
[17:00:05] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1700).
[17:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:00:51] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6
[17:00:52] <icinga-wm>	 RECOVERY - Host cp2002 is UP: PING WARNING - Packet loss = 37%, RTA = 36.07 ms
[17:01:01] <icinga-wm>	 RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK
[17:01:01] <icinga-wm>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK
[17:01:02] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK
[17:01:11] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK
[17:01:11] <icinga-wm>	 RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK
[17:01:11] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK
[17:01:21] <icinga-wm>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK
[17:01:21] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK
[17:01:21] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK
[17:01:21] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK
[17:01:22] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK
[17:01:22] <icinga-wm>	 RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK
[17:01:22] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK
[17:01:22] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK
[17:01:24] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK
[17:01:24] <icinga-wm>	 RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK
[17:01:24] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK
[17:01:24] <icinga-wm>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK
[17:01:25] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK
[17:01:25] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK
[17:01:41] <icinga-wm>	 RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK
[17:01:41] <icinga-wm>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK
[17:01:41] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK
[17:01:42] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK
[17:01:42] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK
[17:01:42] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK
[17:01:42] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK
[17:01:42] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK
[17:01:42] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK
[17:01:51] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK
[17:01:51] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK
[17:01:52] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK
[17:01:52] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK
[17:01:52] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK
[17:01:52] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK
[17:02:01] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK
[17:02:01] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK
[17:02:31] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 2097.0972689655173 1000.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[17:02:31] <icinga-wm>	 RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK
[17:03:31] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is OK: OK - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[17:06:44] <wikibugs>	 (03PS6) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942)
[17:06:54] <ottomata>	 hm
[17:09:31] <icinga-wm>	 PROBLEM - Host cp2003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:16:11] <icinga-wm>	 RECOVERY - Host cp2003 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms
[17:23:25] <wikibugs>	 (03PS2) 10Jrdnch: Update to glibc >=2.19 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250)
[17:25:12] <icinga-wm>	 PROBLEM - Host cp2003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:28:47] <joal>	 Hi ops team - I'm about to deploy analytics-refinery (scheduled hadoop jobs conf)
[17:29:07] <joal>	 No impact whatsoever on mediawiki side
[17:29:25] <wikibugs>	 (03PS9) 10Elukey: coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier)
[17:30:06] <elukey>	 jouncebot: next
[17:30:06] <jouncebot>	 In 0 hour(s) and 29 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1800)
[17:30:14] <wikibugs>	 (03PS3) 10Jrdnch: Update to glibc >=2.19 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250)
[17:30:19] <elukey>	 joal: FYI --^ (useful) 
[17:30:43] <wikibugs>	 (03CR) 10Elukey: [C: 032] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier)
[17:31:48] <joal>	 elukey: I'll ask you more on this tomorrow I think :)
[17:32:57] <logmsgbot>	 !log joal@tin Started deploy [analytics/refinery@7135d44]: Regular weekly analytics deploy - Scheduled hadoop jobs updates
[17:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:17] <logmsgbot>	 !log joal@tin Finished deploy [analytics/refinery@7135d44]: Regular weekly analytics deploy - Scheduled hadoop jobs updates (duration: 05m 21s)
[17:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:47] <wikibugs>	 (03PS2) 10Andrew Bogott: toolforge: Add wikimedia.org to the CSP allowed list [puppet] - 10https://gerrit.wikimedia.org/r/422064 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis)
[17:39:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] toolforge: Add wikimedia.org to the CSP allowed list [puppet] - 10https://gerrit.wikimedia.org/r/422064 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis)
[17:41:03] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.45:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.45:6443}[5m]))): 116328.3594890511 = 50000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:41:10] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4088764 (10Tgr)
[17:41:54] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: OK - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.45:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:43:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4088779 (10Cmjohnson)
[17:48:44] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4088796 (10Mholloway) 05Open>03Resolved Ah, no need to worry about reading lists, then.  Sorry for the partially...
[17:49:39] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4088802 (10Mholloway)
[17:50:01] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4044468 (10Mholloway)
[17:52:37] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#4088811 (10Cmjohnson)
[17:52:51] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3431403 (10Cmjohnson) 05Open>03Resolved
[17:54:03] <icinga-wm>	 PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100%
[17:55:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 3 others: Decommission restbase-test environment - https://phabricator.wikimedia.org/T186755#4088822 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson This was done awhile ago resolving
[17:56:53] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))): 36140855.404015064 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:57:33] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:57:34] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:57:43] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:57:43] <icinga-wm>	 PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6
[17:57:43] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:57:43] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:57:53] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:57:53] <icinga-wm>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:57:53] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:57:54] <icinga-wm>	 PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[17:57:54] <icinga-wm>	 PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[17:58:04] <icinga-wm>	 PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6
[17:58:13] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:58:14] <icinga-wm>	 PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[17:58:14] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:58:14] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:58:23] <icinga-wm>	 PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[17:58:23] <icinga-wm>	 PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[17:58:23] <icinga-wm>	 PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6
[17:58:23] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1800)
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:34] <greg-g>	 twentyafterfour: now?
[18:01:06] <twentyafterfour>	 greg-g: cool
[18:03:13] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#4088853 (10Cmjohnson)
[18:03:53] <twentyafterfour>	 !log deploying 1.31.0-wmf.27 to group0. group1 in an hour. See T183966 for blockers.
[18:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:59] <stashbot>	 T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966
[18:06:07] <wikibugs>	 (03PS1) 1020after4: group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448
[18:06:09] <wikibugs>	 (03CR) 1020after4: [C: 032] group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448 (owner: 1020after4)
[18:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448 (owner: 1020after4)
[18:08:03] <icinga-wm>	 RECOVERY - Host cp2006 is UP: PING WARNING - Packet loss = 61%, RTA = 36.10 ms
[18:08:13] <icinga-wm>	 RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 40 ESP OK
[18:08:13] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK
[18:08:13] <icinga-wm>	 RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK
[18:08:14] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK
[18:08:14] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK
[18:08:23] <icinga-wm>	 RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK
[18:08:23] <icinga-wm>	 RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK
[18:08:23] <icinga-wm>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK
[18:08:24] <icinga-wm>	 RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 40 ESP OK
[18:08:33] <wikibugs>	 (03CR) 10jenkins-bot: group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448 (owner: 1020after4)
[18:08:33] <icinga-wm>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK
[18:08:34] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK
[18:08:43] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK
[18:08:43] <icinga-wm>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK
[18:08:44] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK
[18:08:44] <icinga-wm>	 RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 40 ESP OK
[18:08:53] <icinga-wm>	 RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK
[18:08:54] <icinga-wm>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK
[18:08:54] <icinga-wm>	 RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK
[18:08:54] <icinga-wm>	 RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK
[18:09:13] <icinga-wm>	 PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:41] <wikibugs>	 (03PS17) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[18:10:03] <icinga-wm>	 PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100%
[18:10:42] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn)
[18:11:42] <wikibugs>	 (03Abandoned) 10Sbisson: kartotherian/tilerator: set Last-Modified header [puppet] - 10https://gerrit.wikimedia.org/r/421522 (https://phabricator.wikimedia.org/T187300) (owner: 10Sbisson)
[18:12:13] <urandom>	 !log upgrading restbase-dev1004-a to cassandra 3.11.2 (canary) -- T178905
[18:12:16] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 wikis to 1.31.0-wmf.27
[18:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:20] <stashbot>	 T178905: Evaluate new upstream Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905
[18:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:13] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:14] <icinga-wm>	 PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[18:15:14] <icinga-wm>	 PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6
[18:15:15] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:15] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:23] <icinga-wm>	 PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[18:15:23] <icinga-wm>	 PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:24] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 133 no-child-sa: cp3040_v6 not-conn: cp2006_v4, cp2006_v6
[18:15:33] <icinga-wm>	 PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6
[18:15:34] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:34] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:43] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:44] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:44] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:53] <icinga-wm>	 PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6
[18:15:54] <icinga-wm>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:54] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:15:54] <icinga-wm>	 PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[18:16:03] <icinga-wm>	 PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6
[18:16:23] <icinga-wm>	 RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms
[18:16:48] <greg-g>	 anything we need to be worried about wrt the train? ^^
[18:17:12] <greg-g>	 cc mutante XioNoX ^
[18:17:17] <urandom>	 !log upgrading restbase-dev1004-b to cassandra 3.11.2 (canary) -- T178905
[18:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:23] <stashbot>	 T178905: Evaluate new upstream Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905
[18:18:00] <wikibugs>	 (03PS1) 10Cmjohnson: Removing mgmt dns db1030 [dns] - 10https://gerrit.wikimedia.org/r/422451 (https://phabricator.wikimedia.org/T184397)
[18:18:27] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns db1030 [dns] - 10https://gerrit.wikimedia.org/r/422451 (https://phabricator.wikimedia.org/T184397) (owner: 10Cmjohnson)
[18:18:31] <twentyafterfour>	 greg-g: I'm pretty sure ipsec is unrelated to the train 
[18:19:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1030 - https://phabricator.wikimedia.org/T184397#4088932 (10Cmjohnson)
[18:19:27] <XioNoX>	 bblack: ^
[18:20:31] <greg-g>	 twentyafterfour: yeah, it's at a different layer, just making sure nothing larger is going on :)
[18:21:43] <wikibugs>	 (03PS1) 10Cmjohnson: Removing mgmt dns for db1001 [dns] - 10https://gerrit.wikimedia.org/r/422452 (https://phabricator.wikimedia.org/T190262)
[18:22:26] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for db1001 [dns] - 10https://gerrit.wikimedia.org/r/422452 (https://phabricator.wikimedia.org/T190262) (owner: 10Cmjohnson)
[18:23:46] <wikibugs>	 (03PS2) 10MusikAnimal: Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969)
[18:25:00] <wikibugs>	 (03PS1) 10ArielGlenn: clean up internal rsync client list for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/422454
[18:25:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4088955 (10Cmjohnson)
[18:25:57] <bblack>	 greg-g: the various cp2NNNN / ipsec alerts are of no operational consequence you should worry about.  papaul's doing some hardware reboots in codfw to investigate memory issues.  unfortunately at least the ipsec-spam part of it is relatively-unavoidable.
[18:26:04] <wikibugs>	 (03PS1) 10Cmjohnson: Removing mgmt dns for db1011 [dns] - 10https://gerrit.wikimedia.org/r/422455 (https://phabricator.wikimedia.org/T184703)
[18:26:29] <bblack>	 greg-g: (sorry for the noise!)
[18:26:55] <wikibugs>	 (03PS2) 10Cmjohnson: Removing mgmt dns for db1011 [dns] - 10https://gerrit.wikimedia.org/r/422455 (https://phabricator.wikimedia.org/T184703)
[18:27:26] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for db1011 [dns] - 10https://gerrit.wikimedia.org/r/422455 (https://phabricator.wikimedia.org/T184703) (owner: 10Cmjohnson)
[18:28:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4088963 (10Cmjohnson)
[18:28:24] <icinga-wm>	 PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6
[18:30:00] <wikibugs>	 (03PS1) 10Cmjohnson: Removing mgmt dns from db1016 [dns] - 10https://gerrit.wikimedia.org/r/422459 (https://phabricator.wikimedia.org/T190179)
[18:31:51] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns from db1016 [dns] - 10https://gerrit.wikimedia.org/r/422459 (https://phabricator.wikimedia.org/T190179) (owner: 10Cmjohnson)
[18:35:28] <greg-g>	 bblack: s'ok, I just like double checking :)
[18:36:43] <wikibugs>	 (03PS3) 10BBlack: eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252)
[18:37:34] <wikibugs>	 (03CR) 10BBlack: [C: 032] eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack)
[18:39:51] <wikibugs>	 (03PS3) 10Bstorm: wiki replicas: refactor and record grants and set user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650)
[18:43:05] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088993 (10Papaul) Cp2003 result   {F16367386}
[18:43:33] <icinga-wm>	 RECOVERY - Host cp2003 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms
[18:44:20] <wikibugs>	 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#4089001 (10mmodell)
[18:52:33] <urandom>	 !log upgrading restbase-dev1005-{a,b} to cassandra 3.11.2 -- T178905
[18:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:38] <stashbot>	 T178905: Evaluate new upstream Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905
[18:54:45] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4089049 (10RobH)
[18:55:24] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4072956 (10RobH) @Lucas_Werkmeister_WMDE: I'll go ahead and prepare the patchsets, however we're still lacking a WMF staff sponsorship on this request.  Is there a particul...
[18:56:05] <wikibugs>	 (03PS1) 10Rush: openstack: add nbd kernel module to compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/422465
[18:58:03] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: add nbd kernel module to compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/422465 (owner: 10Rush)
[19:00:04] <jouncebot>	 twentyafterfour: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:01:15] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4072956 (10Nuria) Can you be a bit more explicit on your request?  >I want to run long-running queries, e. g. to analyze usage of the WikibaseQualityConstraints extension Y...
[19:02:16] <twentyafterfour>	 jouncebot: one deployer and a whole posse of bots
[19:02:51] <ebernhardson>	 !log restore elasticsearch eqiad disk high/low watermarks to 75/80% with all large reindexes complete
[19:02:52] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466
[19:02:54] <wikibugs>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466 (owner: 1020after4)
[19:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:30] * twentyafterfour imagines that jouncebot is listening.
[19:03:43] <twentyafterfour>	 learning even
[19:04:12] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466 (owner: 1020after4)
[19:04:46] <wikibugs>	 (03PS1) 10Ottomata: check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611)
[19:05:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata)
[19:05:23] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27
[19:05:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:58] <wikibugs>	 (03PS1) 10Chad: Adding zuul for building [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422469
[19:06:00] <wikibugs>	 (03CR) 10Chad: [C: 032] Adding zuul for building [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422469 (owner: 10Chad)
[19:06:21] <wikibugs>	 (03CR) 10BBlack: wmf-config: Enable oversampling for remaining countries in Asia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier)
[19:06:41] <logmsgbot>	 !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s)
[19:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:16] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466 (owner: 1020after4)
[19:09:10] <logmsgbot>	 !log milimetric@tin Started deploy [analytics/refinery@c22fd1e]: (no justification provided)
[19:09:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:24] <logmsgbot>	 !log milimetric@tin Started deploy [analytics/refinery@c22fd1e]: Fixing python import bug
[19:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:30] <wikibugs>	 (03PS2) 10Ottomata: check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611)
[19:10:34] <wikibugs>	 (03CR) 10Ottomata: [C: 032] check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata)
[19:12:12] <logmsgbot>	 !log milimetric@tin Finished deploy [analytics/refinery@c22fd1e]: Fixing python import bug (duration: 02m 48s)
[19:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:42] <twentyafterfour>	 I'm seeing quite a few "[{exception_id}] {exception_url} Wikimedia\Rdbms\DBExpectedError from line 924 of /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: Replication wait failed: Lost connection to MySQL server during query (10.64.48.172)
[19:17:43] <twentyafterfour>	 	15
[19:17:46] <twentyafterfour>	 "
[19:18:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:18:32] <twentyafterfour>	 30+ in the last 5 minutes which is quite a bit more than the error rate prior to the train
[19:18:36] <twentyafterfour>	 and then there is that
[19:18:40] <twentyafterfour>	 ^
[19:19:06] <twentyafterfour>	 rolling back to wmf.26
[19:20:04] <twentyafterfour>	 !log Rolling back to wmf.26 due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"
[19:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:19] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472
[19:20:21] <wikibugs>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472 (owner: 1020after4)
[19:21:47] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472 (owner: 1020after4)
[19:22:05] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472 (owner: 1020after4)
[19:22:49] <wikibugs>	 (03PS1) 10Ottomata: Increase main -> jumbo MirrorMaker num.streams to 12 [puppet] - 10https://gerrit.wikimedia.org/r/422473 (https://phabricator.wikimedia.org/T189464)
[19:22:50] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26
[19:22:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:35] <twentyafterfour>	 any dbas around to help me figure out what's wrong with wmf.27?
[19:23:47] <twentyafterfour>	 https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen  
[19:24:04] <twentyafterfour>	 ^ those are all the same "Replication wait failed: lost connection to MySQL server during query" error
[19:24:08] <logmsgbot>	 !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 17s)
[19:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:09] <greg-g>	 twentyafterfour: they're out (time of day && holidays)
[19:27:19] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Increase main -> jumbo MirrorMaker num.streams to 12 [puppet] - 10https://gerrit.wikimedia.org/r/422473 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata)
[19:28:54] <twentyafterfour>	 greg-g: great
[19:29:09] <wikibugs>	 (03PS1) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357)
[19:29:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush)
[19:29:44] <twentyafterfour>	 looks like the errors are all from RefreshLinksJob:   /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php
[19:30:16] <Niharika>	 twentyafterfour: Where in tendril does it show the queries which are causing the fatals that show up in logstash? I'm trying to get the hang of tendril. 
[19:30:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:30:46] <twentyafterfour>	 Niharika: I'm not sure, I'm coming at this from the opposite direction - from kibana
[19:30:57] <Niharika>	 twentyafterfour: Ah, okay, makes sense. 
[19:31:14] <twentyafterfour>	 I followed the php stack traces back to RefreshLinksJob line 258
[19:32:32] <wikibugs>	 (03PS2) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357)
[19:32:53] <twentyafterfour>	 commitAndWaitForReplication
[19:33:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush)
[19:33:38] <twentyafterfour>	 so refreshlinks opens a transaction for "runForTitle" and then the commitAndWaitForReplication times out
[19:36:45] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742#4089162 (10BBlack) Looking at ntp::chrony now as I've noticed the above dns5001 switch.  There seems to be nothing in there about local peering, or about clock consistency...
[19:38:02] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] "https://phabricator.wikimedia.org/T177742#4089162 ?" [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff)
[19:38:19] <wikibugs>	 (03PS3) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357)
[19:38:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush)
[19:42:43] <twentyafterfour>	 niharika: I just went to the server by finding it's ip .. there I can see that there was a big jump in implicit temporary tables 
[19:42:47] <twentyafterfour>	 https://tendril.wikimedia.org/host/view/db1109.eqiad.wmnet/3306
[19:43:09] <Niharika>	 twentyafterfour: And how did you pick the server? 
[19:43:42] <Niharika>	 twentyafterfour: And how do we see which queries caused that jump?
[19:43:45] <Niharika>	 In tendril. 
[19:44:03] <twentyafterfour>	 Niharika: the error message mentions 10.64.48.172 so I searched the page to find the db server with that IP.  as for what query caused that jump, I'm not sure 
[19:44:08] <twentyafterfour>	 the php code is opaque 
[19:44:22] <wikibugs>	 (03PS1) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288)
[19:44:24] <wikibugs>	 (03PS4) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357)
[19:45:08] <Niharika>	 I'd have expected tendril to be able to show me queries for a given timestamp but that doesn't seem so. 
[19:45:21] <twentyafterfour>	 nope
[19:45:29] <twentyafterfour>	 it has a slow query log
[19:45:36] <twentyafterfour>	 https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1109&hours=1
[19:45:46] <Niharika>	 Yeah but that's not very useful, is it?
[19:45:52] <Niharika>	 For cases like these.
[19:45:53] <twentyafterfour>	 not really 
[19:46:33] <twentyafterfour>	 11k implicit temp tables is pretty extreme (with the baseline < 3k) 
[19:47:03] <twentyafterfour>	 but I can't figure out what query is involved or anything else that might help pinpoint the cause
[19:48:56] <wikibugs>	 (03PS2) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288)
[19:49:24] <wikibugs>	 (03PS3) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288)
[19:49:40] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush)
[19:50:03] <wikibugs>	 (03CR) 10Dzahn: [C: 032] install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn)
[19:50:09] <wikibugs>	 (03PS4) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288)
[19:52:41] <Niharika>	 twentyafterfour: https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=2d03f574e8b789aec61dc623f2b45ad2&host=family%3Adb1109&user=&schema=&hours=1%2F32
[19:52:46] <Niharika>	 This might be the one?
[19:53:10] <Niharika>	 But it's not useful much. 
[19:54:29] <mutante>	 !log deploy1001 - schedule downtime for reinstall with jessie, reinstalling (T175288)
[19:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:35] <stashbot>	 T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288
[20:00:04] <twentyafterfour>	 Niharika: likely yes
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2000).
[20:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[20:00:28] <twentyafterfour>	 Niharika: it does look like locking code is where the problem surfaced
[20:05:19] <twentyafterfour>	 Niharika: I created a task https://phabricator.wikimedia.org/T190960
[20:05:47] <twentyafterfour>	 greg-g: should this be high or ubn? 
[20:06:02] <twentyafterfour>	 it's definitely a critical train blocker but it's not an outage
[20:06:14] <greg-g>	 UBN as it's blocking the train
[20:09:36] <greg-g>	 I pinged AaronSchulz on the task since there's no DBAs around
[20:09:36] <logmsgbot>	 !log mlitn@tin Started deploy [3d2png/deploy@c447488]: Updating 3d2png
[20:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:02] <logmsgbot>	 !log mlitn@tin Finished deploy [3d2png/deploy@c447488]: Updating 3d2png (duration: 02m 26s)
[20:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:08] <wikibugs>	 (03PS1) 10Rush: openstack: bootstrapping neutron l3 agent for floating ip [puppet] - 10https://gerrit.wikimedia.org/r/422489 (https://phabricator.wikimedia.org/T188266)
[20:22:40] <wikibugs>	 (03PS2) 10Rush: openstack: bootstrapping neutron l3 agent for floating ip [puppet] - 10https://gerrit.wikimedia.org/r/422489 (https://phabricator.wikimedia.org/T188266)
[20:24:23] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: bootstrapping neutron l3 agent for floating ip [puppet] - 10https://gerrit.wikimedia.org/r/422489 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[20:44:42] <logmsgbot>	 !log bsitzmann@tin Started deploy [mobileapps/deploy@6a0d877]: Update mobileapps to a5833a0
[20:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:18] <logmsgbot>	 !log bsitzmann@tin Finished deploy [mobileapps/deploy@6a0d877]: Update mobileapps to a5833a0 (duration: 05m 36s)
[20:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:41] <wikibugs>	 (03CR) 10Chad: [V: 032 C: 032] Adding zuul for building [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422469 (owner: 10Chad)
[20:55:49] <wikibugs>	 (03PS1) 10Chad: Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550
[20:55:51] <wikibugs>	 (03CR) 10Chad: [C: 032] Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550 (owner: 10Chad)
[20:56:21] <wikibugs>	 (03CR) 10Paladox: [C: 031] Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550 (owner: 10Chad)
[21:00:53] <apergos>	 jouncebot: next
[21:01:07] <apergos>	 I swear I never get that command right
[21:01:10] <jouncebot>	 In 1 hour(s) and 58 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2300)
[21:01:21] <apergos>	 slow bots get beaten
[21:02:15] <apergos>	 when skynet comes to kill us all, it will be due to this throwaway remark of mine in a publically logged channel
[21:04:35] <wikibugs>	 (03PS1) 10Dzahn: Revert "mwscript: Detect php across distros" [puppet] - 10https://gerrit.wikimedia.org/r/422554
[21:05:51] <wikibugs>	 (03PS1) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266)
[21:06:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[21:06:27] <wikibugs>	 (03CR) 10Chad: [V: 032 C: 032] Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550 (owner: 10Chad)
[21:07:08] <wikibugs>	 (03PS2) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266)
[21:07:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[21:08:40] <no_justification>	 Krinkle: Do you remember why we have 404.html for secure.wm.o and why it can't use 404.php like the other wikis?
[21:08:47] <no_justification>	 I can't find any other users of 404.html
[21:09:28] <wikibugs>	 (03PS3) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266)
[21:09:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[21:11:19] <wikibugs>	 (03PS4) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266)
[21:11:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[21:12:21] <Krinkle>	 no_justification: https://secure.wikimedia.org/404.hml vs https://meta.wikimedia.org/404.hml
[21:12:28] <Krinkle>	 I suppose the main difference is that secure isn't a wiki.
[21:12:39] <wikibugs>	 (03PS1) 10Chad: Forgot to add zuul to custom_plugins [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422557
[21:12:41] <wikibugs>	 (03CR) 10Chad: [C: 032] Forgot to add zuul to custom_plugins [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422557 (owner: 10Chad)
[21:12:43] <wikibugs>	 (03CR) 10Chad: [V: 032 C: 032] Forgot to add zuul to custom_plugins [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422557 (owner: 10Chad)
[21:12:54] <wikibugs>	 (03PS5) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266)
[21:13:03] <Krinkle>	 no_justification: and unlike other non-wiki domains on apaches (like www.wikimedia, www.wikipedia etc) there isn
[21:13:14] <Krinkle>	 isn't an obvious wiki for /wiki/ to redirect to
[21:13:15] <Krinkle>	 (yet)
[21:13:25] <Krinkle>	 Given the "Try /wiki/X" is part of the 404.php thing
[21:13:42] <no_justification>	 Meh, fair nuff.
[21:13:43] <Krinkle>	 If we make /wiki/ redirect on secure thehn I'd be +1 for killing it
[21:13:49] <Krinkle>	 https://phabricator.wikimedia.org/T113114
[21:14:27] <Krinkle>	 It also used to be used on bits
[21:14:33] <Krinkle>	 But yeah I thnk now it's just secure
[21:14:45] <wikibugs>	 (03PS6) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266)
[21:15:20] <no_justification>	 I wanna remove the symlink and put it straight in the secure docroot if nothing else uses it
[21:15:37] <no_justification>	 I can't find any pointers to it outside the apache config for secure-only
[21:15:43] <Krinkle>	 no_justification: Hm.. that may be tricky from Apache config perspective. It's in the default right? We'd need another fallback
[21:15:51] <Krinkle>	 Unless we want to inverse it and make 404.php the fallback
[21:15:58] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush)
[21:15:59] <Krinkle>	 also https://wikitech.wikimedia.org/d4 should use it
[21:16:21] <Krinkle>	 and https://noc.wikimedia.org/1313 should arguably use 404.html
[21:16:23] <Krinkle>	 but doesn't right now
[21:16:30] <Krinkle>	 Sorry
[21:17:03] <no_justification>	 wikitech shouldn't as it's becoming a normal wiki
[21:20:11] <wikibugs>	 (03PS2) 10Dzahn: Revert "mwscript: Detect php across distros" [puppet] - 10https://gerrit.wikimedia.org/r/422554
[21:22:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "we are currently back on tin, so everything is php5 like before and deploy1001 is reinstalled with jessie for now.." [puppet] - 10https://gerrit.wikimedia.org/r/422554 (owner: 10Dzahn)
[21:23:42] <Krinkle>	 no_justification: Yeah, I meant wikitech should use 404.php
[21:23:45] <Krinkle>	 rght now it's apache default
[21:37:59] <wikibugs>	 (03CR) 10Dzahn: "@Alex that's exactly what i thought and did first, but then the stylecheck voted me down.. and it didn't in the past.. which is what made " [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[21:42:22] <twentyafterfour>	 !log getting the train back on track, group1 wikis to 1.31.0-wmf.27 
[21:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:40] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563
[21:42:42] <wikibugs>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563 (owner: 1020after4)
[21:44:09] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563 (owner: 1020after4)
[21:44:13] <wikibugs>	 (03PS1) 10Rush: openstack: neutron bridge set default to undef instead of '' [puppet] - 10https://gerrit.wikimedia.org/r/422564
[21:46:40] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: neutron bridge set default to undef instead of '' [puppet] - 10https://gerrit.wikimedia.org/r/422564 (owner: 10Rush)
[21:46:45] <wikibugs>	 (03PS2) 10Rush: openstack: neutron bridge set default to undef instead of '' [puppet] - 10https://gerrit.wikimedia.org/r/422564
[21:48:27] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563 (owner: 1020after4)
[21:49:14] <wikibugs>	 (03PS1) 10Rush: Revert "openstack: neutron bridge set default to undef instead of ''" [puppet] - 10https://gerrit.wikimedia.org/r/422568
[21:49:38] <wikibugs>	 (03CR) 10Rush: [C: 032] Revert "openstack: neutron bridge set default to undef instead of ''" [puppet] - 10https://gerrit.wikimedia.org/r/422568 (owner: 10Rush)
[21:49:47] <wikibugs>	 (03CR) 10Rush: [V: 032 C: 032] Revert "openstack: neutron bridge set default to undef instead of ''" [puppet] - 10https://gerrit.wikimedia.org/r/422568 (owner: 10Rush)
[21:50:29] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10User-Ladsgroup: Remove uca-fa from beta cluster - https://phabricator.wikimedia.org/T190965#4089423 (10Ladsgroup) p:05Triage>03High
[21:52:08] <wikibugs>	 (03PS3) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143
[21:52:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[21:52:41] <wikibugs>	 (03PS1) 10Ladsgroup: labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965)
[21:52:46] <wikibugs>	 (03PS1) 10Chad: Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571
[21:52:52] <no_justification>	 Krinkle: ^^^ <3
[21:53:14] <mutante>	 !log deploy1001 - revoking old puppet certs and signing new ones
[21:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:57] <wikibugs>	 (03PS2) 10Chad: Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571
[21:54:03] <icinga-wm>	 PROBLEM - nutcracker process on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:03] <icinga-wm>	 PROBLEM - nutcracker port on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:04] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:13] <icinga-wm>	 PROBLEM - DPKG on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:13] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/jobrunner on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:13] <icinga-wm>	 PROBLEM - configured eth on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:14] <icinga-wm>	 PROBLEM - Check size of conntrack table on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:23] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:23] <icinga-wm>	 PROBLEM - MD RAID on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:24] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/cassandra on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:43] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/ores on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:43] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/maps on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:43] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:44] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:44] <icinga-wm>	 PROBLEM - Disk space on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:44] <icinga-wm>	 PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:47] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup)
[21:54:53] <icinga-wm>	 PROBLEM - confd service on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:53] <icinga-wm>	 PROBLEM - dhclient process on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:53] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/parsoid on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:54] <icinga-wm>	 PROBLEM - Check systemd state on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:54:54] <icinga-wm>	 PROBLEM - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:56:01] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup)
[21:56:12] <bawolff>	 woo
[21:56:33] <icinga-wm>	 PROBLEM - puppet last run on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:57:25] <Amir1>	 group1 wikis to 1.31.0-wmf.27 is merged but not rebased on tin, is it okay if I rebase tin anyway?
[21:57:31] <Amir1>	 twentyafterfour: ^
[21:57:56] <twentyafterfour>	 Amir1: still mid-deploy
[21:58:24] <twentyafterfour>	 I'm waiting on jenkins to merge https://gerrit.wikimedia.org/r/#/c/422565/'
[21:58:37] <wikibugs>	 (03CR) 10jenkins-bot: labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup)
[21:58:46] <Amir1>	 okay, I thought it's ended (the deployment calendar was like it)
[21:59:01] <twentyafterfour>	 train got delayed today
[21:59:02] <Amir1>	 just keep in mind that is mine labs: Change category collataion of fawiki back to default
[21:59:09] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is CRITICAL: Return code of 255 is out of bounds
[21:59:43] <Amir1>	 don't be surprised if you see changes related to that. sorry for interrupting 
[21:59:48] <Amir1>	 it's labs only change
[21:59:54] <twentyafterfour>	 Amir1: ok
[22:00:22] <Amir1>	 Thanks
[22:01:42] <wikibugs>	 (03CR) 10Krinkle: [C: 031] "LGTM. Confirmed all the same hosts are still in there (order changed slightly)" [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad)
[22:03:43] <twentyafterfour>	 !log syncing https://gerrit.wikimedia.org/r/#/c/422565/ refs T190960 T183966
[22:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:50] <stashbot>	 T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966
[22:03:50] <stashbot>	 T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960
[22:05:00] <twentyafterfour>	 Amir1: would you like me to sync https://gerrit.wikimedia.org/r/#/c/422570/ ?
[22:05:29] <logmsgbot>	 !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/: sync https://gerrit.wikimedia.org/r/#/c/422565/ (duration: 02m 15s)
[22:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:19] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on deploy1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[22:08:49] <twentyafterfour>	 !log rolling forward group1 to 1.31.0-wmf.27 refs T183966 T190960
[22:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:55] <stashbot>	 T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966
[22:08:55] <stashbot>	 T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960
[22:09:21] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: sync https://gerrit.wikimedia.org/r/#/c/422563/ group1 wikis to 1.31.0-wmf.27 refs T183966 T190960
[22:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:10] <wikibugs>	 (03PS4) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143
[22:10:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[22:10:49] <twentyafterfour>	 now I'm seeing [{exception_id}] {exception_url} Wikimedia\Rdbms\DBTransactionSizeError from line 1293 of /srv/mediawiki/php-1.31.0-wmf.26/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 6.4027202129364 second(s) in writes, exceeding the limit of 3
[22:10:57] <twentyafterfour>	 greg-g AaronSchulz ^ 
[22:12:25] <twentyafterfour>	 rolling back again 
[22:12:31] <Amir1>	 twentyafterfour: It doesn't need sync
[22:12:38] <Amir1>	 AFAIK
[22:13:01] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576
[22:13:03] <wikibugs>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576 (owner: 1020after4)
[22:13:37] <greg-g>	 twentyafterfour: :(
[22:13:56] <twentyafterfour>	 !log deploy of 1.31.0-wmf.27 resulted in a lot of SlowTimer errors for SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT(...)
[22:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:22] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576 (owner: 1020after4)
[22:15:40] <wikibugs>	 (03PS5) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143
[22:15:41] <twentyafterfour>	 these errors are the symptom not the cause 
[22:16:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[22:16:12] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26
[22:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:19] <twentyafterfour>	 something else is causing database slowness 
[22:16:29] <twentyafterfour>	 probably in unrelated part of the codebase 
[22:16:42] <mutante>	 sorry for the icinga alerts that shouldnt have been here
[22:16:57] <mutante>	 no reason to worry about deploy1001. i got it
[22:17:09] <twentyafterfour>	 mutante: I'm also getting a bunch of scap errors from deploy1001: Permission denied (publickey,keyboard-interactive).
[22:17:28] <mutante>	 twentyafterfour: why would that be if we are back to tin?
[22:17:30] <logmsgbot>	 !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 18s)
[22:17:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:41] <twentyafterfour>	 because scap is still trying to connect to the co-masters and the ssh keys changed?
[22:17:55] <mutante>	 oh, that,yes
[22:18:02] <mutante>	 it's still on the puppet run
[22:18:04] <twentyafterfour>	 or the scap user isn't authorized to connect to deploy1001 anymore for whatever reason
[22:18:15] <twentyafterfour>	 ok no big deal scap continues and ignores the errors
[22:18:18] <mutante>	 it's setting up the things right now
[22:18:21] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576 (owner: 1020after4)
[22:18:48] <mutante>	 it is just kind of slow, i was hoping to have it done earlier
[22:21:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[22:22:54] <Sagan>	 wow, that looks bad
[22:23:10] <Sagan>	 twentyafterfour: cc ^
[22:23:27] <Sagan>	 but looks like it dropped again
[22:23:56] <MaxSem>	 heh, you haven't seen how it's when it's really bad :P
[22:23:59] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[22:24:11] <mutante>	 pheew. nice to see that follow-up
[22:24:59] <greg-g>	 that alert was a bit delayed :)
[22:25:02] <Sagan>	 MaxSem: oh, I zoomed out and found the all time high score: 74k :D
[22:25:29] <twentyafterfour>	 yeah icinga is slow to report fatals per minute. both times I spotted the problem in fatalmonitor before icinga-wm could alert
[22:25:44] <twentyafterfour>	 we might want to increase the sensitivity of that alert?
[22:26:59] <Sagan>	 sounds like a good idea
[22:27:04] <wikibugs>	 (03CR) 10Chad: "Should probably alphabetize them tbh" [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad)
[22:27:22] <mutante>	 not just icinga config, also also how long it takes to be in graphite and to get enough data to diff it afaict
[22:27:36] <Sagan>	 also also also
[22:27:38] <mutante>	 but yea, sure possible
[22:27:42] <Sagan>	 :P
[22:28:08] <greg-g>	 the usual worry is signal/noise of course, but we can always experiment
[22:28:21] <mutante>	 yea. "let icinga check graphite" also can have downsides
[22:34:10] <MaxSem>	 twentyafterfour: is your deployment done? I had a little window:)
[22:35:05] <wikibugs>	 (03PS3) 10Bstorm: toolforge: Add tmpreaper with a custom config to web nodes [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185)
[22:35:08] <greg-g>	 yeah :( we're rolled back
[22:35:11] <twentyafterfour>	 MaxSem: yes the train is rolled back and probably not going to be resolved too soon
[22:35:40] <MaxSem>	 @jouncebot: reload
[22:35:44] <wikibugs>	 (03CR) 10Bstorm: [C: 032] toolforge: Add tmpreaper with a custom config to web nodes [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185) (owner: 10Bstorm)
[22:35:51] <Niharika>	 jouncebot: refresh
[22:35:52] <jouncebot>	 I refreshed my knowledge about deployments.
[22:35:57] <MaxSem>	 @jouncebot: last
[22:36:02] <MaxSem>	 jouncebot: last
[22:36:11] <MaxSem>	 bleh
[22:37:05] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4089633 (10Krinkle) Thanks @Vgutierrez !
[22:38:26] <greg-g>	 jouncebot: now
[22:38:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 21 minute(s)
[22:41:46] <mutante>	 so, just a little warning
[22:41:54] <mutante>	 deploy1001 is still not done with an initial puppet run 
[22:42:07] <mutante>	 but it's a scap host
[22:42:22] <mutante>	 not the deployment server for sure, just a host like others
[22:42:37] <mutante>	 so scap will say it cant connect to it.. but then continue
[22:42:51] <mutante>	 i am watching it finish the install thoguh.. so soon it should be fixed
[22:43:05] <wikibugs>	 (03CR) 10Madhuvishy: [C: 031] clean up internal rsync client list for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/422454 (owner: 10ArielGlenn)
[22:43:07] <mutante>	 hopefully within 21 min
[22:43:28] <MaxSem>	 bleh
[22:46:33] <mutante>	 if not i can remove it from the dsh group too
[22:49:24] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "+1 but DBAs have to approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle)
[22:50:52] <wikibugs>	 (03CR) 10Dzahn: "just wanted to let you know recently i removed another use-case of this module in mw-deployment" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo)
[22:51:30] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "also see https://gerrit.wikimedia.org/r/391849" [puppet] - 10https://gerrit.wikimedia.org/r/421197 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn)
[22:53:50] <wikibugs>	 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4089711 (10Krinkle) Forgot to say: The aforementioned workaround is not actually a workaround (sorry). The hostna...
[22:55:39] <wikibugs>	 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#4089715 (10Dzahn)
[22:55:48] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4089714 (10Dzahn) 05Resolved>03Open
[22:57:03] <wikibugs>	 (03PS4) 10Bstorm: wiki replicas: refactor and record grants and set user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650)
[22:58:03] <wikibugs>	 (03PS1) 10EBernhardson: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148)
[22:58:15] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: trying moving hieradata around [labs/private] - 10https://gerrit.wikimedia.org/r/422586
[22:58:19] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4089727 (10Dzahn) I think the only thing left would have been to test if you can also execute commands like "schedule downtime"...
[22:58:41] <wikibugs>	 (03CR) 10Bstorm: [V: 032 C: 032] wiki replicas: trying moving hieradata around [labs/private] - 10https://gerrit.wikimedia.org/r/422586 (owner: 10Bstorm)
[23:00:04] <jouncebot>	 MaxSem: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Can't make it to SWAT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2300).
[23:00:04] <jouncebot>	 musikanimal and ebernhardson: A patch you scheduled for Can't make it to SWAT is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:31] <musikanimal>	 I'm here
[23:00:48] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK
[23:00:48] <icinga-wm>	 RECOVERY - Host cp2006 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms
[23:00:48] <icinga-wm>	 RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 40 ESP OK
[23:00:49] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK
[23:00:49] <icinga-wm>	 RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK
[23:00:49] <icinga-wm>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK
[23:00:59] <ebernhardson>	 \o
[23:00:59] <icinga-wm>	 RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 40 ESP OK
[23:00:59] <icinga-wm>	 RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK
[23:01:03] <mutante>	 these are due to cp2006 coming back
[23:01:08] <icinga-wm>	 RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK
[23:01:18] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK
[23:01:19] <icinga-wm>	 RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 40 ESP OK
[23:01:19] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK
[23:01:19] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK
[23:01:19] <icinga-wm>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK
[23:01:28] <icinga-wm>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK
[23:01:28] <icinga-wm>	 RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK
[23:01:38] <icinga-wm>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK
[23:01:48] <icinga-wm>	 RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK
[23:01:48] <icinga-wm>	 RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK
[23:01:48] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK
[23:01:51] <wikibugs>	 (03CR) 10EBernhardson: [C: 032] Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969) (owner: 10MusikAnimal)
[23:03:00] <wikibugs>	 (03Merged) 10jenkins-bot: Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969) (owner: 10MusikAnimal)
[23:03:14] <wikibugs>	 (03CR) 10jenkins-bot: Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969) (owner: 10MusikAnimal)
[23:03:27] <mutante>	 MaxSem: deploy1001 is cloning mw-config as we speak.. i can take it out of scap list.. just unfortunate timing.. so close but not yet
[23:03:39] <ebernhardson>	 musikanimal: you're up on mwdebug1001
[23:03:45] <ebernhardson>	 mutante: we could delay like 30 minutes i suppose?
[23:04:31] <mutante>	 ebernhardson: let's say 15 and i prepare a patch to remove it in case we need it?
[23:04:38] <ebernhardson>	 ok
[23:06:29] <wikibugs>	 (03PS1) 10Dzahn: remove deploy1001 from dsh hosts and scap masters [puppet] - 10https://gerrit.wikimedia.org/r/422587
[23:08:12] <mutante>	 last time it was 12,000 seconds.. meh. ok
[23:10:33] <wikibugs>	 (03CR) 10Dzahn: [C: 032] remove deploy1001 from dsh hosts and scap masters [puppet] - 10https://gerrit.wikimedia.org/r/422587 (owner: 10Dzahn)
[23:11:33] <ebernhardson>	 really? wow..thats 3.5 hours
[23:11:47] <mutante>	 yea, but i started hours ago too
[23:12:01] <mutante>	 and of course it is adding the needed user and keyholder about 5 seconds after i merged , lol
[23:12:07] <ebernhardson>	 lol
[23:12:48] <mutante>	 ebernhardson: would you be able to sync a single host to the rest?
[23:13:18] <MaxSem>	 !log created PageAssessments tables on trwiki
[23:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:43] <MaxSem>	 musikanimal: test again
[23:14:41] <musikanimal>	 ebernhardson: we are good to sync!
[23:15:13] <ebernhardson>	 mutante: hmm, i think that's how it works? deployment syncs to canarys, then app servers sync from canarys?
[23:16:02] <mutante>	 ebernhardson: ok, so you should just do what you were plannning to do 
[23:16:09] <mutante>	 and not be blocked by me
[23:16:11] <mutante>	 it's removed on tin
[23:16:31] <mutante>	 it can be re-added and synced later
[23:16:44] <ebernhardson>	 mutante: ok
[23:17:18] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/jobrunner on deploy1001 is OK: No errors detected
[23:17:18] <icinga-wm>	 RECOVERY - configured eth on deploy1001 is OK: OK - interfaces up
[23:17:18] <icinga-wm>	 RECOVERY - DPKG on deploy1001 is OK: All packages OK
[23:17:18] <icinga-wm>	 RECOVERY - MD RAID on deploy1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[23:17:19] <icinga-wm>	 RECOVERY - Check size of conntrack table on deploy1001 is OK: OK: nf_conntrack is 0 % full
[23:17:25] <mutante>	 hahaha, this is hilarious
[23:17:29] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on deploy1001 is OK: No changes to merge.
[23:17:29] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/ores on deploy1001 is OK: No errors detected
[23:17:29] <mutante>	 the timing parts
[23:17:38] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is OK: No errors detected
[23:17:38] <icinga-wm>	 RECOVERY - Disk space on deploy1001 is OK: DISK OK
[23:17:38] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is OK: No errors detected
[23:17:39] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/maps on deploy1001 is OK: No errors detected
[23:17:39] <icinga-wm>	 RECOVERY - confd service on deploy1001 is OK: OK - confd is active
[23:17:48] <icinga-wm>	 RECOVERY - dhclient process on deploy1001 is OK: PROCS OK: 0 processes with command name dhclient
[23:17:48] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/parsoid on deploy1001 is OK: No errors detected
[23:17:58] <icinga-wm>	 RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational
[23:17:58] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is OK: No errors detected
[23:17:59] <icinga-wm>	 RECOVERY - nutcracker process on deploy1001 is OK: PROCS OK: 1 process with UID = 114 (nutcracker), command name nutcracker
[23:17:59] <icinga-wm>	 RECOVERY - nutcracker port on deploy1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[23:18:08] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on deploy1001 is OK: OK ferm input default policy is set
[23:18:08] <icinga-wm>	 RECOVERY - Confd template for /etc/dsh/group/cassandra on deploy1001 is OK: No errors detected
[23:18:37] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T184969: Enable PageAssessments on trwiki (duration: 01m 09s)
[23:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:43] <stashbot>	 T184969: Deploy PageAssessments to Turkish Wikipedia - https://phabricator.wikimedia.org/T184969
[23:18:47] <ebernhardson>	 musikanimal: ^ please test
[23:19:03] <wikibugs>	 (03PS1) 10Dzahn: Revert "remove deploy1001 from dsh hosts and scap masters" [puppet] - 10https://gerrit.wikimedia.org/r/422588
[23:19:06] <wikibugs>	 (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson)
[23:19:08] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is OK: Files ownership is ok.
[23:22:51] <wikibugs>	 (03PS1) 10Jdlrobson: Rollout VirtualPageViews (final stage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422589 (https://phabricator.wikimedia.org/T189906)
[23:23:45] <musikanimal>	 ebernhardson: looks good :)
[23:24:02] <musikanimal>	 blank as it should be https://tr.wikipedia.org/wiki/%C3%96zel:PageAssessments
[23:24:28] <ebernhardson>	 musikanimal: great!
[23:25:25] <wikibugs>	 (03PS2) 10EBernhardson: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148)
[23:25:37] <wikibugs>	 (03CR) 10EBernhardson: [C: 032] Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson)
[23:26:29] <icinga-wm>	 RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[23:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson)
[23:27:16] <musikanimal>	 thank you!
[23:28:39] <wikibugs>	 (03CR) 10jenkins-bot: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson)
[23:28:47] <wikibugs>	 (03PS3) 10Madhuvishy: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643)
[23:29:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy)
[23:33:13] <wikibugs>	 (03PS4) 10Madhuvishy: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643)
[23:37:24] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on deploy1001 is OK: OK: synced at Wed 2018-03-28 23:37:16 UTC.
[23:38:17] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148: Configure next Cirrus AB test (duration: 01m 16s)
[23:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:23] <stashbot>	 T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148
[23:39:01] <ebernhardson>	 SWAT complete
[23:40:39] <mutante>	 ok, re-adding deploy1001
[23:40:42] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Revert "remove deploy1001 from dsh hosts and scap masters" [puppet] - 10https://gerrit.wikimedia.org/r/422588 (owner: 10Dzahn)
[23:42:20] <mutante>	 eh.. puppet run is broken on tin.. what
[23:44:51] <mutante>	 no, it's not, i was just confused 
[23:58:40] <wikibugs>	 10Operations: build new version of mcrouter package - https://phabricator.wikimedia.org/T190979#4089852 (10Dzahn)