[00:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0000). Please do the needful.
[00:00:04] <jouncebot>	 Krinkle: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:00:16] <Krinkle>	 o/
[00:00:23] <MaxSem>	 \m/
[00:00:55] <MaxSem>	 doing
[00:04:05] <MaxSem>	 Krinkle, pulled on mwdebug1002
[00:04:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[00:04:19] <Krinkle>	 MaxSem: OK. verifying now..
[00:05:40] <Krinkle>	 MaxSem: Doesn't appear to be applied.
[00:06:50] <Krinkle>	 MaxSem: Hm.. let me try again
[00:06:55] <wikibugs>	 (03PS4) 10Dzahn: redirects.dat - split non-canonical to separate section [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack)
[00:07:52] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "needed manual rebase - done" [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack)
[00:08:00] <MaxSem>	 Krinkle, the server FS has the right version. Caching?
[00:08:08] <MaxSem>	 Wrong MW version?
[00:08:12] <Krinkle>	 Ah, need to test on group0 only 
[00:08:15] <Krinkle>	 yeah, 1min
[00:09:13] <thcipriani>	 yeah, I just got a backport out for wmf.12 late in the day. Wanted to make sure errors cleared. It's now late, so wmf.12 still on group0 only.
[00:09:58] <MaxSem>	 thcipriani, will train finish tomorrow?
[00:10:48] <Krinkle>	 MaxSem: OK. Good. It's verified and works as expected.
[00:10:49] <thcipriani>	 MaxSem: planning on it. I'll move it forward in my morning to group1 and then push to group2 in the normal window
[00:10:51] <Krinkle>	 (verified on test and test2)
[00:12:24] <logmsgbot>	 !log maxsem@tin Synchronized php-1.29.0-wmf.12/extensions/Gadgets: https://gerrit.wikimedia.org/r/#/c/338004/ (duration: 00m 42s)
[00:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:29] <MaxSem>	 Krinkle, ^
[00:15:10] <Krinkle>	 MaxSem: Thanks
[00:15:17] <MaxSem>	 :)
[00:15:31] <Krinkle>	 thcipriani: Any issues outstanding blocking the roll out?
[00:15:34] <Krinkle>	 Or did we get them all
[00:15:45] <icinga-wm>	 RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[00:15:52] <Krinkle>	 There were a few about rdbms stuff. I've merged them in master. Haent' kept track of which have/haven't been backported
[00:16:23] <thcipriani>	 Krinkle: no outstanding issues, backported and deployed the last one, looks like there haven't been new errors.
[00:16:28] <Krinkle>	 okay
[00:17:18] <thcipriani>	 porting the instanceof solution to wmf.12 caused it to blow up in wmf.11 when I moved it forward for one of them. Hopefully the version change in the key will ensure that doesn't happen when I roll forward in the morning.
[00:17:57] <mutante>	 jouncebot: now
[00:17:57] <jouncebot>	 For the next 0 hour(s) and 42 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0000)
[00:19:14] <MaxSem>	 mutante, I'm done with SWAT
[00:19:35] <mutante>	 ok :) i was just going to use mwdebug1001 to test something
[00:19:43] <mutante>	 and then revert to before
[00:27:46] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "alright, i tried to actually test this" [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack)
[00:28:37] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "that is "testing 209 urls on 1 servers" btw" [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack)
[00:29:12] <wikibugs>	 (03PS4) 10Dzahn: partman: delete raid1-lvm-ext4 recipe [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955)
[00:33:19] <wikibugs>	 (03CR) 10Dzahn: "what about "Requires=network.target". you don't use that but the "working example" has it." [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox)
[00:35:42] <wikibugs>	 (03CR) 10Dzahn: "does "before apache" work? do both services come up after a reboot of the machine?" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox)
[00:48:35] <icinga-wm>	 PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:53:33] <wikibugs>	 (03PS1) 10Jcrespo: Repool db1082 with low load after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338037 (https://phabricator.wikimedia.org/T158188)
[00:57:15] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "See: I566e46bdbdca7fbe5 When a server crashes, its BP is not dumped properly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 (owner: 10Jcrespo)
[00:57:30] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Repool db1082 with low load after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338037 (https://phabricator.wikimedia.org/T158188) (owner: 10Jcrespo)
[00:57:49] <wikibugs>	 (03CR) 10jenkins-bot: Repool db1082 with low load after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338037 (https://phabricator.wikimedia.org/T158188) (owner: 10Jcrespo)
[00:59:16] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with low load (duration: 00m 41s)
[00:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:04] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T0100).
[01:03:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[01:10:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[01:13:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected
[01:16:35] <icinga-wm>	 RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[01:35:36] <wikibugs>	 06Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#3031904 (10Dzahn) a:03Dzahn
[01:35:41] <wikibugs>	 06Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#2987822 (10Dzahn) 05Open>03Resolved
[01:36:27] <wikibugs>	 (03CR) 10Dzahn: "17:13 < icinga-wm> PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on conti" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar)
[02:31:56] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[02:33:06] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 11m 46s)
[02:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:37:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected
[02:40:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 0 below the confidence bounds
[02:42:29] <niedzielski>	 o/ does anyone know if renaming a Wikimedia GitHub repo would break mirroring? I believe GitHub redirects old URL usages but I wondered if anyone knew
[02:46:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 0 below the confidence bounds
[03:05:19] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.12) (duration: 14m 27s)
[03:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:06:25] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3040 is CRITICAL: CRITICAL: expiry mailbox lag is 28355
[03:06:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected
[03:07:25] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3040 is OK: OK: expiry mailbox lag is 8
[03:11:01] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Feb 16 03:11:01 UTC 2017 (duration 5m 42s)
[03:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:15:23] <wikibugs>	 (03CR) 10Krinkle: "Yeah, if invoking clean --keep-static, we shouldn't remove the branch pointer probably. Rather, that would be done when later invoked anot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336901 (owner: 10Chad)
[03:33:05] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.219 second response time
[03:46:50] <wikibugs>	 (03PS1) 10Krinkle: navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044
[03:47:39] <wikibugs>	 (03CR) 10Krinkle: "Perhaps you'd like to rebase on I699c61e3ae20e which would make it easier to add the json objects below their ua-string equivalents. It wo" [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria)
[03:48:54] <wikibugs>	 (03PS2) 10Krinkle: navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044
[03:49:09] <wikibugs>	 (03PS2) 10Krinkle: Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz)
[03:55:18] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz)
[03:56:45] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz)
[03:56:53] <wikibugs>	 (03CR) 10jenkins-bot: Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 (owner: 10Aaron Schulz)
[03:57:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds
[04:00:05] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.212 second response time
[04:00:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds
[04:15:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds
[05:03:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[05:04:45] <icinga-wm>	 PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:11:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds
[05:16:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds
[05:32:45] <icinga-wm>	 RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[05:39:36] <icinga-wm>	 PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:50:25] <icinga-wm>	 PROBLEM - Disk space on labnet1001 is CRITICAL: DISK CRITICAL - free space: / 1420 MB (3% inode=93%)
[06:07:35] <icinga-wm>	 RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[06:10:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected
[06:28:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 0 below the confidence bounds
[06:32:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds
[07:10:25] <icinga-wm>	 PROBLEM - Disk space on labnet1001 is CRITICAL: DISK CRITICAL - free space: / 1418 MB (3% inode=93%)
[07:11:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 2 below the confidence bounds
[07:12:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected
[07:27:41] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, and 4 others: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995#2960964 (10Nikerabbit) This broke the compact language links based on comment T153900#3011037. I'm submitting...
[07:29:35] <icinga-wm>	 PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:29:55] <icinga-wm>	 PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 60567 MB (12% inode=99%)
[07:33:51] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3032318 (10Marostegui) Thanks @Papaul! I will get that ready!
[07:34:25] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:34:45] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[07:37:36] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188)
[07:39:45] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[07:39:49] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui)
[07:40:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Please hold that for now. I'll be doing an exhaustive review of all privileged LDAP groups soon (T129788), if it's all fine I'll merge aft" [puppet] - 10https://gerrit.wikimedia.org/r/333024 (owner: 10Addshore)
[07:40:25] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[07:41:32] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui)
[07:41:35] <icinga-wm>	 PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 62222 MB (12% inode=99%)
[07:41:41] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase load db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338067 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui)
[07:41:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Should be based on the recently merged generic solution added in 336420 (and consequently only enabled for 2001 initially)." [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar)
[07:43:12] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase load db1082 - T158188 (duration: 00m 42s)
[07:43:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:17] <stashbot>	 T158188: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188
[07:46:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068
[07:46:55] <icinga-wm>	 RECOVERY - Disk space on elastic1029 is OK: DISK OK
[07:48:19] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068
[07:50:40] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 (owner: 10Marostegui)
[07:51:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 (owner: 10Marostegui)
[07:52:06] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338068 (owner: 10Marostegui)
[07:54:12] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2060 - T156161 (duration: 00m 44s)
[07:54:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:17] <stashbot>	 T156161: db2060 not accessible - https://phabricator.wikimedia.org/T156161
[07:54:48] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3032346 (10Marostegui) I have repooled the server.
[07:56:35] <icinga-wm>	 RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[07:57:55] <icinga-wm>	 PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 62153 MB (12% inode=99%)
[08:10:04] <wikibugs>	 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3032362 (10fgiunchedi) >>! In T149451#2864911, @Ottomata wrote: > We could set up a special varnishkafka instance for this, if that makes sense.  But, hm, I think using kafkatee would be better!  k...
[08:13:55] <icinga-wm>	 RECOVERY - Disk space on elastic1029 is OK: DISK OK
[08:18:45] <icinga-wm>	 PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:19:35] <icinga-wm>	 RECOVERY - Disk space on elastic1028 is OK: DISK OK
[08:30:41] <wikibugs>	 06Operations: Separate dc ops group in pwstore - https://phabricator.wikimedia.org/T158285#3032373 (10MoritzMuehlenhoff)
[08:31:36] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194#3032388 (10Marostegui) @Cmjohnson were you able to find a replacement BBU in the end? Thanks!
[08:32:55] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3032396 (10Marostegui) 05Open>03stalled Wait for this to happen before we replace any disks on this task: https://phabricator.wikimedia.org/T158194
[08:37:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188)
[08:39:43] <godog>	 !log roll-restart jobrunner in codfw/eqiad to pick up fluorine -> mwlog1001 redis change - T123728
[08:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:48] <stashbot>	 T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728
[08:41:36] <wikibugs>	 06Operations, 10Scap, 13Patch-For-Review, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#3032403 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is completed, `use_experimental` can be removed once deploy...
[08:42:17] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui)
[08:43:49] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui)
[08:44:03] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore db1082 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338078 (https://phabricator.wikimedia.org/T158188) (owner: 10Marostegui)
[08:44:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:44:54] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore origina db1082 weight - T158188 (duration: 00m 41s)
[08:44:55] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:58] <stashbot>	 T158188: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188
[08:45:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Add one more LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/338082
[08:45:45] <icinga-wm>	 RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[08:46:14] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 032 C: 032] Add one more LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/338082 (owner: 10Muehlenhoff)
[08:46:25] <icinga-wm>	 RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational
[08:47:55] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[08:49:08] <godog>	 the systemd fail was some jobrunners not restarting in the salt run in codfw, fixed and now doing qeqiad
[08:49:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:50:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational
[08:50:54] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3032411 (10Marostegui)
[08:50:57] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#3032410 (10Marostegui)
[08:51:17] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2633433 (10Marostegui) I have added the subtask of the last crash of this server, so we can have some tracking as it's been twice already.
[08:52:19] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3029269 (10Marostegui) I will close this ticket after restoring the original weight for this server.  Also added a parent task, which is the first crash this server had back in September (T145533). It wi...
[08:52:33] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: db1082 MySQL crashed - https://phabricator.wikimedia.org/T158188#3032416 (10Marostegui) 05Open>03Resolved a:03Marostegui
[08:52:35] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2633433 (10Marostegui)
[08:52:38] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3032419 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1051.eqiad.wmnet'] ``` The...
[08:55:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:55:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:55:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:56:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational
[08:57:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:58:45] <icinga-wm>	 PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:59:23] <godog>	 ah I get it, jobrunner gets broken pipe via salt it looks like
[09:00:35] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:35] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:35] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:35] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:35] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:36] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:36] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:37] <icinga-wm>	 RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational
[09:00:45] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:55] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:55] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:55] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:55] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:55] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:56] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:04] <marostegui>	 I will check that
[09:01:05] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:05] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:05] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:05] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:05] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:15] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:25] <icinga-wm>	 RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational
[09:01:45] <icinga-wm>	 RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational
[09:02:02] <volans>	 marostegui: tons of show slave status from nagios
[09:02:37] <marostegui>	 yep, it is kinda hang
[09:02:40] <marostegui>	 and I think I know why
[09:03:12] <marostegui>	 should be goodn ow
[09:03:13] <marostegui>	 now
[09:03:15] <volans>	 marostegui: also m3 replica is broken
[09:03:25] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:03:25] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:03:25] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:03:25] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:03:25] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:03:26] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:03:26] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:03:27] <icinga-wm>	 PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:03:37] <volans>	 nice! what did you do? :D
[09:03:45] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[09:03:45] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:03:45] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[09:03:45] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[09:03:45] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:03:46] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:03:53] <marostegui>	 yep, because of this: https://phabricator.wikimedia.org/T154485
[09:03:55] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:03:55] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:03:55] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:03:55] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[09:03:55] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:04:05] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[09:04:20] <marostegui>	 well, because I killed it
[09:04:23] <volans>	 marostegui: ok, are you taking care of m3 replica?
[09:04:30] <marostegui>	 yep :)
[09:04:31] <marostegui>	 thanks
[09:04:42] <volans>	 great, just to not step on each other toes ;)
[09:04:50] <volans>	 thank you, sir!
[09:04:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:05:03] <marostegui>	 no, thank you for jumping in!
[09:05:55] <icinga-wm>	 PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:06:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:12:55] <icinga-wm>	 PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:13:11] <godog>	 sorry about the spam
[09:14:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:14:55] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:15:18] <wikibugs>	 (03CR) 10Hashar: "Yeah it is flapping :(   Posting details on T70113" [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar)
[09:16:35] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[09:16:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:17:55] <icinga-wm>	 RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational
[09:19:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:19:45] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 264 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[09:21:29] <elukey>	 godog: need some help in restarting?
[09:22:12] <godog>	 elukey: thanks! I've switched to stop + start and things should be recovering soon
[09:22:23] <elukey>	 super, let me know otherwise
[09:23:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[09:24:45] <wikibugs>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032454 (10elukey) Details from cp1058:  ``` --  VCL_call       BACKEND_FETCH --  VCL_return     fetch --  FetchError     no backend connection --  Timestamp      Beresp: 148721...
[09:24:45] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 264 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[09:25:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational
[09:25:48] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478)
[09:26:37] <icinga-wm>	 RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational
[09:27:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:27:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[09:27:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational
[09:28:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational
[09:29:06] <icinga-wm>	 PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:29:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational
[09:29:56] <icinga-wm>	 PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 minutes ago with 3 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX],File[/usr/share/elasticsearch/lib/json-simple.jar]
[09:30:26] <icinga-wm>	 RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational
[09:30:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational
[09:31:06] <icinga-wm>	 RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational
[09:31:16] <icinga-wm>	 PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:31:24] <wikibugs>	 (03PS20) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[09:31:30] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[09:31:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:31:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational
[09:32:00] <wikibugs>	 06Operations: 'systemctl restart jobrunner' broken via salt - https://phabricator.wikimedia.org/T158288#3032457 (10fgiunchedi)
[09:32:16] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic1051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:32:51] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[09:33:08] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338083 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[09:33:26] <icinga-wm>	 PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:33:29] <volans>	 godog: let's try it with cumin then ;)
[09:33:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:33:57] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2070 - T156478 (duration: 00m 41s)
[09:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:01] <stashbot>	 T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478
[09:34:07] <wikibugs>	 06Operations: 'systemctl restart jobrunner' broken via salt - https://phabricator.wikimedia.org/T158288#3032473 (10fgiunchedi) Updated https://wikitech.wikimedia.org/wiki/Service_restarts#Application_servers_.28also_image.2Fvideo_scalers_and_job_runners.29 with a disclaimer about stop/start
[09:34:10] <godog>	 volans: sure! how do I do that?
[09:34:19] <godog>	 eqiad is still to go
[09:34:51] <volans>	 godog: can wait next week? not yet deployed but will be by EOW hopefully 
[09:35:11] <volans>	 what I meant was to try the specific case with cumin, to see if has the same issue or not
[09:35:38] <godog>	 ah ok, yeah this specific roll-restart can't wait but we can try next week another one for sure
[09:35:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:36:09] <volans>	 great, thanks
[09:36:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:36:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational
[09:38:06] <elukey>	 volans: mc1019 it then waiting for cumin to be ready \o/
[09:38:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational
[09:38:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational
[09:38:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational
[09:39:10] <moritzm>	 !log installing libgc security updates on trusty systems
[09:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Initial import with the first version (037 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[09:40:26] <icinga-wm>	 RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational
[09:43:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:43:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:44:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:45:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[09:46:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:46:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational
[09:46:57] <icinga-wm>	 RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational
[09:47:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational
[09:49:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:50:27] <godog>	 nevermind now I get it, puppet is also trying to stop 'jobrunner', I'm looking into it
[09:50:41] <wikibugs>	 (03PS1) 10Marostegui: dns: Change db2070 IP [dns] - 10https://gerrit.wikimedia.org/r/338087 (https://phabricator.wikimedia.org/T156478)
[09:52:06] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3032508 (10Marostegui) @Papaul please review the DNS changes: https://gerrit.wikimedia.org/r/#/c/338087/
[09:52:16] <icinga-wm>	 RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[09:52:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational
[09:53:21] <wikibugs>	 (03PS8) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588)
[09:54:02] <wikibugs>	 (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478)
[09:54:20] <godog>	 sorry there will be a little bit more spam
[09:54:21] <wikibugs>	 (03CR) 10Volans: "Thanks for the replies. See inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[09:54:27] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Wait for the server to be off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[09:55:59] <wikibugs>	 06Operations: Unclean stop of jobrunner service via puppet - https://phabricator.wikimedia.org/T158288#3032511 (10fgiunchedi)
[09:57:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:58:26] <hashar>	 Icinga will keep flapping on an alarm for contint1001 :  Work requests waiting in Zuul Gearman server 
[09:58:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational
[09:58:53] <hashar>	 we have enabled yesterday night with mutante.   It is a bug/unhandled corner case  in the check_graphite .  Will fix it this afternoon
[09:59:00] <hashar>	 details are on https://phabricator.wikimedia.org/T70113#3032514
[10:01:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:02:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational
[10:02:43] <elukey>	 restarted --^
[10:04:26] <icinga-wm>	 PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:04:32] <godog>	 elukey: sadly that's not the problem, it is unclean 'stop' by puppet
[10:04:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:05:26] <icinga-wm>	 RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational
[10:05:44] <godog>	 I've "fixed" it by doing systemctl reset-failed jobrunner
[10:05:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational
[10:07:16] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic1051 is OK: SSL OK - Certificate elastic1051.eqiad.wmnet valid until 2022-02-15 10:05:51 +0000 (expires in 1824 days)
[10:07:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational
[10:07:32] <wikibugs>	 06Operations: Unclean stop of jobrunner service via puppet - https://phabricator.wikimedia.org/T158288#3032525 (10fgiunchedi) The cure for the moment is to 'systemctl reset-failed jobrunner' to restore non-degraded systemd state
[10:07:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:07:41] <elukey>	 reading the task
[10:09:24] <elukey>	 godog: one thing that it is not clear to me - why puppet tries to stop the jobrunner?
[10:09:38] <godog>	 I suspect because this is codfw
[10:10:16] <icinga-wm>	 PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:11:16] <icinga-wm>	 RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[10:13:39] <elukey>	 godog: ahhhh
[10:13:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:13:57] <icinga-wm>	 PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:14:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:15:16] <godog>	 shush
[10:15:42] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(48|49|50|51|52).codfw.wmnet
[10:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[10:15:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational
[10:16:36] <icinga-wm>	 RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational
[10:16:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational
[10:17:18] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(48|49|50|51|52).eqiad.wmnet
[10:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:39] <godog>	 !log roll-restart hhvm in eqiad to pick up fluorine -> mwlog1001 changes - T123728
[10:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:43] <stashbot>	 T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728
[10:34:07] <wikibugs>	 (03CR) 10Ema: [C: 031] Only add the Diamond collector if ISC dhcpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff)
[10:36:16] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[10:37:16] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time
[10:40:22] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3032574 (10Marostegui)
[10:40:25] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 13Patch-For-Review: db1019: Decommission - https://phabricator.wikimedia.org/T146265#3032572 (10Marostegui) 05Open>03Resolved I believe this is done
[10:46:35] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032576 (10elukey)
[10:46:46] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[10:47:46] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time
[10:54:28] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032616 (10elukey) ``` elukey@oxygen:/srv/log/webrequest$ grep piwik archive/5xx.json-20170216 | jq -r '[.http_status,.dt]| @csv' | awk -F":" '{print $1}'| sort | u...
[10:58:36] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[10:59:36] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time
[11:05:54] <hashar>	 I don't understand opensource
[11:06:23] <hashar>	 graphite-web is a python based renderer which has implementation for bunch of functions such as sumSeries()
[11:06:44] <hashar>	 and there is another standalone project graphite-api  which is python based as well and seems to just have reimplemented everything
[11:07:40] <hashar>	 err  graphite-api is a fork of graphite-web bah
[11:08:40] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3032661 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1051.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['elastic1051.eqi...
[11:08:50] <wikibugs>	 (03CR) 10DCausse: Update elasticsearch module for es5 compatability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson)
[11:09:04] <godog>	 that's correct, IIRC to have sth easier to deploy than graphite-web
[11:10:36] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[11:11:36] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1164 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time
[11:17:12] <_joe_>	 godog: why do you need to restart hhvm to make it pick up the new log destination?
[11:17:32] <_joe_>	 is it a setting in hhvm itself?
[11:19:06] <godog>	 _joe_: not a setting in hhvm itself, in this case it is the redis address for the profiler, looks like fluorine was still getting some redis traffic
[11:19:15] <godog>	 to answer your question, "I don't know"
[11:19:43] <_joe_>	 godog: uhm the profiler in fact has to do with hhvm itself
[11:20:06] <icinga-wm>	 PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:20:46] <godog>	 some of the traffic did switch yesterday after I did sync-file though, some didn't
[11:29:46] <icinga-wm>	 PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:42:49] <wikibugs>	 (03PS5) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836)
[11:49:06] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[11:49:06] <icinga-wm>	 RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[11:50:06] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time
[11:52:46] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[11:53:46] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.004 second response time
[11:56:46] <icinga-wm>	 RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[12:01:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "A couple of comments on cleanup and one nit, the rest LGTM!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[12:11:36] <wikibugs>	 (03PS2) 10Ladsgroup: gerrit: Make blue buttons look like OOUI [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298)
[12:11:38] <wikibugs>	 (03CR) 10Ladsgroup: "@Chad: Added in the phab card" [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup)
[12:21:11] <wikibugs>	 (03PS6) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836)
[12:25:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[12:28:57] * moritzm shakes fist at pointless "E302 expected 2 blank lines, found 1" CI test
[12:29:35] <wikibugs>	 (03PS7) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836)
[12:30:02] <volans>	 moritzm: why you're angry at PEP8? :-P
[12:34:28] <volans>	 do you mind if I review it? :)
[12:34:55] <wikibugs>	 (03PS1) 10Hashar: check_graphite anomaly option to set minimum upper band [puppet] - 10https://gerrit.wikimedia.org/r/338095 (https://phabricator.wikimedia.org/T70113)
[12:34:56] <Reedy>	 volans: "How dare you review my code!?"
[12:35:10] <volans>	 :D
[12:35:42] <wikibugs>	 (03CR) 10Paladox: "I haven't tested a reboot." [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox)
[12:39:46] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[12:40:46] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time
[12:49:59] <wikibugs>	 (03CR) 10Hashar: "Added as reviewer editors of the check_graphite script.  There are a few details on T70113  and a summary in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/338095 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar)
[12:57:52] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Expand list of people who can create new Labs project - https://phabricator.wikimedia.org/T101688#3032860 (10scfc) 05Open>03Resolved >>! In T101688#1390474, @Legoktm wrote: > Do we currently have an issue with projects not being created in a timely manner?...
[12:58:10] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Expand list of people who can create new Labs project - https://phabricator.wikimedia.org/T101688#3032862 (10scfc) 05Resolved>03declined
[12:59:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 1.1.0e [debs/openssl11] - 10https://gerrit.wikimedia.org/r/338096
[13:08:16] <wikibugs>	 (03CR) 10Volans: "Nice!" (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[13:09:00] <wikibugs>	 (03CR) 10Volans: "I forgot to add one :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[13:15:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Few nitpicks on the README, but LGTM overall. Good job!" (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[13:16:56] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 11747
[13:19:52] <marostegui>	 !log Shutdown db2070 for maintenance - T156478
[13:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:58] <stashbot>	 T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478
[13:21:36] <wikibugs>	 (03PS9) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588)
[13:21:52] <wikibugs>	 (03CR) 10Volans: "Nitpicks addressed ;)" [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[13:23:56] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 298
[13:24:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[13:25:23] <wikibugs>	 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3032910 (10elukey) After running `tcpdump ip6` on a couple of hosts I realized that the puppet agent contacts puppetmaster1001 via IPv6. I added a special term called `puppet` to `analyt...
[13:25:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[13:26:36] <icinga-wm>	 PROBLEM - HHVM rendering on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 3.135 second response time
[13:26:49] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Change db2070 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338088 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[13:27:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2070 IP as it goes to another rack - T156478 (duration: 00m 56s)
[13:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:35] <stashbot>	 T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478
[13:27:36] <icinga-wm>	 RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 72624 bytes in 0.093 second response time
[13:28:24] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2070 IP as it goes to another rack - T156478 (duration: 00m 41s)
[13:28:27] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: wikitech regularly looses session directly after login - https://phabricator.wikimedia.org/T118395#3032915 (10scfc) 05Open>03Invalid I cannot reproduce this.  Please reopen if the problem reoccurs.
[13:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:49] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3032918 (10Marostegui) @Papaul db2070 off, mediawiki files changed with its new IP. If you review the DNS patch I will push it too.
[13:34:46] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[13:35:46] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time
[13:39:36] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[13:40:36] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time
[13:42:55] <wikibugs>	 (03PS8) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836)
[13:43:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[13:49:50] <wikibugs>	 (03PS9) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836)
[14:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1400).
[14:01:18] <hashar>	 nothing for the swat :)
[14:04:32] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3032969 (10Gehel) Relforge1001 is being drained right now, it should be ready in a few hours. Do you need to shut it down? Or is it a hot plug switch? In any case, just ping me before doin...
[14:12:26] <wikibugs>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3033014 (10elukey) Done a quick check to see how much the mw2* hos...
[14:17:45] <wikibugs>	 (03PS2) 10Hashar: Support Jenkins install from 'experimental' component [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429)
[14:18:02] <wikibugs>	 (03CR) 10Hashar: "Done and rebased :)" [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar)
[14:20:56] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 21042
[14:22:56] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0
[14:26:46] <wikibugs>	 (03PS1) 10Elukey: Move codfw appserver conftool-data to codfw.yaml [puppet] - 10https://gerrit.wikimedia.org/r/338108 (https://phabricator.wikimedia.org/T156023)
[14:27:04] <moritzm>	 !log uploaded openssl 1.1.0e to apt.wikimedia.org
[14:27:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update to 1.1.0e [debs/openssl11] - 10https://gerrit.wikimedia.org/r/338096 (owner: 10Muehlenhoff)
[14:28:24] <zeljkof>	 hashar: this was an easy swat ;)
[14:29:41] * Nemo_bis is always available to propose fillers for any swat which felt too empty
[14:29:43] <wikibugs>	 (03PS2) 10Hashar: contint: remove /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/337286
[14:33:10] <hashar>	 Nemo_bis: i got enough with my own patches :D
[14:35:14] <wikibugs>	 (03PS4) 10Hashar: labstore: check should search for exact mount match [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820)
[14:35:52] <wikibugs>	 (03CR) 10Hashar: [C: 031] "This has been cherry picked on the CI master for close to a month and fix the issue at end." [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar)
[14:36:24] <wikibugs>	 (03PS4) 10Hashar: Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981
[14:38:21] <wikibugs>	 (03Abandoned) 10Hashar: (WIP) zuul-merger instances (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/336803 (owner: 10Hashar)
[14:39:48] <wikibugs>	 (03PS2) 10Hashar: zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008
[14:40:05] <wikibugs>	 (03CR) 10Hashar: [C: 031] "rebased/cherry picked to tip of production" [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar)
[14:44:26] <wikibugs>	 (03PS4) 10Muehlenhoff: Only add the Diamond collector if ISC dhcpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794)
[14:45:46] <icinga-wm>	 PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:45:52] <paravoid>	 moritzm: dhcpd? :)
[14:46:01] <paravoid>	 either I am having a stroke or you are :P
[14:46:40] <moritzm>	 oh, all those legacy ISC code bases sound alike :) will amend the commit message
[14:46:50] <paravoid>	 heheh
[14:47:40] <paravoid>	 moritzm: also in the comment in timesyncd.pp
[14:48:40] <moritzm>	 thanks, fixed
[14:48:47] <wikibugs>	 (03PS5) 10Muehlenhoff: Only add the Diamond collector if ISC ntpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794)
[14:52:26] <wikibugs>	 (03CR) 10Volans: [C: 032] "Thanks everyone for the reviews, comments and feedbacks, really appreciated given the size of it in a single change!" [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[14:53:18] <wikibugs>	 (03Merged) 10jenkins-bot: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[14:54:36] <icinga-wm>	 PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:58:07] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3033111 (10Cmjohnson) it's a hot swap disk. I will update the task once it swapped so you can rebuild the raid.
[14:58:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Only add the Diamond collector if ISC ntpd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff)
[15:00:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728)
[15:01:38] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3033142 (10Gehel) I'll actually just reimage the machine (it is due for a reimage), but same result.
[15:02:35] <wikibugs>	 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#3033156 (10MoritzMuehlenhoff) With the merge of https://gerrit.wikimedia.org/r/#/c/337009/ the installation of ISC ntpd is now prevented on stretch.
[15:13:47] <icinga-wm>	 RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[15:15:32] <paravoid>	 moritzm: oops, I had some comments
[15:15:34] <paravoid>	 I'll post them anyway
[15:15:39] <wikibugs>	 (03CR) 10Faidon Liambotis: Only add the Diamond collector if ISC ntpd is used (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff)
[15:17:08] <paravoid>	 moritzm: also in addition to those comments and semi-relatedly to your change... we'll need to change our ntp *server* classes to also disable timesyncd
[15:17:51] <paravoid>	 I think at this point the thing we should do is rethink all of it a little bit -- perhaps add an "ensure" parameter to all of the ntp client, ntp server and timesyncd classes
[15:18:18] <paravoid>	 that would do the right thing (enable or disable ntp and systemd-timesyncd, add or remove the diamond collector, add or remove the monitoring check etc.)
[15:18:46] <paravoid>	 and then say class { 'ntp::server': ensure => present } class { 'timesyncd': ensure => absent }
[15:18:53] <paravoid>	 I can give it a stab at some point
[15:19:28] <moritzm>	 I need to look at what the collector does on servers, not sure
[15:20:01] <moritzm>	 I can address your comments in a followup patch later on, first need to proceed with the hhvm upload
[15:20:14] <paravoid>	 yes, not urgent
[15:21:30] <wikibugs>	 (03PS1) 10Ema: varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/338123 (https://phabricator.wikimedia.org/T145661)
[15:22:36] <icinga-wm>	 RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[15:24:46] <wikibugs>	 (03CR) 10BBlack: [C: 031] varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/338123 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema)
[15:26:26] <icinga-wm>	 PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 63808 MB (12% inode=99%)
[15:27:47] <dcausse>	 ^ should be transient, a full reindex is in progress
[15:28:59] <wikibugs>	 (03PS1) 10Volans: TravisCI: force dependency upgrade [software/cumin] - 10https://gerrit.wikimedia.org/r/338125 (https://phabricator.wikimedia.org/T154588)
[15:29:13] <wikibugs>	 (03CR) 10Ema: [C: 032] varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/338123 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema)
[15:31:46] <wikibugs>	 (03CR) 10Volans: [C: 032] TravisCI: force dependency upgrade [software/cumin] - 10https://gerrit.wikimedia.org/r/338125 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[15:32:24] <wikibugs>	 (03Merged) 10jenkins-bot: TravisCI: force dependency upgrade [software/cumin] - 10https://gerrit.wikimedia.org/r/338125 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[15:34:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728)
[15:37:55] <wikibugs>	 (03PS1) 10Volans: Update TravisCI and Coveralls URLs [software/cumin] - 10https://gerrit.wikimedia.org/r/338127 (https://phabricator.wikimedia.org/T154588)
[15:39:20] <wikibugs>	 (03CR) 10Volans: [C: 032] Update TravisCI and Coveralls URLs [software/cumin] - 10https://gerrit.wikimedia.org/r/338127 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[15:40:10] <wikibugs>	 (03Merged) 10jenkins-bot: Update TravisCI and Coveralls URLs [software/cumin] - 10https://gerrit.wikimedia.org/r/338127 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[15:40:31] <icinga-wm>	 PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 64248 MB (12% inode=99%)
[15:44:10] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 57770 MB (11% inode=99%): Gehel lots of reindex going on, shards are already leaving elastic1024, situation should be back to normal soon - The acknowledgement expires at: 2017-02-17 20:43:31.
[15:44:46] <papaul>	 marostegui: hello are you ready for me?
[15:47:13] <marostegui>	 papaul: hi!
[15:47:36] <marostegui>	 papaul: yes, the server is off, so you can move it now if you like, if you don't mind reviewing the dns patch, I can get it deployed too now
[15:48:30] <wikibugs>	 (03PS1) 10Urbanecm: [throttle] New rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338128 (https://phabricator.wikimedia.org/T158312)
[15:48:52] <icinga-wm>	 PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:50:30] <wikibugs>	 (03PS3) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728)
[15:51:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi)
[15:53:09] <moritzm>	 !log upgrading mwdebug1001 to HHVM 3.12.14
[15:53:10] <wikibugs>	 (03PS1) 10Reedy: Remove empty conditionals for wikis from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129
[15:53:12] <wikibugs>	 (03PS1) 10Reedy: Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130
[15:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:37] <wikibugs>	 (03CR) 10Papaul: [C: 032] dns: Change db2070 IP [dns] - 10https://gerrit.wikimedia.org/r/338087 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui)
[15:54:04] <marostegui>	 papaul: you deploy or I do it?
[15:54:48] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3033317 (10elukey)
[15:56:55] <wikibugs>	 (03PS4) 10Filippo Giunchedi: udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728)
[16:00:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "hieradata: temporarily remove prometheus100[34] from prometheus_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/338131 (https://phabricator.wikimedia.org/T152504)
[16:01:11] <moritzm>	 !log upgrading mwdebug1002 to HHVM 3.12.14
[16:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Revert "hieradata: temporarily remove prometheus100[34] from prometheus_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/338131 (https://phabricator.wikimedia.org/T152504) (owner: 10Filippo Giunchedi)
[16:05:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo)
[16:06:15] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5493/ Will deploy in a hot way, slowly, in number order." [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo)
[16:09:04] <wikibugs>	 (03PS3) 10Jcrespo: Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474)
[16:09:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Increase the concurrent threads of large mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/337840 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo)
[16:09:45] <icinga-wm>	 PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:15:07] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033362 (10Papaul)
[16:15:55] <icinga-wm>	 RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[16:24:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "-1 from ema per the regex not covering up to 2099" [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn)
[16:25:58] <jynus>	 !log SET GLOBAL thread_pool_size=64; on db1074's mariadb
[16:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:04] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033445 (10Marostegui) Hey @RobH  To clarify things, db2070 has been moved from row D to row C (as @Papaul updated on the original task description). Thanks for helping out!
[16:26:42] <wikibugs>	 (03PS2) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2019 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882)
[16:27:45] <wikibugs>	 (03PS3) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882)
[16:28:07] <wikibugs>	 (03PS3) 10Dzahn: zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar)
[16:28:57] <wikibugs>	 (03PS4) 10Jcrespo: phabricator database: Move templates to the role [puppet] - 10https://gerrit.wikimedia.org/r/337827
[16:30:23] <moritzm>	 !log uploaded HHVM 3.12.14 to apt.wikimedia.org
[16:30:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:45] <icinga-wm>	 PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:33:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5494/" [puppet] - 10https://gerrit.wikimedia.org/r/337827 (owner: 10Jcrespo)
[16:35:27] <wikibugs>	 (03CR) 10Dzahn: [C: 032] zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar)
[16:37:10] <wikibugs>	 (03PS5) 10Jcrespo: phabricator database: Move templates to the role [puppet] - 10https://gerrit.wikimedia.org/r/337827
[16:38:01] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033481 (10RobH)
[16:38:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333247 (owner: 10Filippo Giunchedi)
[16:38:25] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10RobH)
[16:38:40] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033483 (10Marostegui) db2070: - DNS updated - network/interfaces changed - mediawiki files changed - MySQL up and replication up  Pending: port configuration   Once the...
[16:38:49] <icinga-wm>	 RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[16:39:24] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3031005 (10RobH) Ok, the new port is setup in row c.  Please assign this back to me once db2070 is moved!
[16:39:39] <icinga-wm>	 PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:40:32] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033489 (10Marostegui) Oh, I saw that @RobH already changed the port and the server is replicating fine! :)
[16:40:53] <wikibugs>	 (03PS4) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define [puppet] - 10https://gerrit.wikimedia.org/r/333247
[16:40:55] <wikibugs>	 (03PS11) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455)
[16:40:57] <wikibugs>	 (03PS2) 10Jcrespo: Remove the templates dir, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/337837
[16:41:01] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033504 (10Marostegui) a:05Papaul>03RobH The server has been already moved to row C
[16:41:32] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033509 (10Marostegui) a:03Marostegui Claiming this task to do the last checks, repool the server etc before closing it.
[16:41:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: terminate https with nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi)
[16:42:48] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033512 (10RobH) >>! In T158246#3033504, @Marostegui wrote: > The server has been already moved to row C  When?  I just setup (as in when I put in my comment) that the port wasn't allocated or enabled,...
[16:43:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "This is ready to deploy, no blockers. This should fix the error: "Warning: Setting templatedir is deprecated. See http://links.puppetlabs." [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo)
[16:44:53] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3033543 (10RobH)
[16:44:56] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw: switch ports clean up - https://phabricator.wikimedia.org/T158246#3033541 (10RobH) 05Open>03Resolved
[16:45:26] <marostegui>	 \o/ ^
[16:49:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar)
[16:50:18] <wikibugs>	 (03CR) 10Ema: [C: 031] adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn)
[16:52:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: scap: upgrade to 3.5.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/338138 (https://phabricator.wikimedia.org/T127762)
[16:54:14] <godog>	 I can be around for the first 15min of puppet swat, anyone else?
[16:54:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar)
[16:55:03] <jynus>	 I am +1 the ones I can deploy
[16:55:12] <jynus>	 ^godog
[16:55:25] <godog>	 nice, thanks jynus 
[17:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1700). Please do the needful.
[17:00:04] <jouncebot>	 hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[17:00:11] <hashar>	 o / 
[17:00:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar)
[17:00:17] <hashar>	 but I am only there for a few :(
[17:00:35] <jynus>	 there is one I do not want to deploy alone
[17:00:40] <icinga-wm>	 RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[17:00:42] <jynus>	 the one changing systemd
[17:00:49] <jynus>	 not systemd
[17:01:04] <hashar>	 yeah I guess it can cause random issue eventually. I only tested it via rspec/cataog compilation
[17:01:05] <jynus>	 the one declaring initv
[17:01:11] <hashar>	 might want a whole run of the puppet compiler
[17:01:19] <hashar>	 oh
[17:01:30] <icinga-wm>	 RECOVERY - Disk space on elastic1024 is OK: DISK OK
[17:01:39] <hashar>	 https://gerrit.wikimedia.org/r/#/c/336978/ contint: git-daemon service is 'sysvinit'
[17:01:48] <jynus>	 need help to test it live
[17:01:49] <hashar>	 found that one when we provisioned a new zuul::merger on contint2001
[17:01:55] <hashar>	 the service did not come up
[17:02:00] <jynus>	 can we do it now?
[17:02:09] <hashar>	 sure
[17:02:18] <jynus>	 let's start with that one
[17:02:23] <jynus>	 the others are mostly trivial
[17:02:34] <hashar>	 on the first puppet run the service was not started and systemd was showing up as active (exited)  https://phabricator.wikimedia.org/T157785
[17:02:46] <hashar>	 we can stop puppet on contint1001
[17:02:47] <hashar>	 merge
[17:02:48] <jynus>	 basically, if it kills conting, you can help me
[17:02:51] <hashar>	 run puppet on contint2001
[17:02:54] <hashar>	 and see what happens
[17:03:01] <jynus>	 that is ok to me
[17:03:15] <jynus>	 I am ok to merge it directly
[17:03:23] <jynus>	 as long as you are on the machine 
[17:03:28] <jynus>	 checking it
[17:03:34] <jynus>	 and restaring it, etc.
[17:03:36] <hashar>	 !log stopped puppet on contint1001 for https://gerrit.wikimedia.org/r/#/c/336978/
[17:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:46] <hashar>	 ready
[17:03:53] <hashar>	 I am on both
[17:03:54] <wikibugs>	 (03PS2) 10Jcrespo: contint: git-daemon service is 'sysvinit' [puppet] - 10https://gerrit.wikimedia.org/r/336978 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[17:04:10] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] contint: git-daemon service is 'sysvinit' [puppet] - 10https://gerrit.wikimedia.org/r/336978 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[17:04:29] <hashar>	 will run puppet on contint2001,  check zuul-merger is still happily managed by systemd
[17:05:29] <hashar>	 Invalid service provider 'sysvinit'
[17:05:32] <hashar>	 ...
[17:05:40] <jynus>	 really?
[17:05:44] <jynus>	 do I revert?
[17:05:47] <hashar>	 Error: Failed to apply catalog: Parameter provider failed on Service[git-daemon]: Invalid service provider 'sysvinit' at /etc/puppet/modules/contint/manifests/zuul/git_daemon.pp:32
[17:05:53] <wikibugs>	 (03PS1) 10Jcrespo: Revert "contint: git-daemon service is 'sysvinit'" [puppet] - 10https://gerrit.wikimedia.org/r/338140
[17:06:02] <hashar>	 guess I used the wrong doc bah :(
[17:06:06] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Revert "contint: git-daemon service is 'sysvinit'" [puppet] - 10https://gerrit.wikimedia.org/r/338140 (owner: 10Jcrespo)
[17:06:38] <jynus>	 well, my gut feeling was good
[17:06:39] <jynus>	 it seems
[17:06:40] <hashar>	 guess I will redo it later on  sorry 
[17:06:42] <jynus>	 :-)
[17:07:01] <hashar>	 at least the service is still running
[17:07:16] <jynus>	 I wanted to be here, it is not dangerous
[17:07:20] <jynus>	 but you know
[17:07:40] <icinga-wm>	 RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[17:07:41] <jynus>	 let me see if there was some other non-trivial
[17:08:02] <hashar>	 !log reenable puppet on contint1001
[17:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:33] <jynus>	 I am not very involved with 331239
[17:08:40] <hashar>	 puppet ok on both hosts and zuul-merger are running
[17:08:43] <jynus>	 I will deploy it, but can be tested?
[17:09:01] <hashar>	 yeah rebase it
[17:09:02] <jynus>	 right away?
[17:09:11] <hashar>	 if CI job rake-jessie says SUCCESS
[17:09:13] <hashar>	 it is fine to merge :)
[17:09:14] <wikibugs>	 (03PS11) 10Jcrespo: puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar)
[17:09:31] <jynus>	 oh, it includes its own modifications?
[17:09:32] <hashar>	 the idea is to run puppet parser validate /    hiera syntax check and  erb templates from rake
[17:09:37] <jynus>	 I didn't know that
[17:09:39] <hashar>	 so one can locally just   rake syntax
[17:09:50] <hashar>	 and happen to run locally exactly what CI does
[17:10:09] <jynus>	 lets wait for that
[17:10:17] <jynus>	 let me see what else we have
[17:10:22] <hashar>	 that also has the side effect of letting me remove the Jenkins jobs pplint-HEAD and erblint-HEAD   that are something like  find . -name*.pp | xargs puppet parser validate
[17:10:33] <hashar>	 that is slow, does not support ignores  and not reproducible locally 
[17:11:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar)
[17:11:13] <hashar>	 \O/
[17:11:14] <jynus>	 ok
[17:11:31] <jynus>	 as you own CI, you break it, you fix it!
[17:11:34] <jynus>	 :-)
[17:11:39] <hashar>	 I will update https://wikitech.wikimedia.org/wiki/Puppet_coding  later tonight :}
[17:11:44] <jynus>	 thanks
[17:12:01] <hashar>	 oh CI is more like:  folks use it. Sometime abuse it and we try to fix it up
[17:12:07] <jynus>	 he he
[17:12:11] <jynus>	 no
[17:12:13] <hashar>	 80% of the maintenance is done by ops via puppet anyway
[17:12:14] <jynus>	 I am ok with that
[17:12:17] <jynus>	 I am more like
[17:12:26] <jynus>	 "syntax error on the new function"
[17:12:42] <jynus>	 I am not worried about new rule is too strict
[17:13:04] <hashar>	 should be fine. Our puppet manifests are reasonably nice nowadays
[17:13:12] <hashar>	 the last man standing is the evil  import realm.pp
[17:13:19] <jynus>	 look, it is my job to be pesimistic :-)
[17:13:35] <jynus>	 specially with codebase I do not normaly touch
[17:13:42] <hashar>	 understandable
[17:13:47] <jynus>	 I have deployed now
[17:15:21] <hashar>	 will be fine
[17:15:57] <jynus>	 I will chose one server to test rsyslog.conf.erb
[17:16:46] <volans>	 it's already live?
[17:16:58] <jynus>	 nope
[17:17:08] <jynus>	 I am seeing all use them, right?
[17:17:16] <jynus>	 mw, eventlogging
[17:17:27] <hashar>	 feel free to skip that one
[17:17:34] <hashar>	 if it is has too much potential impact
[17:17:35] <jynus>	 no
[17:17:50] <jynus>	 we can just do an escalated deploy
[17:17:58] <jynus>	 maybe leave it for later
[17:18:39] <jynus>	 let's do the "easy ones"
[17:18:53] <hashar>	 can use that one https://gerrit.wikimedia.org/r/#/c/337289/
[17:18:59] <hashar>	 changes the jenkins default file
[17:19:27] <wikibugs>	 (03PS5) 10Jcrespo: jenkins: sync default file with upstream 1.651.3 [puppet] - 10https://gerrit.wikimedia.org/r/337289 (owner: 10Hashar)
[17:19:30] <jynus>	 yes, that was easy
[17:19:39] <hashar>	 I left some extended comment on https://gerrit.wikimedia.org/r/#/c/337289/1/modules/jenkins/files/etc_default_jenkins
[17:20:01] <hashar>	 a gotcha is I set:   PREFIX=/$NAME    that would make the web service to use  /jenkins/  as base path
[17:20:13] <hashar>	 but that setting is not passed to the command line;  it is hardcoded to --prefix=ci/
[17:21:24] <jynus>	 but
[17:21:44] <jynus>	 is NAME set?
[17:21:56] <jynus>	 oh, yes
[17:21:58] <hashar>	 yeah at the top
[17:21:58] <jynus>	 sorry
[17:22:03] <hashar>	 :}
[17:22:12] <jynus>	 I didn't want to overwrite /
[17:22:24] <hashar>	 the more eyes the better. I think I wrote that one sunday evening
[17:22:47] <hashar>	 I will have to move out after that one
[17:22:53] <jynus>	 why not changing the execution line?
[17:23:03] <jynus>	 so you can be 100% upstream
[17:23:08] <hashar>	 that is done later on in another patchset
[17:23:13] <hashar>	 which makes the default an erb template
[17:23:23] <hashar>	 I wanted to have small incremental changes
[17:23:23] <jynus>	 no, I mean what it calls this
[17:23:31] <jynus>	 ok ok
[17:23:39] <jynus>	 as long as you promise to do it
[17:23:43] <hashar>	 ultimately jenkins will end up being managed by systemd
[17:24:01] <hashar>	 and the default file content fully generated from hiera /   jenkins::service::config   or something like that
[17:24:15] <hashar>	 I am not sure where to head. But it seems to me hiera is easier to handle than some bash like script
[17:24:26] <hashar>	 but yeah, baby steps essentially :}
[17:24:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] jenkins: sync default file with upstream 1.651.3 [puppet] - 10https://gerrit.wikimedia.org/r/337289 (owner: 10Hashar)
[17:24:58] <hashar>	 running puppet on contint2001
[17:25:12] <jynus>	 wait
[17:25:13] <hashar>	 and disabled it on cont1001
[17:25:19] <jynus>	 I am deploying still
[17:26:38] <hashar>	 restartedjenkins on contint2001
[17:27:23] <hashar>	 looks god
[17:27:25] <jynus>	 everthing ok?
[17:27:27] <hashar>	 doing same on contint1001
[17:28:19] <hashar>	 --webroot=/var/run/jenkins/war --httpPort=8080 --ajp13Port=-1 --prefix=/ci --accessLoggerClassName=winstone.accesslog.SimpleAccessLogger --simpleAccessLogger.format=combined --simpleAccessLogger.file=/var/log/jenkins/access.log
[17:28:23] <hashar>	 which is good :)
[17:28:25] <hashar>	 \O/
[17:28:57] <hashar>	 thanks a  ton !
[17:29:20] <jynus>	 what is the strategy for the mount points
[17:29:39] <hashar>	 that one is empty
[17:29:45] <hashar>	 it is a leftover
[17:29:54] <hashar>	 but I gotta escape so we can skip it for now
[17:30:02] <hashar>	 337014 admin: basic .vimrc for hashar
[17:30:03] <hashar>	 332981 Gemfile: add xmlrpc for ruby 2.4
[17:30:08] <hashar>	 are easy / no impact on prod
[17:30:19] <jynus>	 oh, I missread it
[17:30:20] <hashar>	 and I think that will be good enough for today swat :}
[17:30:41] <jynus>	 I compared it as adding /srv/ssd
[17:30:52] <jynus>	 it is all good
[17:30:56] <hashar>	 yeah it is from when we had ssd
[17:31:03] <wikibugs>	 (03PS3) 10Jcrespo: contint: remove /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/337286 (owner: 10Hashar)
[17:31:14] <hashar>	 the last user was zuul-merger on scandium. But that got phased out :}
[17:31:30] <jynus>	 I can also deploy the user dir change, no problem
[17:31:36] <hashar>	 neat
[17:32:25] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] contint: remove /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/337286 (owner: 10Hashar)
[17:33:04] <wikibugs>	 (03PS2) 10Jcrespo: admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar)
[17:33:15] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar)
[17:33:43] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] admin: basic .vimrc for hashar [puppet] - 10https://gerrit.wikimedia.org/r/337014 (owner: 10Hashar)
[17:34:37] <hashar>	 jynus: thx.  I gotta run out now sorry :/   
[17:34:43] <jynus>	 bye!
[17:36:50] <wikibugs>	 (03PS5) 10Jcrespo: Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar)
[17:39:23] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar)
[17:43:00] <jynus>	 puppet swat is done, but evil swatter rejected my CR :-( https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1530651&oldid=1530254
[17:46:09] <RainbowSprinkles>	 Aww, mine didn't get carried over from the missed puppetswat
[17:46:12] <RainbowSprinkles>	 Next week I guess
[17:46:49] <jynus>	 RainbowSprinkles, which one?
[17:47:31] <RainbowSprinkles>	 https://gerrit.wikimedia.org/r/#/c/332707/ - updates a bunch of links in (mostly HTML pages of sorts) to use HTTPS instead of protocol-relative URLs
[17:47:51] <RainbowSprinkles>	 It's all internal-to-WMF links, so we know HTTPS exists and isn't going away
[17:48:36] <jynus>	 risk-wise I would be ok to deploy that
[17:48:43] <jynus>	 but I am not sure I agree with it
[17:50:00] <jynus>	 maybe if there was a better reason e.g. we want to hardcode https in case of X or something
[17:50:25] <jynus>	 or, it breaks X, Y and Z
[17:51:09] <RainbowSprinkles>	 Eh, not so much a reason other than being pedantic and consistent.
[17:51:37] <RainbowSprinkles>	 ie: If I were writing this file today, I wouldn't have used protocol-relative URLs
[17:52:17] <bblack>	 it's better to just use https: everywhere, so we're not relying on sts-preload to save us
[17:52:36] <jynus>	 bblack, if you are ok with it, I will deploy it
[17:52:47] <jynus>	 I just didn't see a strong reason to do it
[17:52:49] <bblack>	 the only caveat, and the reason we don't make a simple policy announcement of https:// -on-everything, is that some internal-only stuff still doesn't speak https
[17:53:12] <RainbowSprinkles>	 bblack: Indeed, this isn't touching anything like that though
[17:53:24] <RainbowSprinkles>	 This is all links to wikis or other known-https stuff
[17:53:28] <bblack>	 right
[17:53:29] <RainbowSprinkles>	 Links to meta, wmfwiki, etc
[17:53:40] <icinga-wm>	 PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:54:17] <jynus>	 ok, for me context like that would matter, being in the commit message would have eased me up
[17:54:35] <bblack>	 jynus: basically if you use an http:// link to get somewhere, there's a chance for a mitm to hijack the redirect to https and do Bad Things.  Initial access via https is better.
[17:54:57] <jynus>	 so, in general we prefer to harcode https
[17:55:05] <bblack>	 HSTS and STS-preload are designed to minimize that risk (browser internally translates http to https because it knows we're on the https-only list, basically)
[17:55:10] <jynus>	 unless that is not availalbe, right?
[17:55:26] <jynus>	 which is not the case here
[17:55:40] <bblack>	 right, HSTS only works after their first (un-hacked) visit, and STS-preload isn't there in all possible user agents, just modern widespread browsers (FF, Chrome, IE11)
[17:55:53] <RainbowSprinkles>	 Assuming you aren't benefiting from HSTS/STS, you can avoid a MITM on the mixed content operations/puppet/modules/publichtml/templates/index.html.erb
[17:56:02] <jynus>	 ok, that makes sense
[17:56:03] <RainbowSprinkles>	 Which loads some images from upload.wm.o
[17:56:15] <jynus>	 let me have a quick look at all the domains changed
[17:56:26] <jynus>	 in case there is one odd
[17:56:39] <wikibugs>	 (03PS4) 10Jcrespo: Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (owner: 10Chad)
[17:56:54] <wikibugs>	 (03CR) 10Nemo bis: "I guess the commit message might as well claim to address T54253. :)" [puppet] - 10https://gerrit.wikimedia.org/r/332707 (owner: 10Chad)
[17:57:55] <RainbowSprinkles>	 FWIW, discovering protocol relative URLs was *awesome* in the transition period before we supported HTTPS-by-default-for-everyone :)
[17:58:09] <RainbowSprinkles>	 (I remember finding that in an RFC and being like WTF NO WAY THAT ROCKS)
[17:58:09] <jynus>	 ha ha
[17:58:42] <jynus>	 I have to go to a meeting
[17:58:48] <RainbowSprinkles>	 Nemo_bis: #til about T54253
[17:58:49] <stashbot>	 T54253: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253
[17:59:00] <jynus>	 can you amend that, which is probably a good suggestion
[17:59:10] <jynus>	 and I will deploy in 30 minutes or so
[17:59:18] <jynus>	 only the commit message
[17:59:33] <RainbowSprinkles>	 I'll amend the commit message, yeah one min
[18:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1800).
[18:00:23] <halfak>	 I've got a deployment of ORES.  It should be easy. 
[18:00:40] <wikibugs>	 (03PS5) 10Chad: Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253)
[18:00:50] <bd808>	 halfak: DO NOT JINX YOURSELF! ;)
[18:01:08] <halfak>	 Good point.  I'm sure there'll be problems
[18:01:08] <RainbowSprinkles>	 "None of these URLs will ever go back to being non-https. Also, per the linked task, not all clients behave well with protocol-relative URLs, so avoiding them except when absolutely necessary is good for them."
[18:01:10] <greg-g>	 first rule of deployments: nothing is easy :)
[18:01:15] * halfak looks both ways -- shifty-eyed
[18:02:14] <RainbowSprinkles>	 jynus: Amended. I'll be around in ~30 when you're back. Thanks
[18:03:41] <halfak>	 !log deploying ores:e9bbda3
[18:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:48] <logmsgbot>	 !log halfak@tin Started deploy [ores/deploy@e9bbda3]: (no justification provided)
[18:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:34] <greg-g>	 halfak: scap now auto-logs for all deploys ^^
[18:05:40] <greg-g>	 the start and end
[18:05:59] <halfak>	 Thanks greg-g.  I'll remove that from our deploy script :) 
[18:06:12] <halfak>	 Canary looks good.  Moving forward
[18:06:55] <halfak>	 Oh.  Looks like we're still restarting the service. 
[18:09:23] <halfak>	 OK Confirmed canary moving forward now
[18:15:34] <RainbowSprinkles>	 halfak: Also, you can include a message in what you're doing and that's what'll be in IRC/SAL instead of (no justification provided)
[18:15:43] <RainbowSprinkles>	 `scap deploy "My awesome message is here"`
[18:15:55] <halfak>	 Gotcha.  Will add that to the deploy script. 
[18:15:55] <wikibugs>	 (03PS10) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836)
[18:16:07] <halfak>	 Do you think a phab task link is a good message?
[18:16:16] <RainbowSprinkles>	 That's a great message
[18:16:26] <halfak>	 OK will do. :) 
[18:16:27] <RainbowSprinkles>	 Anything that provides context to someone reading later is a good message :)
[18:16:38] <halfak>	 Was thinking the phab task would be perfect for that :) 
[18:16:39] <RainbowSprinkles>	 Phab tasks, gerrit changes
[18:16:40] <RainbowSprinkles>	 Etc
[18:17:37] <RainbowSprinkles>	 ¡log halfak@tin Started deploy [ores/deploy@e9bbda3]: T1234
[18:17:37] <stashbot>	 T1234: Restrict Bugzilla access to read-only - https://phabricator.wikimedia.org/T1234
[18:17:39] <RainbowSprinkles>	 eg ^
[18:17:56] <greg-g>	 also, phab task, eg just doing "scap deploy "rollout for T12345" makes stashbot mention the deploy ont he task, a la https://phabricator.wikimedia.org/T155527#3029942
[18:17:56] <stashbot>	 T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345
[18:17:58] <halfak>	 Oh nice.  It know phab task shape
[18:18:13] <RainbowSprinkles>	 Oh yeah, stashbot too
[18:18:14] <RainbowSprinkles>	 :D
[18:20:00] <icinga-wm>	 PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:20:50] * halfak waits for "promote and restart"
[18:20:53] <wikibugs>	 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3033929 (10Krinkle) Next steps: * [ ] Put node-warmup script in...
[18:20:59] <halfak>	 Deploy script is updated. 
[18:21:15] <halfak>	 RainbowSprinkles, I don't suppose I could go edit past messages to associate the task, could I?
[18:22:00] <icinga-wm>	 PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:22:18] <wikibugs>	 (03CR) 10VolkerE: gerrit: Make blue buttons look like OOUI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup)
[18:22:20] <RainbowSprinkles>	 halfak: You could edit the entry on the SAL on wikitech, but it wouldn't update the logstash store of it
[18:22:40] <icinga-wm>	 RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[18:22:41] <RainbowSprinkles>	 (or if we still have the twitter bridge, it wouldn't edit that)
[18:22:56] <halfak>	 RainbowSprinkles, gotcha.  Will leave it for now. 
[18:23:04] <halfak>	 and make sure to do it next time
[18:23:52] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 032 C: 032] Add account validation script / cron job (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[18:27:10] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[18:27:50] <paladox>	 hashar ^^
[18:27:56] <paladox>	 oh he's not online
[18:28:23] <paladox>	 anyways there dosent look like any tests running on https://integration.wikimedia.org/zuul/ (by that i mean nodepool dosent seem to be working)
[18:29:12] <halfak>	 ores deploy successful
[18:29:16] <halfak>	 \o/
[18:29:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop require_package for python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/338150
[18:29:50] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:30:28] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 032 C: 032] Drop require_package for python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/338150 (owner: 10Muehlenhoff)
[18:31:18] <volans>	 moritzm: what was the problem with the ldap dependency?
[18:31:34] <volans>	 just out of curiosity given I suggested to add tehm
[18:32:40] <moritzm>	 Duplicated declaration, see commit message
[18:32:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "They are all wiki sites, upload, tools and wikimedia portal." [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253) (owner: 10Chad)
[18:32:56] <wikibugs>	 (03CR) 10Chad: "So, the master cleanup bit needs two passes, as the localization cache files are owned by a different user. Sucks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T73313) (owner: 10Chad)
[18:33:10] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected
[18:33:16] <moritzm>	 utils.pp could also be cleaned up to use require_package, but I rather wanted to resolve thd puppet failure quickly
[18:33:30] <icinga-wm>	 PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[18:33:49] <wikibugs>	 (03PS6) 10Jcrespo: Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253) (owner: 10Chad)
[18:33:50] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[18:35:08] <volans>	 sure, no problem, I though that require_package was ok with multiple declarations, and I guess that the error is due because utils.pp uses package()
[18:35:18] <moritzm>	 yeah, that's the problem
[18:35:31] <volans>	 :)
[18:36:35] <wikibugs>	 (03PS2) 10Jcrespo: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860
[18:36:47] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337860 (owner: 10Jcrespo)
[18:38:48] <wikibugs>	 (03CR) 10Dzahn: [C: 031] udp2log: mirror traffic via udpmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/338119 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi)
[18:44:38] <wikibugs>	 (03CR) 10Dzahn: "10:42 <mutante> - stop puppet" [puppet] - 10https://gerrit.wikimedia.org/r/338022 (owner: 10Paladox)
[18:46:37] <wikibugs>	 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3034051 (10Volans) >>! In T156922#3033929, @Krinkle wrote: >  *...
[18:47:18] <wikibugs>	 06Operations, 06Operations-Software-Development, 07HHVM, 13Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2571031 (10Volans) What is the status of `terbium`? From the summary it appears to have been upgraded but the host is still a `trusty`.
[18:48:00] <icinga-wm>	 RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[18:48:30] <icinga-wm>	 RECOVERY - Disk space on labnet1001 is OK: DISK OK
[18:50:00] <icinga-wm>	 RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[18:50:44] <chasemp>	 !log stop noodepool to reset state on pool
[18:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:45] <chasemp>	 !log clean out nodepool instances
[18:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:04] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Swap from protocol-relative urls to https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/332707 (https://phabricator.wikimedia.org/T54253) (owner: 10Chad)
[18:53:50] <icinga-wm>	 PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[18:54:50] <icinga-wm>	 RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[19:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T1900). Please do the needful.
[19:02:20] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[19:06:10] <icinga-wm>	 PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:07:45] <chasemp>	 !log bump up nodepool allocated fixed ips set (I think it exhausted them errantly somehow?)
[19:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:16] <wikibugs>	 (03PS1) 10Volans: Add .gitreview file for Gerrit [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588)
[19:09:37] <wikibugs>	 (03CR) 10Chad: [C: 031] "Minor comment inline, but ok as-is." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[19:10:39] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3034161 (10Nithum) Hi Rob, could you change the ssh public key:   ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDHQ6oDkb1WXmbizF6PX4hIELg7azLCcAaNiIl2ytjKTv7Dcun...
[19:11:07] <volans>	 RainbowSprinkles: oh nice! didn't know about it and yes I'm using at least another branch
[19:11:16] <RainbowSprinkles>	 New-ish feature :)
[19:11:25] <RainbowSprinkles>	 Lots of repos don't use it yet
[19:11:35] <volans>	 so just use track instead of defaultbranch?
[19:11:40] <RainbowSprinkles>	 Yeah
[19:11:51] <volans>	 so no need to change it in other branches
[19:11:52] <RainbowSprinkles>	 Benefit means when you make a new branch you don't need to update gitreview file
[19:11:52] <volans>	 nice!
[19:11:54] <RainbowSprinkles>	 Yep
[19:12:10] <volans>	 thanks for the review then, changing it immediately :D
[19:12:10] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds
[19:12:21] <RainbowSprinkles>	 volans: We started using it on MW core + extensions so when we do our weekly branches we didn't have to do 100 dummy edits and commits
[19:12:22] <gehel>	 !log restarting kartotherian / tilerator on maps-test*
[19:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:41] <RainbowSprinkles>	 Saved tons of round-trips
[19:13:09] <godog>	 thcipriani / RainbowSprinkles / thcipriani I'm ok to push the new scap version shortly btw, maybe after swat if that's still on
[19:13:21] <RainbowSprinkles>	 jouncebot: next
[19:13:21] <jouncebot>	 In 0 hour(s) and 46 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T2000)
[19:13:21] <wikibugs>	 (03PS2) 10Volans: Add .gitreview file for Gerrit [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588)
[19:13:49] <chasemp>	 !log clean out /var/log/ on labnet1001 as it filled up
[19:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:10] <RainbowSprinkles>	 godog: Nothing was on swat for today
[19:14:16] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "compiler says" [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar)
[19:14:24] <RainbowSprinkles>	 We could go ahead now, train doesn't start for another 45m
[19:14:57] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "compiler says: "Error: Must pass http_port to Class[Contint::Proxy_jenkins] at /mnt/jenkins-workspace/puppet-compiler/5495/change/src/modu" [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar)
[19:15:29] <godog>	 https://fat.gfycat.com/DefinitiveSomeKingbird.webm
[19:15:54] <volans>	 awesome! :D
[19:16:02] <RainbowSprinkles>	 I want one of those
[19:16:06] <RainbowSprinkles>	 No real reason, just seems cool
[19:16:23] <godog>	 that's how trains mate, I'm told
[19:18:46] <godog>	 ok going ahead with reprepro and the puppet patch
[19:19:16] <volans>	 something wrong in Zuul? seems there are a lot of waiting checks
[19:19:20] <RainbowSprinkles>	 godog: My favorite train meme is the photo on: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys
[19:19:26] <RainbowSprinkles>	 volans: See #-releng
[19:19:36] <RainbowSprinkles>	 Nodepool is wonky, think it's slowly catching up
[19:20:10] <volans>	 RainbowSprinkles: ok, thanks, I was not there
[19:20:30] <RainbowSprinkles>	 yw. Yeah, it's backed up but known. Hopefully unwinding its backlog now...
[19:21:56] <wikibugs>	 (03CR) 10Dzahn: [C: 031] Remove the templates dir, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo)
[19:22:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: scap: upgrade to 3.5.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/338138 (https://phabricator.wikimedia.org/T127762)
[19:23:30] <wikibugs>	 (03CR) 10Chad: "Commit message nit, but otherwise ok" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo)
[19:23:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] scap: upgrade to 3.5.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/338138 (https://phabricator.wikimedia.org/T127762) (owner: 10Filippo Giunchedi)
[19:24:08] <wikibugs>	 (03CR) 10Chad: Remove the templates dir, not needed anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337837 (owner: 10Jcrespo)
[19:25:07] <wikibugs>	 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3034223 (10Ottomata) It should!  But I haven’t tried it.  General options:   -C | -P | -L       Mode: Consume, Produce or metadata List   -G <group-id>      Mode: High-level KafkaConsumer (Kafka 0....
[19:25:10] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected
[19:25:27] <mutante>	 ^:)
[19:25:45] <RainbowSprinkles>	 jynus: Oh, I didn't see you merge my change a bit ago re: https links. Thx!
[19:32:18] <wikibugs>	 (03PS1) 10Jgreen: rename backup4001 to frbackup4001 for clarity [dns] - 10https://gerrit.wikimedia.org/r/338156
[19:32:50] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 627 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3719929 keys, up 108 days 11 hours - replication_delay is 627
[19:33:10] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3719921 keys, up 108 days 11 hours - replication_delay is 648
[19:33:23] <wikibugs>	 (03CR) 10Jgreen: [C: 032] rename backup4001 to frbackup4001 for clarity [dns] - 10https://gerrit.wikimedia.org/r/338156 (owner: 10Jgreen)
[19:35:10] <icinga-wm>	 RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[19:35:36] <godog>	 RainbowSprinkles: 3.5.2-1 is on tin btw
[19:37:50] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3695310 keys, up 108 days 11 hours - replication_delay is 0
[19:38:20] <RainbowSprinkles>	 godog: Confirmed, lgtm
[19:39:10] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3695104 keys, up 108 days 11 hours - replication_delay is 0
[20:00:04] <jouncebot>	 thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T2000).
[20:00:32] * thcipriani does.
[20:04:59] <wikibugs>	 06Operations, 06Labs, 06Release-Engineering-Team: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10chasemp)
[20:05:04] <wikibugs>	 06Operations, 06Labs, 06Release-Engineering-Team: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034406 (10chasemp) p:05Triage>03High
[20:06:00] <wikibugs>	 06Operations, 06Labs, 06Release-Engineering-Team: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10chasemp) a:03Andrew currently nodepool is going along fine except the quota is clearly wrong.  I don't yet understand why the current...
[20:10:53] <wikibugs>	 (03PS1) 10Thcipriani: group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161
[20:10:55] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161 (owner: 10Thcipriani)
[20:13:16] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161 (owner: 10Thcipriani)
[20:13:45] <logmsgbot>	 !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.12
[20:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:32] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338161 (owner: 10Thcipriani)
[20:37:34] <wikibugs>	 (03CR) 10Volans: [C: 032] Add .gitreview file for Gerrit (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[20:38:25] <wikibugs>	 (03Draft1) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163
[20:38:31] <wikibugs>	 (03PS2) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163
[20:39:02] <wikibugs>	 (03CR) 10Dzahn: "can confirm this from tests done on labs instance" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[20:39:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add .gitreview file for Gerrit [software/cumin] - 10https://gerrit.wikimedia.org/r/338153 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[20:39:37] <wikibugs>	 (03CR) 10Dzahn: "i went to phab2001 to check the status there, and" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[20:40:07] <wikibugs>	 (03CR) 10Paladox: "> i went to phab2001 to check the status there, and" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[20:40:42] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Some (mostly minor) comments, see inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson)
[20:44:10] <icinga-wm>	 PROBLEM - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms
[20:45:10] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[20:47:32] <wikibugs>	 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3034504 (10RobH) Ok, I took a redundant supply from cp4007 and installed it into lvs4002  power supply 2 slot.  Less than a minute later, the system killed the new power supply.  Record:      1022 Da...
[20:47:39] <mutante>	 joFeb 16 20:38:11 phab2001 systemd[1]: [/etc/systemd/system/phd.service:5] Unknown lvalue 'User' in section 'Unit'
[20:48:03] <wikibugs>	 (03PS3) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163
[20:48:12] <mutante>	 i didn't mean to paste that, but yes ^ 
[20:48:28] <mutante>	 that is on phab2001 . iridium is ok, not jessie
[20:49:10] <wikibugs>	 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3034511 (10Eevans) Is there an ETA on this?   We have some testing as a part of T156199 that could benefit from this environment; Having some idea would help with planning these tasks.
[20:52:22] <wikibugs>	 (03PS4) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163
[21:06:08] <wikibugs>	 (03PS1) 10Thcipriani: all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168
[21:06:10] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168 (owner: 10Thcipriani)
[21:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168 (owner: 10Thcipriani)
[21:07:36] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338168 (owner: 10Thcipriani)
[21:08:02] <logmsgbot>	 !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.12
[21:08:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:23] <SMalyshev>	 anybody knows whether mw* hosts are time-synced? I get edits on test.wikidata.org which are 20 secs in the past
[21:18:56] <twentyafterfour>	 SMalyshev: they _should_ be time synced
[21:19:56] <SMalyshev>	 ok maybe my local clock is broken then...
[21:30:10] <wikibugs>	 (03CR) 1020after4: [C: 031] Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[21:39:52] <Reedy>	 !log Deleted around 9500 pre 2013 captchas
[21:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:58] <Reedy>	 !log make that 2017
[21:40:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:56] <wikibugs>	 (03PS1) 10MaxSem: Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075)
[21:51:26] <Zppix>	 (note: this is just a joke i know this is a serious channel but i think 1 message wont hurt) Reedy, time traveling while doing tasks is discouraged and can cause confusion please avoid time traveling until your are done with the task at hand :P
[21:51:54] <Reedy>	 Zppix: The funny thing is the captchas were actually from 2014
[21:51:55] <Reedy>	 *2013
[21:51:56] <Reedy>	 ffs
[21:52:03] <Reedy>	 Hence my slip up
[21:52:35] <wikibugs>	 (03CR) 10Dzahn: "much better, but on reboot it fails because mysql is not started first" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[21:53:47] <Zppix>	 Reedy i figured as much, but until then time traveling privelges are suspended
[21:57:19] <wikibugs>	 (03PS7) 10Dzahn: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar)
[21:58:49] <MaxSem>	 wowwwww
[21:58:58] <MaxSem>	 fatalmonitor looks boringly clean!
[21:59:48] <Zppix>	 MaxSem want me to fix that for you?
[22:00:04] <jouncebot>	 MaxSem and jgirault: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170216T2200). Please do the needful.
[22:00:38] <MaxSem>	 I'mv gonna try to fix that right now!
[22:00:51] <wikibugs>	 (03PS5) 10Paladox: Phabricator: Fix systemd phd.service file [puppet] - 10https://gerrit.wikimedia.org/r/338163
[22:04:01] <mutante>	 !log phab2001 - start/stop phd, testing gerrit 338163
[22:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:51] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "works on phab2001 and it was tested on labs that services come back after reboot now, thank you for this fix" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[22:08:01] <wikibugs>	 (03CR) 10Dzahn: ". correction.. still an issue on reboot, needs follow-up fix, but this was not wrong, it was needed too" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[22:09:29] <wikibugs>	 (03PS4) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882)
[22:11:01] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5497/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar)
[22:11:19] <wikibugs>	 (03PS8) 10Dzahn: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar)
[22:16:45] <wikibugs>	 (03PS3) 10Smalyshev: [DNM] [WIP] Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151)
[22:17:58] <wikibugs>	 (03PS4) 10Smalyshev: [DNM] [WIP] Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151)
[22:18:05] <MaxSem>	 dear Zuul, how much sacrifice do you need?
[22:18:23] <wikibugs>	 (03CR) 10Smalyshev: [C: 04-1] "Not to be deployed until Blazegraph patch for X-BIGDATA-READ-ONLY support is merged." [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev)
[22:18:37] <wikibugs>	 (03CR) 10Smalyshev: [C: 04-1] [DNM] [WIP] Allow SPARQL endpoint to be queries via POST [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev)
[22:19:42] <hashar>	 mutante: if you are around? I am willing to remove the Zuul gearman icinga probe. It is useless after all
[22:20:00] <hashar>	 mutante: I should just use a threshold instead of the anomaly detector as you suggested yesterday
[22:20:03] <Zppix>	 MaxSem here have my lamb see if the zuul gods will be appeased
[22:20:07] <mutante>	 hashar: i am around, and i am testing the latest merge on contint2001
[22:20:14] <mutante>	 while i stopped puppet on 1001 for a moment
[22:20:20] <hashar>	 ahh
[22:20:25] <MaxSem>	 IT WORKED
[22:20:26] <hashar>	 yeah that was my morning hack
[22:20:30] <Zppix>	 MaxSem your welcome
[22:20:48] <hashar>	 I probably should document it a bit more. But the idea is to have a minimum threshold for the anomaly detection
[22:20:58] <mutante>	 hashar: there is a problem with https://gerrit.wikimedia.org/r/#/c/337307/8/modules/contint/templates/apache/proxy_jenkins.erb
[22:21:07] <mutante>	 there are no new lines
[22:21:12] <mutante>	   7 ProxyPass       /ci http://localhost:8080/ciProxyPassReverse    /ci http://localhost:8080/ciProxyRequests       Of    f
[22:21:37] <mutante>	 it ends up on a single line in /etc/apache2/jenkins_proxy
[22:23:00] <paladox>	 mutante i had that same problem
[22:23:06] <wikibugs>	 (03CR) 10Dzahn: "somehow there are missing new lines in the resulting /etc/apache2/jenkins_proxy" [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar)
[22:23:14] <paladox>	 remove the - - lines from <%= @prefix -%>
[22:23:20] <paladox>	 hashar ^^
[22:23:33] <hashar>	 :(
[22:23:56] <paladox>	 hashar should be fixable by removing the - <%= @prefix -%> -> <%= @prefix %>
[22:23:58] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075) (owner: 10MaxSem)
[22:24:42] <hashar>	 why do I always get those .erb things wrong :(
[22:24:47] <paladox>	 i had that problem on the logstash change for gerrit.
[22:25:16] <mutante>	 hashar: re: icinga check. all up to you, we can remove it or ACK it a little longer if we see a chance to fix it later
[22:26:55] <wikibugs>	 (03Merged) 10jenkins-bot: Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075) (owner: 10MaxSem)
[22:27:11] <wikibugs>	 (03CR) 10jenkins-bot: Tabular data license CC0-1.0+ -> CC0-1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338208 (https://phabricator.wikimedia.org/T154075) (owner: 10MaxSem)
[22:27:55] <wikibugs>	 (03PS1) 10Hashar: contint: keeping trailing new line in proxy_jenkins [puppet] - 10https://gerrit.wikimedia.org/r/338274
[22:28:04] <hashar>	 paladox: mutante: https://gerrit.wikimedia.org/r/338274   should keep the newlines 
[22:28:51] <hashar>	 mutante: I will just drop the icinga check. Preparing a patch for that. If you want the details https://phabricator.wikimedia.org/T70113#3034630
[22:29:06] <hashar>	 the anomaly band closely follow the raising metrics, and thus there is no anomaly :]
[22:29:35] <mutante>	 ah! yea
[22:29:47] <wikibugs>	 (03CR) 10Paladox: contint: keeping trailing new line in proxy_jenkins (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/338274 (owner: 10Hashar)
[22:29:49] <paladox>	 thanks
[22:29:51] <mutante>	 so then let's try with a fixed "max number of jobs" 
[22:29:52] <hashar>	 I would need a better way, most probably just a threshold
[22:30:03] <hashar>	 I asked releng team for some feedback about it
[22:30:15] <hashar>	 so I guess we will come back with a better patch :]
[22:30:38] <mutante>	 ok!
[22:30:59] <hashar>	 but maybe that is just due to the holtWintersConfidenceBand being hard set to a delta=5
[22:31:21] <mutante>	 alright, let's do the template fix for now
[22:31:25] <hashar>	 so I will have to put a bit more thoughts in it
[22:31:26] <mutante>	 looks like there are some more lines with that
[22:31:33] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/338208/ (duration: 00m 53s)
[22:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:46] <hashar>	 ahr
[22:31:50] <hashar>	 paladox found some more cases
[22:32:00] <paladox>	 Yep
[22:32:09] <hashar>	 ah no
[22:32:13] <mutante>	 but some of them are NOT supposed to be new lines
[22:32:14] <mutante>	 @paladox
[22:32:15] <hashar>	 because they are inside a line
[22:32:20] <paladox>	 Yep
[22:32:21] <hashar>	 so we actually dont want newlines in the others
[22:32:25] <mutante>	 line 14 is right though
[22:32:36] <logmsgbot>	 !log maxsem@tin Synchronized php-1.29.0-wmf.12/extensions/JsonConfig/: https://gerrit.wikimedia.org/r/#/c/338013/ (duration: 00m 42s)
[22:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:48] <hashar>	 <Proxy http://localhost:8080<%= @prefix -%>*>
[22:32:54] <hashar>	 there is trailing  *>
[22:32:55] <mutante>	 line 14 should be changed, line 40-44 should stay
[22:33:15] <wikibugs>	 (03PS5) 10Zppix: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn)
[22:33:18] <hashar>	 used to be   <Proxy http://localhost:8080/ci*>
[22:33:32] <mutante>	 oh!
[22:33:32] <hashar>	 so we need to keep the  *> on the same line dont we?
[22:33:39] <logmsgbot>	 !log maxsem@tin Started scap: Update messages for https://gerrit.wikimedia.org/r/#/c/338013/
[22:33:39] <mutante>	 you are right, yes
[22:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:53] <hashar>	 eventually I will have some rspec tests and will probably add some for the template
[22:34:02] <wikibugs>	 (03CR) 10Dzahn: [C: 032] contint: keeping trailing new line in proxy_jenkins [puppet] - 10https://gerrit.wikimedia.org/r/338274 (owner: 10Hashar)
[22:34:12] <hashar>	 paladox: thanks for the hint!
[22:34:19] <paladox>	 Your welcome :)
[22:34:37] <hashar>	 for the icinga alarm, you are right lets ack it for a week ?
[22:34:50] <hashar>	 making sure it stay acknowledged on recovery
[22:35:00] <icinga-wm>	 PROBLEM - Disk space on tin is CRITICAL: DISK CRITICAL - free space: / 582 MB (1% inode=78%)
[22:35:05] <hashar>	 will revisit it and come with a proper fix for the anomaly check next week
[22:35:12] <mutante>	 ok, so the remaining diff is:
[22:35:13] <mutante>	 -PREFIX=/$NAME
[22:35:13] <mutante>	 +PREFIX=/ci
[22:35:21] <mutante>	 -JENKINS_ARGS="--webroot=/var/run/jenkins/war --httpPort=$HTTP_PORT --ajp13Port=$AJP_PORT --prefix=/ci $JENKINS_ACCESSLOG_ENABLE"
[22:35:24] <mutante>	 +JENKINS_ARGS="--webroot=/var/run/jenkins/war --httpPort=$HTTP_PORT --ajp13Port=$AJP_PORT --prefix=$PREFIX $JENKINS_ACCESSLOG_ENABLE"
[22:35:27] <mutante>	 looks good to me
[22:35:30] <hashar>	 yeah
[22:35:43] <mutante>	 now i am enabling puppet on contin1001 
[22:35:45] <mutante>	 to apply it there
[22:35:46] <hashar>	 that was a source of confusion. Tripped on it earlier today during the puppet swat
[22:36:04] <hashar>	 surely setting PREFIX to a wrong value was confusing, but that is just because before PREFIX was not used
[22:36:08] <hashar>	 \o/
[22:36:26] <mutante>	 ok, it's done
[22:36:27] <hashar>	 I dont know whether puppet reload apache, i think it does
[22:36:32] <hashar>	 anyway that is a noop for ci itself
[22:36:40] <mutante>	 hmm. it did not
[22:36:43] <hashar>	 only can break the https://integration.wikimedia.org/
[22:37:08] <mutante>	 i restarted apache to be sure
[22:37:08] <hashar>	 and the various sub pages the proxy to jenkins https://integration.wikimedia.org/ci/  or the proxy to zuul https://integration.wikimedia.org/zuul/status.json
[22:37:10] <mutante>	 done
[22:37:29] <hashar>	 looks fine :)
[22:37:42] <mutante>	 great
[22:37:57] <hashar>	 for the context all that serie of patches has two goals:
[22:38:05] <hashar>	 hook jenkins behind systemd
[22:38:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:38:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:38:10] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused
[22:38:11] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.186:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[22:38:11] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[22:38:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[22:38:21] <hashar>	 and eventually let us have multiple jenkins instances on a single host
[22:38:28] <mutante>	 "making sure it stays acknowledeged after recovery" does not work
[22:38:34] <mutante>	 ACK is always "until next status change"
[22:38:38] <hashar>	 ah
[22:38:40] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused
[22:38:40] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[22:38:43] <hashar>	 so I guess lets disable it
[22:38:43] <mutante>	 but we can "schedule downtime"
[22:38:47] <mutante>	 with a similar effect
[22:38:47] <hashar>	 oh
[22:39:03] <mutante>	 i will do that now.. downtime until next week
[22:39:03] <hashar>	 guess we can consider it down for a week so :]
[22:39:54] <wikibugs>	 (03CR) 10Ladsgroup: gerrit: Make blue buttons look like OOUI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup)
[22:40:21] <mutante>	 This service has been scheduled for fixed downtime from 2017-02-16 22:39:31 to 2017-02-23 00:39:31. Notifications for the service will not be sent out during that time period.
[22:41:40] <hashar>	 awesome. and sorry for all the trouble
[22:41:50] <hashar>	 I should have been more careful and actually play test the command on my local machine
[22:42:03] <mutante>	 no problem at all
[22:42:05] <hashar>	 I did that this morning, even retrievied the raw metrics from statsd,  played with them all locally
[22:42:15] <mutante>	 more work = more things to break
[22:42:15] <hashar>	 I think I was expecting things to work all magically
[22:42:53] <mutante>	 one step closer to multiple jenkins on one host :)
[22:42:59] <hashar>	 yeah hopefully
[22:43:19] <hashar>	 but I am distracting you, you might want to look at the restbase alarms above
[22:43:25] <mutante>	 paladox: gerrit and button color sounds like something for you :)
[22:43:52] <paladox>	 mutante yep, i've applied it on gerrit-test3. but i doint notice a different
[22:43:57] <paladox>	 difference
[22:45:05] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10hashar)
[22:45:29] <wikibugs>	 (03CR) 10VolkerE: "In general I'd recommend not start aligning this tool with WMUI style guide, as it would go far beyond colors, if we want to do it right™." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup)
[22:46:00] <icinga-wm>	 RECOVERY - Disk space on tin is OK: DISK OK
[22:46:04] <mutante>	 !log tin - apt-get clean - 4.6G avail (T158359)
[22:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:56] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034851 (10hashar) ``` $ du -h /var/lib/l10nupdate/caches/ 1.5G    /var/lib/l10nupdate/caches/cache-1.29.0-wmf.2 1.5G    /var/lib/l10nupdate/caches/cache-1.29.0-wmf....
[22:47:05] <hashar>	 oh nice 
[22:47:29] <hashar>	 RainbowSprinkles: twentyafterfour: Reedy: arent we supposed to clean the old l10nupdate caches ? 
[22:47:52] <twentyafterfour>	 hashar: yes
[22:47:52] <Reedy>	 probably
[22:47:58] <Reedy>	 Doesn't scap do it?
[22:48:00] <bd808>	 there was once a cron to do that
[22:48:05] <hashar>	 the one from 1.29.0-wmf.1 is from November 10th
[22:48:08] <hashar>	 and it is still on tin :(
[22:48:11] <mutante>	 it's been this multiple times :p
[22:48:16] <hashar>	 https://phabricator.wikimedia.org/T158358#3034851
[22:48:17] <wikibugs>	 (03CR) 10Paladox: "Actually it works on reboot now. Just that i had puppet disabled. After enabling it, rebooting works :)" [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[22:48:19] <mutante>	 i thought we have a cron that deletes it
[22:48:20] <bd808>	 Reedy: no, /var/lib/l10nupdate is the l10nupdate cron job
[22:48:25] <mutante>	 because it happened before
[22:48:54] <Reedy>	 Scap should just delete those when it deletes the /srv/mediawiki-staging branches
[22:49:17] <bd808>	 scap shouldn't have to know about l10nupdate unless we fold all that crap into scap
[22:49:22] <wikibugs>	 (03CR) 10Dzahn: "because... /var/run/phd/pid was owned by root, and once that was deleted and puppet re-created it, and "phd" user could own the pid file.." [puppet] - 10https://gerrit.wikimedia.org/r/338163 (owner: 10Paladox)
[22:49:28] <bd808>	 we really should just stop doing l10nupdate
[22:49:32] <Reedy>	 lol
[22:49:40] <Reedy>	 Can't we do it as as a scap plugin that hooks into that function?
[22:49:48] <mutante>	 i expected it's a cron that runs  find .. and deletes older than X
[22:49:53] <bd808>	 its usefulness is pretty low with the weekly branch cadence
[22:50:09] <Reedy>	 Was there a task created about it yet?
[22:50:51] <bd808>	 T130317
[22:50:51] <stashbot>	 T130317: setup automatic deletion of old l10nupdate  - https://phabricator.wikimedia.org/T130317
[22:50:56] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034858 (10hashar) From IRC supposedly we had a cron job to garbage collect the old caches. ``` $ sudo -u l10nupdate -s crontab -l 0 2 * * * /usr/local/bin/l10nupdat...
[22:51:00] <bd808>	 and T133913
[22:51:00] <stashbot>	 T133913: Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913
[22:51:05] <RainbowSprinkles>	 scap plugin WIP for cleaning up old branches
[22:51:07] <RainbowSprinkles>	 `scap clean`
[22:51:13] <RainbowSprinkles>	 New features up for review
[22:51:16] <bd808>	 and T119747
[22:51:17] <stashbot>	 T119747: deleteMediaWiki should delete /var/lib/l10nupdate/caches/cache-$wmgVersionNumber - https://phabricator.wikimedia.org/T119747
[22:51:20] <RainbowSprinkles>	 It's on the train docs
[22:51:23] <Reedy>	 I was meaning for turning it off?
[22:51:24] <RainbowSprinkles>	 Oh, those caches
[22:51:30] <RainbowSprinkles>	 Bleh, I can add to scap clean
[22:51:32] <hashar>	 bd808: mind copy pasting those tasks to https://phabricator.wikimedia.org/T158358 ? :]
[22:51:34] * RainbowSprinkles makes note
[22:51:54] <RainbowSprinkles>	 https://gerrit.wikimedia.org/r/#/c/336730/ https://gerrit.wikimedia.org/r/#/c/336901/
[22:52:01] <RainbowSprinkles>	 ^ reviews welcome kthnxbai
[22:52:10] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10Dzahn) also see:  T130317, T133913, T119747
[22:52:20] <RainbowSprinkles>	 T119747 is outdated
[22:52:21] <hashar>	 :]
[22:52:26] <RainbowSprinkles>	 deleteMediaWiki was dumb so I killed it
[22:52:28] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034871 (10bd808)
[22:52:34] <RainbowSprinkles>	 T119747 should be about scap clean now
[22:52:39] <Reedy>	 heh
[22:52:40] <icinga-wm>	 PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:52:47] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034872 (10Dzahn) I ran "apt-get clean" on tin which freed another 2G or so
[22:52:52] <RainbowSprinkles>	 But yes, see ./scap/plugins/clean.py in mediawiki-config
[22:52:54] <hashar>	 neat
[22:52:54] <thcipriani>	 hrm, scap clean for 1.29.0-wmf.5 had some...errors
[22:52:55] <RainbowSprinkles>	 If you want to contribute
[22:52:58] <RainbowSprinkles>	 Or review those patches
[22:53:09] <RainbowSprinkles>	 thcipriani: Yes, I know
[22:53:10] <hashar>	 so I am not touching anything since it is close to midnight here
[22:53:11] <RainbowSprinkles>	 See final comment on https://gerrit.wikimedia.org/r/#/c/336730/
[22:53:12] <hashar>	 I dont want to explode something
[22:53:18] <RainbowSprinkles>	 Err, wait, that hasn't landed
[22:53:20] <RainbowSprinkles>	 What errors?
[22:53:23] <RainbowSprinkles>	 *angry face*
[22:54:03] <thcipriani>	 perm errors for masters for deleting l10n dirs
[22:54:14] <thcipriani>	 then some other random ones
[22:54:17] * thcipriani makes a paste
[22:54:29] <RainbowSprinkles>	 Ah yes
[22:54:30] <RainbowSprinkles>	 Ok
[22:54:32] <RainbowSprinkles>	 Known
[22:54:43] <RainbowSprinkles>	 (I hate that permission discrepancy)
[22:54:54] <RainbowSprinkles>	 Same problem as on that comment I made in the gerrit change
[22:55:05] <RainbowSprinkles>	 deploy masters /srv/mediawiki-staging/ need 2 passes
[22:55:10] <RainbowSprinkles>	 One for l10n, one for everything else
[22:55:25] <RainbowSprinkles>	 (again, screw that discrepancy)
[22:55:52] <hashar>	 I like the idea of removing l10nupdate
[22:55:56] <RainbowSprinkles>	 +10000
[22:55:58] <RainbowSprinkles>	 write an rfc
[22:56:00] <icinga-wm>	 PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:56:00] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034874 (10Dzahn)
[22:56:03] <hashar>	 now that we deploy once per week, it is probably less useful than it used to be 
[22:56:06] <hashar>	 and probably
[22:56:11] <RainbowSprinkles>	 Indeed
[22:56:14] <RainbowSprinkles>	 Plus it's always broken
[22:56:16] <RainbowSprinkles>	 There's race conditions
[22:56:25] <mutante>	 did you delete stuff already?
[22:56:26] <RainbowSprinkles>	 Easy to overwrite to prior non-auto msgs
[22:56:27] <hashar>	 we could get the l10n bot to refresh translation once per week instead of on a daily basis accross a thousand of repos
[22:56:28] <mutante>	 21G free
[22:56:33] <thcipriani>	 https://phabricator.wikimedia.org/P4942
[22:56:33] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10Reedy) I killed the 1.28 l10nupdate cache folders, and the 1.29 ones < .10
[22:56:38] <RainbowSprinkles>	 hashar: It refreshes anyway with train
[22:56:41] <mutante>	 ah 
[22:56:45] <RainbowSprinkles>	 Then subsequent scaps overwrite the messages
[22:56:45] <thcipriani>	 also a bunch of stuff in the .git directory
[22:56:49] <RainbowSprinkles>	 until l10nupdate comes along
[22:56:53] <RainbowSprinkles>	 And reverts
[22:56:57] <Reedy>	 yeah, once a week is pointless with the train
[22:56:58] <RainbowSprinkles>	 Again and again and again they fight
[22:57:04] <Reedy>	 So awesome.
[22:57:04] <RainbowSprinkles>	 Manual scaps v l10nupdate
[22:57:16] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034878 (10Dzahn) Makes me think how mira is doing.
[22:57:20] <Reedy>	 RainbowSprinkles: That's gotta be the best reason to just disable it
[22:58:08] <logmsgbot>	 !log maxsem@tin Finished scap: Update messages for https://gerrit.wikimedia.org/r/#/c/338013/ (duration: 24m 29s)
[22:58:09] <RainbowSprinkles>	 Reedy: I've been meaning to write an RfC on it but haven't had the spare time
[22:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:00] <icinga-wm>	 PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:02:58] <logmsgbot>	 !log maxsem@tin Started scap: Another time, just ot make sure some files synched cuz lat time there were some mid-air collisions
[23:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational
[23:03:10] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active
[23:04:40] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042
[23:04:41] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2017-09-12 15:36:12 +0000 (expires in 207 days)
[23:05:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational
[23:05:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2010 is OK: OK - cassandra-a is active
[23:06:10] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.186 port 9042
[23:06:10] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.186:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-a valid until 2017-11-17 00:54:24 +0000 (expires in 273 days)
[23:08:36] <wikibugs>	 (03PS6) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388
[23:08:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar)
[23:09:14] <wikibugs>	 (03CR) 10Hashar: "I passed http_port to the 'jenkins' class to have the daemon listen on that port.  But I forgot to pass http_port for the Apache proxy par" [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar)
[23:09:36] <wikibugs>	 (03PS5) 10Hashar: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377
[23:14:32] <wikibugs>	 (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/5498/ pass :]" [puppet] - 10https://gerrit.wikimedia.org/r/337377 (owner: 10Hashar)
[23:15:27] <wikibugs>	 (03PS1) 10ArielGlenn: write results from getlastpageid and getlastrevid to stdout, not stderr [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/338280
[23:15:29] <wikibugs>	 (03PS1) 10ArielGlenn: update .gitignore with the binaries for the new utilities [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/338281
[23:15:31] <wikibugs>	 (03PS1) 10ArielGlenn: script to check whether page range of bz2 checkpoint file is correct [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/338282
[23:15:38] <apergos>	 yeah, I was hoarding, sorry
[23:17:18] <wikibugs>	 (03CR) 10Dzahn: [C: 032] adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882) (owner: 10Dzahn)
[23:17:26] <wikibugs>	 (03PS6) 10Dzahn: adjust wikimania regex for mobile hosts, cover 2002-2099 [puppet] - 10https://gerrit.wikimedia.org/r/337893 (https://phabricator.wikimedia.org/T152882)
[23:18:43] <logmsgbot>	 !log maxsem@tin Finished scap: Another time, just ot make sure some files synched cuz lat time there were some mid-air collisions (duration: 15m 44s)
[23:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:10] <icinga-wm>	 PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:20:40] <icinga-wm>	 RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[23:21:50] <icinga-wm>	 PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:21:58] <mutante>	 hashar: alright, we can do the umask change if you are still here to confirm it :)
[23:22:08] <hashar>	 yeah
[23:22:15] <paladox>	 im getting a ton of errors on the labs phabricator instance
[23:22:18] <paladox>	 with stuff like
[23:22:19] <paladox>	 Feb 16 23:16:46 phabricator systemd[1]: [/lib/systemd/system/keyholder-proxy.service:12] Unknown lvalue 'ExecPre' in section 'Service'
[23:22:24] <wikibugs>	 (03PS6) 10Dzahn: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 (owner: 10Hashar)
[23:22:44] <paladox>	 Feb 16 23:18:28 phabricator nslcd[2298]: [adea3d] <group/member="Debian-exim"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Can't contact LDAP server: Connection timed out
[23:22:45] <paladox>	 and ^^
[23:22:46] <paladox>	 mutante ^^
[23:22:47] <mutante>	 paladox: ? _after_ all the testing you just did?
[23:22:57] <mutante>	 oh, LDAP server
[23:22:58] <paladox>	 not realted to any of the stuff we did
[23:22:58] <hashar>	 mutante: though I would rather not restart JEnkins now to double check :]
[23:24:00] <icinga-wm>	 RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[23:24:04] <mutante>	 hashar: better another time?
[23:24:12] <hashar>	 I mean
[23:24:15] <hashar>	 the change can land
[23:24:33] <hashar>	 just have to verify that sourcing  /etc/default/jenkins has  UMASK properly set
[23:24:48] <hashar>	 then the init.d will catch it
[23:24:51] <hashar>	 and pass --umask
[23:24:58] <hashar>	 can verify tomorrow
[23:25:00] <mutante>	 paladox: is this on all instances or just one?
[23:25:11] <paladox>	 mutante only seen it on one
[23:25:15] * paladox checks the others
[23:25:32] <mutante>	 hashar: ok, i can land it, but i won't be here tomorrow
[23:26:04] <mutante>	 so .. maybe better to do that together then?
[23:26:10] <icinga-wm>	 PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:26:29] <hashar>	 we can land the change, and I will restart jenkins tomorrow to triple confirm
[23:26:33] <hashar>	 but I am not worrying :]
[23:26:35] <mutante>	 ok
[23:26:37] <hashar>	 but I am not worried
[23:26:44] <paladox>	 i doint see it on the other instances
[23:26:44] <wikibugs>	 (03CR) 10Dzahn: [C: 032] jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 (owner: 10Hashar)
[23:26:51] <mutante>	 paladox: firewalling?
[23:26:59] <mutante>	 paladox: any changes?
[23:26:59] <hashar>	 I am afraid of having to fix up jenkins at 1am :]
[23:27:17] <paladox>	 Maybe, only change i did was change vcs ip to floating ip.
[23:27:20] <paladox>	 for ssh.
[23:27:33] <hashar>	 mutante: thanks for all the patches review this week :]
[23:27:43] <mutante>	 paladox: probably related to getting the new IP and firewalling
[23:27:55] <mutante>	 paladox: try to connect to it manually with telnet or nc
[23:28:02] <paladox>	 ok
[23:28:11] <mutante>	 hashar: yw :)
[23:28:17] <hashar>	 oh
[23:28:24] <paladox>	 telnet ldap://ldap-labs.eqiad.wikimedia.org 389
[23:28:24] <hashar>	 I can actually test on cont2001 :)
[23:28:43] <mutante>	 yes to both :)
[23:28:53] <paladox>	 telnet ldap-labs.eqiad.wikimedia.org 389
[23:28:53] <paladox>	 Trying 208.80.154.79...
[23:28:53] <paladox>	 telnet: Unable to connect to remote host: No route to host
[23:28:56] <mutante>	 paladox: well, without the protocol
[23:29:00] <icinga-wm>	 RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[23:29:24] <mutante>	 -# UMASK=027
[23:29:24] <mutante>	 +UMASK=0002
[23:29:31] <hashar>	 --umask=0002
[23:29:34] <hashar>	 on contint2001 :)
[23:29:44] <mutante>	 applied on 1001
[23:30:16] <hashar>	 so should be fine
[23:30:23] <hashar>	 I will restart Jenkins on contint1001 tomorrow
[23:30:56] <hashar>	 and the logrotate for /var/log/jenkins/access.log works!
[23:32:08] <wikibugs>	 (03CR) 10JGirault: gerrit: Make blue buttons look like OOUI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337397 (https://phabricator.wikimedia.org/T158298) (owner: 10Ladsgroup)
[23:32:21] <mutante>	 hashar: :) ok, nice
[23:32:30] <hashar>	 enjoy your week-end! :]
[23:32:34] <hashar>	 I am heading to bed
[23:32:39] <mutante>	 you too, it will be long over here
[23:32:49] <mutante>	 "president's day" .. great timing for that, heh
[23:33:06] <mutante>	 good night hashar, bye
[23:34:09] <hashar>	 :]
[23:41:57] <wikibugs>	 (03PS1) 10RobH: update nithum's ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/338291
[23:43:00] <icinga-wm>	 PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:43:16] <apergos>	 Amir1: due to other work, your html fixes will go out tomorrow evening rather than today
[23:43:29] <apergos>	 I just wound up my regular work for the night (almost 2 am)
[23:43:35] <apergos>	 sorry for the delay
[23:45:36] <wikibugs>	 (03CR) 10RobH: [C: 032] update nithum's ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/338291 (owner: 10RobH)
[23:48:11] <icinga-wm>	 RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[23:49:35] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3035128 (10RobH) The new key is now live, it can take up to 30 minutes for all affected hosts to call in for the change.
[23:50:50] <icinga-wm>	 RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[23:54:10] <icinga-wm>	 RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures